Real-time Object Detection and Tracking in Mixed Reality using

Microsoft HoloLens

Alessandro Farasin

1,2 a

, Francesco Peciarolo

, Marco Grangetto

3 b

, Elena Gianaria

and

Paolo Garza

Department of Control and Computer Engineering, Politecnico di Torino, Corso Duca degli Abruzzi 24, 10129, Turin, Italy

LINKS Foundation, Via Pier Carlo Boggio 61, 10138, Turin, Italy

Computer Science Department, University of Torino, Via Pessinetto 12, 10149, Turin, Italy

Keywords:

Computer Vision, Mixed Reality, Microsoft Hololens, Object Detection, Object Tracking, Deep Learning,

Spatial Understanding.

Abstract:

This paper presents a mixed reality system that, using the sensors mounted on the Microsoft Hololens headset

and a cloud service, acquires and processes in real-time data to detect and track different kinds of objects and

ﬁnally superimposes geographically coherent holographic texts on the detected objects. Such a goal has been

achieved dealing with the intrinsic headset hardware limitations, by performing part of the overall computation

in a edge/cloud environment. In particular, the heavier object detection algorithms, based on Deep Neural

Networks (DNNs), are executed in the cloud. At the same time we compensate for cloud transmission and

computation latencies by running light scene detection and object tracking on board the headset. The proposed

pipeline allows meeting the real-time constraint by exploiting at the same time the power of state of art DNNs

and the potential of Microsoft Hololens. This paper presents the design choices and describes the original

algorithmic steps we devised to achieve real time tracking in mixed reality. Finally, the proposed system is

experimentally validated.

1 INTRODUCTION

In the late ’90s, Augmented Reality (AR) was popular

in the research ﬁeld and its potentials and applications

in the real world were already well-known (Azuma,

1997). The idea of being able to increase our un-

derstanding of the world simply by observing some-

thing, has encouraged the research in the ﬁeld, sug-

gesting lots of practical applications. Thanks to recent

technical advances, these kinds of applications are be-

coming even more popular. In this work, we exploit

a modern Head Mounted Display (HMD) equipped

with numerous kinds of sensors, to operate in Mixed

Reality (MR) (Milgram and Kishino, 1994), a more

generic meaning of AR. The goal is to retrieve infor-

mation from the external world and to detect and track

numerous kinds of objects in real-time. As output, a

label will be shown over each recognized object, with

its relative name. The entire process focuses on a

dynamic system in which both the observer and ob-

https://orcid.org/0000-0001-9086-8679

https://orcid.org/0000-0002-2709-7864

jects could move in the environment. The real-time

constraint implies the need for limiting the adoption

of complex detection algorithms (e.g. Convolutional

Neural Networks, CNNs) in favor of faster solutions,

such as feature detection/matching and tracking al-

gorithms. To avoid possible lack of accuracy while

using simpler approaches, the information extracted

by complex algorithms is properly exploited to allow

faster methods to be more effective. In the following

sections, the adopted HMD is presented along with a

brief overview of the algorithms exploited by the MR

system proposed in this paper.

1.1 Microsoft HoloLens

Microsoft HoloLens (HoloLens, ), is an HMD able

to project holograms in a real space. It has different

kinds of sensors, like RGB and depth cameras and an

Inertial Measurement Unit (IMU), that allow a real-

time perception of nearby shapes and obstacles. It is

provided with two on-board computational units: an

Intel 32-bit processor and an Holographic Processing

Unit (HPU) that manage both the operating system

Farasin, A., Peciarolo, F., Grangetto, M., Gianaria, E. and Garza, P.

Real-time Object Detection and Tracking in Mixed Reality using Microsoft HoloLens.

DOI: 10.5220/0008877901650172

In Proceedings of the 15th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2020) - Volume 4: VISAPP, pages

165-172

ISBN: 978-989-758-402-2; ISSN: 2184-4321

165

and the real-time update of displayed holograms. This

tool is able to determine its relative position with re-

spect to the real world, building in real-time a map of

the surrounding environment. Leveraging on its capa-

bilities, who wears this headset is able to move freely

in the real space, having the possibility to collect and

process different kinds of data that can be displayed

directly on the glasses. However, its computational

power is limited, and therefore it may not be able to

compute costly algorithms in real-time. To this pur-

pose, an external server provided with GPU computa-

tional capabilities should be considered. It might be

reasonable to adopt Microsoft HoloLens merely as a

mixed reality and move the entire computation part

to a cloud system. Relying on an external computa-

tional system means introducing latency. This means

that both HoloLens and the cloud environment should

cooperate to reach the right trade-off between accu-

racy and time complexity, limiting the invocations of

the cloud service.

1.2 Object Detection

In the literature, there are several methods to perform

object detection from 2D images. One of the most re-

cent and accurate technique derives from the machine

learning ﬁeld (Borji et al., 2015; Han et al., 2018)

leveraging on deep Convolutional Neural Networks

(CNNs). These kinds of networks result as efﬁcient

as complex. Among the possible networks we opted

for the fastest and most accurate ones. Three differ-

ent CNNs are used and compared: Faster Regional

Convolutional Neural (Faster R-CNN) (Ren et al.,

2015), Region-based Fully Convolutional Networks

(R-FCN) (Dai et al., 2016) and Single Shot Multibox

Detector (SSD) (Liu et al., 2016). Even if these al-

gorithms are among the fastest in their domain, they

require high computational cost to be executed. For

this reason (despite the communication delay), it is

reasonable to externalize this task to the cloud sys-

tem.

1.3 Object Tracking

By considering a dynamic environment, the object de-

tection is just the ﬁrst part of the task. In real-time,

a fast methodology to follow the changing position

of several objects is needed. The tracking approach

computes the newer position of the object by com-

paring the newer frame to the last one/s (Alper Yil-

maz and Shah, 2006). It follows that no latency is

admitted in this task and hence the chosen algorithm

might have low-complexity to be executed directly

by HoloLens. Among the possible choices (Padma-

vathi S., 2014) for this purpose, the well-known and

robust Mean Shift Tracking (Wen and Cai, 2006) al-

gorithm is adopted. With an appropriate tracking

methodology, the real need to detect a new object is

limited to the situation when a new (and unknown)

object appears into the Field of View (FoV), making

the overall system more rapid and effective.

1.4 Feature Detection and Matching

To catch changes between two images or simply com-

pare two regions, several descriptors that can used.

In this speciﬁc context, possible objects could ap-

pear in the FoV while the tracking algorithm is ac-

tive. When new objects appear, a new object detec-

tion on the image must be requested. One of the

most used algorithm for this goal is Scale Invariant

Feature Transform (SIFT) (Lowe, 2004). Unfortu-

nately, this approach is too complex to be executed

in HoloLens. For that reason, we opted for the Ori-

ented FAST Rotated BRIEF (ORB) (Ethan Rublee and

Bradski, 2004) algorithm, which is a faster approach

with a comparable accuracy in our context.

2 PROPOSED METHOD

In this section, the overall system is analyzed and de-

tailed. The principal goal for this work is to deal with

the complexity of the adopted algorithms to succeed

in recognizing dynamic objects in the real-time en-

vironment. The system is meant to be fast and ac-

curate, leveraging on a high interoperability between

the cloud system and the HoloLens system (called lo-

cal system). The cloud system, provided with high

computational resources, is able to perform complex

and accurate algorithms in a short period, but suffers

from latency due to communication delays. The lo-

cal system instead is less powerful but can process

the video stream in real-time on the HPU. In detail,

the HoloLens sensors provide two kinds of informa-

tion: i) a stream of 2D images and ii) an updated mesh

of the observed surface. Furthermore, HoloLens is

able to establish the observer’s relative position and

direction. This data is preprocessed by a Detection

Policy module that decides what is the lighter action

to perform to guarantee the overall object recognition

and limit as much as possible complex computations.

Depending on the observer’s position and gaze direc-

tion and the observed scene, that module can activate

the 2D-3D Mapping, the Object Detection or the Ob-

ject Tracking modules. Finally, each recognized ob-

ject is labeled, over-imposing an hologram indicating

its name in the real space. Through the entire pro-

VISAPP 2020 - 15th International Conference on Computer Vision Theory and Applications

166

cess, several features are extracted to represent and

remind the detected objects. The whole process and

all the interactions are exhaustively described in the

Sections 2.1 and 2.2.

2.1 Cloud System

The Cloud System performs the object detection task.

Due to the complexity of the involved algorithms, it

is hosted by a server provided with one of the fastest

GPU, the NVIDIA GeForce 1080Ti. The process is

described as follows. Firstly, the Local System sends

a Detection Request, which is an HTTP request con-

taining the last captured frame, to the Cloud System .

Then, the picture is processed by the Object Detection

module that implements the algorithms presented in

Section 1.2. For this work we leveraged on pretrained

models, freely available on Tensorﬂow (Tensorﬂow, ),

that are Faster R-CNN and R-FCN with ResNet 101

and SSD with Inception ResNet V2. These models are

trained on the public dataset COCO (COCO, ).

When the Object Detection module processes a new

image, a list of Detected Objects (DOs) is returned.

Each DO is represented by:

• Bounding Box (BB): the image’s rectangle area in

which an object is identiﬁed. It consists in a cou-

ple of coordinates c

, y

), c

, y

), represent-

ing the upper-left and bottom-right region corners;

• Class: the most-likely class that the identiﬁed ob-

ject is supposed to belong (e.g. mouse, keyboard,

..);

• Score: the conﬁdence with which the algorithm

associates the object to the Class. It is deﬁned as

Score = {x|x ∈ ℜ, x ∈ [0, 1]}

To avoid false detections, we empirically deﬁned a

minimum threshold for Score. We consider valid de-

tections only DOs having Score ≥ 0.8.

Finally, the DOs list is returned to the Local System.

2.2 Local System

The Local System runs directly on the HoloLens’

hardware. It is developed in C# by using Unity 3D

(Unity3D, ) and OpenCV (OpenCV, ). Unity 3D al-

lows mapping the information coming from the sen-

sors to a virtual environment, in which 1 meter corre-

sponds to a 1 virtual unit. This framework allows to

make a representation of the real scene, reconstruct-

ing the surfaces and tracking the HoloLens relative

position. Furthermore, every virtual object added in

the Unity scene, it is displayed as hologram in the real

space. The OpenCV framework, instead, is used to

perform image processing algorithms over the frames.

The system operates in real time, directly on the video

stream. According to several variables related to the

observed scene and the observer’s position, it evalu-

ates whether to perform a detection request or to track

the identiﬁed objects detected in the past. Further-

more, it performs the 2D-3D and 3D-2D space con-

version to map the 2D frame’s domain into the 3D

space domain (and vice-versa). This mapping is re-

quired because HoloLens provides raw data from dif-

ferent sources without mixing them: the video stream

is acquired from a 2D RGB camera, while the third

spatial dimension is computed from a central depth

camera and four gray-scale cameras placed on the

headset’s frontal sides.

In Fig.1, the local system’s ﬂow is shown. From an

initial condition, in which the observer is steady, the

Local System takes a frame from the Frame capture

module (Fig.1, A). At startup, when previous knowl-

edge of the surrounding world cannot be exploited,

the Object detection request to server (Fig.1, G) is

triggered to request a new detection to the Cloud Sys-

tem. At this point, the Cloud System starts computing

the received frame to extract and provide the right in-

formation. Meanwhile the Local System computes,

for each acquired frame, a gray-scale histogram, dis-

cretized onto 64 bins (this information will be used in

a second step). When the Cloud System ﬁnishes the

computation, it returns a DOs list. The received data

(Receiving detected objects list module, in Fig.1, B)

is used to build the ﬁrst representation of the observed

objects. First of all, it is required to map the 2D de-

tection on the camera’s space to the 3D space. This

action is performed by the 2D-3D Mapping module

(Fig.1, C) as follows:

1. Deﬁned:

• P(x

, y

), the central pixel of a detected region:

it is a position in the frame’s space;

• C(x

, y

, z

), the camera position in the world’s

space, taken at the same time as the frame is

acquired

2. P(x

, y

) is transformed from the frame space, to

the world space, obtaining P

, y

, z

);

3. A ray (Fig.2), starting from C(x

, y

, z

) and pass-

ing through P

, y

, z

), is traced: the intersec-

tion between the ray and the environmental mesh

identiﬁes a point I(x

, y

, z

), that is an approxima-

tion of the object’s location in the world space.

This step is easily performed by using Unity 3D

Ray Tracing and Collider functionalities.

4. Points 1 to 3 are repeated for each detected region

At each point I(x

, y

, z

) in the world space, the

Labels in Mixed reality module (Fig.1, D) displays a

Real-time Object Detection and Tracking in Mixed Reality using Microsoft HoloLens

167

Figure 1: Local System ﬂow diagram.

Figure 2: 2D-3D Spatial Mapping.

Unity 3D GameObject Textmesh with the respective

DO’s Class as name, as shown in Fig.3. These labels

are stored in a vector and are the internal represen-

tation of the detected objects. At this step, each ob-

ject consists in the DO’s attributes, plus the coordinate

I(x

, y

, z

). For each frame for which a detection is

performed, the corresponding ORB descriptor is com-

puted: this information represents the frame contents

and is stored in memory with the related recognized

objects. These data represent the geometric and visual

memory of the MR system and are exploited in the

following to avoid the analysis of frames correspond-

ing to scenes that are already known. Afterwards, a

novel frame is captured and the processing chain be-

havior depends on several policies and heuristics, the

ﬁrst being based on the observer’s state. We deﬁne

observer the user that is wearing the HoloLens and

running the application. The observer is free to move

into the real world. We assume that the observer is

interested in augmenting the scene with DO informa-

tion only if her/his gaze is quite stable: for that reason,

given that the HoloLens camera streams at 30 fps, the

Local System is active only under the following con-

ditions:

1. The Maximum Translation Velocity (MTV) must

be lower than 0.27 m/s

VISAPP 2020 - 15th International Conference on Computer Vision Theory and Applications

168

2. The Maximum Angular Velocity (MAV) must be

lower than 0.25 rad/s

Overtaking these limits means deactivating the Local

System until the values lower again under the respec-

tive thresholds. These thresholds are introduced in or-

der to ensure a minimum movement tolerance when

the observer is looking at a scene. Both the current

translation and angular velocities are available from

the gyroscope and accelerometer installed on board

the HoloLens: these data are automatically mixed

with the ones acquired from the other sensors to de-

termine the relative observer’s position and frontal di-

rection.

Every novel frame is processed by the Detection

Policy module. Starting from an internal representa-

tion of previously detected objects, its task is to min-

imize the effort spent to recognize the objects in the

current frame. The possible situations that could oc-

cur are summarized as follows:

1. the observer is steady, looking in the same direc-

tion and nothing changes in the scene;

2. the observer is steady, looking in the same direc-

tion and some object is moving;

3. the observer is steady, looking in the same direc-

tion and a new object joins or leaves the scene;

4. the system, due to the observer’s movements over

the MTV and/or the MAV thresholds, was deacti-

vated and now it is reactivated.

In cases 1, 2, 3 the Detection Policy checks (through

the module ”Detection of changes in current scene”

in Fig.1, F) whether there has been any variation in

the current scene by comparing the actual frame gray-

scale histogram with the previous one. The compar-

ison is done by computing the correlation coefﬁcient

d(H

, H

) =

∑

(I) − H

)(H

(I) − H

)

∑

(I) − H

)

∑

(I) − H

)

(1)

where H

(I), H

(I) corresponds to the value of the I-

th bin for the ﬁrst and the second histograms,

∑

(J) (2)

and N is the total bins number (in our setting N = 64).

We empirically deﬁned a threshold d

= 0.93. For

any d ≤ d

, we consider that the scene is changed:

this means that a new detection is required (Fig.1, G)

because case 3 could be happened. In the other cases

(1 and 2), a new detection request is avoided and the

Object Tracking module (Fig.1, H) is activated. The

Object Tracking module uses the mentioned Mean

Shift Tracking (MST) algorithm to track a known ob-

ject in the newly acquire frame. To this end the HSV

color space is considered. Then MFT is run using as

target the histogram of the H component of the rectan-

gular region determined by the Bounding Box (BB) of

each DO. The MST algorithm exploits the same his-

togram distance measure deﬁned in Equation 1 and

determines the new Bounding Box coordinates c

and

for each DO. As ﬁnal step the Object Tracking

module updates the DO coordinates in the internal

structure and triggers the 2D-3D Spatial Mapping al-

gorithm to update the world point I(x

, y

, z

). This

procedure is performed for each DO that was previ-

ously detected in the scene. If any object is not de-

tected by the MST algorithm, the module deactivates

the labels related to each missed object and will issue

a new detection request to the cloud server The 4-th

case needs a different treatment by the Detection Pol-

icy: the ”Identiﬁcation of an already analyzed scene”

module (Fig. 1, I) is activated. After the observer has

moved, it could look at a new portion of the environ-

ment or toward a scene that was already seen and pre-

viously analyzed. That module checks this last case

by performing a Geometric Check, computed as fol-

lows:

1. Considering the camera position C(x

, y

, z

) as

origin, we deﬁne:

−→

, the unit vector of the ob-

server’s looking direction and

−→

, the unit vector

computed from the direction between a stored ob-

ject position I(x

, y

, z

) and C(x

, y

, z

)

2. Compute the cosine of the angle between the two

unit vectors: cosθ =

−→

A minimum threshold d

= 0.93 for the value of

cosθ is empirically deﬁned: this threshold determines

whether the observer is looking toward an already rec-

ognized object. If none of the objects stored in mem-

ory yields cosθ ≥ d

, a new detection request is per-

formed. Otherwise, one must check whether the ob-

served scene has changed from its last representation

or not. For this reason, the ORB algorithm is used

over the whole frame. The key-points computed from

the current frame are compared to the ones stored dur-

ing the last acquisition of the same scene, as shown in

Fig.4. Then, a similarity measure s is computed:

s =

| match |

max(| d

|, | d

, (3)

where, | match | is the number of matching key-points

between the two frames and | d

| is the number of key-

points founded for the i-th descriptor. Also for this

measure, a minimum similarity s

= 0.4 is deﬁned. If

both cosθ ≥ d

and s ≥ s

, it might be asserted that

the observer is looking at a previous recognized and

unchanged scene, otherwise a new detection request

Real-time Object Detection and Tracking in Mixed Reality using Microsoft HoloLens

169

Figure 3: Recognized objects in two different scenes: each label, is placed over the respective object. The mesh, retrieved by

the sensors and computed by Unity3D, is displayed and wraps the entire visible scene.

Figure 4: ORB keypoints comparison between two frames.

is needed. In case the check succeeded, a last step

must be executed. Supposing that the user is looking

to the same scene, but from a different point of view,

the bounding boxes position must be recomputed. Re-

ferring to the 2D-3D Spatial Mapping and Fig.2, the

points C(x

, y

, z

) and I(x

, y

, z

) are known, while

, y

, z

) must be computed as the intersection be-

tween the CI segment and the camera’s display plane.

That point is then translated into the point P(x

, y

)

in the frame’s domain and, as consequence, the coor-

dinates c

, c

are derived. This last step is named 3D-

2D Spatial Mapping and is performed by the Labels

Reactivation module (Fig.1, J) that, after this compu-

tation, over-imposes the Unity 3D Textmeshes over

the respective recognized objects.

3 EXPERIMENTS AND ANALYSIS

The principal goal for this work is to recognize ob-

jects from the real world in real-time. The following

experiments have been devised to quantify the gain

provided by the designed Detection Policy with re-

spect to a simple reference system relaying solely on

cloud frame detection requests. First of all, we de-

termined which algorithm, among the considered de-

tection algorithms, ﬁts better our purposes. We pre-

pared a collection of 50 images about desk items and

tech products, each one containing up to ﬁve objects.

All the objects were selected from the categories that

the adopted neural networks were trained to recog-

nize. We tested each CNNs solution in the Cloud

System. We setup the cloud service on a local server

equipped with an NVIDIA GeForce 1080 Ti and con-

nected Hololens through WiFi/LAN networking. This

local server clearly simulates an ideal cloud solution

with low latency; nonetheless we will show that this

is not enough to guarantee real-time operations. Both

the Mean Average Precision (mAP) and the Execution

time are used as evaluation metrics of the cloud sys-

tem. The results are the following: (i) SSD scored

a mAP of 24 with an execution time of 77 ms; (ii)

R-FCN scored a mAP of 30 with an execution time

of 127 ms; (iii) Faster R-CNN scored a mAP of 32

with an execution time of 138 ms. On top of that, it

must be pointed out that the communication time be-

tween Local and Cloud systems is independent from

the adopted algorithm: in our settings we measured

an average round trip delay of 300 ms for server re-

quest and reply. According to the results, we chose

R-FCN as be the best balance between accuracy and

speed. The overall delayed required for a cloud ob-

ject detection is therefore about 427 ms. Then, the

Mean Shift Tracking algorithm is tested to estimate

its contribution on the execution time. For a single

object, still or moving, a single MST takes on aver-

age 43 ms. For several objects in the same picture,

the time complexity raises linearly with the number

of objects. To deal with this problem, the MST has

been implemented in a parallel fashion so as to man-

age 5 concurrent tracking threads: it means that up to

5 objects can be tracked at the same time.

In Table 1, the main execution times for all the

policies and heuristics used in our systems are shown.

VISAPP 2020 - 15th International Conference on Computer Vision Theory and Applications

170

(a) (b)

Figure 5: (a.) Detection requests reduction related to the ”Detection of changes in current scene” (Fig.1, F) module; (b.)

Detection requests reduction related to the ”Identiﬁcation of scene already analyzed” (Fig.1, I) module.

The obtained results show that the “Motion detection”

and the “Detection of changes in the current scene”

modules are quite fast; the slower mechanism is rep-

resented by the module “Identiﬁcation of an already

analyzed scene”, that in any case introduces a latency

of about 243 ms that is much better than the cloud

system delay that is approximately 427 ms. The ORB

algorithm is performed both in this module and after

every detection request: even if performing this action

is time-costly, the retrieved information could avoid a

new detection request every time the observer looks

again to an already seen location. To test the impact

of the different checks and policies, a scene with 5

objects has been set up. The evaluation consisted in

two different tests, repeated ten times, each one by a

different user.

The ﬁrst test is related to the “Detection of

changes in the current scene” module. Standing still,

in front of the scene, the number of cloud detection

requests performed in 60 seconds is logged. As a ref-

erence system we consider the case when we disable

all the designed policies, i.e. a new cloud detection

request is issued as soon as a reply from the previ-

ous one is received. Then, we repeat the same exper-

iment a second time by activating the “Detection of

changes” module. The results shown in Fig.5.a con-

ﬁrm that the designed heuristic yields a signiﬁcant re-

duction of detection requests that lower from 82 re-

quests per minute (rpm) to an average of 13 rpm in

the second case.

A second test is performed on the ”Identiﬁcation of an

Table 1: Execution time for checks and policy modules.

Module Exec. time

Motion detection:

< 1 ms

MTV and MAV computation

Detection of changes in the current scene:

13 ms

Frame’s gray-scale histogram

Ident. of already analyzed scene :

Geometric Check < 1 ms

ORB computation 197 ms

Similarity measure (ORB key-points) 46 ms

already analyzed scene” module: in this case, another

similar scene with 5 objects was prepared. After a

ﬁrst detection of the new test environment, each tester

gazed alternatively 10 times the two scenes. The com-

parison is evaluated on the sent requests, having just

the “Detection of changes” module activated. The

results provide in Fig.5.b show further improvement

of the proposed technique. All the users experiment-

ing the proposed MR system conﬁrmed the beneﬁ-

cial effect of the designed heuristics: in particular,

during the experiments, the users did not evidence a

wrong or badly aligned MR labels. More importantly

the 10 users conﬁrmed that the system works in real-

time with labels correctly refreshed. According to the

previous results, the Local System plays an impor-

tant role in limiting the number of detection requests:

both MST and Detection Policy algorithms perform

faster than the delay induced by the communication

between Local and Cloud systems. These modules

improve signiﬁcantly the performance and the respon-

siveness of the application. In the best case, in which

the observer is statically looking at a scene, the Local

System can process up to about 18 fps by the activa-

tion of MST and the “Detection of changes for cur-

rent scene” module: this is a signiﬁcant improvement

compared with the limit of 1.5 fps imposed by a sim-

ple use of cloud detection only.

4 CONCLUSIONS

This paper presented a mixed reality system, able to

detect and track generic objects in a dynamic envi-

ronment in real-time. That system dealt with com-

plexity problems of detection algorithms and limited

computational resources by combining the HoloLens’

processor with a cloud system equipped with high

computational capabilities. The Cloud System was

in charge to process R-FCN algorithm to detect ob-

jects from a frame with the right compromise between

speed and accuracy. HoloLens runs the Local Sys-

tem, which performs objects tracking, features extrac-

Real-time Object Detection and Tracking in Mixed Reality using Microsoft HoloLens

171

tion and spatial mapping tasks. Several policies and

heuristics are introduced in the Local System in order

to limit the detection requests to the Cloud System.

Limiting the external requests implied an increment

of about 12 times the total frames per second pro-

cessed (in the best case), without altering the overall

system’s accuracy.

REFERENCES

Alper Yilmaz, O. J. and Shah, M. (2006). Object Tracking:

A Survey.

Azuma, R. T. (1997). A survey of augmented reality.

Presence: Teleoperators and virtual environments,

6(4):355–385.

Borji, A., Cheng, M.-M., Jiang, H., and Li, J. (2015).

Salient object detection: A benchmark. IEEE trans-

actions on image processing, 24(12):5706–5722.

COCO. http://cocodataset.org/.

Dai, J., Li, Y., He, K., and Sun, J. (2016). R-fcn: Object de-

tection via region-based fully convolutional networks.

In Advances in neural information processing sys-

tems, pages 379–387.

Ethan Rublee, Vincent Rabaud, K. K. and Bradski, G.

(2004). Orb: an efﬁcient alternative to sift or surf.

Han, J., Zhang, D., Cheng, G., Liu, N., and Xu, D. (2018).

Advanced deep-learning techniques for salient and

category-speciﬁc object detection: a survey. IEEE

Signal Processing Magazine, 35(1):84–100.

HoloLens, M. https://www.microsoft.com/hololens.

Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S.,

Fu, C.-Y., and Berg, A. C. (2016). Ssd: Single shot

multibox detector. In European conference on com-

puter vision, pages 21–37. Springer.

Lowe, D. G. (2004). Distinctive image features from scale-

invariant keypoints.

Milgram, P. and Kishino, F. (1994). A taxonomy of mixed

reality visual displays.

OpenCV. https://opencv.org//.

Padmavathi S., D. S. (2014). Survey on tracking algorithms.

International Journal of Engineering Research and

Technology (IJERT).

Ren, S., He, K., Girshick, R., and Sun, J. (2015). Faster

r-cnn: Towards real-time object detection with region

proposal networks. In Advances in neural information

processing systems, pages 91–99.

Tensorﬂow. https://www.tensorflow.org/.

Unity3D. https://unity3d.com/.

Wen, Z.-Q. and Cai, Z.-X. (2006). Mean shift algorithm

and its application in tracking of objects. In Machine

Learning and Cybernetics, 2006 International Con-

ference on, pages 4024–4028. IEEE.

VISAPP 2020 - 15th International Conference on Computer Vision Theory and Applications

172