cess, several features are extracted to represent and
remind the detected objects. The whole process and
all the interactions are exhaustively described in the
Sections 2.1 and 2.2.
2.1 Cloud System
The Cloud System performs the object detection task.
Due to the complexity of the involved algorithms, it
is hosted by a server provided with one of the fastest
GPU, the NVIDIA GeForce 1080Ti. The process is
described as follows. Firstly, the Local System sends
a Detection Request, which is an HTTP request con-
taining the last captured frame, to the Cloud System .
Then, the picture is processed by the Object Detection
module that implements the algorithms presented in
Section 1.2. For this work we leveraged on pretrained
models, freely available on Tensorflow (Tensorflow, ),
that are Faster R-CNN and R-FCN with ResNet 101
and SSD with Inception ResNet V2. These models are
trained on the public dataset COCO (COCO, ).
When the Object Detection module processes a new
image, a list of Detected Objects (DOs) is returned.
Each DO is represented by:
• Bounding Box (BB): the image’s rectangle area in
which an object is identified. It consists in a cou-
ple of coordinates c
1
(x
1
, y
1
), c
2
(x
2
, y
2
), represent-
ing the upper-left and bottom-right region corners;
• Class: the most-likely class that the identified ob-
ject is supposed to belong (e.g. mouse, keyboard,
..);
• Score: the confidence with which the algorithm
associates the object to the Class. It is defined as
Score = {x|x ∈ ℜ, x ∈ [0, 1]}
To avoid false detections, we empirically defined a
minimum threshold for Score. We consider valid de-
tections only DOs having Score ≥ 0.8.
Finally, the DOs list is returned to the Local System.
2.2 Local System
The Local System runs directly on the HoloLens’
hardware. It is developed in C# by using Unity 3D
(Unity3D, ) and OpenCV (OpenCV, ). Unity 3D al-
lows mapping the information coming from the sen-
sors to a virtual environment, in which 1 meter corre-
sponds to a 1 virtual unit. This framework allows to
make a representation of the real scene, reconstruct-
ing the surfaces and tracking the HoloLens relative
position. Furthermore, every virtual object added in
the Unity scene, it is displayed as hologram in the real
space. The OpenCV framework, instead, is used to
perform image processing algorithms over the frames.
The system operates in real time, directly on the video
stream. According to several variables related to the
observed scene and the observer’s position, it evalu-
ates whether to perform a detection request or to track
the identified objects detected in the past. Further-
more, it performs the 2D-3D and 3D-2D space con-
version to map the 2D frame’s domain into the 3D
space domain (and vice-versa). This mapping is re-
quired because HoloLens provides raw data from dif-
ferent sources without mixing them: the video stream
is acquired from a 2D RGB camera, while the third
spatial dimension is computed from a central depth
camera and four gray-scale cameras placed on the
headset’s frontal sides.
In Fig.1, the local system’s flow is shown. From an
initial condition, in which the observer is steady, the
Local System takes a frame from the Frame capture
module (Fig.1, A). At startup, when previous knowl-
edge of the surrounding world cannot be exploited,
the Object detection request to server (Fig.1, G) is
triggered to request a new detection to the Cloud Sys-
tem. At this point, the Cloud System starts computing
the received frame to extract and provide the right in-
formation. Meanwhile the Local System computes,
for each acquired frame, a gray-scale histogram, dis-
cretized onto 64 bins (this information will be used in
a second step). When the Cloud System finishes the
computation, it returns a DOs list. The received data
(Receiving detected objects list module, in Fig.1, B)
is used to build the first representation of the observed
objects. First of all, it is required to map the 2D de-
tection on the camera’s space to the 3D space. This
action is performed by the 2D-3D Mapping module
(Fig.1, C) as follows:
1. Defined:
• P(x
p
, y
p
), the central pixel of a detected region:
it is a position in the frame’s space;
• C(x
c
, y
c
, z
c
), the camera position in the world’s
space, taken at the same time as the frame is
acquired
2. P(x
p
, y
p
) is transformed from the frame space, to
the world space, obtaining P
0
(x
p
, y
p
, z
p
);
3. A ray (Fig.2), starting from C(x
c
, y
c
, z
c
) and pass-
ing through P
0
(x
p
, y
p
, z
p
), is traced: the intersec-
tion between the ray and the environmental mesh
identifies a point I(x
i
, y
i
, z
i
), that is an approxima-
tion of the object’s location in the world space.
This step is easily performed by using Unity 3D
Ray Tracing and Collider functionalities.
4. Points 1 to 3 are repeated for each detected region
At each point I(x
i
, y
i
, z
i
) in the world space, the
Labels in Mixed reality module (Fig.1, D) displays a
Real-time Object Detection and Tracking in Mixed Reality using Microsoft HoloLens
167