2 BACKGROUND
Our implementation builds on the KinectFusion
(KinFu) algorithm which was published in (Izadi
et al., 2011; Newcombe et al., 2011) and has become
a state-of-the-art method for 3D reconstruction in the
terms of robustness, real-time and map density. The
next subsection outlines the original KinFu publica-
tion. Afterwards, an enhanced version is introduced.
Finally, the detection of new moving objects is de-
scribed in detail.
2.1 KinectFusion
The KinectFusion algorithm generates a 3D recon-
struction of the environment on a GPU in real-time
by integrating all available depth images from a depth
sensor into a discretized Truncated Signed Distance
Functions (TSDF) representation (Curless and Levoy,
1996). The measurements are collected in a voxel grid
in which each voxel stores a truncated distance to the
closest surface including a related weight that is pro-
portional to the certainty of the stored value. To inte-
grate the depth data into the voxel grid every incom-
ing depth image is transformed into a vertex map and
normal map pyramid. Another deduced vertex map
and normal map pyramid is obtained by ray casting
the voxel grid based on the last known camera pose.
According to the camera view, the grid is sampled by
rays searching in steps for zero crossings of the TSDF
values. Both pyramids are registered by a ICP proce-
dure and the resulting transformation determines the
current camera pose. Due to runtime optimization the
matching step of the ICP is accomplished by projec-
tive data association and the alignment is rated by a
point-to-plane error metric (see (Chen and Medioni,
1991)). Subsequently, the voxel grid data is updated
by iterating over all voxels and the projection of each
voxel into the image plane of the camera. The new
TSDF values are calculated using a weighted running
average. The weights are also truncated – allowing
the reconstruction of scenes with some degree of dy-
namic objects. Finally, the maps, created by the ray-
caster, are used to generate a rendered image of the
implicit environment model.
2.2 KinectFusion with Moving Objects
Tracking
Based on the open-source publication of KinFu, re-
leased in the Point-Cloud-Library (PCL; (Rusu and
Cousins, 2011)), Korn and Pauli extended in (Korn
and Pauli, 2015) the scope of KinFu to the ability to
reconstruct the static background and several moving
rigid objects simultaneously and in real-time. Inde-
pendent models are constructed for the background
and each object. Each model is stored in its own voxel
grid. During the registration process for each pixel
of the depth image the best matching model among
all models is determined. This yields a second out-
put of the ICP beside the alignment of all existing
models: the correspondence map. This map contains
the assignment of each pixel of the depth image to a
model or information why the matching failed. First
of all, the correspondence map is needed to construct
the equation systems which minimize the alignment
errors. Afterwards, it is used to detect new moving
objects. It is highly unlikely that during the initial de-
tection of an object the entire shape and extent of the
object can be observed. During the processing of fur-
ther frames the initial model will be updated and ex-
tended. Because of this the initial allocated voxel grid
may turn out as too small and therefore, the voxel grid
will grow dynamically in this case.
2.3 Detection of New Moving Objects
The original detection approach from (Korn and
Pauli, 2015) is illustrated in Fig. 1 by the com-
puted correspondence maps. The figure shows cho-
sen frames from a dataset recorded with a handheld
Microsoft Kinect for Xbox 360. In the center of the
scene, a middle size robot with a differential drive and
caster is moving. On an aluminum frame a laptop, a
battery and an open box on the top is transported. At
first most pixels can be matched with the static back-
ground model. Then more and more registration out-
liers occur (dark and light red). In addition, potential
outliers (yellow) were introduced in (Korn and Pauli,
2015). These are vertices with a point-to-plane dis-
tance that is small enough (< 3.5 cm) to be treated as
inlier in the alignment process. On the other hand, the
distance is noticeable large (> 2 cm) and such match-
ings cannot be considered as totally certain. Because
of this, potential outliers are processed like outliers
during the detection phase.
The basic idea of the detection is that small accu-
mulations of outliers can occur due to manifold rea-
sons. But large clusters are mainly caused by move-
ment in the scene. The detection in (Korn and Pauli,
2015) is performed window-based for each pixel. The
neighborhood with a size of 51 × 51 pixels of each
marked outlier is investigated. If 90% of the neigh-
bor pixels are marked as outliers or potential outliers,
then the pixel in the center of the window is marked as
new moving object. In the next step, each (potential)
outlier in a much smaller 19 × 19 neighborhood of a
pixel marked as a moving object is also marked as a
VISAPP 2017 - International Conference on Computer Vision Theory and Applications
500