KinFu MOT: KinectFusion with Moving Objects Tracking
Michael Korn and Josef Pauli
Intelligent Systems Group, University of Duisburg-Essen, 47057 Duisburg, Germany
Keywords:
3D, GPU, Tracking, SLAM, Iterative Closest Point (ICP), Point Cloud Registration, Depth Cameras.
Abstract:
Using a depth camera, the KinectFusion algorithm permits tracking the camera poses and building a dense 3D
reconstruction of the environment simultaneously in real-time. We present an extension to this algorithm that
allows additionally the concurrent tracking and reconstruction of several moving objects within the perceived
environment. This is achieved through an expansion of the GPU processing pipeline by several new function-
alities. Our system detects moving objects from the registration results and it creates a separate storing volume
for such objects. Each object and the background are tracked and reconstructed individually. Since the size of
an object is uncertain at the moment of detection the storing volume grows dynamically. Moreover, a sliding
reduction method stabilizes the tracking of objects with ambiguous registrations. We provide experimental re-
sults showing the effects of our modified matching strategy. Furthermore, we demonstrate the system’s ability
to deal with three different challenging situations containing a moving robot.
1 INTRODUCTION
In recent years visual SLAM and 3D reconstruction
methods with handheld sensors have reached a signif-
icant level of maturity. The requirements of real-time
and robustness seem to be accomplished so compre-
hensively that they are not the weak point for sev-
eral applications. Instead, further needs get more
attention like the spatial extension of the perceiv-
able volume with loop closure functionality (Endres
et al., 2012) and dense mapping (Klein and Murray,
2007). With the emergence of highly available RGB-
D cameras, in particular the Microsoft Kinect, the
meaning of a dense map has developed further to-
gether with the new potentials. In (Izadi et al., 2011;
Newcombe et al., 2011) the KinectFusion algorithm
(KinFu) was introduced and it has become a state-of-
the-art method in the terms of robustness, real-time
and density. Since KinFu accumulates all informa-
tion obtained from the depth frames in a small (3
3
m
3
)
stationary voxel grid, the spatial scope is too small for
several applications. These spatial limitations are al-
ready being tackled, too. With (Heredia and Favie,
2012) an open source KinFu derivative with a mov-
ing voxel grid is available. A similar approach is de-
scribed in (Whelan et al., 2012). This paper also men-
tions ongoing work on SLAM pose-graph optimiza-
tion and hence loop closing.
A large and least treated issue represents the pres-
ence of moving objects. On the one hand such objects
can interfere with the camera movement reconstruc-
tion and force the algorithm to fail. On the other hand
a moving object can lead to a gain of information.
Such an object can be clearly segmented and recon-
structed from different perspectives, independent of
the camera movement. Furthermore, the object move-
ment can be tracked which yields advantages for ex-
ample in the field of autonomous robot navigation or
articulated objects. This paper presents a way to make
these benefits available by detecting the movement of
rigid objects and creating a dense reconstruction of
each object in a separate voxel grid.
Section 2 summarizes some required background
knowledge concerning KinFu, section 3 describes our
modifications to the matching step of KinFu and our
pipeline extensions, section 4 details our experiments
and results. In the final section we discuss our con-
clusions and future directions.
2 BACKGROUND
Our implementation builds on the open source publi-
cation of KinFu released by the Point-Cloud-Library
(PCL) (Rusu and Cousins, 2011) which is based
strictly on the original descriptions from (Izadi et al.,
2011; Newcombe et al., 2011). In the next two sub-
sections we point at the limitations of KinFu.
648
Korn M. and Pauli J..
KinFu MOT: KinectFusion with Moving Objects Tracking.
DOI: 10.5220/0005362106480657
In Proceedings of the 10th International Conference on Computer Vision Theory and Applications (VISAPP-2015), pages 648-657
ISBN: 978-989-758-091-8
Copyright
c
2015 SCITEPRESS (Science and Technology Publications, Lda.)
2.1 KinectFusion
The KinFu algorithm generates a 3D reconstruc-
tion of the environment on a GPU in real-time by
integrating all available depth maps from a depth
sensor into a discretized Truncated Signed Distance
Functions (TSDF) representation (Curless and Levoy,
1996). The measurements are collected in a voxel
grid in which each voxel stores a truncated distance
to the closest surface including a related weight that
is proportional to the certainty of the stored value. To
integrate the depth data into the voxel grid every in-
coming depth map is transformed into a vertex map
and normal map pyramid. Another deduced vertex
map and normal map pyramid is obtained by ray cast-
ing the voxel grid based on the last known camera
pose. According to the camera view, the grid is sam-
pled by rays searching in steps for zero crossings of
the TSDF values. Both pyramids are registered by a
point-to-plane ICP procedure and the resulting trans-
formation determines the current camera pose. Sub-
sequently, the voxel grid data is updated by iterating
over all voxels and the projection of each voxel into
the image plane of the camera. The new TSDF val-
ues are calculated using a weighted running average.
Usually the weights are also truncated allowing the re-
construction of scenes with some degree of dynamic.
Finally, the maps created by the raycaster are used to
generate a rendered image of the implicit environment
model.
Furthermore, the detection of outliers is in the
scope of (Izadi et al., 2011) which is not a part of
the PCL. The possibility is described to first recon-
struct the background and then to use the ICP outliers
to detect movement. They primarily intended to allow
interaction with the background model, but the 3D re-
construction of the foreground object was also sug-
gested. However, the prospects are restricted since the
object cannot be reconstructed simultaneously with
the background.
2.2 KinectFusion Limitations
KinFu is primarily limited to the reconstruction of
a small static environment which fits into the 512 ×
512 × 512 voxel grid. Moving objects in the space of
the perceived environment are not totally off-limits,
but they cause outliers in the registration process and
are able to interfere with the estimated camera poses.
In the best case, as long as such objects are moved
quickly, they will not become part of the environment
model and they stay disregarded entirely. The other
issue of a small size of the voxel grid is caused by
the claim of reaching real-time capability and the fact
prev. depth map
correspondence map
cur. depth map
estimate roughly camera movement
3.4
match background model
3.1,3.3
match all object models
3.1,3.3
3.3
detect moving objects;
when indicated: init. new model
3.2
ICP Iterations
volumetric integration
3.5
object voxel grid growing
3.6
prev. poses
cur. poses
correspondence map with assigned
undetermined vertices
3.5
prev. voxel grids
do not use underconstrained
registrations, at rst
3.4
optimize transformations:
independent and in parallel
3.3
Matching
prev. correspondences
Figure 1: Overview of the processing pipeline which is in-
troduced in section 3. Related subsection are indicated in
red. Inputs as parallelograms, data as rounded rectangles
and processing steps as rectangles.
that the runtimes of the raycast process and the depth
map integrations process rely strongly on the voxel
grid size. It is desirable to reduce the impact of the
voxel grid size on the runtimes to increase the num-
ber of voxel. This would allow to cover a larger sec-
tion of the environment or to enhance the resolution
of the voxel grid. Furthermore, the targeted additional
reconstruction of several moving objects leads to the
need of being able to use a larger number of voxels to
store the extra information.
3 CONTRIBUTION
In this section we first present an alternative match-
ing method for the ICP. It is designed to deal with a
larger number of voxels, since every object gets its
own voxel grid in addition to the voxel grid of the
background. This method is also able to deal with
more dynamics in the scene due to the fact, that we do
not sample the voxel grid based on old pose informa-
tion. This is especially advantageous when the object
and the camera move in opposed directions. In the re-
maining subsections, we introduce several extensions
to the original KinFu GPU processing pipeline which
permits to reconstruct the background and several
moving rigid objects in parallel and in real-time. A
simplified overview of the resulting pipeline is given
in Figure 1.
KinFuMOT:KinectFusionwithMovingObjectsTracking
649
3.1 Vertex Map to Model Matching
KinFu creates deduced vertex and normal maps from
the voxel grid data and register these maps with the
maps obtained from the Kinect. The sampling of the
voxel grid causes unwanted information reduction. It
may happen, that the depth measurements contain no
adequate correspondence in the deduced data, due to
missing points caused by occlusion, a slight offset of
the measured area, imperfect data in the voxel grid
or inaccurate sampling. Furthermore, ray casting the
whole voxel grid is not always necessary and the ef-
fort can be reduced. A sample application is robot
navigation in which no rendered image of the envi-
ronment is needed. Even when a rendered image is
desired, the viewpoint of the virtual camera can dif-
fer from the viewpoint of the kinect and consequently
two raycast processes are required.
We introduce a method to avoid the registration
with maps which are deduced from the voxelgrid. In-
stead, the maps obtained from the Kinect are matched
directly against the model in the voxel grid. Our mod-
ification is based on the idea that the Kinect vertex
map indicates possible correspondences between the
latest data and the model. We conform as far as pos-
sible to the notation from (Izadi et al., 2011) in Algo-
rithm 1 in order to make comparisons simple. Sub-
stantially, Algorithm 1 is a combination of a locally
limited raycaster and a projective ICP algorithm. In
parallel, each GPU thread walks in steps within a
section of a single ray through the voxel grid. The
threads search for an implicit surface within the voxel
grid matching the related point from the vertex map
obtained from the latest kinect frame. On the one
hand, the step size step = trunc dist · 0.6 must be
small enough so that no zero crossing is missed. On
the other hand, too small of steps or even a cell by
cell handling downgrade the results significantly be-
cause of the noisy data. Obviously, larger steps have
a smoothing effect and reduce the runtime, too. The
rays are determined by the latest vertex map V and the
best known Kinect pose T . After a new Kinect frame
arrives, T is initialized with the physical global cam-
era pose obtained from the previous frame and refined
iteratively. Depending on the current Pose T and the
vertex map V the rays run from the estimated cam-
era position to the measured points. The search for
correspondences must only take place along the rays
in the neighborhood of the measured points restricted
by the distance threshold for the ICP plus a tolerance
depending on the step size. The exact surface inter-
section point is computed using a linear interpolation
based on the trilinearly extracted TSDF values of the
points g and g
prev
. As might be expected from
Algorithm 1: Vertex Map to Model Matching.
Input: vertex map V , normal map N, voxel grid grid, latest
transformation (camera pose) T = [R|t] with rotation R
and translation t, distance threshold for ICP distThres,
step wide step
Output: point pairs incl. normal for each matched pixel u
1: for each valid pixel u N in parallel do
2: v
g
u
T ·V (u)
3: ray
dir
normalize(v
g
u
t)
4: ray
start
v
g
u
ray
dir
· (distThres +step)
5: ray
endLen
2 ·(distThres +step)
6: ray
len
max(0, lenToEnterVolume(ray
start
,ray
dir
))
7: if ray
len
>= ray
endLen
step then
8: exit
9: g ray
start
+ ray
dir
· ray
len
10: while ray
len
< ray
endLen
do
11: g
prev
g
12: g ray
start
+ ray
dir
· (ray
len
+ step)
13: if g left volume bounds then
14: exit
15: if zero crossing from grid(g) to grid(g
prev
) then
16: l extract trilinear interpolated ray length
17: l min(max(l, ray
len
), ray
len
+ step)
18: if l < step or l > (2 · distThres + step) then
19: exit
20: v
model
u
ray
start
+ ray
dir
· l
21: n
g
u
R · N(u)
22: return correspondence hv
model
u
,v
g
u
,n
g
u
i
23: ray
len
ray
len
+ step
line 17 the trilinear interpolation of the TSDF data is
susceptible to faults and the computed points are lo-
cated sometimes several meters away from the cor-
rect position. We have followed the original PCL
implementation regarding the interpolation, but since
the interpolation is time-consuming and the benefit is
questionable, this approach should be subject to fur-
ther evaluation. After a final check, whether the dis-
tance threshold is not exceeded, the found correspon-
dences consisting of vertex v
g
u
, the related normal n
g
and the computed intersection point between the ray
and implicit surface v
model
u
take part in the optimiza-
tion process. The approaches to build the linear equa-
tion system and to solve it are based on the original
KinFu paper and remain unchanged in this subsec-
tion.
The following part points at some conceptual dif-
ferences between our ICP process and the original
procedure. First of all we abstain from using a thresh-
old between the normals extracted from the voxel grid
and the normals calculated from the depth map. In
our experiments we were not able to find a general
benefit in such a threshold. On the contrary, we ob-
served that in the PCL implementation too many cor-
respondences are rejected because the normals differ
more than the threshold of 20
. In some experiments
VISAPP2015-InternationalConferenceonComputerVisionTheoryandApplications
650
with our matching approach and a normal threshold,
about 15% of mainly useful correspondences were re-
jected. Aside from the problems with noise and in-
terpolation, mentioned above, the Kinect depth maps
are disturbed by a vertical banding. In these areas
the normal threshold rejects most correspondences
even on homogeneous surfaces. The normal thresh-
old would have to be softened so much that it should
be removed right away. Furthermore, we use the nor-
mals calculated from the depth map and not the nor-
mals extracted from the voxel grid for the point-to-
plane error metric. These modifications show small
improvements on some datasets and small degrada-
tions on other datasets. Overall the results remain the
same. However, the compution of the model normals
is relatively expensive and can be avoided. Finally,
we always use the full resolution of the kinect and we
removed the multi-scale ICP alignment proposed in
(Newcombe et al., 2011) since we could not see any
advantage that justifies that additional effort in our ex-
periments.
3.2 Object Initialization
A moving object can be recognized by outliers dur-
ing the ICP registration. Unfortunately, in addition
to noise, the projective ICP algorithm leads to a lot
of outliers which must not be mistaken as movement.
In particular, this ICP needs a large distance thresh-
old, the default is 10cm. On the one hand, detecting
movement only after a large displacement is not sat-
isfactory and it may work only for fast objects since
small displacements are gradually integrated into the
background voxel grid. On the other hand, reducing
the distance threshold has a significant adverse effect
on the results. To consider both, we use a threshold of
10cm but in the last iteration we decrease it do 3.5cm.
We estimate further potential outliers according to the
point-to-plane metric (v
model
u
v
g
u
) · n
g
u
> 2cm. These
correspondences are used in the registration process,
but also marked as outliers for further processing.
Subsequently, the neighborhood with a size of
51 × 51 pixels of each marked outlier o
u
is investi-
gated. If 90% of neighbor pixels are marked as out-
liers, o
u
is marked as a moving object. In the next
step, each outlier in the 51 × 51 neighborhood of a
pixel marked as moving object, is marked as moving
object, too.
If such pixels were detected, an initialization of
a new moving object is taken into account. An ini-
tialization is only carried out if (a) a moving object
was suspected in at least 5 of the sequential previous
frames and (b) it has been more than 30 frames (1 sec-
ond) since the last object initialization. Additionally,
(c) the initialization is postponed by 30 extra frames
after each object growing (see subsection 3.6). This
indicates that an object is not recognized as a whole.
In the case of an initialization, the bounding box for
the marked vertices is determined to obtain the posi-
tion and size of a new voxel grid. This voxel grid is
enlarged by 10% and 10cm in each of the six direc-
tions to allow growing of the object model. Finally,
the volumetric integration of the marked pixels is per-
formed.
3.3 Matching and Pose Estimation
The pose estimation is distinguished from the origi-
nal by the vertex map to model matching (Algorithm
1) of every voxel grid voxel grid
i
with i {bg}
ob jectIDs. Within seven iterations, first the corre-
spondences between vertex map V and background
voxel grid voxel grid
bg
are found and subsequently
the correspondences between V and all object voxel
grids. Since every vertex could be assigned to sev-
eral voxel grids the best matching for each V (u) is
searched. For this purpose Algorithm 1 is extended by
a correspondence map C within which the best assig-
nation C(u) for each V (u) is stored. In the case of
several matchings, the correspondence with the short-
est point-to-plane distance (v
model
u
v
g
u
)·n
g
u
is chosen.
As this value is anyway computed for the optimiza-
tion process the additional effort for this approach is
negligible.
The background model and each detected object
have their own registration. Therefore, separate nor-
mal equation systems are created for the background
and each detected object which results in different
transformation estimations x
i
:
(A
T
i
A
i
)x
i
= A
T
i
b
i
, with i {bg} ob jectIDs (1)
In contrast to the original approach, it is not pos-
sible to create normal equations for each block of
the vertex map (The GPU implementation works in
blocks with size 32 × 8 pixels.) in the first step and
to sum this equations in a subsequent step. Before
creating the normal equations, V must be matched
with every model and the correspondences have to
be fixed. Accordingly, we cannot store the equa-
tion systems of small blocks in a buffer, but each
line [A
i
|b
i
] is buffered. Since the assignation C(u)
of any voxel position u is distinct, only one common
buffer [A|b] for all models is sufficient. After this
pure matching and preparation step, Equation 1 is set-
tled up for voxel grid
bg
and each other voxel grid
i
.
[A
T
i
A
i
|A
T
i
b
i
]
6×7
consists of 27 independent elements
because of symmetry. The performance of this ap-
proach is similar to the original since any of these el-
ements of any object can be computed separately by
KinFuMOT:KinectFusionwithMovingObjectsTracking
651
their own GPU block in parallel with tree-based re-
duction.
3.4 Sliding Reduction
A substantial drawback of the point-to-plane metric is
the tendency to diverge, if the transformation between
the model and the data is not fully constrained. A reg-
istration of a planar surface for example has three un-
constrained degrees of freedom: translation in two di-
rections along the plane and rotation around the plane
normal. Neither of these change the point-to-plane er-
ror significantly and the resulting transformation may
slide in each ICP iteration due to noise. This issue
can already be observed in the original KinFu algo-
rithm but since tracked objects are smaller and with
less details than the background, especially right af-
ter the initialization, its importance increases. Addi-
tionally, the transformation between the current V and
the background results from the camera movement.
Whereas the transformation between V and an ob-
ject results from the camera movement and the object
movement. Due to the assumption of marginal move-
ment the ICP starts the processing of a new frame
with the last known transformations. Since the cam-
era movement remains unconsidered during the first
iteration, it flows into the initialization of the object
transformation. This influence may have an impact on
the final estimated transformation in such a way that a
part of the camera movement is ascribed to the object
movement. We deal with the sliding in two steps: first
we compute a rough estimation of the camera move-
ment for the initialization of the transformations. Fur-
thermore, we use the updated transformation estima-
tion from the ICP only after the last two iterations if
it is a subject to an underconstrained registration.
Before the actual ICP we determine the transfor-
mation between the current V and the previous ver-
tex map
ˆ
V based on the previous assignment of corre-
spondences
ˆ
C. There are no multiple iterations and as
shown in Algorithm 2 this approach is simple com-
pared to the original matching step. Since the first
supposed transformation is the identity matrix we can
Algorithm 2: Vertex Map V to
ˆ
V Matching.
1: for each pixel u V in parallel do
2: if
ˆ
C(u) = backgroundID and
V (u) valid and
ˆ
N(u) valid and
k
ˆ
V (u) V (u)k distThres then
3: return correspondence h
ˆ
V (u),V (u),
ˆ
N(u)i
avoid any coordinate transformation and also the per-
spective projection into the image coordinate system.
In the further process we use the original method for
creating the normal equations and computing the first
estimation of the incremental transformation. It is
used to correct the initial guess of the camera move-
ment. Since the first optimization step has by far the
largest impact, this improves the first object matching
greatly.
Even with a perfect initialization, sliding can-
not be prevented. (Gelfand et al., 2003) describes
amongst others how the stability of a registration can
be estimated reliably. A condition number
λ
1
λ
6
can be
computed from the eigenvalues λ
1
... λ
6
of A
T
i
A
i
.
A small condition number indicates at least one un-
constrained degree of freedom and an unstable reg-
istration due to possible sliding. In our experiments
with stable situations the condition number is around
100 and going far beyond 1000 when we observe slid-
ing. Given this margin a reliable threshold can be eas-
ily found and we choose 1000. If this threshold is ex-
ceeded, a situation occurs where the result of the ICP
is questionable and should not be used. Obviously,
using wrong data should be avoided, however the ob-
ject is probably still in movement and without regis-
tration the tracking will be lost after several frames.
As a compromise, just the last two ICP iterations de-
cide the object movement. Since the camera move-
ment is quite well estimated from this point and the
iterations are few, the sliding is within limits. More-
over, just two iterations are absolutely sufficient be-
cause the data do not contain enough information for
a registration which needs to be approximated over
many iterations. During the first five iterations the
object is still part of the matching process to reduce
false matchings of other objects or the background.
In addition, the first six iterations are always started
with respect to the most recent camera movement es-
timation.
3.5 Volumetric Integration
The information from each new frame is integrated
into every voxel grid voxel grid
i
with i {bg}
ob jectIDs. This can be initiated in parallel since the
voxel grids are independent. After the pose estima-
tion step, the camera poses related to the coordinate
systems of each voxel grid
i
are known. Furthermore,
every pixel of the recent depth map is assigned to one
particular voxel grid
i
based on the correspondences
from C. In addition, a distance map D is prepared
where D(u) is the measured distance between the
camera and the perceived surface at pixel u based on
the raw depth map. Our integration method uses ma-
jor parts from the PCL and the simplified concept is
summarized in Algorithm 3. The changes compared
with (Izadi et al., 2011) are located between the lines
VISAPP2015-InternationalConferenceonComputerVisionTheoryandApplications
652
Algorithm 3: Volumetric Integration.
1: for each g = (x, y, 0) voxel grid
i
in parallel do
2: for each z with g = (g
x
,g
y
,z) voxel grid
i
do
3: v
g
convert g from grid to global 3D position
4: v T
1
i
· v
g
5: u project v into image plane
6: if v
z
> 0 and u D and D(u) valid then
7: sdf D(u) kt
i
v
g
k
8: if sdf trunc dist then
9: if C(u) = i or
(C(u) invalid and C(u) = backgroundID ) then
10: tsdf min(1, sd f /trunc dist)
11: else
12: if sd f < 0 then
13: continue
14: tsdf 1
15: read average tsdf and weight in voxel grid
i
at g
16: compute and store new average tsdf and weight
8 and 14. The whole TSDF proposed in (Izadi et al.,
2011) is only used in the case that the particular pixel
is assigned by C to the current voxel grid or the cor-
respondence of this pixel is unknown and the current
voxel grid contains the background model. Elsewise
the tsd f value for the current integration is set to 1 for
all voxel between the camera and the measured sur-
face and the stored values behind the surface remain
unchanged.
Although this procedure works in general and it
is able to complete gradually a moving object which
was initialized partially at the beginning, the rate of
convergence is unsatisfactory. Usually, shortly after
an initialization a cluster of pixels can be matched
with the object. This cluster is surrounded with pixels
which belong to the object but were matched with the
background model since this information was inte-
grated even before the movement was detected. This
is a result of the projective ICP that matches a vertex
with the nearest surface along the particular ray and
not just with the nearest surface of the surrounding
area. While the object area grows due to imperfect
alignments of the depth maps and the voxel grid, the
correspondences with the background model decline
due to exceedances of the distance threshold. We use
these unmatched pixels to accelerate the growing of
the object.
Subsequent to the ICP process the correspon-
dence map C is modified based on the vertex map
V with Algorithm 4. We apply this method three
times to reach more pixels and we use in our imple-
mentation a buffered C to ensure determinism of our
whole approach.We investigate the neighborhood of
any outlier and any potential outlier according to a
large point-to-plane distance (see subsection 3.2). We
search the nearest of the 8-connected neighbors with
determined correspondence. Finally, we adopt the as-
Algorithm 4: Assign Undetermined Vertices.
1: for each pixel u V \1-pixel-border in parallel do
2: if V (u) valid and C(u) invalid or
u is potential outlier then
3: v V (u)
4: buf
dist
5: for each υ 8-connected neighbors of u do
6: if C(υ) valid then
7: dist kv V (υ)k
8: if dist < buf
dist
then
9: buf
dist
dist
10: buf
id
C(υ)
11: if dist < v
z
· distance scale then
12: C(u) buf
id
signment if the distance between the original vertex
and his neighbor is lower than a threshold. We use
distance scale =
3
focal length
, where focal length is
the focal length expressed in pixel units. This scale
multiplied with v
z
yields to the triple distance of in-
tersections of neighboring rays and the plane parallel
to the x and y axis at z = v
z
.
3.6 Object Voxel Grid Growing
Due to an imperfect initialization of a moving object
and the ability to grow during the volumetric integra-
tion, the volume capacity of an object voxel grid is of-
ten reached. The capacity of a voxel grid is reached,
if at least one voxel of any of the six sides of the
voxel grid was updated after the initialization and it
contains a TSDF value < 1. This means the distance
between the object model and the voxel grid border
is < trunc dist. Right after every volumetric integra-
tion of an object the borders of the voxel grid were
searched for such voxels. Each grid side containing
such voxels is expended by 10% of the grid size in the
related direction. For this purpose a further voxel grid
with the new size is created. The content of the old
voxel grid is copied into the new one and the extended
areas are initialized. Subsequently, the old grid is re-
placed by the new grid and the old grid is released.
Since an object voxel grid is usually small relative to
the background model and all this is done on the GPU
and just the device memory is involved, the time effort
is not particularly great.
3.7 GPU Implementation
The PCL provides a KinFu version for academic re-
search. There is just one CPU thread and the host
code waits regularly for kernel calls. Because of this,
runtimes can be measured well but the whole algo-
rithm is much slower. With our modifications it is im-
possible to reach acceptable framerates under these
KinFuMOT:KinectFusionwithMovingObjectsTracking
653
conditions. However, the main problem is not the
runtime of our KinFu algorithm, but the duration for
painting five windows (rendered background, one ren-
dered object, depth image, correspondences illustra-
tion and an interactive examination window) instead
of just three. We use three threads. One preparation
thread for grabbing the depth images and creating the
vertex and normal maps. One Thread for controlling
and painting the visualization. And one execution
thread for the actual GPU processing pipeline. Fur-
thermore, the building of the normal equation systems
and their download to host memory is parallelized
with cuda streams. In the second phase the detection
of outliers and moving objects and volumetric integra-
tions including growing of all objects is parallelized.
Our concept has two bottlenecks: First, the matching
– due to finding distinct correspondences without the
risk of race conditions. Secondly, the waiting for the
ICP results since many following parts relay on them.
This degree of parallelism is sufficient for now but to
keep the code clear the potential had not yet been fully
exhausted.
4 EVALUATION AND RESULTS
We measured the performance and SLAM accuracy of
our matching strategy, comparing it with the original
KinFu implementation based on three datasets with
known ground truth.
Furthermore, we provide qualitative evaluations
of the capabilities of our whole processing pipeline.
For this purpose we chose three scenes recorded with
a handheld Microsoft Kinect containing a moving
robot in a corner of our laboratory. In the corner are
two small desks, on the left desk there is a workstation
with display, peripheral equipment and a telephone,
underneath a drawer unit. In some datasets a wall
with vertical blinds and a heating unit was perceived
on the left side. On the right side a large box and a part
of a blackboard can be seen. We used a middle size
robot with a differential drive and caster. On an alu-
minum frame a slightly opened laptop, a battery and
on the top an open box is transported. In our exper-
iments, the robot size was enough to reconstruct the
robot, so we added the carton to investigate difficult
situations. In addition the robot drove slowly since
fast objects are a lesser evil. Fast objects cause early
outliers in the registration process and consequently
they do not largely disturb the tracking of the back-
ground. The initialization as a new object is simple,
too. Moreover, our datasets are short and the robot
moves from the very beginning. This leads to low
weights in the background voxel grid when the robot
Table 1: Comparison of the original matching strategy with
vertex map to model matching proposed in subsection 3.1.
All framerates consider exclusively the pure KinFu algo-
rithm, they include the runtimes of the raycaster but not the
effort for visualization and grabbing.
dataset RMSE/ATE [cm] fps
original ours original ours
fr1/xyz 2.10 2.02 58.4 72.3
fr2/xyz 2.23 2.32 54.7 66.2
fr2/desk 9.54 7.88 51.1 58.4
is already moving and hence the background model
is adapted to the moving object. This constitutes the
challenge to extract the object before the background
model is cluttered with misleading information and
maybe a registration failure occur. In summary, this
evaluation focuses on situations which are difficult for
the initialization, tracking and growing process.
The evaluation was performed with an Intel
i7-2600@3.40GHz and a NVIDIA GeForce GTX970.
A video of the evaluation results is available at
http://www.is.uni-due.de/kinfu mot/.
4.1 Vertex Map to Model Matching
We evaluated the vertex map to model matching with
three datasets from (Sturm et al., 2012). All datasets
were obtained from a Microsoft Kinect. The datasets
from (Sturm et al., 2012) include highly accurate and
time-synchronized ground truth camera poses from
a motion capture system, which works with at least
100 Hz and up to 300 Hz. We used the datasets
fr1/xyz with 798, fr2/xyz with 3666 and fr2/desk with
2964 depth frames. All three contain an office scene.
In fr2/desk the Kinect is moved around two tables so
that the loop is closed. To make our and the origi-
nal approach comparable we used the original PCL
implementation as a basis and we modified it as less
as necessary. For this subsection we just replaced the
matching part and we removed the multi-scale fea-
ture without further changes. In particular the algo-
rithm works from a global perspective synchronous
and thus the runtimes are comparable, too. We neither
adapt the parameters, 10cm are used as ICP distance
threshold and the truncation distance is 3cm although
a shorter truncation leads to better results. Table 1
denotes the root mean square (RMS) absolute trajec-
tory errors (ATE) and the framerates. The ATE di-
rectly measures the difference between points of the
true and the estimated trajectory at each frame and it
is a good measure to rate SLAM methods. We ap-
plied our approach with seven ICP iterations and we
did not switch off the visualization and accordingly
the raycaster. Because our approach does not require
ray casting the framerates could be increased further.
VISAPP2015-InternationalConferenceonComputerVisionTheoryandApplications
654
The raycaster takes for fr2/desk an average of 2.86ms
with a standard deviation of 1.05ms. Nonetheless, our
method shows better framerates in general due to the
reduction of the iteration number. The resulting er-
rors depend obviously on the dataset (two improve-
ments and one degradation) but they are similar over-
all. In summary, we have presented a good alternative
matching method with better runtime and we expect
particular advantages for our KinFu extensions.
4.2 Sliding Reduction
In this subsection a dataset was used in which the
robot drives towards the camera, at the beginning.
The robot then turns to the left side (seen from the
camera). To make it harder, the camera was pivoted
to left and right permanently. The scene is a subject to
sliding because at the beginning just an almost planar
surface is responsible for the most outliers. The mov-
ing object is initialized after 97 frames in the shape of
this planar surface. It is found in the area of the box
und it grows downwards to the robot. The transforma-
tions are underconstrained for about 100 frames until
the robot starts to rotate. As shown in Figure 2 the
robot can be reconstructed correctly from two sides.
After about 700 frames we get two separate models
with high quality. In another experiment we deacti-
vated the estimation of the registration stability based
on the eigenvalues. Obviously, the assumed position
of the object tilts to the right side in Figure 3, after
jumping back and forth. Even a 180
rotation about
the surface normal is performed (not in the images).
The first step of the sliding reducing, the registration
with the previous depth map, was still used. Without
any sliding reduction approach the tracking diverges
after a few frames because of the continuous moving
of the camera.
4.3 Background Model Preserving
The registrations of the background model in Fig-
ure 4 are not sufficient restricted. One direction is
constrained by the wall and another direction is con-
strained by the floor. However, the whole background
voxel grid can be moved to the left or right along the
wall and the floor without effecting the ICP error sig-
nificantly. Only the box on the ground (top image)
and later the drawer unit hold the background voxel
grid at its position. On the other side the robot moves
and turns to the left direction. The robot is a large ob-
ject with a lot of correspondences which are assigned
to the background at the beginning. The robot takes
the background voxel grid along with itself, the voxel
grid moves to the left. This can be seen after 182
Figure 2: This dataset benefits from sliding reduction.
Shown are the results after 131 frames (top) and 711 frames
(bottom). The figure represents the depth image, the ren-
dered background, the rendered object and the visualization
of the correspondence map. This shows the matched points
with the background (light green), the matched points with
the object (dark green), potential outlier (yellow), ICP out-
lier (dark red), potential new object (light red), missing
points in depth image (black) and missing correspondences
due to volume boundary (blue).
Figure 3: Experiment without the second step of the sliding
reduction. The object tilts to the right side.
frames (top) due to the outlier on the right side of the
box, the table and the table-leg. The original KinFu
cannot handle this dataset, the voxel grid slides fur-
ther and further until the other wall on the left side is
reached. Our algorithm initializes the new object af-
ter 162 frames. It needs just a few frames to grow and
KinFuMOT:KinectFusionwithMovingObjectsTracking
655
Figure 4: In this dataset the transformation of the back-
ground is underconstrained. Shown are the results after
182 frames (top) and 425 frames (bottom). A huge part
of the robot is still in the background model at this moment
(bottom) since the view to the empty space was blocked by
the robot and the voxels there remain without update. The
background model becomes clear after the view is free, 300
frames later.
after a major part of the correspondences is assigned
correctly the position of the background voxel grid
remains stable. Actually, it returns a little bit back to
the original position. A complete return is impossible
because the background model is already filled with
shifted data. If the robot would not move from the be-
ginning the weights of the background model could
rise and the model would become more stable. Af-
ter the object initialization, the background voxel grid
could return to the old position.
4.4 Parallel Moving Object
An object which moves almost parallel to the image
plane can be a special challenge. Some parts produce
outliers during other parts can be matched to the back-
ground for a long time. In addition the background
Figure 5: This dataset shows a parallel moving object. The
box can be matched with the background for a long time.
Long enough to move within the background model (Dif-
ferent positions of the box within both rendered background
images.). Shown are the results after 343 frames (top) and
613 frames (bottom).
voxel grid gets the change to assimilate data from the
moved object. This also leads to a late initialization.
In Figure 5 the object is initialized after 209 frames.
This time the growing starts at the robot and not at the
box. By the way, this example shows the capability to
reconstruct small objects. The size of the first voxel
grid was 43 × 53 × 60 voxels (25.2 × 31.1 × 35.2cm)
and the size shown in the top of Figure 5 was reached
133 frames and 15 grid growing steps after the ini-
tialization. Due to the distance between robot frame-
work and the box it takes very long before the model
reaches the whole box. As shown in the bottom image
a new potential object is already detected. The initial-
ization delays which are triggered by grid growing are
very useful in this situation.
4.5 Current Investigations
The theoretical framerate regarding our datasets, con-
VISAPP2015-InternationalConferenceonComputerVisionTheoryandApplications
656
sidering just the pure gpu processing pipeline (com-
parable to subsection 4.1), is about 40fps with one re-
constructed object beside the background. The fram-
erate including five windows and grabbing during the
reconstruction of one object is around 25fps (in com-
parison with 12.6fps of the original KinFu from PCL).
The framerate is limited by the painting of the win-
dows. The complete algorithm with one object and
deactivated visualization runs with more than 30fps.
If three objects are initialized the reduction of the
framerate is negligible. This is due to the fact that the
bottleneck is the communication and synchronization
between CPU and GPU. Accordingly, the computer
hardware is insufficient utilized. Since we use GPU
streams, more objects just mean a better utilization.
We did some experiments with tracking of humans
as example for non-rigid objects, too. We get good re-
sults if we allow just one moving object. This obser-
vation conforms to (Izadi et al., 2011). Without object
number limit several parts of the body get a separate
voxel grid. The separation is reasonable (individual
movement of the extremities) but some parts become
relative small and may get occluded during the move-
ment. Since our algorithm does not consider relation-
ships between the tracked objects this is not allowed
at the moment. However, as long occlusion is avoid
and the body parts do not get too small the results are
stable.
5 CONCLUSIONS
We extended KinectFusion by the ability to track
and reconstruct several moving objects simultane-
ously. We propose an alternative matching strategy
and some further modifications to the GPU processing
pipeline. The capabilities of our system are demon-
strated with three examples. It was shown, that the
stability of object tracking is enhanced due to sliding
reduction. Furthermore, the robustness of the deter-
mination of the camera poses in scenes with moving
objects is improved. Finally, we give an example in
which, at the beginning a small object can be tracked
during its movement parallel to the image plane. De-
spite the new functionalities our systems is still real-
time capable.
For future work, we plan to focus on the initializa-
tion. First, we would like to move data from the back-
ground voxel grid to the new object voxel grid dur-
ing the initialization. Second, the detection of mov-
ing object pixels during the initialization should rely
on a complete pixel based segmentation of the corre-
spondence map. This would enhance the complete-
ness of the object initialization and allow to relax the
constrains for triggering a new initialization. Beyond
this, we intend to consider the relations between the
objects with technics from the field of articulated ob-
jects. This can further improve the registration re-
sults especially of small independent moved object
parts – and enable us to implement a voxel grid merge
mechanism for objects which were wrongly initial-
ized twice.
REFERENCES
Curless, B. and Levoy, M. (1996). A volumetric method for
building complex models from range images. In Pro-
ceedings of the 23rd Annual Conference on Computer
Graphics and Interactive Techniques, SIGGRAPH
’96, pages 303–312, New York, NY, USA. ACM.
Endres, F., Hess, J., Engelhard, N., Sturm, J., Cremers, D.,
and Burgard, W. (2012). An evaluation of the RGB-
D SLAM system. In Proc. of the IEEE Int. Conf. on
Robotics and Automation (ICRA), St. Paul, MA, USA.
Gelfand, N., Ikemoto, L., Rusinkiewicz, S., and Levoy, M.
(2003). Geometrically stable sampling for the ICP al-
gorithm. In Fourth International Conference on 3D
Digital Imaging and Modeling (3DIM).
Heredia, F. and Favie, R. (2012). Kinfu Large Scale in PCL.
http://www.pointclouds.org/blog/srcs/fheredia/.
Izadi, S., Kim, D., Hilliges, O., Molyneaux, D., Newcombe,
R., Kohli, P., Shotton, J., Hodges, S., Freeman, D.,
Davison, A., and Fitzgibbon, A. (2011). Kinectfusion:
Real-time 3d reconstruction and interaction using a
moving depth camera. ACM Symposium on User In-
terface Software and Technology.
Klein, G. and Murray, D. (2007). Parallel tracking and
mapping for small AR workspaces. In Proc. Sixth
IEEE and ACM International Symposium on Mixed
and Augmented Reality (ISMAR’07), Nara, Japan.
Newcombe, R. A., Izadi, S., Hilliges, O., Molyneaux, D.,
Kim, D., Davison, A. J., Kohli, P., Shotton, J., Hodges,
S., and Fitzgibbon, A. (2011). KinectFusion: Real-
Time Dense Surface Mapping and Tracking. In IEEE
ISMAR. IEEE.
Rusu, R. B. and Cousins, S. (2011). 3D is here: Point
Cloud Library (PCL). In International Conference on
Robotics and Automation, Shanghai, China.
Sturm, J., Engelhard, N., Endres, F., Burgard, W., and Cre-
mers, D. (2012). A benchmark for the evaluation of
rgb-d slam systems. In Proc. of the International Con-
ference on Intelligent Robot Systems (IROS).
Whelan, T., Kaess, M., Fallon, M., Johannsson, H.,
Leonard, J., and McDonald, J. (2012). Kintinuous:
Spatially extended KinectFusion. In RSS Workshop on
RGB-D: Advanced Reasoning with Depth Cameras,
Sydney, Australia.
KinFuMOT:KinectFusionwithMovingObjectsTracking
657