KinFu MOT: KinectFusion with Moving Objects Tracking

Michael Korn and Josef Pauli

Intelligent Systems Group, University of Duisburg-Essen, 47057 Duisburg, Germany

Keywords:

3D, GPU, Tracking, SLAM, Iterative Closest Point (ICP), Point Cloud Registration, Depth Cameras.

Abstract:

Using a depth camera, the KinectFusion algorithm permits tracking the camera poses and building a dense 3D

reconstruction of the environment simultaneously in real-time. We present an extension to this algorithm that

allows additionally the concurrent tracking and reconstruction of several moving objects within the perceived

environment. This is achieved through an expansion of the GPU processing pipeline by several new function-

alities. Our system detects moving objects from the registration results and it creates a separate storing volume

for such objects. Each object and the background are tracked and reconstructed individually. Since the size of

an object is uncertain at the moment of detection the storing volume grows dynamically. Moreover, a sliding

reduction method stabilizes the tracking of objects with ambiguous registrations. We provide experimental re-

sults showing the effects of our modiﬁed matching strategy. Furthermore, we demonstrate the system’s ability

to deal with three different challenging situations containing a moving robot.

1 INTRODUCTION

In recent years visual SLAM and 3D reconstruction

methods with handheld sensors have reached a signif-

icant level of maturity. The requirements of real-time

and robustness seem to be accomplished so compre-

hensively that they are not the weak point for sev-

eral applications. Instead, further needs get more

attention like the spatial extension of the perceiv-

able volume with loop closure functionality (Endres

et al., 2012) and dense mapping (Klein and Murray,

2007). With the emergence of highly available RGB-

D cameras, in particular the Microsoft Kinect, the

meaning of a dense map has developed further – to-

gether with the new potentials. In (Izadi et al., 2011;

Newcombe et al., 2011) the KinectFusion algorithm

(KinFu) was introduced and it has become a state-of-

the-art method in the terms of robustness, real-time

and density. Since KinFu accumulates all informa-

tion obtained from the depth frames in a small (3

)

stationary voxel grid, the spatial scope is too small for

several applications. These spatial limitations are al-

ready being tackled, too. With (Heredia and Favie,

2012) an open source KinFu derivative with a mov-

ing voxel grid is available. A similar approach is de-

scribed in (Whelan et al., 2012). This paper also men-

tions ongoing work on SLAM pose-graph optimiza-

tion and hence loop closing.

A large and least treated issue represents the pres-

ence of moving objects. On the one hand such objects

can interfere with the camera movement reconstruc-

tion and force the algorithm to fail. On the other hand

a moving object can lead to a gain of information.

Such an object can be clearly segmented and recon-

structed from different perspectives, independent of

the camera movement. Furthermore, the object move-

ment can be tracked which yields advantages for ex-

ample in the ﬁeld of autonomous robot navigation or

articulated objects. This paper presents a way to make

these beneﬁts available by detecting the movement of

rigid objects and creating a dense reconstruction of

each object in a separate voxel grid.

Section 2 summarizes some required background

knowledge concerning KinFu, section 3 describes our

modiﬁcations to the matching step of KinFu and our

pipeline extensions, section 4 details our experiments

and results. In the ﬁnal section we discuss our con-

clusions and future directions.

2 BACKGROUND

Our implementation builds on the open source publi-

cation of KinFu released by the Point-Cloud-Library

(PCL) (Rusu and Cousins, 2011) which is based

strictly on the original descriptions from (Izadi et al.,

2011; Newcombe et al., 2011). In the next two sub-

sections we point at the limitations of KinFu.

648

Korn M. and Pauli J..

KinFu MOT: KinectFusion with Moving Objects Tracking.

DOI: 10.5220/0005362106480657

In Proceedings of the 10th International Conference on Computer Vision Theory and Applications (VISAPP-2015), pages 648-657

ISBN: 978-989-758-091-8

 2015 SCITEPRESS (Science and Technology Publications, Lda.)

2.1 KinectFusion

The KinFu algorithm generates a 3D reconstruc-

tion of the environment on a GPU in real-time by

integrating all available depth maps from a depth

sensor into a discretized Truncated Signed Distance

Functions (TSDF) representation (Curless and Levoy,

1996). The measurements are collected in a voxel

grid in which each voxel stores a truncated distance

to the closest surface including a related weight that

is proportional to the certainty of the stored value. To

integrate the depth data into the voxel grid every in-

coming depth map is transformed into a vertex map

and normal map pyramid. Another deduced vertex

map and normal map pyramid is obtained by ray cast-

ing the voxel grid based on the last known camera

pose. According to the camera view, the grid is sam-

pled by rays searching in steps for zero crossings of

the TSDF values. Both pyramids are registered by a

point-to-plane ICP procedure and the resulting trans-

formation determines the current camera pose. Sub-

sequently, the voxel grid data is updated by iterating

over all voxels and the projection of each voxel into

the image plane of the camera. The new TSDF val-

ues are calculated using a weighted running average.

Usually the weights are also truncated allowing the re-

construction of scenes with some degree of dynamic.

Finally, the maps created by the raycaster are used to

generate a rendered image of the implicit environment

model.

Furthermore, the detection of outliers is in the

scope of (Izadi et al., 2011) which is not a part of

the PCL. The possibility is described to ﬁrst recon-

struct the background and then to use the ICP outliers

to detect movement. They primarily intended to allow

interaction with the background model, but the 3D re-

construction of the foreground object was also sug-

gested. However, the prospects are restricted since the

object cannot be reconstructed simultaneously with

the background.

2.2 KinectFusion Limitations

KinFu is primarily limited to the reconstruction of

a small static environment which ﬁts into the 512 ×

512 × 512 voxel grid. Moving objects in the space of

the perceived environment are not totally off-limits,

but they cause outliers in the registration process and

are able to interfere with the estimated camera poses.

In the best case, as long as such objects are moved

quickly, they will not become part of the environment

model and they stay disregarded entirely. The other

issue of a small size of the voxel grid is caused by

the claim of reaching real-time capability and the fact

prev. depth map

correspondence map

cur. depth map

estimate roughly camera movement

3.4

match background model

3.1,3.3

match all object models

3.1,3.3

3.3

detect moving objects;

when indicated: init. new model

3.2

ICP Iterations

volumetric integration

3.5

object voxel grid growing

3.6

prev. poses

cur. poses

correspondence map with assigned

undetermined vertices

3.5

prev. voxel grids

do not use underconstrained

registrations, at ﬁrst

3.4

optimize transformations:

independent and in parallel

3.3

Matching

prev. correspondences

Figure 1: Overview of the processing pipeline which is in-

troduced in section 3. Related subsection are indicated in

red. Inputs as parallelograms, data as rounded rectangles

and processing steps as rectangles.

that the runtimes of the raycast process and the depth

map integrations process rely strongly on the voxel

grid size. It is desirable to reduce the impact of the

voxel grid size on the runtimes to increase the num-

ber of voxel. This would allow to cover a larger sec-

tion of the environment or to enhance the resolution

of the voxel grid. Furthermore, the targeted additional

reconstruction of several moving objects leads to the

need of being able to use a larger number of voxels to

store the extra information.

3 CONTRIBUTION

In this section we ﬁrst present an alternative match-

ing method for the ICP. It is designed to deal with a

larger number of voxels, since every object gets its

own voxel grid in addition to the voxel grid of the

background. This method is also able to deal with

more dynamics in the scene due to the fact, that we do

not sample the voxel grid based on old pose informa-

tion. This is especially advantageous when the object

and the camera move in opposed directions. In the re-

maining subsections, we introduce several extensions

to the original KinFu GPU processing pipeline which

permits to reconstruct the background and several

moving rigid objects in parallel and in real-time. A

simpliﬁed overview of the resulting pipeline is given

in Figure 1.

KinFuMOT:KinectFusionwithMovingObjectsTracking

649

3.1 Vertex Map to Model Matching

KinFu creates deduced vertex and normal maps from

the voxel grid data and register these maps with the

maps obtained from the Kinect. The sampling of the

voxel grid causes unwanted information reduction. It

may happen, that the depth measurements contain no

adequate correspondence in the deduced data, due to

missing points caused by occlusion, a slight offset of

the measured area, imperfect data in the voxel grid

or inaccurate sampling. Furthermore, ray casting the

whole voxel grid is not always necessary and the ef-

fort can be reduced. A sample application is robot

navigation in which no rendered image of the envi-

ronment is needed. Even when a rendered image is

desired, the viewpoint of the virtual camera can dif-

fer from the viewpoint of the kinect and consequently

two raycast processes are required.

We introduce a method to avoid the registration

with maps which are deduced from the voxelgrid. In-

stead, the maps obtained from the Kinect are matched

directly against the model in the voxel grid. Our mod-

iﬁcation is based on the idea that the Kinect vertex

map indicates possible correspondences between the

latest data and the model. We conform as far as pos-

sible to the notation from (Izadi et al., 2011) in Algo-

rithm 1 in order to make comparisons simple. Sub-

stantially, Algorithm 1 is a combination of a locally

limited raycaster and a projective ICP algorithm. In

parallel, each GPU thread walks in steps within a

section of a single ray through the voxel grid. The

threads search for an implicit surface within the voxel

grid matching the related point from the vertex map

obtained from the latest kinect frame. On the one

hand, the step size step = trunc dist · 0.6 must be

small enough so that no zero crossing is missed. On

the other hand, too small of steps or even a cell by

cell handling downgrade the results signiﬁcantly be-

cause of the noisy data. Obviously, larger steps have

a smoothing effect and reduce the runtime, too. The

rays are determined by the latest vertex map V and the

best known Kinect pose T . After a new Kinect frame

arrives, T is initialized with the physical global cam-

era pose obtained from the previous frame and reﬁned

iteratively. Depending on the current Pose T and the

vertex map V the rays run from the estimated cam-

era position to the measured points. The search for

correspondences must only take place along the rays

in the neighborhood of the measured points restricted

by the distance threshold for the ICP plus a tolerance

depending on the step size. The exact surface inter-

section point is computed using a linear interpolation

based on the trilinearly extracted TSDF values of the

points g and g

. As might be expected from

Algorithm 1: Vertex Map to Model Matching.

Input: vertex map V , normal map N, voxel grid grid, latest

transformation (camera pose) T = [R|t] with rotation R

and translation t, distance threshold for ICP distThres,

step wide step

Output: point pairs incl. normal for each matched pixel u

1: for each valid pixel u ∈ N in parallel do

2: v

← T ·V (u)

3: ray

dir

← normalize(v

−t)

4: ray

start

← v

− ray

dir

· (distThres +step)

5: ray

endLen

← 2 ·(distThres +step)

6: ray

len

← max(0, lenToEnterVolume(ray

start

,ray

dir

))

7: if ray

len

>= ray

endLen

− step then

8: exit

9: g ← ray

start

+ ray

dir

· ray

len

10: while ray

len

< ray

endLen

11: g

← g

12: g ← ray

start

+ ray

dir

· (ray

len

+ step)

13: if g left volume bounds then

14: exit

15: if zero crossing from grid(g) to grid(g

) then

16: l ← extract trilinear interpolated ray length

17: l ← min(max(l, ray

len

), ray

len

+ step)

18: if l < step or l > (2 · distThres + step) then

19: exit

20: v

model

← ray

start

+ ray

dir

· l

21: n

← R · N(u)

22: return correspondence hv

model

23: ray

len

← ray

len

+ step

line 17 the trilinear interpolation of the TSDF data is

susceptible to faults and the computed points are lo-

cated sometimes several meters away from the cor-

rect position. We have followed the original PCL

implementation regarding the interpolation, but since

the interpolation is time-consuming and the beneﬁt is

questionable, this approach should be subject to fur-

ther evaluation. After a ﬁnal check, whether the dis-

tance threshold is not exceeded, the found correspon-

dences consisting of vertex v

, the related normal n

and the computed intersection point between the ray

and implicit surface v

model

take part in the optimiza-

tion process. The approaches to build the linear equa-

tion system and to solve it are based on the original

KinFu paper and remain unchanged in this subsec-

tion.

The following part points at some conceptual dif-

ferences between our ICP process and the original

procedure. First of all we abstain from using a thresh-

old between the normals extracted from the voxel grid

and the normals calculated from the depth map. In

our experiments we were not able to ﬁnd a general

beneﬁt in such a threshold. On the contrary, we ob-

served that in the PCL implementation too many cor-

respondences are rejected because the normals differ

more than the threshold of 20

◦

. In some experiments

VISAPP2015-InternationalConferenceonComputerVisionTheoryandApplications

650

with our matching approach and a normal threshold,

about 15% of mainly useful correspondences were re-

jected. Aside from the problems with noise and in-

terpolation, mentioned above, the Kinect depth maps

are disturbed by a vertical banding. In these areas

the normal threshold rejects most correspondences

even on homogeneous surfaces. The normal thresh-

old would have to be softened so much that it should

be removed right away. Furthermore, we use the nor-

mals calculated from the depth map and not the nor-

mals extracted from the voxel grid for the point-to-

plane error metric. These modiﬁcations show small

improvements on some datasets and small degrada-

tions on other datasets. Overall the results remain the

same. However, the compution of the model normals

is relatively expensive and can be avoided. Finally,

we always use the full resolution of the kinect and we

removed the multi-scale ICP alignment proposed in

(Newcombe et al., 2011) since we could not see any

advantage that justiﬁes that additional effort in our ex-

periments.

3.2 Object Initialization

A moving object can be recognized by outliers dur-

ing the ICP registration. Unfortunately, in addition

to noise, the projective ICP algorithm leads to a lot

of outliers which must not be mistaken as movement.

In particular, this ICP needs a large distance thresh-

old, the default is 10cm. On the one hand, detecting

movement only after a large displacement is not sat-

isfactory and it may work only for fast objects since

small displacements are gradually integrated into the

background voxel grid. On the other hand, reducing

the distance threshold has a signiﬁcant adverse effect

on the results. To consider both, we use a threshold of

10cm but in the last iteration we decrease it do 3.5cm.

We estimate further potential outliers according to the

point-to-plane metric (v

model

− v

) · n

> 2cm. These

correspondences are used in the registration process,

but also marked as outliers for further processing.

Subsequently, the neighborhood with a size of

51 × 51 pixels of each marked outlier o

is investi-

gated. If 90% of neighbor pixels are marked as out-

liers, o

is marked as a moving object. In the next

step, each outlier in the 51 × 51 neighborhood of a

pixel marked as moving object, is marked as moving

object, too.

If such pixels were detected, an initialization of

a new moving object is taken into account. An ini-

tialization is only carried out if (a) a moving object

was suspected in at least 5 of the sequential previous

frames and (b) it has been more than 30 frames (1 sec-

ond) since the last object initialization. Additionally,

after each object growing (see subsection 3.6). This

indicates that an object is not recognized as a whole.

In the case of an initialization, the bounding box for

the marked vertices is determined to obtain the posi-

tion and size of a new voxel grid. This voxel grid is

enlarged by 10% and 10cm in each of the six direc-

tions to allow growing of the object model. Finally,

the volumetric integration of the marked pixels is per-

formed.

3.3 Matching and Pose Estimation

The pose estimation is distinguished from the origi-

nal by the vertex map to model matching (Algorithm

1) of every voxel grid voxel grid

with i ∈ {bg} ∪

ob jectIDs. Within seven iterations, ﬁrst the corre-

spondences between vertex map V and background

voxel grid voxel grid

are found and subsequently

the correspondences between V and all object voxel

grids. Since every vertex could be assigned to sev-

eral voxel grids the best matching for each V (u) is

searched. For this purpose Algorithm 1 is extended by

a correspondence map C within which the best assig-

nation C(u) for each V (u) is stored. In the case of

several matchings, the correspondence with the short-

est point-to-plane distance (v

model

−v

)·n

is chosen.

As this value is anyway computed for the optimiza-

tion process the additional effort for this approach is

negligible.

The background model and each detected object

have their own registration. Therefore, separate nor-

mal equation systems are created for the background

and each detected object which results in different

transformation estimations x

= A

, with i ∈ {bg} ∪ ob jectIDs (1)

In contrast to the original approach, it is not pos-

sible to create normal equations for each block of

the vertex map (The GPU implementation works in

blocks with size 32 × 8 pixels.) in the ﬁrst step and

to sum this equations in a subsequent step. Before

creating the normal equations, V must be matched

with every model and the correspondences have to

be ﬁxed. Accordingly, we cannot store the equa-

tion systems of small blocks in a buffer, but each

line [A

] is buffered. Since the assignation C(u)

of any voxel position u is distinct, only one common

buffer [A|b] for all models is sufﬁcient. After this

pure matching and preparation step, Equation 1 is set-

tled up for voxel grid

and each other voxel grid

]

6×7

consists of 27 independent elements

because of symmetry. The performance of this ap-

proach is similar to the original since any of these el-

ements of any object can be computed separately by

KinFuMOT:KinectFusionwithMovingObjectsTracking

651

their own GPU block in parallel with tree-based re-

duction.

3.4 Sliding Reduction

A substantial drawback of the point-to-plane metric is

the tendency to diverge, if the transformation between

the model and the data is not fully constrained. A reg-

istration of a planar surface for example has three un-

constrained degrees of freedom: translation in two di-

rections along the plane and rotation around the plane

normal. Neither of these change the point-to-plane er-

ror signiﬁcantly and the resulting transformation may

slide in each ICP iteration due to noise. This issue

can already be observed in the original KinFu algo-

rithm but since tracked objects are smaller and with

less details than the background, especially right af-

ter the initialization, its importance increases. Addi-

tionally, the transformation between the current V and

the background results from the camera movement.

Whereas the transformation between V and an ob-

ject results from the camera movement and the object

movement. Due to the assumption of marginal move-

ment the ICP starts the processing of a new frame

with the last known transformations. Since the cam-

era movement remains unconsidered during the ﬁrst

iteration, it ﬂows into the initialization of the object

transformation. This inﬂuence may have an impact on

the ﬁnal estimated transformation in such a way that a

part of the camera movement is ascribed to the object

movement. We deal with the sliding in two steps: ﬁrst

we compute a rough estimation of the camera move-

ment for the initialization of the transformations. Fur-

thermore, we use the updated transformation estima-

tion from the ICP only after the last two iterations if

it is a subject to an underconstrained registration.

Before the actual ICP we determine the transfor-

mation between the current V and the previous ver-

tex map

V based on the previous assignment of corre-

spondences

C. There are no multiple iterations and as

shown in Algorithm 2 this approach is simple com-

pared to the original matching step. Since the ﬁrst

supposed transformation is the identity matrix we can

Algorithm 2: Vertex Map V to

V Matching.

1: for each pixel u ∈ V in parallel do

2: if

C(u) = backgroundID and

V (u) valid and

N(u) valid and

V (u) −V (u)k ≤ distThres then

3: return correspondence h

V (u),V (u),

N(u)i

avoid any coordinate transformation and also the per-

spective projection into the image coordinate system.

In the further process we use the original method for

creating the normal equations and computing the ﬁrst

estimation of the incremental transformation. It is

used to correct the initial guess of the camera move-

ment. Since the ﬁrst optimization step has by far the

largest impact, this improves the ﬁrst object matching

greatly.

Even with a perfect initialization, sliding can-

not be prevented. (Gelfand et al., 2003) describes

amongst others how the stability of a registration can

be estimated reliably. A condition number

can be

computed from the eigenvalues λ

≥ ... ≥ λ

of A

A small condition number indicates at least one un-

constrained degree of freedom and an unstable reg-

istration due to possible sliding. In our experiments

with stable situations the condition number is around

100 and going far beyond 1000 when we observe slid-

ing. Given this margin a reliable threshold can be eas-

ily found and we choose 1000. If this threshold is ex-

ceeded, a situation occurs where the result of the ICP

is questionable and should not be used. Obviously,

using wrong data should be avoided, however the ob-

ject is probably still in movement and without regis-

tration the tracking will be lost after several frames.

As a compromise, just the last two ICP iterations de-

cide the object movement. Since the camera move-

ment is quite well estimated from this point and the

iterations are few, the sliding is within limits. More-

over, just two iterations are absolutely sufﬁcient be-

cause the data do not contain enough information for

a registration which needs to be approximated over

many iterations. During the ﬁrst ﬁve iterations the

object is still part of the matching process to reduce

false matchings of other objects or the background.

In addition, the ﬁrst six iterations are always started

with respect to the most recent camera movement es-

timation.

3.5 Volumetric Integration

The information from each new frame is integrated

into every voxel grid voxel grid

with i ∈ {bg} ∪

ob jectIDs. This can be initiated in parallel since the

voxel grids are independent. After the pose estima-

tion step, the camera poses related to the coordinate

systems of each voxel grid

are known. Furthermore,

every pixel of the recent depth map is assigned to one

particular voxel grid

based on the correspondences

from C. In addition, a distance map D is prepared

where D(u) is the measured distance between the

camera and the perceived surface at pixel u based on

the raw depth map. Our integration method uses ma-

jor parts from the PCL and the simpliﬁed concept is

summarized in Algorithm 3. The changes compared

with (Izadi et al., 2011) are located between the lines

VISAPP2015-InternationalConferenceonComputerVisionTheoryandApplications

652

Algorithm 3: Volumetric Integration.

1: for each g = (x, y, 0) ∈ voxel grid

in parallel do

2: for each z with g = (g

,z) ∈ voxel grid

3: v

← convert g from grid to global 3D position

4: v ← T

−1

· v

5: u ← project v into image plane

6: if v

> 0 and u ∈ D and D(u) valid then

7: sdf ← D(u) − kt

− v

8: if sdf ≥ −trunc dist then

9: if C(u) = i or

(C(u) invalid and C(u) = backgroundID ) then

10: tsdf ←min(1, sd f /trunc dist)

11: else

12: if sd f < 0 then

13: continue

14: tsdf ← 1

15: read average tsdf and weight in voxel grid

at g

16: compute and store new average tsdf and weight

8 and 14. The whole TSDF proposed in (Izadi et al.,

2011) is only used in the case that the particular pixel

is assigned by C to the current voxel grid or the cor-

respondence of this pixel is unknown and the current

voxel grid contains the background model. Elsewise

the tsd f value for the current integration is set to 1 for

all voxel between the camera and the measured sur-

face and the stored values behind the surface remain

unchanged.

Although this procedure works in general and it

is able to complete gradually a moving object which

was initialized partially at the beginning, the rate of

convergence is unsatisfactory. Usually, shortly after

an initialization a cluster of pixels can be matched

with the object. This cluster is surrounded with pixels

which belong to the object but were matched with the

background model since this information was inte-

grated even before the movement was detected. This

is a result of the projective ICP that matches a vertex

with the nearest surface along the particular ray and

not just with the nearest surface of the surrounding

area. While the object area grows due to imperfect

alignments of the depth maps and the voxel grid, the

correspondences with the background model decline

due to exceedances of the distance threshold. We use

these unmatched pixels to accelerate the growing of

the object.

Subsequent to the ICP process the correspon-

dence map C is modiﬁed based on the vertex map

V with Algorithm 4. We apply this method three

times to reach more pixels and we use in our imple-

mentation a buffered C to ensure determinism of our

whole approach.We investigate the neighborhood of

any outlier and any potential outlier according to a

large point-to-plane distance (see subsection 3.2). We

search the nearest of the 8-connected neighbors with

determined correspondence. Finally, we adopt the as-

Algorithm 4: Assign Undetermined Vertices.

1: for each pixel u ∈ V \1-pixel-border in parallel do

2: if V (u) valid and C(u) invalid or

u is potential outlier then

3: v ← V (u)

4: buf

dist

← ∞

5: for each υ ∈ 8-connected neighbors of u do

6: if C(υ) valid then

7: dist ← kv −V (υ)k

8: if dist < buf

dist

then

9: buf

dist

← dist

10: buf

← C(υ)

11: if dist < v

· distance scale then

12: C(u) ← buf

signment if the distance between the original vertex

and his neighbor is lower than a threshold. We use

distance scale =

focal length

, where focal length is

the focal length expressed in pixel units. This scale

multiplied with v

yields to the triple distance of in-

tersections of neighboring rays and the plane parallel

to the x and y axis at z = v

3.6 Object Voxel Grid Growing

Due to an imperfect initialization of a moving object

and the ability to grow during the volumetric integra-

tion, the volume capacity of an object voxel grid is of-

ten reached. The capacity of a voxel grid is reached,

if at least one voxel of any of the six sides of the

voxel grid was updated after the initialization and it

contains a TSDF value < 1. This means the distance

between the object model and the voxel grid border

is < trunc dist. Right after every volumetric integra-

tion of an object the borders of the voxel grid were

searched for such voxels. Each grid side containing

such voxels is expended by 10% of the grid size in the

related direction. For this purpose a further voxel grid

with the new size is created. The content of the old

voxel grid is copied into the new one and the extended

areas are initialized. Subsequently, the old grid is re-

placed by the new grid and the old grid is released.

Since an object voxel grid is usually small relative to

the background model and all this is done on the GPU

and just the device memory is involved, the time effort

is not particularly great.

3.7 GPU Implementation

The PCL provides a KinFu version for academic re-

search. There is just one CPU thread and the host

code waits regularly for kernel calls. Because of this,

runtimes can be measured well but the whole algo-

rithm is much slower. With our modiﬁcations it is im-

possible to reach acceptable framerates under these

KinFuMOT:KinectFusionwithMovingObjectsTracking

653

conditions. However, the main problem is not the

runtime of our KinFu algorithm, but the duration for

painting ﬁve windows (rendered background, one ren-

dered object, depth image, correspondences illustra-

tion and an interactive examination window) instead

of just three. We use three threads. One preparation

thread for grabbing the depth images and creating the

vertex and normal maps. One Thread for controlling

and painting the visualization. And one execution

thread for the actual GPU processing pipeline. Fur-

thermore, the building of the normal equation systems

and their download to host memory is parallelized

with cuda streams. In the second phase the detection

of outliers and moving objects and volumetric integra-

tions including growing of all objects is parallelized.

Our concept has two bottlenecks: First, the matching

– due to ﬁnding distinct correspondences without the

risk of race conditions. Secondly, the waiting for the

ICP results since many following parts relay on them.

This degree of parallelism is sufﬁcient for now but to

keep the code clear the potential had not yet been fully

exhausted.

4 EVALUATION AND RESULTS

We measured the performance and SLAM accuracy of

our matching strategy, comparing it with the original

KinFu implementation based on three datasets with

known ground truth.

Furthermore, we provide qualitative evaluations

of the capabilities of our whole processing pipeline.

For this purpose we chose three scenes recorded with

a handheld Microsoft Kinect containing a moving

robot in a corner of our laboratory. In the corner are

two small desks, on the left desk there is a workstation

with display, peripheral equipment and a telephone,

underneath a drawer unit. In some datasets a wall

with vertical blinds and a heating unit was perceived

on the left side. On the right side a large box and a part

of a blackboard can be seen. We used a middle size

robot with a differential drive and caster. On an alu-

minum frame a slightly opened laptop, a battery and

on the top an open box is transported. In our exper-

iments, the robot size was enough to reconstruct the

robot, so we added the carton to investigate difﬁcult

situations. In addition the robot drove slowly since

fast objects are a lesser evil. Fast objects cause early

outliers in the registration process and consequently

they do not largely disturb the tracking of the back-

ground. The initialization as a new object is simple,

too. Moreover, our datasets are short and the robot

moves from the very beginning. This leads to low

weights in the background voxel grid when the robot

Table 1: Comparison of the original matching strategy with

vertex map to model matching proposed in subsection 3.1.

All framerates consider exclusively the pure KinFu algo-

rithm, they include the runtimes of the raycaster but not the

effort for visualization and grabbing.

dataset RMSE/ATE [cm] fps

original ours original ours

fr1/xyz 2.10 2.02 58.4 72.3

fr2/xyz 2.23 2.32 54.7 66.2

fr2/desk 9.54 7.88 51.1 58.4

is already moving and hence the background model

is adapted to the moving object. This constitutes the

challenge to extract the object before the background

model is cluttered with misleading information and

maybe a registration failure occur. In summary, this

evaluation focuses on situations which are difﬁcult for

the initialization, tracking and growing process.

The evaluation was performed with an Intel

i7-2600@3.40GHz and a NVIDIA GeForce GTX970.

A video of the evaluation results is available at

http://www.is.uni-due.de/kinfu mot/.

4.1 Vertex Map to Model Matching

We evaluated the vertex map to model matching with

three datasets from (Sturm et al., 2012). All datasets

were obtained from a Microsoft Kinect. The datasets

from (Sturm et al., 2012) include highly accurate and

time-synchronized ground truth camera poses from

a motion capture system, which works with at least

100 Hz and up to 300 Hz. We used the datasets

fr1/xyz with 798, fr2/xyz with 3666 and fr2/desk with

2964 depth frames. All three contain an ofﬁce scene.

In fr2/desk the Kinect is moved around two tables so

that the loop is closed. To make our and the origi-

nal approach comparable we used the original PCL

implementation as a basis and we modiﬁed it as less

as necessary. For this subsection we just replaced the

matching part and we removed the multi-scale fea-

ture without further changes. In particular the algo-

rithm works from a global perspective synchronous

and thus the runtimes are comparable, too. We neither

adapt the parameters, 10cm are used as ICP distance

threshold and the truncation distance is 3cm although

a shorter truncation leads to better results. Table 1

denotes the root mean square (RMS) absolute trajec-

tory errors (ATE) and the framerates. The ATE di-

rectly measures the difference between points of the

true and the estimated trajectory at each frame and it

is a good measure to rate SLAM methods. We ap-

plied our approach with seven ICP iterations and we

did not switch off the visualization and accordingly

the raycaster. Because our approach does not require

ray casting the framerates could be increased further.

VISAPP2015-InternationalConferenceonComputerVisionTheoryandApplications

654

The raycaster takes for fr2/desk an average of 2.86ms

with a standard deviation of 1.05ms. Nonetheless, our

method shows better framerates in general due to the

reduction of the iteration number. The resulting er-

rors depend obviously on the dataset (two improve-

ments and one degradation) but they are similar over-

all. In summary, we have presented a good alternative

matching method with better runtime and we expect

particular advantages for our KinFu extensions.

4.2 Sliding Reduction

In this subsection a dataset was used in which the

robot drives towards the camera, at the beginning.

The robot then turns to the left side (seen from the

camera). To make it harder, the camera was pivoted

to left and right permanently. The scene is a subject to

sliding because at the beginning just an almost planar

surface is responsible for the most outliers. The mov-

ing object is initialized after 97 frames in the shape of

this planar surface. It is found in the area of the box

und it grows downwards to the robot. The transforma-

tions are underconstrained for about 100 frames until

the robot starts to rotate. As shown in Figure 2 the

robot can be reconstructed correctly from two sides.

After about 700 frames we get two separate models

with high quality. In another experiment we deacti-

vated the estimation of the registration stability based

on the eigenvalues. Obviously, the assumed position

of the object tilts to the right side in Figure 3, after

jumping back and forth. Even a 180

◦

rotation about

the surface normal is performed (not in the images).

The ﬁrst step of the sliding reducing, the registration

with the previous depth map, was still used. Without

any sliding reduction approach the tracking diverges

after a few frames because of the continuous moving

of the camera.

4.3 Background Model Preserving

The registrations of the background model in Fig-

ure 4 are not sufﬁcient restricted. One direction is

constrained by the wall and another direction is con-

strained by the ﬂoor. However, the whole background

voxel grid can be moved to the left or right along the

wall and the ﬂoor without effecting the ICP error sig-

niﬁcantly. Only the box on the ground (top image)

and later the drawer unit hold the background voxel

grid at its position. On the other side the robot moves

and turns to the left direction. The robot is a large ob-

ject with a lot of correspondences which are assigned

to the background at the beginning. The robot takes

the background voxel grid along with itself, the voxel

grid moves to the left. This can be seen after 182

Figure 2: This dataset beneﬁts from sliding reduction.

Shown are the results after 131 frames (top) and 711 frames

(bottom). The ﬁgure represents the depth image, the ren-

dered background, the rendered object and the visualization

of the correspondence map. This shows the matched points

with the background (light green), the matched points with

the object (dark green), potential outlier (yellow), ICP out-

lier (dark red), potential new object (light red), missing

points in depth image (black) and missing correspondences

due to volume boundary (blue).

Figure 3: Experiment without the second step of the sliding

reduction. The object tilts to the right side.

frames (top) due to the outlier on the right side of the

box, the table and the table-leg. The original KinFu

cannot handle this dataset, the voxel grid slides fur-

ther and further until the other wall on the left side is

reached. Our algorithm initializes the new object af-

ter 162 frames. It needs just a few frames to grow and

KinFuMOT:KinectFusionwithMovingObjectsTracking

655

Figure 4: In this dataset the transformation of the back-

ground is underconstrained. Shown are the results after

182 frames (top) and 425 frames (bottom). A huge part

of the robot is still in the background model at this moment

(bottom) since the view to the empty space was blocked by

the robot and the voxels there remain without update. The

background model becomes clear after the view is free, 300

frames later.

after a major part of the correspondences is assigned

correctly the position of the background voxel grid

remains stable. Actually, it returns a little bit back to

the original position. A complete return is impossible

because the background model is already ﬁlled with

shifted data. If the robot would not move from the be-

ginning the weights of the background model could

rise and the model would become more stable. Af-

ter the object initialization, the background voxel grid

could return to the old position.

4.4 Parallel Moving Object

An object which moves almost parallel to the image

plane can be a special challenge. Some parts produce

outliers during other parts can be matched to the back-

ground for a long time. In addition the background

Figure 5: This dataset shows a parallel moving object. The

box can be matched with the background for a long time.

Long enough to move within the background model (Dif-

ferent positions of the box within both rendered background

images.). Shown are the results after 343 frames (top) and

613 frames (bottom).

voxel grid gets the change to assimilate data from the

moved object. This also leads to a late initialization.

In Figure 5 the object is initialized after 209 frames.

This time the growing starts at the robot and not at the

box. By the way, this example shows the capability to

reconstruct small objects. The size of the ﬁrst voxel

grid was 43 × 53 × 60 voxels (25.2 × 31.1 × 35.2cm)

and the size shown in the top of Figure 5 was reached

133 frames and 15 grid growing steps after the ini-

tialization. Due to the distance between robot frame-

work and the box it takes very long before the model

reaches the whole box. As shown in the bottom image

a new potential object is already detected. The initial-

ization delays which are triggered by grid growing are

very useful in this situation.

4.5 Current Investigations

The theoretical framerate regarding our datasets, con-

VISAPP2015-InternationalConferenceonComputerVisionTheoryandApplications

656

sidering just the pure gpu processing pipeline (com-

parable to subsection 4.1), is about 40fps with one re-

constructed object beside the background. The fram-

erate including ﬁve windows and grabbing during the

reconstruction of one object is around 25fps (in com-

parison with 12.6fps of the original KinFu from PCL).

The framerate is limited by the painting of the win-

dows. The complete algorithm with one object and

deactivated visualization runs with more than 30fps.

If three objects are initialized the reduction of the

framerate is negligible. This is due to the fact that the

bottleneck is the communication and synchronization

between CPU and GPU. Accordingly, the computer

hardware is insufﬁcient utilized. Since we use GPU

streams, more objects just mean a better utilization.

We did some experiments with tracking of humans

as example for non-rigid objects, too. We get good re-

sults if we allow just one moving object. This obser-

vation conforms to (Izadi et al., 2011). Without object

number limit several parts of the body get a separate

voxel grid. The separation is reasonable (individual

movement of the extremities) but some parts become

relative small and may get occluded during the move-

ment. Since our algorithm does not consider relation-

ships between the tracked objects this is not allowed

at the moment. However, as long occlusion is avoid

and the body parts do not get too small the results are

stable.

5 CONCLUSIONS

We extended KinectFusion by the ability to track

and reconstruct several moving objects simultane-

ously. We propose an alternative matching strategy

and some further modiﬁcations to the GPU processing

pipeline. The capabilities of our system are demon-

strated with three examples. It was shown, that the

stability of object tracking is enhanced due to sliding

reduction. Furthermore, the robustness of the deter-

mination of the camera poses in scenes with moving

objects is improved. Finally, we give an example in

which, at the beginning a small object can be tracked

during its movement parallel to the image plane. De-

spite the new functionalities our systems is still real-

time capable.

For future work, we plan to focus on the initializa-

tion. First, we would like to move data from the back-

ground voxel grid to the new object voxel grid dur-

ing the initialization. Second, the detection of mov-

ing object pixels during the initialization should rely

on a complete pixel based segmentation of the corre-

spondence map. This would enhance the complete-

ness of the object initialization and allow to relax the

constrains for triggering a new initialization. Beyond

this, we intend to consider the relations between the

objects with technics from the ﬁeld of articulated ob-

jects. This can further improve the registration re-

sults – especially of small independent moved object

parts – and enable us to implement a voxel grid merge

mechanism for objects which were wrongly initial-

ized twice.

REFERENCES

Curless, B. and Levoy, M. (1996). A volumetric method for

building complex models from range images. In Pro-

ceedings of the 23rd Annual Conference on Computer

Graphics and Interactive Techniques, SIGGRAPH

’96, pages 303–312, New York, NY, USA. ACM.

Endres, F., Hess, J., Engelhard, N., Sturm, J., Cremers, D.,

and Burgard, W. (2012). An evaluation of the RGB-

D SLAM system. In Proc. of the IEEE Int. Conf. on

Robotics and Automation (ICRA), St. Paul, MA, USA.

Gelfand, N., Ikemoto, L., Rusinkiewicz, S., and Levoy, M.

(2003). Geometrically stable sampling for the ICP al-

gorithm. In Fourth International Conference on 3D

Digital Imaging and Modeling (3DIM).

Heredia, F. and Favie, R. (2012). Kinfu Large Scale in PCL.

http://www.pointclouds.org/blog/srcs/fheredia/.

Izadi, S., Kim, D., Hilliges, O., Molyneaux, D., Newcombe,

R., Kohli, P., Shotton, J., Hodges, S., Freeman, D.,

Davison, A., and Fitzgibbon, A. (2011). Kinectfusion:

Real-time 3d reconstruction and interaction using a

moving depth camera. ACM Symposium on User In-

terface Software and Technology.

Klein, G. and Murray, D. (2007). Parallel tracking and

mapping for small AR workspaces. In Proc. Sixth

IEEE and ACM International Symposium on Mixed

and Augmented Reality (ISMAR’07), Nara, Japan.

Newcombe, R. A., Izadi, S., Hilliges, O., Molyneaux, D.,

Kim, D., Davison, A. J., Kohli, P., Shotton, J., Hodges,

S., and Fitzgibbon, A. (2011). KinectFusion: Real-

Time Dense Surface Mapping and Tracking. In IEEE

ISMAR. IEEE.

Rusu, R. B. and Cousins, S. (2011). 3D is here: Point

Cloud Library (PCL). In International Conference on

Robotics and Automation, Shanghai, China.

Sturm, J., Engelhard, N., Endres, F., Burgard, W., and Cre-

mers, D. (2012). A benchmark for the evaluation of

rgb-d slam systems. In Proc. of the International Con-

ference on Intelligent Robot Systems (IROS).

Whelan, T., Kaess, M., Fallon, M., Johannsson, H.,

Leonard, J., and McDonald, J. (2012). Kintinuous:

Spatially extended KinectFusion. In RSS Workshop on

RGB-D: Advanced Reasoning with Depth Cameras,

Sydney, Australia.

KinFuMOT:KinectFusionwithMovingObjectsTracking

657