FAST REAL-TIME SEGMENTATION AND TRACKING OF

MULTIPLE SUBJECTS BY TIME-OF-FLIGHT CAMERA

A New Approach for Real-time Multimedia Applications with 3D Camera Sensor

Piercarlo Dondi and Luca Lombardi

Department of Computer Engineering and System Science, University of Pavia, Via Ferrata 1, Pavia, Italy

Keywords:

Time-of-Flight camera, Segmentation, Tracking, Mixed reality.

Abstract:

Time-of-Flight cameras are a new kind of sensors that use near-infrared light to provide distance measures

of an environment. In this paper we present a very fast method for real-time segmentation and tracking,

that exploits the peculiar characteristics of these devices. The foreground segmentation is achieved by a

dynamic thresholding and region growing: an appropriate correction based on ﬂexible intensity thresholding

and mathematical morphology is used to partially compensate one of the most common problem of the TOF

cameras, the noise generated by sun light. By the use of a Kalman ﬁlter for tracking the retrieved objects the

system is able to correctly handle the occlusions and to follow multiple objects placed at different distances.

The proposed system is our basic step for complex multimedia applications, such as augmented reality. An

example of mixed reality that includes the integration of color information, supplied by a webcam is shown in

the experimental results.

1 INTRODUCTION

The introduction of Time-of-Flight (TOF) cameras

(Oggier et al., 2004) made possible real-time depth

measurement using a single compact sensor in spite

of previous multi-camera systems, like stereo cams.

The researches of the last few years showed a large

interest on this kind of devices in many ﬁelds related

to computer vision and computer graphics, like 3D

modeling, scene recostruction, user interaction or seg-

mentation and tracking of moving people (Kolb et al.,

2010). In our research we are focused primarily on

these last two topics. Our main purpose is the devel-

opment of a fast and accurate system for segmentation

and tracking of multiple subjects, than can be used as

basic step for multimedia applications, as augmented

reality or human-machine interaction.

For achieving this goal we have considered

strengths and weaknesses of these new kind of de-

vices. So we have designed a real-time foreground

segmentation algorithm that exploits the peculiar data

provided by TOF cameras and also compensates one

of this most recurrent problem, the noise introduced

by sun light.

Through the successive tracking phase based on

Kalman ﬁlter the system is able to follow multiple

subjects also in presence of occlusions and guarantees

the concurrent retrieval of subjects placed at different

distances.

The paper is organized as follow: section 2 pro-

vides an overview of TOF cameras; section 3 presents

the state of art of TOF based segmentation algorithms

and describes our solution with the proposed method

for compensate the sunlight interferences; section 4

analyzes our tracking method and its integration with

segmentation; in section 5 we show the experimental

results and in section 6 our conclusions.

2 TIME-OF-FLIGHT CAMERA

Time-of-Flight cameras are active imaging sensors

that use laser light in near infrared spectrum to pro-

vide distance measures of the scene. There are two

main solutions at the base of these devices: pulsed

light and modulated light. In the ﬁrst case a coherent

wavefront hits the target and high frequency photon

gating measures the return time-of-ﬂight; in the sec-

ond, the emitted light is modulated and time-of-ﬂight

is measured by phase delay detection.

TOF cameras supply some advantages respect to

laser scanners or stereo cameras: they do not use any

moving mechanical component, can work at real-time

582

Dondi P. and Lombardi L..

FAST REAL-TIME SEGMENTATION AND TRACKING OF MULTIPLE SUBJECTS BY TIME-OF-FLIGHT CAMERA - A New Approach for Real-time

Multimedia Applications with 3D Camera Sensor.

DOI: 10.5220/0003322505820587

In Proceedings of the International Conference on Computer Vision Theory and Applications (VISAPP-2011), pages 582-587

ISBN: 978-989-8425-47-8

 2011 SCITEPRESS (Science and Technology Publications, Lda.)

(18-20 fps), are insensible to shadows and can mea-

sure 3D distance in any kind of scenario. Artiﬁcial

illumination sources do not interact with the sensor,

but the presence of sun light introduces a signiﬁcant

noise. A TOF camera has a nominal range of about

10 m: noise caused by scattering, multi-paths and en-

vironment light can reduce this value, so the useful

range is generally between 2 and 5 meters (Oprisescu

et al., 2007).

In our project we utilize the SR3000 realized by

MESA Imaging (Oggier et al., 2004), a modulated-

light camera that we use at 20MHz. Its active sources

emit in the near infrared around 850nm. The SR3000

supplies two maps per frame with a resolution of

176x144 pixels: the former contains distance infor-

mation and the other one measures the intensity of

light reﬂected by objects. The sensor is completely

indifferent to visible light, so values of intensity de-

pend only by light in near infrared. As a consequence

closed objects appear more clear because reﬂect more

light, instead faraway objects result to be darker.

3 SEGMENTATION

There are some different approachs for TOF-based

segmentation. An interesting solution, based only on

the depth data, is proposed in (Parvizi and Wu, 2008).

The extraction and the association of the objects are

made analyzing the probability density function of the

depth and the distribution of its histogram.

Moreover the most common method is certainly

the background subtraction, a classical computer vi-

sion tecnique, widely used for video surveillance and

for tracking applications. This method is most often

pixel-based and only rarely region-based; newly ac-

quired frames are compared to the model and pixels

which differ signiﬁcantly are selected as foreground.

In some works the background model is estimated us-

ing only the data providing by TOF-camera (Hansen

et al., 2008) or (Bevilacqua et al., 2006), while in

other cases is made using a combination of multiple

cameras (Guomundsson et al., 2008). In particular

the integration of color and depth data for background

modeling is largely used for multimedia applications,

like in (Crabb et al., 2008) for background substitu-

tion, or in (Bartczak et al., 2008) for create a 3D am-

bient for mixed reality.

However, background subtraction suffers from

known problems like ghosts appearing when back-

ground objects changes or absorption of immobile

persons. Background model generation can also be

computationally expensive, expecially if it needs a

high resolution 3D model.

Also other alternative approaches take advantage

of the combination of traditional and TOF cameras. In

(Santrac et al., 2006) the depth information are used

for select the best input area for a color based seg-

mentation algorithm (SIOX). While in (Bleiweiss and

Werman, 2009) a fusion of colors and depth data is

employed for creating a new segmentation and track-

ing method. This solution, based on mean shift algo-

rithm, is intended to compensate the respective weak-

nesses of the two different kind of sensors.

The presented segmentation method is designed

so as not to need any preprocessing operations or a

priori knowledge of the environment or of the objects.

It can be subdividedin two main phases: a ﬁrst thresh-

olding of the distance map based on the correspon-

dent values of intensity image, followed by a region

growing that starts from seeds planted on peaks of the

intensity map.

3.1 Thresholding and Region Growing

Considering the characteristic of the TOF camera

summarized in section 2, we decided to use the data

of intensity map as a guide to restrict the area of inves-

tigation in the range map and to ﬁnd good candidates

for becoming seeds.

We estimate an opportune intensity threshold

(λ

seed

) applying the Otsu’s method. This parameter

is used to deﬁne the set of seeds S (see formula 1).

> λ

seed

, kP

− P

k > γ, γ > 1} → {P

∈ S} (1)

is a point of the distance map, I

is its corresponded

intensity value and P

is the last seed found. The pres-

ence of a control about the distance between seeds

guarantees a better distribution of them and reduces

signiﬁcantly their number in order to decrease the

time needed for the following growing step.

The similarity measure S between a cluster pixel x

and a neighboring pixel y is deﬁned in 2:

S(x, y) = |µ

− D

| (2)

is the distance value of pixel y and µ

is a local

parameter related to the mean distance value around x

(see equation 5). The lower is S, the more similar are

the pixels. When a seed is planted, µ

is initialized

to D

. Considering a 4-connected neighborhood, a

pixel x belonging to a cluster C absorbs a neighbor y

according to the following conditions:

{x ∈ C, S(x, y) < θ, I

> λ} → {y ∈ C} (3)

λ is an intensity threshold proportional to λ

seed

dy-

namically calculate for every frame using equation 4.

λ = k ∗ λ

seed

, k ∈ [0.25;0.33] (4)

FAST REAL-TIME SEGMENTATION AND TRACKING OF MULTIPLE SUBJECTS BY TIME-OF-FLIGHT

CAMERA - A New Approach for Real-time Multimedia Applications with 3D Camera Sensor

583

θ is a constant parameter experimentally estimated.

Our tests with multiple sequences of data estabilsh

that θ must assume a value power of 2 major of 512

for maintain a good clusters separation. An optimal

default choice is 1024.

When a neighbor y of seed x is absorbed, we com-

pute the average distance value µ

in an incremental

manner as follows:

∗ α+ D

α+ 1

(5)

Parameter α is a learning factor of the local mean of

D. If pixel y has exactly α neighbors in the cluster,

and if the mean of D in this neighbor is exactly µ

then µ

becomes the mean of D when y is added to

the cluster.

Every region grows excluding the just analyzed

pixels from successive steps. The process is iterated

for all seeds in order of descending intensity. Regions

too small, with dimension inferior to a ﬁxed value, are

discarded.

Our approach is faster than methods that use

global region statistics, like e.g. centroid region grow-

ing, where the order in which boundary pixels are

tested is signiﬁcant. µ

depends only on the history

of pixel absorptions until y is ﬁrst reached by a grow-

ing front, and not from later steps. Thus, as soon as

a pixel y is reached by the cluster boundary, it can be

tested for absorption.

The locality of our approach tolerates greater vari-

ations of map values inside a cluster because it pro-

duces transitive closures of the similarity S. Both head

and shoulders of the same person, lying at slightly dif-

ferent distance from the camera, are more likely to be

segmented as the same cluster, rather than two differ-

ent clusters (Bianchi et al., 2009).

The advantages of the proposed region growing

can be summarized in good quality of boundaries (in-

trinsic noise rejection), independence of background

models, and independence of shape models.

3.2 Improvements for noised Conditions

The proposed approach is very fast and ensures a

good compromise between the computational time

and the precision of the results. A previous analy-

sis (Bianchi et al., 2009) has described its behavior

in term of correctness (the percentage of correctly ex-

tracted foreground data) and completeness (the per-

centage of the reference data that is explained by the

extracted data) in funciton of the parameter λ.

In optimal conditions, with no noise generated by

sun light, the tests have scored between 94% and 97%

in correctness and between 92% and 96% in com-

pleteness (see red dashed line and blue continuous

Figure 1: Correctness vs. completness for different se-

quences.

line in ﬁgure 1). Otherwise in a more general situ-

ation, like a room with a window, the system presents

a predictable reduction of completeness (82%-85%),

also maintaining a very good values of correctness

(93% - 97%) (see green dotted line in ﬁgure 1). This

means that in noised condition the algorithm retrieves

correctly the main part of the shape of the object but

excludes some details, for example the hair (see top

line of ﬁgure 4). The impact of this issue can be re-

duced making less resctrictive the threshold λ.

Firstly we introduce a new set of point A, deﬁned

by the following equation 6:

{(I

> λ) ∨ [(I

< λ) ∧ (I

> δ ∗ λ)], δ ∈ [0, 1]}

→ {I

∈ A} (6)

where I

is the intensity of all the neighbours of the

pixel y considering the 8-connection.

This set is obviously greater than the precedent,

but has still some imprecision (see ﬁgure 2(b)). So

we have designed another set of points, called M, ap-

plying to A a series of morphological operations (see

ﬁgure 2(c)). The union of A e M produces a more

precise set, L, that can ﬁx the inaccuracies of A.

Figure 2(d) shows all the points of the distance

map that have a correspondent intensity belonging to

L. The comparison with ﬁgure 2(a), that presents

the standard thresholding method, shows a notable in-

crease of details with the new solution.

The appropriate sequence of morphological op-

erations was established experimentally making tests

with different light conditions and subjects. Gener-

ally the best results were obtained applying in order

two dilation, ﬁve erosion and a ﬁnal dilation.

L = A∪ M (7)

We can now redeﬁne as follow the equation 3, ac-

cordingly with the proposed corrections.

{x ∈ C, S(x, y) < θ, I

∈ L} → {y ∈ C} (8)

The other parts of region growing do not need another

modiﬁcation and can be implemented like described

previously.

VISAPP 2011 - International Conference on Computer Vision Theory and Applications

584

(a) (b) (c) (d)

Figure 2: A distance map ﬁltered with different type of

thresholding: (a) point with I

> λ; (b) only set A with

δ = 0.5 ; (c) only set M; (d) set L.

4 TRACKING

Many related works have examined the potential of

Time-of-Flight cameras for tracking. The proposed

approachs are very different: they range from an

Expectation Maximization algorithm (Hansen et al.,

2008) to a method based on depth distribution(Parvizi

and Wu, 2008). The use of multiple cameras was in-

vestigated in (Guomundsson et al., 2008), but its ap-

plicability appears limited to a little ambient. Another

interesting solution involves the integration of color

and depth data for obtaining a more precise outcome

(Bleiweiss and Werman, 2009).

We have experimented with a traditional Kalman

ﬁlter to track the clusters. This kind of solution is al-

ready been analyzed in (Bevilacqua et al., 2006) with

good results, but in that case the camera was placed as

to provide a top-down view of the scene. This simpli-

ﬁes the elaboration but reduces signiﬁcantly the visi-

ble area. Moreover all the detailes of the people are

lost. So we always use a frontal view of the scene to

made a more versatile implementation.

The Kalman state has six dimensions referring

to centroid coordinates, i.e. (x, y, z, v

, v

), all ex-

pressed in image coordinates, as the SR3000 provides

output data already organized in 3D Cartesian coor-

dinates. After segmenting an image by region grow-

ing, we compare the detected clusters and those being

tracked. The association between measured clusters

and Kalman clusters is evaluated by minimum square

euclidean distance between their centroids.

We compute a Gaussian representation of Kalman

cluster at time t − 1 and use its updated centroid posi-

tion at time t to delineate the image region where the

cluster should appear in frame t.

In case of cluster occlusion the kalman tries to es-

timate the more probablepath of the disappeared clus-

ter using its last detected movements and increasing

the research area in order to compensate estimation

error. If the cluster reappears shortly (at most within

30-40 frames) in a position closed to the predicted one

it can be reassigned to its precedent kalman. On the

contrary its kalman can be reinitialized and reassigned

to a new cluster.

Figure 3: Segmentation without and with kalman seeding.

The data provided by our method can be used not

only for tracking but also for increasing the accuracy

of the segmentation, like presented in ﬁgure 3. In this

situation we have two foreground objects to follow: if

one of them gets too close to the sensor its intensity

values grow too much and accordingly all the seeds

will be concentrated on it, excludingthe second object

from region growing. This issue can be ﬁxed using

the informations generated by the prediction step of

Kalman ﬁlter as new input for seeding phase. We seed

at time t in all pixels around the predicted centroid at

time t + 1.

Adopting this procedure it is possible to extract

and to track at the same time middleground and fore-

ground objects.

5 EXPERIMENTAL RESULTS

5.1 Foreground Segmentation

In order to evaluate the robustness against noise of our

segmentation algorithm we consider some video se-

quences acquired in our lab with artiﬁcial and natural

illumination. The foreground objects extracted are vi-

sualized like 3D models where the vertexes positions

are supplied by distance map and the color is obtained

normalizing the intensity map. Figure 4 shows the

results obtained with different subjects using the old

(section 3.1) and the new method (section 3.2). Af-

ter the improvements the extracted data contain a lot

of more signiﬁcant details, like hair or part of arms

and legs (see bottom line of ﬁgure 4). This signiﬁcant

increase of completeness involves a small reduction

of correctness, generates by new false positive at the

border of the extracted objects. However these inac-

curacies can be accepted considering the advantages

supplied by a more complete model.

5.2 Performance Evalutation

We run our segmentation and tracking method on dif-

ferent models of computers in order to determine its

FAST REAL-TIME SEGMENTATION AND TRACKING OF MULTIPLE SUBJECTS BY TIME-OF-FLIGHT

CAMERA - A New Approach for Real-time Multimedia Applications with 3D Camera Sensor

585

Figure 4: Segmentation results without (top) and with (bot-

tom) the proposed noise correction.

Figure 5: Tracking sequence.

Table 1: Speed on different computers: PC1: Intel Core

2 Quad Q9300 2.60 GHZ with Nvidia GeForce GTX 260;

PC2: Intel Core 2 Duo T6600 2.40 GHz with Nvidia

GeForce 9800 GTX; PC3: Intel Pentium IV 3.00 GHZ with

Nvidia GeForce FX 5600; NB1: Intel Core 2 Duo L9400

1.86 GHZ with Integrated Graphic; NB2: Intel Pentium M

750 1.86 GHz with ATI Radeon X700.

Computers Test 1 Test 2

Computer 1 18 fps 44 fps

Computer 2 18 fps 42 fps

Computer 3 18 fps 18 fps

Notebook 1 18 fps 28 fps

Notebook 2 18 fps 24 fps

performances on low, middle and high level CPUs.

We considered two types of situations: online ex-

ecution, with input directly provided by the camera,

and ofﬂine execution, where the program analyzes

precedent recorded video sequences. The ﬁrst experi-

ment is useful to test if it is possible using the SR3000

at its standard frame rate (18-20 fps as mentioned in

section 2); while the second one gives us the maxi-

mum obtainable speed. This last data is particulary

relevant considering the capabilities of the more re-

cent versions of TOF cameras, like SR4000, that can

reach 54 fps.

Table 1 summarizes the obtained results. The

method turns out to be not computationally expen-

sive, the ﬁrst test is passed in all the cases and also

the second shows very good performances. Also with

a 10 years old processor (see computer 3 in table 1)

we can reach the real-time execution. It is interest-

ing to notice than the load for the GPU is very small

for these operations, involves only the ﬁnal visualiza-

tion. So it possible to transfer the heaviest phases of

the segmentation algorithm on graphic hardware for

taking advantages of its parallel computational capa-

bilities. This is a promising solution for increasing the

performances, especially if we use a low level CPU.

5.3 Mixed Reality

The mixed reality is the merge of real and virtual

worlds to produce a new environment where phys-

ical and digital objects coexist and interact in real-

time. The Time-of-Flight camera are very useful in

this kind of scenario (Bartczak et al., 2008), so it is a

good choice for testing all the features of our system.

(a) (b) (c)

Figure 6: Mixed reality steps: (a) foreground segmentation;

(b) adding of virtual object; (c) adding of 3D backgroud

provided by TOF.

In an initial phase we considered only the data

provided by SR3000, then we integrated the distance

data with the color information supplied by a tra-

ditional webcam. This last test was made only for

checking the feasibility of the color integration in our

method, so we used a very low resolution webcam. In

future improvements we will consider a high resolu-

tion camera for provide a more likeable outcome.

For solve the cameras calibration issue we stud-

ied the more recent approaches, as the ortophoto gen-

eration described in (Reulke, 2006) or the projective

texture maps analyzed in (Lindner and Kolb, 2007).

After some tests with both of them we chose a solu-

tion quite similar to the ﬁrst one.

For both the cases (black and white and color) we

follow the same procedure summarized in ﬁgure 6:

ﬁrstly we extract the foreground subject, then add the

column and ﬁnally insert the backround, that is sim-

ply the 3D distance map generated by the TOF. Some

examples results with color are shown in ﬁgure 7.

Figure 7: Mixed Reality with color.

The system is able to manage the interaction of

multiple clusters and also the tracking appears to be

VISAPP 2011 - International Conference on Computer Vision Theory and Applications

586

robust to the occlusions. This feature is better ex-

plained by ﬁgure 5 where there are displayed different

frames of the same sequence. The colored spheres on

the top of the two subjects are the marks of kalman

trackers, note how the correspondence between clus-

ter and its kalmans is always mantained. The tracking

data can be useful also for increasing the realism of

the simulation, for example with a real time adjust-

ment of the position of the virtual objects. A simi-

lar solution was ﬁrstly described in (Bartczak et al.,

2008) but suffers of some limitations: it need an of-

ﬂine computation of the background model and it can

discriminate only one subject at time.

In accordance with the description of section 3 our

approach appears to be more general. We do not need

any preprocessing operation except for the calibration

and all the steps of segmentation are executed in real-

time without any information on the background and

with heterogeneus types of illumination.

6 CONCLUSIONS

We have presented a new approach to multiple sub-

jects segmentation and tracking, that exploits the in-

trinsic characteristics of the intensity and distance sig-

nals generated by modulated-light TOF. Our method

is able to reduce the effect of noise introduced by sun

light interferences, through a ﬂexible intensity thresh-

olding and the mathematical morphology.

The experimental results show that the proposed

approach can be used in multimedia applications, like

mixed reality.

The performance tests prove that the system is

computationally efﬁcient and can reach real-time exe-

cution also with low level computers. A future paral-

lel implementation of the most power intensity parts

of the system, like mathematical morphology, can fur-

ther increase the performances.

Other improvements include data fusion of color

and TOF cameras for a more robust segmentation.

REFERENCES

Bartczak, B., Schiller, I., Beder, C., and Koch, R. (2008).

Integration of a time-of-ﬂight camera into a mixed

reality system for handling dynamic scenes, moving

viewpoints and occlusions in real-time. In 3DPVT08,

Fourth International Symposium on 3D Data Process-

ing, Visualization and Transmission.

Bevilacqua, A., Stefano, L. D., and Azzari, P. (2006). Peo-

ple tracking using a time-of-ﬂight depth sensor. In

AVSS 06, Video and Signal Based Surveillance. IEEE

Computer Society.

Bianchi, L., Dondi, P., Gatti, R., L.Lombardi, and Lom-

bardi, P. (2009). Evaluation of a foreground segmen-

tation algorithm for 3d camera sensor. In ICIAP 2009,

15th International Conference of Image Analysis and

Processing. Springer.

Bleiweiss, A. and Werman, M. (2009). Real-time fore-

ground segmentation via range and color imaging. In

Dyn3D09, Proceedings of DAGM2009 Workshop on

Dynamic 3D Imaging. Springer.

Crabb, R., Tracey, C., Puranik, A., and Davis, J. (2008).

Real-time foreground segmentation via range and

color imaging. In CVPRW2008, Computer Vision and

Pattern Recognition Workshops. IEEE Computer So-

ciety.

Guomundsson, S., Larsen, R., Aanaes, H., Pardas, M., and

Casas, J. R. (2008). Tof imaging in smart room en-

vironments towards improved people tracking. In

CVPRW 2008, Computer Vision and Pattern Recog-

nition Workshops. IEEE Computer Society.

Hansen, D. W., Hansen, M. S., Kirschmeyer, M., Larsen, R.,

Silvestre, D., and Silvestre, D. (2008). Cluster track-

ing with time-of-ﬂight cameras. In CVPRW 2008,

Computer Vision and Pattern Recognition Workshops.

IEEE Computer Society.

Kolb, A., Barth, E., Koch, R., and Larsen, R. (2010). Time-

of-ﬂight cameras in computer graphics. In Computer

Graphics Forum volume 29, issue 1. Wiley.

Lindner, M. and Kolb, A. (2007). Data-fusion of pmd-based

distance-information and high-resolution rgb-images.

In ISSCS 2007, International Symposium on Signals,

Circuits and Systems.

Oggier, T., Lehmann, M., Kaufmann, R., Schweizer, M.,

Richter, M., Metzler, P., Lang, G., Lustenberger, F.,

and Blanc, N. (2004). An all-solid-state optical range

camera for 3d real-time imaging with sub-centimeter

depth resolution (swissranger). In SPIE 2004, Society

of Photo-Optical Instrumentation Engineers Confer-

ence Series.

Oprisescu, S., Falie, D., Ciuc, M., and Buzuloiu, V. (2007).

Measurements with tof cameras and their necessary

corrections. In ISSCS 2007, International Symposium

on Signals, Circuits and Systems.

Parvizi, E. and Wu, Q. J. (2008). Multiple object track-

ing based on adaptive depth segmentation. In Cana-

dian Conference of Computer and Robot Vision, pages

273–277. IEEE Computer Society.

Reulke, R. (2006). Combination of distance data with high

resolution images. In IEVM06, Image Engeeniring

and Vision Metrology.

Santrac, N., Friedland, G., and Rojas, R. (2006). High

resolution segmentation with a time-of-ﬂight 3d-

camera using the example of a lecture scene.

Technical report, http://www.inf.fu-berlin.de/inst/ag-

ki/eng/index.html.

FAST REAL-TIME SEGMENTATION AND TRACKING OF MULTIPLE SUBJECTS BY TIME-OF-FLIGHT

CAMERA - A New Approach for Real-time Multimedia Applications with 3D Camera Sensor

587