REAL-TIME 3D MULTIPLE HUMAN TRACKING WITH

ROBUSTNESS ENHANCEMENT THROUGH MACHINE LEARNING

Suraj Nair, Emmanuel Dean-Leon and Alois Knoll

Robotics and Embedded Systems, Technische Universit

at M

unchen, M

unchen, Germany

Keywords:

Visual Tracking.

Abstract:

This paper presents a novel and robust vision-based real-time 3D multiple human tracking system. It is capable

of automatically detecting and tracking multiple humans in real-time even when they occlude each other.

Furthermore, it is robust towards drastically changing lighting conditions. The system consists of 2 parts, 1.

a vision based human tracking system using multiple visual cues with a robust occlusion handling module,

2. a machine learning based module for intelligent multi-modal fusion and self adapting the system towards

drastic light changes. The paper also proposes an approach to validate the system through zero-error ground

truth data obtained by virtual environments. The system is also validated in real-world scenarios.

1 INTRODUCTION

This paper presents a novel real-time 3D multiple hu-

man tracking system with the primary focus on ro-

bustness enhancement through machine learning. It

is a vision based system, capable of automatically de-

tecting human targets. After detecting the targets, the

trajectory of each detected target is tracked with a 3D

pose in real-time. The system has an ability to re-

solve target occlusions in real-time and maintain in-

dividual trajectories provided the targets do not leave

the designated tracking area. The occlusion handling

system resolves mutual occlusion between the targets

and serves as an important tool for robust tracking

under circumstances of mutual occlusion in multi-

ple camera views when the tracking scene consists of

many targets.

A machine learning based approach is introduced

to train and classify lighting conditions in the track-

ing environment. Lighting conditions being highly

inﬂuential in robust tracking, its classiﬁcation helps

the tracker to take important decisions to maintain the

robustness. On the basis of this classiﬁcation, intel-

ligent multi-modal fusion of two visual cues is per-

formed. Depending on the current situation, the op-

timal weights in which the visual cues are fused in

order to achieve the desired robustness is computed.

This approach improves the robustness of the tracking

system in terms of self adaptability to changing track-

ing conditions. Although the machine learning based

lighting conditions classiﬁcation is useful in multi-

modal fusion, it ﬁnds an important use case in robust

pre-processing of camera images such as background

segmentation. Sudden changes in lighting conditions

can be detected and the background model can be up-

dated using this approach. The background model up-

date is not trivial in presence of foreground targets.

This paper introduces an approach to identify such sit-

uations and update the background model in presence

of foreground targets under drastic changes in lighting

conditions.

2 PRIOR ART

Several systems have been developed to track hu-

mans using multiple cameras in both un-calibrated

and stereo-calibrated fashion. (Santos and Morimoto,

2011) provides a systematic mention of approaches

(Eshel and Moses, 2008), (Fleuret et al., 2008), (Hu

et al., 2006), (Kim and Davis, 2006), which use un-

calibrated cameras and homography to perform peo-

ple tracking.

(Soto et al., 2009) present another multi-target

tracking system using multiple cameras. Their ap-

proach is focussed on a self-conﬁguring camera net-

work consisting of cameras with pan-tilt. The cam-

eras keep track of the targets and adjust their parame-

ters with respect to each other.

(Khan and Shah, 2008) present a slightly different

approach of multi view tracking of people. They use

359

Nair S., Dean-Leon E. and Knoll A..

REAL-TIME 3D MULTIPLE HUMAN TRACKING WITH ROBUSTNESS ENHANCEMENT THROUGH MACHINE LEARNING.

DOI: 10.5220/0003824203590366

In Proceedings of the International Conference on Computer Vision Theory and Applications (VISAPP-2012), pages 359-366

ISBN: 978-989-8565-04-4

 2012 SCITEPRESS (Science and Technology Publications, Lda.)

information in combination from all views which is

projected back to each camera view and a planar ho-

mographic occupancy constraints for likelihood com-

putation.

(Francois et al., 2006) combines target occupancy

in the ground plane with colour and motion models

to track people in continuous video sequences. This

approach requires heuristics to rank the individual tar-

gets to avoid confusing them with another.

(Focken and Stiefelhagen, 2002) introduces a sys-

tem for tracking people in a smart room. They use a

calibrated camera system within a distributed frame-

work. Each camera runs on a dedicated PC. The de-

tected foreground regions are sent to a tracking agent

which computes the locations of people from the de-

tected regions. (Cai and Aggarwal, 1996) use grey

scale images from multiple ﬁxed cameras to perform

the tracking. They use multivariate Gaussian mod-

els to estimate closest matches of humans between

consecutive image frames. The system proposed by

(Dockstader and Tekalp, 2001) is aimed at tracking

human motion with key focus on occlusions. Each

camera view is independently processed on a indi-

vidual computer. Within the Bayesian network, the

observations from the different cameras are fused to-

gether in order to resolve the independent relations

and conﬁdence levels. An additional Kalman ﬁlter is

used to update the 3D state estimates.

(Chang and Gong, 2001) present a multi camera

people tracking system using Bayesian ﬁltering based

modality fusion. (Zhao et al., 2005) presents another

stereo cameras based people tracking system. It is a

real-time system to track humans over a wide area.

A multi-camera fusion modules combines tracks of a

single target in all view to a global track.

Considering the state of the art, the primary contri-

bution of this work is made in the form of robustness

enhancement through machine learning which makes

the system robust to drastic changes in lighting condi-

tions and improves the tracker robustness through in-

telligent multi-modal fusion of two visual modalities.

Another important contribution is a robust occlusion

handling system which can resolve multiple occlusion

in real time. A novel benchmarking approach is pre-

sented due to un availibility of a uniﬁed approach for

existing systems unlike pedestrian detections systems.

3 SYSTEM ARCHITECTURE

The 3D multiple human tracking system uses visual

information from multiple cameras in order to auto-

matically detect and tracks humans in real time. The

detection process operates independent of the track-

ing allowing detection of new targets when they enter

the tracking area

while the tracker is tracking exist-

ing targets.

The target shape is modelled as a 3D rectangu-

lar box approximating to the dimensions of a human.

The appearance model is generated in the form of a

2D joint probability histogram in all camera views.

The target dynamics is modelled using the constant

white noise acceleration (CW NA) motion model. The

tracker uses a bank of SIR particle ﬁlters (Isard and

Blake, 1998), working on a 3D motion model, ap-

pearance model and optical ﬂow. The particle ﬁlter

provides the sequential prediction and update of the

respective 3D states = (x,y,z). For real-time perfor-

mance, a global particle set is maintained and dis-

tributed evenly among the bank of particle ﬁlters in

order to maintain real-time performance.

3.1 Tracking Pipeline

Fig. 1 describes the complete pipeline of the track-

ing system. Each module is discussed in detail in the

subsections below.

CAM0

Image Acquisition from

Calibrated USB Cameras

(25 ~35fps)

Grabber

RGB 444

Background

Segmentation

RGB to HSV

Conversion

PreProcessing ofCameraImages

Preprocessed

Image Stack

Automatic Online

Target Detection

(Detects new targets

and generate a model

foreach new target

Occlusion Test

(Checks if any target

occludes other targets in

each camera view)

Targets

-Count

-Ids

-Shape

-Appearance

-3D Pose

-Cam Occupancy

Particle Filter Bank

One filterforeach target

FromTarget ID 0 to ID NTarget

FeatureMatching

Color Histogram

Cam 0

Warp

Particle Filter

Target Id. 0

3D Pose(x,y ,z )

0 0 0

Cam N

Warp

Compute

Average

Output

Filter

TV-L1

Optical Flow

CAM2

Grabber

RGB 444

Background

Segmentation

RGB to HSV

Conversion

TV-L1

Optical Flow

CAMN

Grabber

RGB 444

Background

Segmentation

RGB to HSV

Conversion

TV-L1

Optical Flow

Intelligent

Data

Fusion

FeatureMatching

Color Histogram

Cam N

FeatureMatching

Optical Flow

Cam N

FeatureMatching

Optical Flow

Cam 0

Target Management

Machine Learning For

Robustness Enhancement

FeatureMatching

Color Histogram

Cam 0

Warp

Particle Filter

Target Id. N

3D Pose(x ,y ,z )

N N N

Cam N

Warp

Compute

Average

Output

Filter

Intelligent

Data

Fusion

FeatureMatching

Color Histogram

Cam N

FeatureMatching

Optical Flow

Cam N

FeatureMatching

Optical Flow

Cam 0

Figure 1: The ﬁgure illustrates the block diagram of the

multiple human tracking system.

3.1.1 Pre-processing of Sensor Images

Each sensor image undergoes a initial background

segmentation step followed by RGB to HSV colour

space conversion and optical ﬂow segmentation.

The predeﬁned camera workspace where the cameras

can view the targets.

VISAPP 2012 - International Conference on Computer Vision Theory and Applications

360

Figure 2: The ﬁgure illustrates the occlusion test system.

3.1.2 On Line Target Detection

This module automatically detects targets when they

enter the tracking area by performing a scan along the

tracking ﬂoor area using the 3D box target model. A

target is recorded if the target occupancy is beyond

a certain threshold in 2 or more cameras. The target

data consists of: 1. Unique Target ID, 2. Initial 3D

pose, 3. Shape data, 4. Appearance data, 5. Occlusion

test information, 6. Current 3D pose, 7. Velocity.

3.1.3 Occlusion Testing

It determines if a target is occluded by other targets

in a particular camera view. This information is es-

sential during target detection and tracking. These re-

gions are obtained by warping the 3D pose of each tar-

get under consideration on to each camera image (id=

0,....,M). Each target is deﬁned by a 3-dimensional

container box comprised by 8 vertices

(t) =



∈ R

| j = 0,1,...,7



(1)

where, v

is the j

vertex of target shape model de-

ﬁned in Cartesian space for the state s (t). These ver-

tices are projected on each camera as follows:

(t) =



∈ R

| r

= K [R | T ]v

,∀ v

∈ V

(t)



(2)

where, S

(t) is a set of the projected vertices of the

target n. K, R, and T describe the camera model.

Then, we deﬁne d

(t) as the Oriented Bounding Box

(OBB) of S

(t).

(t) =



(x,y) ∈ R

| (x,y) ∈ d

(t)



(3)

The geometric meaning of l

(t) is all the pixels within

the OBB d

(t). These pixels are used for the occlu-

sion test.

Fig. 2 illustrates the occlusion test system. This

system considers all the targets and computes their

occupancies in each camera image and computes the

euclidean distance from the camera to each target.

The bounding box of target farthest from the camera

is computed and rendered ﬁrst. Once all targets are

rendered an overlap test is conducted between the

rendered regions to check which targets are occluded.

3.1.4 Tracker

Each target is equipped with its own particle ﬁlter.

The visual modalities used are 2D colour histograms

and optical ﬂow. The tracking pipeline is as follows:

Tracker Prediction: The particle ﬁlter generates

several prior state hypotheses s

from the previous

distribution (s

)

t−1

through a prediction model.

In this system the constant velocity model was used,

= s

t−1

+ ˙s

t−1

(4)

where, ˙s

t−1

is the velocity and is constant, v

is a ran-

dom acceleration. τ is the sampling interval.

Likelihood: The likelihood is computed on the pro-

jected hypothesis in each camera view. The colour

matching is computed through a distance measure of

the underlying and reference histograms through the

Bhattacharyya coefﬁcient (Bhattacharyya, 1943).

(s),q

∗

) =

1 −

∑

∗

(n)q

(s,n)

(5)

The colour likelihood is then evaluated under a Gaus-

sian model in the overall residual

P(z

col

) ∝ exp(−

∑

log(B

/λ)) (6)

with given covariance λ. Similarly, the optical ﬂow

distance measure if computed by comparing the pro-

jected motion vector of the hypothesis and the under-

lying motion vectors in the hypothesis region.

( f

(s), f

∗

) =

1 −

∑

∗

(n) f

(s,n)

(7)

is the resulting distance measure from the optical

ﬂow modality for an individual hypothesis. The opti-

cal ﬂow likelihood is computed as follows.

P(z

f low

) ∝ exp(−

∑

log(F

/λ)) (8)

Multi-modal Fusion: The intelligent multi-modal

fusion module described in the next sections, gen-

erates the normalized weights W

col

and W

f low

. The

global likelihood for the hypothesis s

is then given

P(z

global

) = P(z

col

+ P(z

f low

(9)

Computing the Estimated State:

The average state

∑

, (10)

is computed and the three components (x, y, z) are re-

turned. In order to reduce the jitter in the output, the

average pose can be smoothed using an exponential

ﬁlter.

REAL-TIME 3D MULTIPLE HUMAN TRACKING WITH ROBUSTNESS ENHANCEMENT THROUGH MACHINE

LEARNING

361

4 MACHINE LEARNING TO

ENHANCE ROBUSTNESS

This section introduces an approach to improve the

robustness of the system using machine learning tech-

niques.

4.1 Learning Lighting Conditions

Figure 3: Building blocks of the SVM trainer for lighting

conditions.

Multi class support vector machines (Crammer

and Singer, 2001) can be used as an important tool

to learn the different possible lighting conditions that

could occur during tracking. They are divided into

classes where each class consists of a wide variety of

possible lighting conditions. From the learning pro-

cess, a model is generated which can be further used

for online classiﬁcation of the lighting conditions.

Using this information the contribution of the indi-

vidual modalities towards multi-modal fusion module

can be computed.

Fig. 3 illustrates the building blocks of the sup-

port vector machine based training module for light-

ing conditions. It consists of a large set of training

samples in the form of images. Each training sam-

ple is processed to obtain the training data. Once the

training data is available, it is used by the svm train-

ing module to generate a model based on the classes

in which the training data were grouped. The light-

ing types are Insu f f icient lighting, Good lighting

and Saturated lighting representing classes Bad and

Good. The three stages in training are, see Fig. 4,

1. RGB to HSV Colour Space Conversion.

2. Histogram Computation: N bin normalized his-

togram of the V channel is computed representing

the intensity distribution.

3. Labelling: A class label is generated through au-

tomatic analysis or manual observation. The class

label together with the histogram data forms one

training data sample for the multi-class support

vector machine.

Around 4000 images of each class were used to

generate the training data. This makes the total train-

ing data set to consists of 12000 data samples. These

samples were generated using camera images ob-

tained from the real scene and from 3D simulations of

the entire scene, where the lighting conditions could

be controlled.

4.2 Classiﬁcation of Lighting

Conditions

Figure 4: Building blocks of the SVM classiﬁer for a light-

ing conditions.

Fig. 3 illustrates the building blocks of the sup-

port vector machine based lighting conditions clas-

siﬁer which uses the model generated by the SV M

trainer.

(a) (b)

Figure 5: On-line classiﬁcation of the lighting condition.

Red: bad. Green: good.

Fig. 5 illustrates the test conducted for the on-

line classiﬁcation of lighting conditions. The model is

able to classify and associate the current lighting con-

ditions in the camera views to their respective classes.

In this experiment, the SVM model was trained for

three classes of lighting conditions.

4.3 Background Model Update in the

Presence of Foreground Targets

Updating the background model when light changes.

The background model update is trivial in case of an

empty scene, but becomes a complex task in the pres-

ence of foreground targets in the scene being tracked.

This paper proposes an approach to update the

background model under changing lighting condition

while foreground targets are present in the scene. It

exploits the fact that during the course of tracking the

targets will move and expose the regions previously

VISAPP 2012 - International Conference on Computer Vision Theory and Applications

362

Figure 6: Background model update in presence of fore-

ground targets.

occluded by them. The occluded regions can be in-

cluded into the background model once they are vis-

ible due to target motion. The assumption that the

target will move is valid because, if the they do not

move, then the tracker only needs to perform a ex-

tremely small local search to keep track of the target

which does not require information from the back-

ground subtracter. Fig. 6 illustrate the Background

Model Update Procedure (BMUP) under changing

lighting conditions and in the presence of foreground

targets. The BMUP comprises of three main parts:

• Light Classiﬁer: determines which class the

lighting conditions in the current camera image

belongs to.

• Light Change Detector: continuously reads

the classiﬁcation result from the Light Classiﬁer

and compares it with the classiﬁcation results of

the previous instance and thereby detects drastic

changes in lighting conditions.

• Background Model Updater: updates the back-

ground model when it is notiﬁed about a light

change event by the Light Change Detector. It

uses the target positions, region sampler and the

status checker modules. If number of targets N =

0, the background model is updated with the im-

age I

. If N > 0, from each target position the

occupancy region L

of each camera (id=0,...,M)

is obtained. This is given by:

L (t) =

[

j=1

(t) (11)

where l

is given by Eq. 3.

This is the area that can not be included in the ref-

erence image for the new background model, and

needs to be included when exposed. The current

area for the reference image is initialised as:

D(t

) = (A ∩ L (t

))

(12)

Then the background image is initialized,

re f

{

I (x, y) | x,y ∈ D (t

)

}

(13)

where

A =

{

(x,y) | x = 1,2,...,width, y = 1,2,..., height

}

The unupdated regions are updated in time when

the targets are in motion, thereby exposing the

previously hidden regions, this is computed in the

next form:

(t) = (L (t − 1) \ (L (t) ∩ L (t − 1))) (14)

where h

is the new exposed pixels in the current

frame. Then the background image is updated us-

ing these pixels as follows:

re f

{

I (x, y) | x,y ∈ h

(t)

}

(15)

Finally, the current area at time t is updated as

below:

D(t) = D (t − 1) ∪ h

(t) (16)

D(t) is updated until

D(t)

When the background update process is initiated

the tracker suspends the new target detection process.

Further, instead of generating the HSV image from

the background segmented image, the tracker uses a

mask to highlight only the local regions surrounding

each target. Once the background model update is

complete, the tracker activates the background sub-

traction module in the target detection and initial pre-

processing phases. Figs. 7 illustrates the process. See

video: www.youtube.com/watch?v=LpnUkf2GEQ4

Figure 7: Background update in presence of foreground tar-

gets.

4.4 Modality Weight Generation for

Multi-modal Fusion

Figure 8: Intelligent fusion module to generate weights for

the individual modalities through scene analysis.

REAL-TIME 3D MULTIPLE HUMAN TRACKING WITH ROBUSTNESS ENHANCEMENT THROUGH MACHINE

LEARNING

363

Fig. 8 describes the module performing the task

of generating the weights for the individual visual

modalities through scene analysis. This module con-

sists of two scene analysis units, each analysing the

usability of the individual modality in the current

scene. The usability of the modalities can be repre-

sented in the form of classes. The class categories

can be divided into two simple types namely, Bad and

Good or more if needed.

The optical ﬂow usability class is considered to be

Bad when,

• The target is stationary or moving with a velocity

below a certain threshold V

• The target is moving closer than a deﬁned thresh-

old d

min

to another target and the absolute differ-

ence of the optical ﬂow direction components is

below a certain threshold θ

min

On the other hand, the optical ﬂow usability class

is considered to be Good when,

• The target is moving with a velocity higher than

the deﬁned threshold V

and at a distance greater

than d

min

with respect to all other targets.

• It has a velocity component higher than V

and

the absolute difference of its optical ﬂow direction

component with other targets is greater than θ

min

Once the usability classes of the respective visual

modalities are known, this information is supplied to

the modality weight generator. The rule based fusion

technique is constructed through a ﬁxed set of rules

deﬁned by the user. These rules specify the combina-

tion of normalized weights to be assigned to the two

modalities for each possible combination of classes.

Fig. 8 illustrates a simple fusion rule data-bank for

the binary classes consisting of Bad and Good labels.

As mentioned above, these classes can be extended to

a wider range along with a more dense rule data-bank.

Once the individual weights for each modality is

obtained, they are fused in order to obtain a global

likelihood. The fusion operation is performed for

each hypothesis generated by the particle ﬁlter and

for each camera view. When both modalities are un-

suitable for tracking, the tracker declares a target loss

and instantiates the target recovery mechanisms in the

form of re-detection. The mathematical representa-

tion of the complete fusion procedure is formulated

below:

colour

= L

svm

, U

f low

= A

f low

tid

(17)



colour,

f low



= R



colour

f low

tid



(18)

where, U

colour

is the usability class for the colour

modality in the i

hypothesis, L

svm

is the machine

learning based lighting condition classiﬁer and I

the current image from the camera. U

f low

tid

is the us-

ability class for the optical ﬂow modality in the i

hypothesis for the target with id tid. A

f low

is the func-

tion which performs the optical ﬂow usability check

on the motion parameters of the current target given

by T

tid

. (W

colour

f low

) are the unique weights for

the two modalities using the fusion rule data-bank R.

Finally, L

f ilter

is the global likelihood.

5 EXPERIMENTS

In this section the experiments are discussed.

5.1 Ground Truth Generation

There exists no uniﬁed benchmarking and quantita-

tive analysis framework for stereo multiple human

trackers. Different system test their method in dif-

ferent ways making quantitative comparisons difﬁ-

cult. Ground truth generation methods are either

manual, semi-automatic or automatic (Doll

ar et al.,

2009). They cannot guarantee accuracy since they

themselves have a certain tolerance. In order to gener-

ate ground truth without inherent errors, our test envi-

ronment was modelled in 3D in its completeness us-

ing Blender (Roosendaal and Selleri, 2004). The cam-

eras were reproduced with exact intrinsic and extrin-

sic parameters. The light sources were modelled sim-

ilar to the ones used in the lab environment. The hu-

mans were modelled using simple models. For each

target, the motion trajectories can be planned and sim-

ulated. The animation can be rendered using the per-

spective of the cameras. The simulated trajectory data

of each target was extracted through a python script

within Blender.

Figure 9: The lab environment modelled in 3D and extrac-

tion of simulated trajectory data.

VISAPP 2012 - International Conference on Computer Vision Theory and Applications

364

0 100 200 300 400 500 600

−6000

−4000

−2000

2000

4000

6000

Position Target 1 (X axis)

Frame Number [frame]

Target Position X [mm]

real postion

tracked position

0 100 200 300 400 500 600

−6000

−4000

−2000

2000

4000

6000

Position Target 1 (Y axis)

Frame Number [frame]

Target Position Y [mm]

real postion

tracked position

0 100 200 300 400 500 600

−2000

−1000

1000

2000

Error Position Target 1

Frame Number [frame]

Error [mm]

std x: 89.52 mm

std y: 98.34 mm

x error

y error

0 100 200 300 400 500 600

−6000

−4000

−2000

2000

4000

6000

Position Target 2 (X axis)

Frame Number [frame]

Target Position X [mm]

real postion

tracked position

0 100 200 300 400 500 600

−6000

−4000

−2000

2000

4000

6000

Position Target 2 (Y axis)

Frame Number [frame]

Target Position Y [mm]

real postion

tracked position

0 100 200 300 400 500 600

−2000

−1000

1000

2000

Error Position Target 2

Frame Number [frame]

Error [mm]

std x: 80.07 mm

std y: 81.21 mm

x error

y error

0 100 200 300 400 500 600

−6000

−4000

−2000

2000

4000

6000

Position Target 3 (X axis)

Frame Number [frame]

Target Position X [mm]

real postion

tracked position

0 100 200 300 400 500 600

−6000

−4000

−2000

2000

4000

6000

Position Target 3 (Y axis)

Frame Number [frame]

Target Position Y [mm]

real postion

tracked position

0 100 200 300 400 500 600

−2000

−1000

1000

2000

Error Position Target 3

Frame Number [frame]

Error [mm]

std x: 97.02 mm

std y: 99.81 mm

x error

y error

0 100 200 300 400 500 600

−6000

−4000

−2000

2000

4000

6000

Position Target 4 (X axis)

Frame Number [frame]

Target Position X [mm]

real postion

tracked position

0 100 200 300 400 500 600

−6000

−4000

−2000

2000

4000

6000

Position Target 4 (Y axis)

Frame Number [frame]

Target Position Y [mm]

real postion

tracked position

100 200 300 400 500 600

−2000

−1000

1000

2000

Error Position Target 4

Frame Number [frame]

Error [mm]

std x: 87.01 mm

std y: 71.95 mm

x error

y error

0 100 200 300 400 500 600

−6000

−4000

−2000

2000

4000

6000

Position Target 5 (X axis)

Frame Number [frame]

Target Position X [mm]

real postion

tracked position

0 100 200 300 400 500 600

−6000

−4000

−2000

2000

4000

6000

Position Target 5 (Y axis)

Frame Number [frame]

Target Position Y [mm]

real postion

tracked position

0 100 200 300 400 500 600

−2000

−1000

1000

2000

Error Position Target 5

Frame Number [frame]

Error [mm]

std x: 68.72 mm

std y: 85.37 mm

x error

y error

Figure 10: Illustration of the tracking system with 5 targets in the scene moving close to each other and two targets having

similar appearance.

Figure 11: Experiment results in real world environment

with 3 targets and the motion trajectories generated.

5.2 Experimental Validations

Fig. 10 illustrates the ﬁrst test with 5 targets. It shows

the plots of the tracked trajectories along with the ac-

tual trajectories obtained from the ground truth gen-

erator. The third row represents the error computed

in the X and Y directions. The standard deviation of

the error is below 10 cm even under increased num-

bers of mutual occlusions simultaneously in multiple

cameras. See video: www.youtube.com/watch?v=-Y-

sZ2g53fM.

Fig. 11 illustrates a real experiment in the

lab. The red box representing the desired trajectory

and the actual tracked trajectories are shown. See

video:www.youtube.com/watch?v=wePVQ7cXB9c

Fig. 12 and video www.youtube.com/watch?v=u6

OFTVWO-qg presents an experiment with two tar-

gets having exactly similar appearance. There is a

sharp overshoot in the error in the dominant direction

of motion when the targets move in the same direc-

tion but recovers in a few frames through the occlu-

sion handling module. The average error is lower than

10 cm.

Fig. 13 validates the use of intelligent multi-

modal fusion in good and bad lighting.

The feature matching distance of colour histogram

worsens in the bad lighting condition while the opti-

0 200 400 600 800 1000

−6000

−4000

−2000

2000

4000

6000

Position Target 1 (Xax is)

Frame Number [frame]

Target Position X [mm]

real postion

tracked position

0 200 400 600 800 1000

−6000

−4000

−2000

2000

4000

6000

Position Target 1 (Yax is)

Frame Number [frame]

Target Position Y [mm]

real postion

tracked position

0 200 400 600 800 1000

−2000

−1500

−1000

−500

500

1000

1500

2000

Error Position Target 1

Frame Number [frame]

Error [mm]

std x: 216.46 mm

std y: 68.33 mm

x error

y error

0 200 400 600 800 1000

−6000

−4000

−2000

2000

4000

6000

Position Target 2 (Xax is)

Frame Number [frame]

Target Position X [mm]

real postion

tracked position

0 200 400 600 800 1000

−6000

−4000

−2000

2000

4000

6000

Position Target 2 (Yax is)

Frame Number [frame]

Target Position Y [mm]

real postion

tracked position

0 200 400 600 800 1000

−2000

−1500

−1000

−500

500

1000

1500

2000

Error Position Target 2

Frame Number [frame]

Error [mm]

std x: 270.44 mm

std y: 77.66 mm

x error

y error

Figure 12: Two targets with exactly similar appearance and

very close motion in the same direction.

0 100 200 300 400 500 600 700 800

0.2

0.25

0.3

0.35

0.4

0.45

0.5

Feature Matching Distance In Good Lighting

Frame Number [frame]

Distance

Colour Histograms

Optical Flow

0 100 200 300 400 500 600 700 800

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

Feature Matching Distance In Bad Lighting

Frame Number [frame]

Distance

Colour Histograms

Optical Flow

100 200 300 400 500 600 700

0.8

0.85

0.9

0.95

Likelihood In Good Lighting

Frame Number [frame]

Likelihood

Only Colour Histograms

Fusion of Colour Histogram and Optical Flow

100 200 300 400 500 600 700

0.8

0.85

0.9

0.95

Likelihood In Bad Lighting

Frame Number [frame]

Likelihood

Only Colour Histograms

Fusion of Colour Histogram and Optical Flow

Figure 13: Experiments results for multi-modal fusion.

cal ﬂow distance remains fairly constant. The like-

lihood without fusion degrades in bad lighting con-

ditions while the likelihood with multi-modal fusion

remains fairly constant. This indicates that under bad

lighting conditions the multi-modal fusion of colour

and optical ﬂow ensures robust and stable tracking re-

sults.

Fig. 14 illustrates the experiment conducted to

validate the performance under drastic changes in

lighting conditions in a real scenario. See video:

www.youtube.com/watch?v=yZHCXgdDf14

Fig. 15 shows the real application involving an

industrial robot. Human-robot-object interaction

was achieved under drastic changes in light. The

environment was dynamic as the robot was moving in

REAL-TIME 3D MULTIPLE HUMAN TRACKING WITH ROBUSTNESS ENHANCEMENT THROUGH MACHINE

LEARNING

365

Figure 14: Experiment results in real world environment

under drastically changing lighting conditions.

Figure 15: Experiment with real robot.

the ﬁeld of view of the cameras thereby inﬂuencing

the background subtraction and also occluding hu-

mans in different cameras while moving. See video:

www.youtube.com/watch?v=Pn7kyhlEkEc&list=UU

CgI4kT12FB7GjphbrIKAQ&index=2&feature=plpp

video

6 CONCLUSIONS AND FUTURE

WORK

To conclude this paper proposed and validated a ro-

bust multiple human tracker. The primary contri-

butions are: A vision based real-time 3D multiple

human tracking system; Simultaneous multiple oc-

clusion handling module; A machine learning based

model trained to classify quality lighting conditions;

Updating the background model in the presence of

foreground targets; Analysis of each visual modality

for intelligent fusion. Furthermore, a novel approach

through which zero error ground truth data for evalu-

ation and validation of the tracker was introduced and

experiments were conducted to very different aspects.

In the future work camera placements will be im-

proved for better stereo coverage. The target mod-

elled will be improved to resemble the human shape

more closely. The detection and occlusion handling

modules will be improved by adding a classiﬁcation

engine such as in pedestrian detection systems to con-

ﬁrm the presence of a human. Addition visual modal-

ities will be introduced into the fusion engine.

REFERENCES

Bhattacharyya, A. (1943). On a measure of divergence

between two statistical populations deﬁned by their

probability distributions. Bulletin of the Calcutta

Mathematical Society, 35:99–109.

Cai, Q. and Aggarwal, J. (1996). Tracking human mo-

tion using multiple cameras. In Pattern Recognition,

1996., Proceedings of the 13th International Confer-

ence on, volume 3, pages 68–72. IEEE.

Chang, T. and Gong, S. (2001). Tracking multiple people

with a multi-camera system. womot, page 0019.

Crammer, K. and Singer, Y. (2001). On the algorithmic

implementation of multi-class svms. In JMLR.

Dockstader, S. and Tekalp, A. (2001). Multiple camera

tracking of interacting and occluded human motion.

Proceedings of the IEEE, 89(10):1441–1455.

Doll

ar, P., Wojek, C., Schiele, B., and Perona, P. (2009).

Pedestrian detection: A benchmark. In CVPR.

Eshel, R. and Moses, Y. (2008). Homography based multi-

ple camera detection and tracking of people in a dense

crowd. In CVPR 2008. IEEE.

Fleuret, F., Berclaz, J., Lengagne, R., and Fua, P. (2008).

Multicamera people tracking with a probabilistic oc-

cupancy map. pattern analysis and machine intelli-

gence. IEEE Transactions on, 30(2):267–282.

Focken, D. and Stiefelhagen, R. (2002). Towards vision-

based 3-d people tracking in a smart room. In Multi-

modal Interfaces, 2002. IEEE Computer Society.

Francois, J. B., Berclaz, J., Fleuret, F., and Fua, P. (2006).

Robust people tracking with global trajectory opti-

mization. In In Conference on Computer Vision and

Pattern Recognition, pages 744–750.

Hu, W., Hu, M., Zhou, X., Tan, T., Lou, J., and Maybank, S.

(2006). Principal axis-based correspondence between

multiple cameras for people tracking. pattern analy-

sis and machine intelligence. IEEE Transactions on,

28(4):663–671.

Isard, M. and Blake, A. (1998). Condensation – conditional

density propagation for visual tracking. International

Journal of Computer Vision (IJCV), 29(1):5–28.

Khan, S. and Shah, M. (2008). Tracking multiple occluding

people by localizing on multiple scene planes. IEEE

transactions on pattern analysis and machine intelli-

gence, pages 505–519.

Kim, K. and Davis, L. (2006). Multi-camera tracking and

segmentation of occluded people on ground plane us-

ing search-guided particle ﬁltering. Computer Vision–

ECCV 2006, pages 98–109.

Roosendaal, T. and Selleri, S. (2004). The Ofﬁcial Blender

2.3 Guide: Free 3D Creation Suite for Modeling, An-

imation, and Rendering. No Starch Press.

Santos, T. T. and Morimoto, C. H. (2011). Multiple camera

people detection and tracking using support integra-

tion. Pattern Recognition Letters, 32(1):47–55.

Soto, C., Song, B., and Roy-Chowdhury, A. (2009). Dis-

tributed multi-target tracking in a self-conﬁguring

camera network. In CVPR 2009. IEEE.

Zhao, T., Aggarwal, M., Kumar, R., and Sawhney, H.

(2005). Real-time wide area multi-camera stereo

tracking. In Computer Vision and Pattern Recogni-

tion, 2005. CVPR 2005. IEEE Computer Society.

VISAPP 2012 - International Conference on Computer Vision Theory and Applications

366