A Graph-based MAP Solution for Multi-person Tracking using

Multi-camera Systems

Xiaoyan Jiang, Marco K

orner, Daniel Haase and Joachim Denzler

Computer Vision Group, Friedrich Schiller University of Jena, Jena, Germany

Keywords:

Multi-person tracking, Multi-camera, Min-cost, MAP.

Abstract:

Accurate multi-person tracking under complex conditions is an important topic in computer vision with various

application scenarios such as visual surveillance. Taking into account the difﬁculties caused by 2D occlusions,

missing detections, and false positives, we propose a two-stage graph-based object tracking-by-detection

approach using multiple calibrated cameras. Firstly, data association is formulated into a maximum a posteriori

(MAP) problem. After transformation, we show that this single MAP problem is equivalent of ﬁnding min-cost

paths in a two-stage directed acyclic graph. The ﬁrst graph aims to extract an optimal set of tracklets based on

the hypotheses on the ground plane by using both 2D appearance feature and 3D spatial distances. Subsequently,

the tracklets are linked into complete tracks in the second graph utilizing spatial and temporal distances.

This results in a global optimization over all the 2D detections obtained from multiple cameras. Finally, the

experimental results on three difﬁcult sequences of the PETS’09 dataset with comparison to the state-of-the-art

methods show the precision and consistency of our approach.

1 INTRODUCTION

Automatic initialization and tracking of multiple, po-

tentially changing number of persons in real situa-

tions are a classic but challenging topic in computer

vision. Along with the development of object detec-

tion approaches, the tracking-by-detection framework

is adopted widely for multi-object tracking scenarios.

Given discrete detections in separate time steps, the

task afterwards is to assign the right detections to indi-

vidual targets. Hence, data association is a key issue

for multi-object tracking.

Massive works with regard to single-camera based

multi-object tracking have shown the limitation of

tracking performance (Breitenstein et al., 2011). This

is mainly due to the large portion of false positive and

missing detections caused by severe occlusions or bad

lightness conditions. By contrast, for tracking using

multiple cameras, one view can change information

with others and compensate the data scarcity. How-

ever, data association from multiple cameras occurs

an extra difﬁculty known as ”ghost effect” (Wu et al.,

2012) caused by triangulation of objects in 3D space.

Accordingly, we propose a global optimization ap-

proach using two graphs for multi-person tracking in

multiple calibrated camera systems. To simplify the

calculation, we adopt hypotheses on the ground plane

reconstructed from 2D detections from all available

views. Afterwards, track fragments (Tracklets) are

extracted by ﬁnding the min-cost paths in a so-called

hypothesis graph. Finally, complete tracks are gen-

erated by linking those tracklets through a so-called

tracklet graph.

1.1 Related Work

There are much effort made for efﬁcient data associ-

ation for multi-object tracking in previous years. In

many works, tracking-by-detection was deﬁned as a

maximum a posteriori (MAP) problem (Zhang et al.,

2008)(Berclaz et al., 2011)(Hofmann et al., 2013),

which aims to ﬁnd the optimal set of trajectories with

maximum posteriori probabilities given all the obser-

vations from every video frame (Xing et al., 2009).

Step by step assignment such as particle ﬁltering

(Breitenstein et al., 2011)(Jiang et al., 2012), kalman

ﬁlter (Satoh et al., 2004) propagated the object state

vector according to a given motion model and per-

formed data association between detections and tracks

(Breitenstein et al., 2011) when a new frame came.

However, during the period when the observation

model that was normally deﬁned as local features

changed much due to occlusion or illumination, it was

easy for the tracker to make wrong decisions or lose

343

Jiang X., Körner M., Haase D. and Denzler J..

A Graph-based MAP Solution for Multi-person Tracking using Multi-camera Systems.

DOI: 10.5220/0004690103430350

In Proceedings of the 9th International Conference on Computer Vision Theory and Applications (VISAPP-2014), pages 343-350

ISBN: 978-989-758-009-3

 2014 SCITEPRESS (Science and Technology Publications, Lda.)

the target totally. Approaches such as Hungarian algo-

rithm (Huang et al., 2008), bipartite graph matching

(Bredereck et al., 2012), and energy minimization (An-

driyenko and Schindler, 2011) tried to ﬁnd local max-

ima or minima of matching, while global information

was not considered.

Recently, many researches concerning global op-

timization schemes based on ﬂow networks (Zhang

et al., 2008)(Wu et al., 2011)(Hofmann et al., 2013)

and graphs (Leal-Taix

e et al., 2012)(Collins, 2012)

have been widely presented in the literature. They

converted the multi-object tracking problem to the

searching of multiple min-cost paths in the network

or the graph. Generally, each node of the network or

the graph represents a single object’s hypothesis of

state with a speciﬁc time stamp. Trajectories of the

targets are then obtained by traversing the found min-

cost paths from the sink node to the source node (Jiang

et al., 2013). In (Berclaz et al., 2011), 2D detections

were ﬁrstly mapped into a probabilistic occupancy

map on the ground plane. Afterwards, they built a

ﬂow network based on this probabilistic occupancy

map (Fleuret et al., 2008). Tracking was ultimately

formulated to a Integer Programming Problem with

the basic restriction that the ﬂows arriving at any posi-

tion on the probabilistic occupancy map equal to the

ones departing from this location. Authors in (Leal-

Taix

e et al., 2012) constructed local graphs for each

view and then considered every pair of cameras as

a possible unit to build a higher level graph. How-

ever, this was considered to be not intuitive (Hofmann

et al., 2013). In (Wu et al., 2011), tracks were obtained

in each view by track graphs. Subsequently, the set

cover algorithm was implemented for linking among

track segments from multiple views. Henriques et al.,

modeled the merge and split activities during tracking

and removed the common restriction used in graph

based approaches that one node in the graph belongs

to one target (one-one match) at most (Henriques et al.,

2011).

In this paper, we argue that the one-one match is

necessary for multi-object tracking in multi-camera

systems. Data association among multiple cameras is

globally modeled by a two-stage graph. Additionally,

we incorporate local features to the cost function in the

ﬁrst graph. Finally, we evaluate the proposed paradigm

on 3 sequences with different difﬁculties of tracking

from the PETS’09 dataset. The experimental results

show the accuracy and consistency of the approach, as

well as the cheap computational time it needs.

1.2 Outline and Contributions

Our two-stage graph-based multi-person tracking us-

reconstructed 3D detections

3D Hypotheses

Tracklets Extraction

Tracklets Linking

Figure 1: Diagram of our approach: hypothesis generation,

tracklet extraction, and tracklet linking. The left part is

the generation of

: each

has a 3D detection on the

ground plane (dashed grids). Those 3D detections whose

back projections in different image views are nearest to the

same detections in corresponding views are considered to be

identical objects.

therefore consists of the 3D detections

that have the minimum average back projection errors (solid

grids).

ing multi-camera systems approach is shown as Fig. 1

(detailed in Sec. 2). The approach contains three key

components: hypothesis generation, tracklet extrac-

tion, and tracklet linking. By contrast with the indi-

cated studies and the work in (Jiang et al., 2013) where

a two-stage graph was used as well, our contributions

are as follows:

1. We formulate the multi-object tracking problem

into two MAP problems and solve each MAP problem

by an individual graph. The graphs are conducted on

the ground plane directly, which is more straightfor-

ward than the methods that construct local graphs for

each view.

2. Local features such as appearance and size in im-

age scale of the person are integrated to the assignment

of costs for edges in the hypothesis graph.

The rest of the paper is structured as follows: for-

mulation of two MAP problems is discussed in Sec. 2.

Sec. 3 presents the details of mapping the two MAP

problems into a two-stage graph. Subsequently, quali-

tative and quantitative results on the PETS’09 dataset

with a comparison to the state-of-the-art algorithms

are stated in Sec. 4. Finally, Sec. 5 summarizes the

paper and gives an outlook of the future work.

2 TRACKING FORMULATION

After applying a detector to video frames from each

camera view, 2D detections are obtained as input to

tracking approaches. Normally, a 2D observation is

formulated as

= {x

}

indicating the position

, size

and time index

of detection

in camera

(Felzenszwalb et al., 2010). Assume that the total

number of cameras in the system is

∈ {1,··· ,N}

we deﬁne

= {o

}

to be the set of observations

from camera

at time

1:t

= {O

,··· ,O

}

is the

VISAPP2014-InternationalConferenceonComputerVisionTheoryandApplications

344

set of observations until time

in camera

. Therefore,

known the set of observations for all cameras

1:t

,··· ,O

1:t

}

, the trajectories of the targets until

time

, that is

1:t

, are searched in this huge observation

space. One step further, for multi-object tracking in

multi-camera systems in this work, data association is

a MAP problem based on 3D hypotheses H

1:t

∗

1:t

= argmax

1:t

P(T

1:t

) , (1)

1:t

⊆ R

1:t

= R (∀ O

1:t

∈ O

1:t

) , (2)

with

∗

1:t

is the set of optimal trajectories until time

1:t

is a subspace of all the possible reconstructed 3D

detections

1:t

on ground plane accompanying local

features in visible 2D images.

R (·)

is the reconstruc-

tion function. Additionally, 3D detections belong to

identical objects are integrated into single hypotheses.

The process of generation of

is shown on the left

part of Fig. 1.

Now

1:t

is a subset of

1:t

. Recursive searching

for every possible combination is impractical because

of the computational complexity. We intend to use a

global optimization strategy to ﬁnd the possible asso-

ciations that have the highest posteriori probabilities

over the whole sequences with T frames.

Denote

= {ζ

} ∈ ϒ

is a tracklet from

frame

to frame

with a set of connected 3D lo-

cations

= (p

,··· , p

)

and refer to (Xing et al.,

2009), ﬁnding the optimal trajectories for multiple

targets can be written as:

∗

1:T

= argmax

1:T

P(T

1:T

|ϒ,H

1:T

) (3)

= argmax

1:T

P(T

1:T

|ϒ) ·P(ϒ|H

1:T

) (4)

= argmax

1:T

∏

P(T

1:T

|τ

)

∏

P(τ

1:T

). (5)

Equ. 5 is true because of the independence of

individual tracklets.

We also follow the non-overlap constraint:

∩ τ

0, ∀k 6= l, ∀τ

,τ

∈ ϒ . (6)

From Equ. 5, we can see that the MAP problem is

separated into two MAP sections: optimal tracklets

extraction and optimal tracklets linking. The two MAP

problems are equivalent of searching for the paths

with minimum costs in individual graphs accordingly,

which is discussed in the following Sec. 3.

Figure 2: An exemplary hypothesis graph consists of 3 time

steps. Dashed lines from and to virtual sink node and source

node allow every possible entering and exiting position re-

spectively. Here, we also consider missing detections by

allowing edges composed by hypotheses with the time dif-

ference larger than one, which are shown by purple lines.

3 MAPPING TO A TWO-STAGE

GRAPH

In the ﬁrst-stage graph, all available tracklets are ex-

tracted without knowing the number of objects as a

priori (Leal-Taix

e et al., 2012) or assuming entrance

and exit regions (Hofmann et al., 2013). In the second-

stage graph, tracklets are linked using temporal and

spatial distances to form complete tracks. Finally, the

trajectories are reﬁned to generate a unique trajectory

per target.

3.1 Tracklet Extraction

We deﬁne a direct acyclic graph

G = (V,E,c)

, which

is called hypothesis graph, to extract tracklets from

1:T

. A vertice

v ∈ V

represents one hypothesis

= {p

i,1:n(c

)

i,1:n(c

)

} ∈ H

∈ N

that contains

3D reconstructed position on the ground plane

, ap-

pearance features and sizes in 2D images from number

n(c

)

visible camera views with time stamp

. We in-

troduce one virtual vertice of source

source

along with

one virtual vertice of sink

sink

as shown in Fig. 2 to

start and terminate paths separately. Since the num-

ber of objects varies time by time, each vertice

v ∈ V

has probabilities incoming to

sink

and outgoing from

source

that are proportional to the corresponding frame

index t:

e=(v

source

,v)

= 1 −t/T , (7)

e=(v,v

sink

)

= t/T , (8)

where T is the total number frames.

Denote

∆t

i, j

= {v

t+∆t

} ∈ E

∆t ∈ [1,t

max

]

is the

number of frame differences. The transition probabil-

AGraph-basedMAPSolutionforMulti-personTrackingusingMulti-cameraSystems

345

ity P

∆t

i, j

assigned to e

∆t

i, j

is deﬁned as:

∆t

i, j

(

pena

· P

spat

· P

, d

spat

< th

vel

∞ , else,

(9)

where

pena

= ρ

∆t

,ρ < 1

is the penalty for skipping

frames of missing detections. And

spat

= 1 −d

spat

/th

vel

(10)

computes the spatial afﬁnity in 3D world coordinate

system with a maximum deﬁned motion

vel

. And

spat

= kp

− p

t+∆t

is the Euclidean distance.

The average probability for size afﬁnity from all

visible views is

= (

∑

min(s

)

max(s

)

)/n(c

) . (11)

The average appearance similarity from all visible

views is

= (

∑

sim(a

))/n(c

) . (12)

We use Bhattacharyya coefﬁcient to evaluate the simi-

larity of RGB histograms of the objects as appearance

features.

After the conﬁguration and refer to (Zhang et al.,

2008), we can convert one of the MAP problem of

extracting optimal tracklets

∗

in Equ. 5 to k-shortest

paths algorithm conducted on

through negative log-

arithm transformation:

∗

= argmax

∏

P(τ

1:T

) (13)

= argmin

∑

−logP(τ

1:T

) (14)

= argmin

∑

i, j

−logP

∆t

i, j

(15)

+ (−logP

e={v

source

,v}

) +(−log P

e={v,v

sink

}

) (16)

= argmin

(

∑

i, j

+ c

) . (17)

Thus, the costs are naturally deﬁned as:

i, j

= −logP

∆t

i, j

(18)

= −logP

e=(v

source

,v)

(19)

= −logP

e=(v,v

sink

)

. (20)

We iteratively employ Dijkstra’s shortest path algo-

rithm (Dijkstra, 1959) to ﬁnd a number of relatively

min-cost paths

P = (v

sink

P ,0

,··· ,v

P ,1

source

)

. De-

pends on the non-overlap constraint between tracklets,

costs of edges to and from vertices in found

are

set to be inﬁnite. Afterwards, tracklets are obtained

by traversing each

from

sink

source

with exiting

frame index and entering frame index.

Figure 3: An exemplary tracklet graph consists of four ver-

tices (Jiang et al., 2013). Dashed lines from and to virtual

sink vertice and source vertice allow every tracklet to start

and terminate a ﬁnal track respectively. Edges only exist

between nodes with temporally consistent order.

3.2 Tracklet Linking

Since tracklets are fragments of ﬁnal trajectories, we

deﬁne another directed acyclic tracklet graph

)

to globally choose the optimal combina-

tions of tracklets. We again deﬁne a virtual source

vertice

source

and a virtual sink vertice

source

to make

all the paths found in

begin and stop by them. Each

∈ V

represents a tracklet

= {ζ

}

with

starting and terminating frame indexes and 3D loca-

tions in between. Similarly, the entrance cost and exit

cost for v

are:

= t

· c

, (21)

= (T − t

) ·c

. (22)

Here,

is the penalty cost which is manually set.

From this we can see that the tracklets start from the

ﬁrst frame and terminate in the last frame of the video

have a lower entrance/exit cost.

The cost

(k, l)

for

(k, l) = (v

) ∈ E

is de-

ﬁned as follows:

k,l

(

k,l

spat

· d

k,l

temp

0 < t

−t

< th

temp

∞ else .

(23)

The spatial distance

k,l

spat

= kζ

− ζ

(24)

is the Euclidean distance between the corresponding

terminating and starting points of τ

,τ

, and

k,l

temp

= t

−t

(25)

is the temporal distance between two tracklets. There-

fore, only the pairs of tracklets who have no temporal

overlap and the frame differences are smaller than a

certain threshold

temp

have weighted edges. Other-

wise, they are assigned by inﬁnite costs to invalid the

speciﬁc edges. The exemplary ﬁgure of

is indicated

in Fig. 3.

Similar to Subsect. 3.1, the second MAP problem

of ﬁnding optimal trajectories from tracklets in Equ. 5

VISAPP2014-InternationalConferenceonComputerVisionTheoryandApplications

346

can be converted into searching for min-cost paths

conducted on G

∗

1:T

= argmin

1:T

∏

P(T

1:T

|τ

) (26)

= argmin

1:T

(

∑

k,l

+ c

) . (27)

We again iteratively employ the Dijkstra’s shortest

path algorithm (Dijkstra, 1959) on

. The ﬁnal tra-

jectories are consequently obtained by traversing the

linked tracklets.

3.3 Tracking Reﬁnement

After both stages of tracklet extraction and tracklet

linking, tracklets/tracks that belong to identical objects

are recursively merged according to their beginning

and ﬁnishing frame indexes. A pair of tracklets/tracks

is judged to be identical objects when their spatial Eu-

clidean distance in the same frame index is closer than

a threshold

mer

for a minimum number of frames

mer

. And

mer

is proportional to the minimum length

of the pair of tracklets/tracks considered.

The missed positions within tracklets in the ﬁrst

stage and the gaps between linked tracklets in the sec-

ond stage are both linearly interpolated.

4 EXPERIMENTS

4.1 Dataset

We use the public available dataset of PETS’09 (Ferry-

man and Shahrokni, 2009) to evaluate the performance

of our multi-object tracking using multi-camera sys-

tems approach. The dataset has three object-tracking

sequences with different levels of people density:

sparse (S2.L1), medium (S2.L2), and high (S2.L3).

They were recorded by different number of cameras

located in different positions. These videos are very

challenging since they contain many different types of

occlusion, for example inter-object occlusion, object-

obstacle occlusion (people are occluded by a light pole

with a big sign). The resolution in the ﬁrst view is

768 ×576

. The frame rate is 7

, for which persons

can move very fast between neighboring frames. Ad-

ditionally, we report the detection results as well for

comparison, since the tracking-by-detection paradigm

depends much on the detection performance.

4.2 Implementation

We employ the deformable part models based object

detector (Felzenszwalb et al., 2010) to get 2D detec-

tions in video frames of each camera. For PETS’09

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

0 10 20 30 40 50 60

Individual Tracking Precision − PETS’09 S2.L1

Index of Person in GroundTruth

2D Euclidean Error (pixels)

Figure 4: Boxplots of the pixel wise precisions for each

person according to the ground truth data on PETS’09 S2.L1

sequence. Most of the targets are tracked and have a local-

ization error around 10 pixels.

S2.L1, we utilize 6 out of 7 recorded cameras as the

fourth view suffers from frame rate instability (Ferry-

man and Shahrokni, 2009). The middle bottom points

of the 2D detections are adopted for the reconstruc-

tion of 3D detections on the ground plane which are

integrated into hypotheses detailed in Sec. 2.

Parameters.

The parameters conﬁgured in graph

based approaches affect a lot on the ﬁnal results. In our

experiments, we set

mer

1000

. Thus, positions that

are spatially near than

in 3D space are considered

to be identical. The cost for entering or exiting a path

is set to

60000

, which is larger than the transition

cost to encourage the system to link all these vertices

which are associated. Furthermore, we conﬁgured

0.95

to make the system penalize much more to the

edges that link vertices across more number of frames.

The fastest speed a person can walk is limited to be

3.5

/s in the dataset, therefore th

vel

= 500.

For tracking in sequence S2.L1, we extract

100

tracklets and set the maximum frame gap to

max

= 4

since a relatively higher or lower numbers reduce the

performance. Eventually,

tracks are generated from

these tracklets. Also, we select

1000

tracklets, set

max

= 10

120

and

tracks are extracted on the video

S2.L2 and S2.L3 respectively.

4.3 Evaluation

For evaluation, we employ the multiple object track-

ing precision (MOTP) and multiple object tracking

accuracy (MOTA) (Bernardin and Stiefelhagen, 2008)

AGraph-basedMAPSolutionforMulti-personTrackingusingMulti-cameraSystems

347

Table 1: Quantitative results on PETS’09 S2.L1, S2.L2 and S2.L3 dataset. We compared MOTP, MOTA, False Positive Rate,

Miss Rate and Id switches with particle ﬁlter based tracking-by-detection (Jiang et al., 2012) (Breitenstein et al., 2011), Energy

minimization (Andriyenko and Schindler, 2011), k-shortest paths (Berclaz et al., 2011), and Probabilistic tracking (J. Yang and

Teizer, 2009).

Sequence Method MOTP MOTA False Pos. Rate Miss Rate Id switches

PETS’09 S2.L1

Jiang et al., (Jiang et al., 2012) 78.8% 60.8% n/a n/a n/a

Yang et al., (J. Yang and Teizer, 2009) 53.8% 75.9 % n/a n/a n/a

Breitenstein et al., (Breitenstein et al., 2011) 56.3% 79.7% n/a n/a n/a

Berclaz et al., (Berclaz et al., 2011) 60.0 % 66.0 % n/a n/a n/a

Andriyenko et al., (Andriyenko and Schindler, 2011) 76.1 % 81.4% n/a n/a 15

Our Approach 81.44% 77.74% 7.83% 13.91% 24

PETS’09 S2.L2

Breitenstein et al., (Breitenstein et al., 2011) 51.3% 50.0% n/a n/a n/a

Our Approach 60.14% 55.54% 0.85% 40.83% 287

PETS’09 S2.L3

Breitenstein et al., (Breitenstein et al., 2011) 52.1% 67.5% n/a n/a n/a

Our Approach 50.08% 67.71% 0.0% 29.84% 107

metrics which have become de facto standard in the

ﬁeld of multi-object tracking. MOTP considers the

average error of tracked positions over the whole se-

quence. False positives, misses, and mismatches com-

pose MOTA that aims to estimate the tracker’s ability

of recognition and consistency.

The ground truth in the ﬁrst view was provided by

Anton Andriyenko (Andriyenko and Schindler, 2011).

We back project our tracking results to this single view

for measurement. The assignment between tracking

and the ground truth in image coordinate system has a

threshold of

pixels according to the average width

of people appeared in the view. Because of the char-

acter of tracking-by-detection framework and in order

to have a fair evaluation on tracking, we provide the

false positive rate and the false negative rate for the de-

tection by removing objects’ Id numbers in the ground

truth data. The detector we use has a distinct perfor-

mance of different videos in the ﬁrst view: the false

positive rate is

0.05

0.01

, and

0.0

, while the missing

rate is

0.1

0.6

, and

0.5

for S2.L1, S2.L2, and S2.L3

respectively.

Tab. 1 shows the quantitative results of our ap-

proach compared to the state-of-the-art methods. From

the results of S2.L1, it can be seen that we have the

highest MOTP, which indicates the precision of tar-

get localization of our method. MOTA for S2.L1 is

also comparable with others. Additionally, the pixel

wise precisions for individual persons are shown as

boxplots in Fig. 4. We can state that the average bias

of the tracked objects’ localization is approximately

10 pixels, which is probability the same error between

different ground truth data labeled by diverse persons.

Besides, our approach has the highest MOTP and

MOTA on S2.L2 sequence, which is partially because

of the allowance of linking between hypotheses with

number of frame gaps in the ﬁrst graph. This beneﬁt

is also obvious when comparing detection to track-

ing with their missing rates. For S2.L3, our tracking

performance is also comparable with the work that

reported their results.

Therefore, from Tab. 1, we can see that our method

has comparable or relatively better results for the three

sequences on PETS’09. Although the numbers of Id

switches are slightly higher, the tracker keeps track-

ing after switching. This happens when people are

merging and splitting which cause wrong relatively

low costs.

Qualitative results of PETS’09 S2.L1, S2.L2 and

S2.L3 from different number of cameras are shown

in Fig. 5. We visualize the trajectories from tracking

results and obtain corresponding rectangles of each

frame by ﬁnding the nearest 2D detections conducted

on the frame. From the ﬁgure we can see that our

approach is able to consistently track people who have

already been occluded by others or the obstacle in the

scene for a long time. In the third view of S2.L1 where

the scenario is under bad lightness condition, we can

still recognize and recover the trajectories using the

data from other views, which indicate the beneﬁt of

utilization of multiple cameras.

4.4 Complexity and Runtime

The entire system was realized in c++. Our approach

has a complexity of

O(k · (n log n + m))

, where

the number of vertices,

is the number of edges, and

is the number of min-cost paths to be found in the

respective graph. For 6 cameras with

795

frames each

in S2.L1, it needs approximately

1.5

minutes for the

extraction of 100 tracklets from a

1.5 ×10

node hy-

pothesis graph and linking of ﬁnal 25 tracks from the

tracklet graph. Similar runtime is needed for the other

two sequences we conducted. All measurements were

conducted on a standard desktop computer with an

Intel



Core

i5-760 CPU (2.80GHz).

VISAPP2014-InternationalConferenceonComputerVisionTheoryandApplications

348

Figure 5: Qualitative results of our approach on PETS’09 S2.L1, S2.L2, and S2.L3 sequences. First four rows are for S2.L1

from four cameras separately. The ﬁfth row shows the results from S2.L2 in the ﬁrst view and results for the ﬁrst view of S2.L3

are shown in the sixth row. Different colors and shapes indicate the different identities of the targets. The trajectories are shown

by linking previous tracking results for up to 20 frames. Rectangles of the tracks are obtained by back projecting the tracking

results to individual views and ﬁnding the nearest 2D detections. Therefore, the trajectories without rectangles denote the

frames where existing missing detections while the tracker can still keep tracking.

AGraph-basedMAPSolutionforMulti-personTrackingusingMulti-cameraSystems

349

5 CONCLUSIONS AND

OUTLOOK

In this work, we ﬁrstly had a review on the recent stud-

ies of multi-object tracking using a single camera or

multiple cameras and discussed the recent researches

for global optimization based on ﬂow networks or

graphs. After the formulation of data association into

two MAP problems. we proposed a two-stage graph-

based multi-person multi-camera tracking approach.

Firstly, a hypothesis graph was constructed to extract

possible associated tracklets from reconstructed 3D

detections. While most of the papers considered global

features only, we incorporated local features such as

appearance, size into the computation of costs for the

edges in the hypothesis graph. Importantly, the hy-

potheses for tracking on the ground plane were arose

from the reconstructions of 2D detections from each

view at the same time step. Those reconstructed 3D de-

tections who were regarded to be the same object were

replaced by the one with the minimum back projection

error. Consequently, the task of the second graph was

to link tracklets into complete tracks. For this sake,

the cost function accordingly took temporal and spa-

tial distances into account. All in all, this framework

is general for multi-object tracking in multi-camera

systems.

From our experiments, we conclude that it is im-

portant to have the optimal outcome from the ﬁrst

step of hypothesis generation for multi-object track-

ing in multi-camera systems. Due to the impact of

calibration and object detection errors, the precision

of recognizing of identical objects can be improved

by more restrict constraints. Hence, in the future, we

would like to focus on the modeling of calibration and

detection errors and incorporating them into the frame-

work. Incremental learning might be able to reﬁne

the modeling as more and more frames are processed.

Additionally, the tracklet linking stage could consider

more information such as histogram of motion in the

cost function to reduce the false positive rate and the

number of Id switches.

REFERENCES

Andriyenko, A. and Schindler, K. (2011). Multi-target track-

ing by continuous energy minimization. In CVPR,

pages 1265–1272.

Berclaz, J., Fleuret, F., Turetken, E., and Fua, P. (2011).

Multiple object tracking using k-shortest paths opti-

mization. TPAMI, 33:1806–1819.

Bernardin, K. and Stiefelhagen, R. (2008). Evaluating multi-

ple object tracking performance: The clear mot metrics.

EJIVP, 246309.

Bredereck, M., Jiang, X., K

orner, M., and Denzler, J. (2012).

Data association for multi-object tracking-by-detection

in multi-camera networks. In ICDSC.

Breitenstein, M. D., Reichlin, F., Leibe, B., Koller-Meier, E.,

and Gool, L. V. (2011). Online muti-person tracking-

by-detection from a single, uncalibrated camera. PAMI,

33:1820 – 1833.

Collins, R. T. (2012). Multitarget data association with

higher-order motion models. In CVPR.

Dijkstra, E. W. (1959). A note on two problems in connexion

with graphs. NUMERISCHE MATHEMATIK, 1:269–

271.

Felzenszwalb, P., Girshick, R., and McAllester, D. (2010).

Cascade object detection with deformable part models.

In CVPR.

Ferryman, J. and Shahrokni, A. (2009). Pets2009: Dataset

and challenge. In 2009 Twelfth IEEE International

Workshop on Performance Evaluation of Tracking and

Surveillance (PETS-Winter).

Fleuret, F., Berclaz, J., Lengagne, R., and Fua, P. (2008).

Multi-camera people tracking with a probabilistic oc-

cupancy map. TPAMI, 30:267–282.

Henriques, J. F., Caseiro, R., and Batista, J. (2011). Globally

optimal solution to multi-object tracking with merged

measurements. In ICCV.

Hofmann, M., Wolf, D., and Rigoll, G. (2013). Hypergraphs

for joint multi-view reconstruction and multi-object

tracking. In CVPR.

Huang, C., Wu, B., and Nevatia, R. (2008). Robust ob-

ject tracking by hierarchical association of detection

responses. In ECCV, pages 788–801.

J. Yang, Z. Shi, P. V. and Teizer, J. (2009). Probabilistic mul-

tiple people tracking through complex situations. In

IEEE Workshop Performance Evaluation of Tracking

and Surveillance.

Jiang, X., Haase, D., K

orner, M., Bothe, W., and Denzler,

J. (2013). Accurate 3d multi-marker tracking in x-ray

cardiac sequences using a two-stage graph modeling

approach. In the 15th Conference on Computer Analy-

sis of Images and Patterns (CAIP).

Jiang, X., Rodner, E., and Denzler, J. (2012). Multi-

person tracking-by-detection based on calibrated multi-

camera systems. In ICCVG, pages 743–751.

Leal-Taix

e, L., Pons-Moll, G., and Rosenhahn, B. (2012).

Branch-and-price global optimization for multi-view

multi-target tracking. In CVPR, pages 1987–1994.

Satoh, Y., Okatani, T., and Deguchi, K. (2004). A color-

based tracking by kalman particle ﬁlter. In ICPR, pages

502–505.

Wu, Z., Kunz, T. H., and Betke, M. (2011). Efﬁcient track

linking methods for track graphs using network-ﬂow

and set-cover techniques. In CVPR, pages 1185–1192.

Wu, Z., Thangali, A., Sclaroff, S., and Betke, M. (2012).

Coupling detection and data association for multiple

object tracking. In CVPR.

Xing, J., Ai, H., and Lao, S. (2009). Multi-object track-

ing through occlusions by local tracklets ﬁltering and

global tracklets association with detection responses.

In CVPR.

Zhang, L., Li, Y., and Nevatia, R. (2008). Global data

association for multi-object tracking using network

ﬂows. In CVPR.

VISAPP2014-InternationalConferenceonComputerVisionTheoryandApplications

350