DETECTION AND LOCALISATION OF STATIONARY OBJECTS

WITH A PAIR OF PTZ CAMERAS

Constant Guillot

, Quoc-Cuong Pham

, Patrick Sayd

, Christophe Tilmant

and Jean-Marc Lavest

CEA LIST, Vision and Content Engineering Laboratory, BP 94, Gif-sur-Yvette, F-91191 France

LASMEA UMR 6602, PRES Clermont Universit

e/CNRS, 63177 Aubi

ere cedex, France

Keywords:

Stationary Object, PTZ Camera, Multiview Matching.

Abstract:

We propose a novel approach for detecting and localising stationary objects using a pair of PTZ cameras

monitoring a wide scene. First we propose a stationary object detection and labelling algorithm. It relies

on the re-identiﬁcation of foreground blocks of the image and an MRF framework to detect and separate the

stationary objects of the scene. Second we propose a geometric approach for robustly matching the detected

silhouettes of stationary objects from a pair of PTZ cameras. Our system is tested on challenging sequences

which prove its robustness to occlusions even in an unknown non planar 3D scene.

1 INTRODUCTION

With the increasing number of CCTV cameras in pub-

lic places a lot of effort has been done to build auto-

mated systems capable of detecting events of interest.

These systems make possible the effective surveil-

lance of areas by a limited number of operators and

therefore allow the use of an active video-surveillance

at a real scale.

Detection of stationary objects is a preliminary

task required by many topical applications. The main

difﬁculties of this task, which is very often addressed

using background subtraction techniques, are the ro-

bustness to changes in illumination and to occlusions.

In this paper, we propose a novel system to detect

stationary objects with a pair of PTZ cameras moni-

toring a wide area. Each camera independently mon-

itors the scene by going through a predeﬁned set of

positions (pan, tilt, zoom) in order to cover the area at

an adapted resolution. Each of these positions, which

we will refer to as a view, can be seen as an indepen-

dent stationary camera with a very low frame rate.

The contribution of this paper is twofold. First,

stationary objects are detected and labelled indepen-

dently in each view. The labelling phase allows, in

some cases, the distinction of several objects which

are part of a single blob. This is done through the re-

identiﬁcation of the foreground through time and the

minimisation of an energy under an MRF framework.

The second contribution consists in matching the sil-

houettes from one camera to the other. The main difﬁ-

culty stands in dealing with an arbitrary 3D scene (not

necessary planar) and with the large baseline between

the two cameras.

2 RELATED WORK

In the past years many approaches were proposed to

detect stationary objects, the main difﬁculty being ro-

bustness to occlusions. In (Mathew et al., 2005) the

authors use a mixture of Gaussians to model the back-

ground and the foreground. Stationary objects are

detected by analysing the transitions of the state of

Gaussians from foreground to background. In (Guler

et al., 2007) the authors propose to track moving ob-

jects in the scene and deﬁne for each object an en-

durance probability which is incremented when the

object does not ﬁt the background model. In (Porikli

et al., 2008) the authors use a short term and a long

term background model. They assume that station-

ary objects will enter the short term model, and state

that an object is static if it is in the short term model

but not in the long term model. Then, they update

an evidence map image which counts the number of

times a pixel has been classiﬁed as stationary. In

(Liao et al., 2008) the authors use foreground mask

sampling to detect stationary objects. They use 6

foreground masks equally distributed through the last

30 seconds and compute the logical “and” of these

masks. Although this method has been proved (Bay-

591

Guillot C., Pham Q., Sayd P., Tilmant C. and Lavest J..

DETECTION AND LOCALISATION OF STATIONARY OBJECTS WITH A PAIR OF PTZ CAMERAS.

DOI: 10.5220/0003827205910596

In Proceedings of the International Conference on Computer Vision Theory and Applications (VISAPP-2012), pages 591-596

ISBN: 978-989-8565-03-7

 2012 SCITEPRESS (Science and Technology Publications, Lda.)

ona et al., 2009) to be one of the best approaches, it

raises many false alarms. This work has however been

extended recently in (Bayona et al., 2010). They pre-

vent false alarms caused by moving objects by build-

ing a mask of moving regions.

Effort has been done to try to achieve robust-

ness to occlusions. However, methods based on sub-

sampling, which were proved to perform the best

(Bayona et al., 2009), rely on the on logical opera-

tions which cannot guarantee that the same object is

observed.

In the past years, multi-camera object localisation

has already been studied. In (Beynon et al., 2003) the

authors make a ground plane assumption and can thus

easily retrieve the world coordinates. A cost func-

tion based on colour, blob area and position is built to

measure the similarity of 2D observations to already

observed 3D world objects. They use a linear assign-

ment problem algorithm to perform an optimal asso-

ciation between observations and tracked objects. In

(Miezianko and Pokrajac, 2008) the authors also as-

sume that the 3D scene is planar. Once they have lo-

cated an object in a camera, it is projected onto the 2D

plane using a homography. The location of objects

are the local maxima of overlap in the orthoimage. In

(Utasi and Csaba, 2010) the authors deﬁne an energy

function based on geometric features depending on

the position and height of objects and which is max-

imal for the real conﬁguration. The optimal conﬁgu-

ration is found using multiple death and birth dynam-

ics, an iterative stochastic optimisation process. In

(Fleuret et al., 2008) the authors discretise the ground

plane into a grid. A rectangle modelling a human sil-

houette is projected on cameras from each position

on the grid. This serves as an evidence of the occu-

pancy of the ground by a person. In (Khan and Shah,

2009) the authors introduce a planar homographic oc-

cupancy constraint which fuses foreground informa-

tion from multiple cameras. This constraint brings

robustness to occlusion and allows the localisation of

people on a reference plane.

Among these methods some assume that the 3D

world is planar through the use of homographies,

other because they have to reduce the search space

for their optimisation process. We will propose a di-

rect matching method which enables the computation

of 3D positions and heights of stationary objects.

3 OBJECT DETECTION

Our stationary object detection algorithm can be di-

vided into three main steps. First a background sub-

traction stage generates an image containing the age

of the re-identiﬁed foreground. Then this informa-

tion is used to generate a segmentation of the visible

stationary objects. Finally one binary mask for each

stationary object is updated.

3.1 Background Subtraction

We use the background subtraction algorithm from

(Guillot et al., 2010) and extend it by building also

a foreground model. The original image is tiled as

a regular square grid of 8 × 8 blocks on which over-

lapping descriptors are computed. The background

subtraction therefore generates an image whose pixels

can be assimilated to the blocks of the original image.

To this aim, a descriptor is computed at each

block. If it doesn’t match the background model then

it is checked against the foreground model. If a match

is found in the foreground model then it is updated,

otherwise a new foreground component is created and

its time of creation is recorded. The foreground model

at a speciﬁc block is emptied when background is ob-

served. Thus, the output of the background subtrac-

tion stage is an image whose pixels contain 0 when

background is observed, or the age of the foreground

descriptor.

3.2 What We Want to Segment

Segmenting unknown stationary objects is a very dif-

ﬁcult problem which we will not try to address in the

general case. For instance, if two objects appear at

the same time and are detected as a single blob in the

image, we do not try to separate them. What we want

to do is to give different labels to objects appearing

at different times while giving a single label to an ob-

ject appearing under partial occlusion (eg: a man par-

tially occludes a baggage then leaves). This is not an

easy task since at a block level it is impossible to state

whether we are observing an object or an occluder.

To this aim we construct in 3.3 an energy func-

tion under the following assumption. Blocks should

be grouped under a same label l when l is a compati-

ble label for all these blocks.

3.3 Segmentation

Markov Random Fields are widely used in image seg-

mentation when the problem can be written as the

minimisation of an energy function. Let G = (V, E)

be a graph representing an image. Each vertex v ∈ V

corresponds to a pixel of the image, and each edge

e ∈ E ⊆ V ×V corresponds to a neighbourhood rela-

tion. Let L be a set of labels.Each labelling x ∈ L

|V |

is assigned an energy, which we try to minimise. The

VISAPP 2012 - International Conference on Computer Vision Theory and Applications

592

considered energy function are of the form E : L

|V |

→

R and satisfy:

E(x) =

∑

i∈V

) +

∑

(i, j)∈E

i j

, x

) (1)

where D

), called the data or unary term, repre-

sents the cost of assigning label x

to vertex i ∈ V , and

i, j

, x

) called smoothing or binary term represents

the cost of assigning different labels to neighbouring

vertices.

To minimise the energy function we use the algo-

rithm proposed in (Alahari et al., 2008), which guar-

antees a good and fast approximation of the optimum

labelling.

We now deﬁne our energy function E constructed

in such a way that ˆx = argmin

E(x) is a labelling of the

image corresponding to the visible stationary objects.

Let L = {l

, l

, . . . , l

} be our set of labels, with

being the label for both background and non sta-

tionary objects, and l

, . . . , l

the labels of n distinct

stationary objects.

) = 0 (2)

Equation 2 states that the cost of the non stationary

label is equal to 0. In other terms this label is chosen

by default, unless the stationary object conditions are

met.

6= l

) = C − age

+ pTemporal

)

+pIncompatibility

) (3)

with C > 0 being the lapse of time necessary to

consider that an object is stationary, and age

the

age of the foreground block i. The pTemporal and

pIncompatibility penalties are deﬁned in equations 4

and 5:

pTemporal

) = max(t

−t

−C, 0) (4)

where t

is the time of the ﬁrst assignment of label

to a block, and t

is the time of ﬁrst appearance of

the foreground block i

pIncompatibility

) = max(t

−t

+C, 0) (5)

where t

is the time background was last seen at

block i.

From equation 3 we can see that D

6= l

) <

) only if age

> C. In other terms assigning a

stationary object label to a block costs less than the

assignment of the l

label only if this block is old

enough. A priori any label can be assigned to a block

considered as stationary, however we make two as-

sumptions (equations 4 and 5) which allow us to ob-

tain the desired segmentation.

The pTemporal

penalty (equation 4) is positive

for a label x

at a block i when its time of assign-

ment is posterior of more than C to the time of the

ﬁrst observation of the foreground block i. Labels

whose time of creation is after t

+C are therefore pe-

nalised. However, labels which were assigned before

time t

+C are not penalised because we have no ev-

idence that these labels are incompatible with block

The pIncompatibility

penalty (equation 5) is pos-

itive for a label x

at a block i if label x

was already

assigned to another block while background was ob-

served at block i.

The smoothing term is deﬁned in equation 6 as

follows:

i j

, x

) =

(

+ λ

exp

−|age

−age

if x

6= x

0 if x

= x

(6)

with λ

and λ

> 0. The role of λ

is to pe-

nalise the labelling of two neighbouring blocks by

two different labels. The role of the exponential

term, weighted by λ

, is to penalise the assignment

of different labels to neighbouring foreground blocks

which have a similar age.

3.4 Masks Update

The use of the MRF to ﬁnd a labelling with our en-

ergy function E gives us a segmentation of the visi-

ble part of stationary objects. In order to keep track

of the occluded stationary objects we maintain up-

dated one binary mask per label. When a label is

assigned to a block it is added to the corresponding

mask. When background is seen at a block, the cor-

responding blocks of all the masks are emptied. The

interest in having multiple labels can be observed on

ﬁgure 1.

Figure 1: A mask is used for each label to store the sta-

tionary objects. Thus, we know which are the visible and

occluded parts of each object. The visible parts of station-

ary objects are coloured.

4 MULTI VIEW MATCHING

4.1 General Intuition

With a large baseline, objects may have a very dif-

DETECTION AND LOCALISATION OF STATIONARY OBJECTS WITH A PAIR OF PTZ CAMERAS

593

ferent appearance in the two cameras and therefore it

cannot be used as a matching criterion.

Being given a pair of camera and a 3D object

there exists at least two points, called frontier points

(Cipolla et al., 1995) which are visible by the two

cameras. These points are the points of the object for

which the epipolar planes are tangent to the surface

of the object. In rectiﬁed images the top and bottom

points are frontier points. These points are the con-

straint we use in our matching criterion.

Because of segmentation errors and differences in

view point there is not necessarily only one silhouette

per object (and reciprocally). Thus, one to one silhou-

ette associations is not sufﬁcient to ﬁt the complexity

of the task. We propose to build a graph representing

associations between frontier points instead of silhou-

ettes.

4.2 Graph Construction

We propose to make associations between frontier

points rather then directly matching silhouettes. To

this aim we build a directed graph, as illustrated on

ﬁgure 2, to model the authorised associations in such

a way that an object (or association of silhouettes) is

represented by a cycle. There are four types of arcs,

they represent the different relations between frontier

points, and each of them is assigned a particular cost.

Figure 2: Directed graph of possible associations. Cycles

in the graph represent possible silhouette associations. The

entering and exiting frontier points are noted i

and o

Let us consider two silhouettes s

, s

and their re-

spective exiting and entering frontier points o

and i

The cost of a silhouette arc is zero. It enforces the

unity of the silhouette.

The cost of an association arc it is set to the angle

difference between the two epipolar planes in which

lie the two frontier points, as illustrated in ﬁgure 3. Its

expression is:

association

= |o

− i

| (7)

In order to help ﬁlter false matches, the associ-

ation cost of two frontier points is set to +∞ if the

resulting triangulated 3D point is over or below a pre-

deﬁned altitude threshold.

The creation arc has to be selected when an ob-

ject is seen in only one camera, thus it is considered

fully occluded in the other camera and its cost is the

following:

creation

= |o

− i

| (8)

The fusion cost from silhouette s

to silhouette s

is deﬁned in a similar way in equation 9. This cost is

illustrated ﬁgure 4.

f usion

= (o

−o

)

+(i

−i

)

+(o

−i

)

+d(s

, s

)

(9)

where (.)

= max(0, .) and d(s

, s

) is the dis-

tance between the two silhouettes in the rectiﬁed im-

age. This distance prevents the fusion of silhouettes

which are far apart in the image, in other terms we

consider that occlusions cannot be too large. This cost

is illustrated in ﬁgure 3.

Figure 3: Association cost for arc o

→ i

. |o

− i

| repre-

sents the angular cost of an occlusion. Situation on the left

is can be interpreted as the situation on the right.

Figure 4: Illustration of the cost of the two possible fusion

arcs (in red) for a pair of silhouettes. The non active costs

are greyed, the active costs are highlighted by red rectan-

gles. The entering and exiting frontier points of the result-

ing “virtual” silhouette are i

and o

4.3 Optimisation

In section 4.2 objects were represented by a cycle in a

graph and each arc was assigned a cost. The cost as-

sociated to an object is therefore the sum of the costs

the arcs of the cycle. Finally ﬁnding the best matching

between silhouettes from the two cameras is equiva-

lent to ﬁnding the vertex disjoint cycle cover of the

graph of minimal cost. The cost of a cycle cover be-

ing the sum of the costs of all its cycles.

The number of cycle covers exponentially grows

with the number of silhouettes. Thus we propose a

simple heuristic to ﬁnd an approximate solution. A

random node n is selected and the shortest cycle start-

ing from this node is computed using the Dijkstra al-

gorithm. Nodes selected in the cycle are then removed

VISAPP 2012 - International Conference on Computer Vision Theory and Applications

594

from the graph. The process is repeated until all nodes

are selected.

5 EXPERIMENTS

This section is divided into two parts. First, the de-

tection of stationary objects is evaluated in the usual

context of single static cameras. In the second part

the matching algorithm is evaluated for a pair of PTZ

cameras.

The stationary object detection is ﬁrst tested on

public sequences (I-Lids dataset for AVSS2007), then

on sequences more challenging in terms of occlu-

sions.

For the AVSS2007 dataset we consider that an ob-

ject is stationary 60 seconds after it has been seen for

the ﬁrst time and remains at the same place. The re-

sults and ground truth values can be found in table 1.

Table 1: Detection results on the I-Lids 2007 dataset.

Sequence

name

Start time (s) End time (s)

Ground

truth

Detected Ground

truth

Detected

AB Easy 2:20 2:20 3:14 3:18

AB Medium 1:58 1:58 3:02 3:03

AB Hard 1:51 1:52 3:07 3:11

PV Easy 2:48 2:48 3:15 3:21

PV Medium 1:28 1:28 1:47 1:56

PV Hard 2:12 2:12 2:33 2:35

The algorithm is also tested on other sequences

which are made to provide more challenging scenar-

ios in terms of occlusions and object labelling. These

sequences show stationary objects which are always

partially occluded and stationary objects occluding

other objects.

Figure 5 shows that our algorithm performs well

at grouping with the same label the blocks of a sta-

tionary object, even if this object is always partially

occluded.

Figure 6 shows the inﬂuence of the

pIncompatibility penalty and of the pairwise

term V

i j

. The baggage is correctly segmented because

both objects did not appear at the same time. On the

overlap region both label have the same unary cost.

It is therefore the pairwise term, with λ

> 0 which

translates a notion of age consistency, which gives

the desired segmentation.

The second part of the experiments concerns

stereo matching. Sequences are acquired by PTZ

cameras, each performing a guard tour of approxi-

mately 15s. Thus, the cameras are highly unsynchro-

nised and parts of the scene are refreshed at a very low

frame-rate.

Figure 5: This sequence of images shows a partially oc-

cluded stationary object. Even in this case where the age of

the baggage is not the same for all the blocks only one label

is assigned.

Figure 6: Effect of the binary term on the segmentation.

Top line: λ

= 0, the criterion of similarity in age is not

active, the occluding object is not fully segmented. Bottom

line: λ

> 0, the criterion of similarity in age is active, the

segmentation is correct.

The ﬁrst set of sequences is acquired indoor, it

shows an important difference in the point of view

of the two cameras as well as high degrees of occlu-

sion, as it can be seen in ﬁgure 7. Figure 7(a) gives

an example of object detected as two silhouettes. One

can see that the silhouettes 1 and 2 are linked by a fu-

sion arc and are correctly matched with silhouette 5

from the other camera. On ﬁgure 7(b) the bag repre-

sented by silhouettes 1 and 4 is almost fully occluded

in the left camera but they are nevertheless correctly

matched.

The second set of sequences is acquired outdoor

and shows a non planar scene, as objects can be put

on the window ledge. It is 1.35m high but this infor-

mation is not know a priori. The baseline between the

two cameras is 13m, there altitudes are 4.70m each,

and the objects in the scene are between 15m and 20m

from the cameras. The guard tour of each camera

is composed of 8 views. Figure 8(a) shows correct

matching in a case of strong occlusion. The suitcase

represented by silhouette 0 is severely occluded in the

left camera, only its top and the handle is visible. Our

approach based on stereo-geometry allows to success-

fully ﬁnd the associations which best explains the ob-

servations using a 3D criterion.

Table 2 shows the precision and recall scores com-

puted on the four previous sequences. To be de-

tected, objects need to be observed in both cameras.

Thus this approach does not increase the recall but

increases the precision. This is essential for real case

application as the disturbance rate of the operators has

to be as low as possible.

DETECTION AND LOCALISATION OF STATIONARY OBJECTS WITH A PAIR OF PTZ CAMERAS

595

(a) Correct case of fusion of silhouettes 1 and 2.

(b) Despite the almost total occlusion, silhouettes 4 and 1 are correctly

matched.

Figure 7: Rectiﬁed panoramas from a pair of PTZ cameras.

Stationary objects are given an id. The straight lines are arcs

the of selected cycles and therefore correspond to frontier

points associations.

(a) Correct fusion and matching of a highly occluded non ﬂat object (0 ↔

3 ↔ 4). An object at a height z 6= 0 is also correctly matched.

(b) Example of correct matches.

Figure 8: Rectiﬁed panoramas from a pair of PTZ cameras.

Stationary objects are given an id. Lines correspond to fron-

tier point associations

Table 2: Comparison of statistics computed on sequences

with a single-camera and a multi-camera approach.

Sequence Single-camera Multi-camera

Recall Precision Recall Precision

Interior 1 0,99 0,63 0,99 0,88

Interior 2 1 0,63 0,93 0,80

Exterior 1 0,95 0,86 0,95 0,92

Exterior 2 0,95 0,81 0,91 0,81

6 CONCLUSIONS

In this article we presented a novel approach for

the detection of stationary objects using a pair of

PTZ cameras. We successfully applied our detec-

tion and segmentation algorithm on challenging se-

quences. The obtained object silhouettes are used in

a matching phase increase the detection precision, but

also allow the computation of 3D position and height.

This matching stage was proved to be robust to severe

occlusions and segmentation errors.

REFERENCES

Alahari, K., Kohli, P., and Torr, P. H. S. (2008). Re-

duce, reuse & recycle: Efﬁciently solving multi-label

MRFs. In CVPR.

Bayona, A., SanMiguel, J., and Martinez, J. (2009). Com-

parative evaluation of stationary foreground object de-

tection algorithms based on background subtraction

techniques. In AVSS.

Bayona, A., SanMiguel, J., and Martinez, J. (2010). Sta-

tionary foreground detection using background sub-

traction and temporal difference in video surveillance.

In ICIP.

Beynon, M. D., Van Hook, D. J., Seibert, M., Peacock, A.,

and Dudgeon, D. (2003). Detecting abandoned pack-

ages in a multi-camera video surveillance system. In

AVSS.

Cipolla, R., Astrom, K., and Giblin, P. (1995). Motion from

the frontier of curved surfaces. In ICCV.

Fleuret, F., Berclaz, J., Lengagne, R., and Fua, P. (2008).

Multicamera people tracking with a probabilistic oc-

cupancy map. PAMI.

Guillot, C., Taron, M., Sayd, P., Pham, Q.-C., Tilmant, C.,

and Lavest, J.-M. (2010). Background subtraction for

ptz cameras performing a guard tour and application

to cameras with very low frame rate. In ACCV VS.

Guler, S., Silverstein, J., and Pushee, I. (2007). Stationary

objects in multiple object tracking. In AVSS.

Khan, S. M. and Shah, M. (2009). Tracking multiple oc-

cluding people by localizing on multiple scene planes.

PAMI.

Liao, H.-H., Chang, J.-Y., and Chen, L.-G. (2008). A lo-

calized approach to abandoned luggage detection with

foreground-mask sampling. In AVSS.

Mathew, R., Yu, Z., and Zhang, J. (2005). Detecting new

stable objects in surveillance video. In Multimedia

Signal Processing, Workshop on.

Miezianko, R. and Pokrajac, D. (2008). Localization of de-

tected objects in multi-camera network. In ICIP.

Porikli, F., Ivanov, Y., and Haga, T. (2008). Robust

abandoned object detection using dual foregrounds.

EURASIP J. Adv. Signal Process, 2008.

Utasi, A. and Csaba, B. (2010). Multi-camera people local-

ization and height estimation using multiple birth and

death dynamics. In ACCV VS.

VISAPP 2012 - International Conference on Computer Vision Theory and Applications

596