PEOPLE COUNTING WITH STEREO CAMERAS

Two Template-based Solutions

Tim van Oosterhout

, Ben Kr

ose

1,2

and Gwenn Englebienne

University of Applied Sciences of Amsterdam, Amsterdam, The Netherlands

University of Amsterdam, Amsterdam, The Netherlands

Keywords:

Stereo Camera, People Counting, Template Projection.

Abstract:

People counting is a challenging task with many applications. We propose a method with a ﬁxed stereo camera

that is based on projecting a template onto the depth image. The method was tested on a challenging outdoor

dataset with good results and runs in real time.

1 INTRODUCTION

For events such as large-scale concerts, fairs and out-

door art festivals, it is important to know the number

of people attending. This number can be inferred if

people are counted at all entrances to the area. Manual

counting is very accurate but is costly since humans

can only perform the counting task for a limited time.

Mechanical solutions such as shaft devices restrict the

throughput and placement may be impractical. Infra-

red sensing devices are only applicable with one per-

son at a time in the passage. Therefore, cameras have

been suggested for automatic counting of people.

Figure 1: Example frame from the (stereo) camera.

This paper focuses on counting of people for an

outdoor event. We developed and compared two

template-based methods for people detection that are

used for counting using stereo cameras. Our method

has the following advantages: (1) It is robust to

changes in lighting. (2) It incorporates prior knowl-

edge about locations where people may walk.

2 RELATED WORK

When using static cameras usually a model of the

background is learned to classify pixels as foreground

or background. A noise-cleaning step is performed to

eliminate too small or short-lived regions. Connected

components of pixels can then be found, resulting in

foreground “blobs”. However, a single person may

give rise to multiple blobs, and parts of multiple peo-

ple can be combined into a single blob because of vi-

sual overlap. In (Bahadori et al., 2007) an approach is

described where the foreground pixels are mapped to

the ﬂoor plane where the segmentation is done.

Template-based methods typically model the fore-

ground but not the background. Examples are the

Viola-Jones face detector (Viola and Jones, 2001),

or the use of edge templates for pedestrian detection

(Gavrila, 2000). They have been applied successfully

to a variety of situations, including the tracking of

rigid objects (Nguyen et al., 2001) and segmenting

and tracking humans in crowded scenes (Zhao et al.,

2007). However, such methods typically require very

large training sets (Gavrila, 2000), adapting the tem-

plates to the test data over time (Nguyen et al., 2001)

or extra parameters which need to be optimized.

Our methods takes the best of both worlds. As we

will discuss in section 3, we model the background

404

van Oosterhout T., Kröse B. and Englebienne G..

PEOPLE COUNTING WITH STEREO CAMERAS - Two Template-based Solutions.

DOI: 10.5220/0003865304040408

In Proceedings of the International Conference on Computer Vision Theory and Applications (VISAPP-2012), pages 404-408

ISBN: 978-989-8565-04-4

 2012 SCITEPRESS (Science and Technology Publications, Lda.)

120cm

8cm

17cm

26cm

170cm

24cm

40cm

20cm

16cm

Figure 2: Illustration and dimensions of the template used.

Note that the total width of 80cm might seem large but is

needed to accommodate foreground fattening.

and additionally use the shape of the foreground.

3 METHODS

Our approach uses a stereo camera that looks straight

down and has the volume delimited by the ground

area and about 2 meters above it in view, so that peo-

ple moving through the passage will be fully visible

while they do so. Figure 1 gives a typical frame.

We deﬁne a 2.5D range image r for which the

pixel values r

are zero on the ﬂoor and positive to-

wards the camera. In the ﬁrst method a 2D foreground

region is extracted from the range image onto which

3D shapes are projected. This method is based on

(Englebienne and Kr

ose, 2010), but now using range

data instead of color images. In the second method

3D shapes are used to reconstruct the 2.5D range im-

age, such that a similarity measure is maximized.

For both methods we use a simpliﬁed 3D model of

a human being of average size. The model consists of

several stacked cone segments, which is illustrated in

ﬁgure 2. The advantage of not including limbs in the

model is that it makes the model rotationally invari-

ant in the horizontal plane and forgoes the multitude

of poses people could possibly assume at the cost of

leaving small foreground sections unaccounted for.

Using the camera’s intrinsic and extrinsic param-

eters we can project the template at any position. For

both methods we generate hypotheses about the num-

ber and locations of people compare the ﬁt under pro-

jection. By scanning the passage at discrete intervals

we can ﬁnd and localize any number of people.

3.1 Method 1: Generative Model

We consider the observation vector r, containing the

range image data. Each pixel r

is assigned a probabil-

ity that it is foreground p

) using a dynamic statis-

Figure 3: Three templates have been placed manually for

illustration. The middle template is in between two people

but seems to match well because only the silhouette is pre-

served. The range image (ﬁgure 4(a)) provides more detail.

tical background model. We denote the vector of pix-

elwise foreground probabilities as p

(r). Similarly,

we use p

) and p

(r) to denote the background

probabilities. We further introduce l to indicate the

location of a person and L = {l

. . . l

} to indicate a

set of locations corresponding to multiple people.

The quantity we are interested in is the posterior

probability of the people given an observed image,

p(L |r). We compute this using Bayes’ theorem,

p(L |r) =

p(r, L )

∑

p(r, L )

, (1)

where p(r, L ) is the joint probability of the observed

image and the locations of the visible people. We de-

tail how this quantity is computed below.

3.1.1 Modelling People’s Locations

In (Englebienne and Kr

ose, 2010) a “mask” vector

m is deﬁned, with the same dimensionality D as the

observed frame, which is a vector whose elements are

one for the components of foreground pixels and zero

otherwise. This allows us to express the probability

of an image, given a mask, as

p(r|m) = m · p

(r) + (1 − m) · p

(r) (2)

where · indicates the elementwise product and 1 is a

D-dimensional vector containing all ones.

We can project the 3D template of a person in any

location on the ground plane. This is illustrated in

ﬁgure 3. We can compute which pixels would belong

to the object resulting in a mask vector m. Because of

our approximations, the mask m

depends only on the

position, l = (x, y), of the person on the ground plane.

When more than one person is considered, we take

PEOPLE COUNTING WITH STEREO CAMERAS - Two Template-based Solutions

405

the union of the masks created for each person. The

resulting mask contains all pixels that fall inside the

contours of templates located at the set of locations

L = {l

, . . . , l

}, and is denoted as m

We can now express the complete likelihood, ie.

the joint probability of an observed image r and that

a number of people should be present in locations L :

p(r, L ) = p(L ) p(r|L )

= p(L ) [m

· p

(r) + (1 − m

) · p

(r)] (3)

Here, p(L ) indicates the prior probability that

there should be n people in the set, and that these peo-

ple are located in the locations L . In (Englebienne

and Kr

ose, 2010) it is explained that this probability

can be factorized in components p(|L|), given the pri-

ors on the number of persons p(l

), the location and

the distance between personsp(D(l

, l

)). In our work

the prior on location was set manually to represent

the regions in which people can appear. In ﬁgure 3

this area S is delineated in green. The probability of

locations outside S is 0 and uniform otherwise.

3.1.2 Inferring the Position of Multiple People

We can now infer the probability of a set of person

locations L using Bayes’ rule. However, even a low-

resolution 320× 240 pixel image yields 76800

com-

binations of locations for m people. In (Englebienne

and Kr

ose, 2010) it was shown that the search space

can be reduced and that a fast greedy solution can be

applied, similar to the one suggested by (Williams and

Titsias, 2004).

1. Compute p(L =

0), ie. no people.

2. Compute and store p(l

) for all l

∈ S.

3. Find the most likely position of the ﬁrst person.

4. If this position improves the likelihood, add it to

L , otherwise exit the algorithm.

5. Find the next most likely position. Go to 4.

This algorithm results in the most likely number and

locations of the detected persons.

3.2 Method 2: Reconstruction of Range

Image

Again we consider a vector r containing range data.

We construct a range image h and maximize the sim-

ilarity between r and h. The reconstructed image h is

a function of L and the objective is to ﬁnd the L that

minimizes the distance D(r, h(L )).

(a) The depth image by the stereo camera. Red regions rep-

resent pixels for which no stereo value could be computed.

(b) Adding 2.5D templates to the background to minimise

the difference between the range image and the reconstruc-

tion.

Figure 4: Range image reconstruction.

3.2.1 Modelling People’s Locations

We construct an image h(L ) with range values for

each pixel. We use the same 3D model as in method

1, tessellate it into triangles and render it on a new

image h. Self-occlusion and other occlusion handling

is easy since the calculated values represent distances

to the camera. An example is shown in ﬁgure 4.

The method can search for any number of people

by computing the union of templates. We can then

compute the difference between h and the range im-

age r as such:

D(r, h(L )) =

∑

i=1,D

− h

)

(4)

3.2.2 Inferring the Position of Multiple People

We use the same greedy approach as we used in

method 1. To account for the fact that only one person

VISAPP 2012 - International Conference on Computer Vision Theory and Applications

406

can occupy a physical space, an interpersonal distance

s is deﬁned that speciﬁes how close people can be to

one another, and this is used in the selection of loca-

tions l.

1. Compute the h(

0), ie., no people.

2. Compute and store D(r, h(L + l)) for each loca-

tion l ∈ S.

3. Find the location m that reduces the distance most:

∀l ∈ S : D(r,h(L +m)) ≤ D(r, h(L + l))

4. If D(r, h(L + m)) < D(r, h(L )), add m to L and

remove locations within the interpersonal distance

s from consideration, otherwise stop.

5. Find the next best position m. Go to 4.

This algorithm results in the number and locations of

the persons such that the distance between the range

image and the reconstructed image is minimal.

3.3 Tracking

The output for each method is the number and loca-

tions of people present in a single video frame. To

establish whether people are entering or leaving the

area of interest these individual observations must be

linked over time. We use a basic tracker that predicts

people’s next locations and smooths the data using a

Kalman ﬁlter per person.

4 EXPERIMENTS

The methods were tested on an especially challenging

dataset. The data was recorded in cooperation with a

day long outdoor music festival. Recording started

when the area was opened to visitors and ended after

the last visitor left. In total the video covered about 11

hours and contained over 498.000 frames. The pro-

posed methods’ counts are compared against a manu-

ally created ground truth and additionally against a

simpler method. Using the manual ﬁgures, a busy

(Fig. 1) and quiet sequence were selected for com-

parison, in addition to a sequence in bad lighting con-

dition (Fig. 5). The details are described next.

4.1 Alternative Method

The described methods are compared against a na

ıve

method that works on the same range image. This

method creates a foreground mask that depends di-

rectly on the values in the depth map as such:



1 , if g − r

> c

0 , otherwise

(5)

Figure 5: At the end of the sequence the lighting conditions

have changed dramatically. Despite this the stereo matching

algorithm is still able to generate decent range images.

with g representing ground level and c a cut-off value

chosen at waist height. In this intersection people will

appear as islands. These are then detected using a

connected component algorithm and tracked.

4.2 Performance

The na

ıve method ran while recording in real-time at

the full capture rate which was set at 25 fps. The gen-

erative and reconstructive methods used the recording

and did stereo processing as well as their own compu-

tation. They ran single threaded on an Intel Core 2

Duo 8400. The reconstruction method was slowest

achieving frame rates varying from 20 fps to 6 fps de-

pending on the amount of people visible.

4.3 Results

Results were collected for people entering and exit-

ing separately. People turning around in view were

discarded, as were tracks for which either the starting

or ending point could not be established. Table 1 lists

the number of people reported per method and per se-

quence. To better compare the individual methods ta-

ble 2 lists the counting error relative to the ground

truth.

5 DISCUSSION

A ﬁrst striking result is that all methods seem to

undercount. This can be easily explained since the

method speciﬁes a number of constraints on each

track to be counted at all. Overcounting seems harder

PEOPLE COUNTING WITH STEREO CAMERAS - Two Template-based Solutions

407

Table 1: Numeric results per method.

Manual Height Threshold Generative model Reconstruction model

In Out In Out In Out In Out

Entire video 14276 10934 4871 3688 10046 6741 11425 8789

Busy 779 1124 109 230 331 461 557 719

Quiet 417 43 273 23 435 56 446 46

Night 22 597 29 237 15 482 45 549

Table 2: Error percentages per method.

Height Threshold Generative model Reconstruction model

In Out In Out In Out

Entire video -65.9% -66.3% -29.6% -38.3% -20.0% -19.6%

Busy -86.0% -79.5% -57.5% -59.0% -28.5% -36.0%

Quiet -34.5% -46.5% 4.3% 30.2% 7.0% 7.0%

Night 31.8% -60.3% -31.8% -19.3% 104.5% -8.0%

to explain but is seen when people bring in baby car-

riages or when children carry helium ﬁlled balloons.

Undercounting is most prominent in the na

ıve

method. The reason is that people do not produce

separate blobs when the cut-off is applied. Rais-

ing the cut-off height will not solve this problem un-

til it is raised to above shoulder height, but at then

short people will be overlooked. Moreover, with fore-

ground fattening (Scharstein and Szeliski, 2002) peo-

ple’s blobs may merge even at head height.

Another remarkable result is the over 100% er-

ror margin of people entering in the night sequence

for the reconstruction method. The only people en-

tering in that sequence are personnell bringing in ob-

jects such as trash bins as illustrated in ﬁgure 5 which

match the template close enough.

6 CONCLUSIONS

We have shown two novel template based people

counting and localisation methods that work with

range images by stereo cameras. The methods were

tested on a dataset that featured many people in view

at once, a changing background and big changes in

lighting conditions. We found that as the methods

make more use of the available information from the

range image the detection and tracking results im-

prove. The methods run in real-time, making them

suitable for live deployment.

ACKNOWLEDGEMENTS

The research reported in this paper was supported

by the Foundation Innovation Alliance (SIA - Sticht-

ing Innovatie Alliantie) with funding from the Dutch

Ministry of Education, Culture and Science (OCW),

in the framework of the ‘Mens voor de Lens’ project.

REFERENCES

Bahadori, S., Iocchi, L., Leone, G., Nardi, D., and Scoz-

zafava, L. (2007). Real-time people localization and

tracking through ﬁxed stereo vision. Applied Intelli-

gence, 26(2):83–97.

Englebienne, G. and Kr

ose, B. (2010). Fast bayesian peo-

ple detection. In proceedings of the 22nd benelux AI

conference (BNAIC 2010).

Gavrila, D. (2000). Pedestrian detection from a moving ve-

hicle. Computer VisionECCV 2000, pages 37–49.

Nguyen, H., Worring, M., and Van Den Boomgaard, R.

(2001). Occlusion robust adaptive template tracking.

Computer Vision, IEEE International Conference on,

1:678.

Scharstein, D. and Szeliski, R. (2002). A taxonomy and

evaluation of dense two-frame stereo correspondence

algorithms. International journal of computer vision,

47(1):7–42.

Viola, P. and Jones, M. (2001). Rapid object detection us-

ing a boosted cascade of simple features. Computer

Vision and Pattern Recognition, IEEE Computer So-

ciety Conference on, 1:511.

Williams, C. and Titsias, M. (2004). Greedy learning

of multiple objects in images using robust statis-

tics and factorial learning. Neural Computation,

16(5):10391062.

Zhao, T., Nevatia, R., and Wu, B. (2007). Segmentation

and tracking of multiple humans in crowded environ-

ments. IEEE transactions on pattern analysis and ma-

chine intelligence, pages 1198–1211.

VISAPP 2012 - International Conference on Computer Vision Theory and Applications

408