proposed in (Albiol et al., 2009), in (Chan et al., 2008)
and in (Conte et al., 2010). All these methods have
been submitted to the PETS 2009 and 2010 contests
on people counting and have obtained very good per-
formance among the contests participants. In partic-
ular, in Albiol’s paper, the authors propose the use
of corner points detected using the Harris’ algorithm
(Harris and Stephens, 1988). Static corner points,
likely belonging to the background, are removed by
computing motion vectors between adjacent frames.
Finally, the number of people is estimated from the
number of moving corner points assuming a direct
proportionality relation.
Although Albiol’s method has proved to be quite
more robust than its competitors, its accuracy is lim-
ited by the fact that it does not take into account per-
spective effects, nor the influence of people density
on the detection of corner points. Moreover, the Har-
ris’ corner detector is sometimes unstable for objects
moving towards the camera or away from it.
In the paper (Conte et al., 2010), the authors pro-
pose a method that provides a more accurate estima-
tion of the people number by considering also the is-
sues related to perspective effects and occlusions. In
particular, the authors propose to carry out the esti-
mation of the count through a trainable regressor (us-
ing the ε-SVR algorithm) suitably trained on the used
scene. Tests performed on very crowded scenes char-
acterized by a large field depth demonstrated high per-
formance improvements with respect to the method
by Albiol et al. However, this is obtained at the cost
of complex set up procedures for training the ε-SVR
regressor.
In this paper we describe a method that is able to
obtain performance comparable to those obtained by
the method of Conte et al., but at the same time retains
the overall simplicity of Albiol’s approach.
2 SYSTEM ARCHITECTURE
The approach we propose in this paper is conceptu-
ally similar to the one in (Albiol et al., 2009), but
introduces several changes to overcome some limita-
tions of that method and draws some ideas from the
approach in (Conte et al., 2010).
The first problem addressed is the stability of the
detected corner points. The latter are strongly depen-
dent on the perceived scale of the considered object:
the same object, even in the same pose, will have dif-
ferent detected corners if its image is acquired from
different distances. This can cause problems in at
least two different conditions. Firstly, the observed
scene contains groups of people whose distance from
the camera is very different: in this case it is not effec-
tive to use a simple proportionality law to estimate the
number of people, since the average number of cor-
ner points per person is different passing from close
people to far ones. Secondarily, the observed scene
contains people walking on a direction that has a sig-
nificant component orthogonal to the image plane, i.e.
they are coming closer to the camera or getting far-
ther from it: in this case the number of corner points
for these people is changing even if the number of
people remains constant. To mitigate this problem, as
in (Conte et al., 2010) we adopt the SURF algorithm
proposed in (Bay et al., 2008). SURF is inspired by
the SIFT scale-invariant descriptor (Lowe, 2004), but
replaces the Gaussian-based filters of SIFT with fil-
ters that use the Haar wavelets, which are significantly
faster to compute. The interest points found by SURF
are much more independent of scale (and hence of
distance from camera) than the ones provided by Har-
ris detector. They are also rotation invariant, which
is an important issue for the stability of the points lo-
cated on the arms and on the legs of the people in
the scene. The interest points associated to people
are obtained in two steps. First, we determine all the
SURF points within the frame under analysis. Then,
we prune the points not associated to persons by tak-
ing into account their motion information. In partic-
ular, for each detected point we estimate the motion
vector, with respect to the previous frame, by using a
block-matching technique and pruning those one with
a null motion vector.
The second issue we address in this paper is the
perspective effect, which causes that the farther the
person is from the camera, the fewer are the detected
interest points. As a consequence, a simple propor-
tionality relation between the number of detected in-
terest points and the number of persons in the scene
provides acceptable results only when the average dis-
tance of the persons is close to a reference distance
used to determine the proportionality factor, other-
wise this approach tends to overestimate the number
of people that are close to the camera and to underes-
timate it when people are far from the camera.
The authors in (Conte et al., 2010) propose to seg-
ment each single person or small group of persons at
similar distances from the camera by clustering the
detected interest points. The distance of each cluster
from the camera is derived from the position of the
bottom points of the cluster applying an Inverse Per-
spective Mapping (IPM), assuming that the bottom
points of the cluster lie on the ground plane. Then,
the number of persons in each cluster is determined
using an ε-Support Vector Regressor that receives the
number of points of a cluster, the distance and the
VISAPP 2011 - International Conference on Computer Vision Theory and Applications
68