limitations, including directly intrinsic modifications
of the person model or using information from addi-
tional sources.
Most of the methods in the existing literature are
based on the appearance of the object of interest, i.e.,
a person. Appearance based approaches can be classi-
fied attending to the complexity of the model. Simple
person models define the person as a region or shape,
which can be described by means of a holistic model
(Dalal and Triggs, 2005; Doll´ar et al., 2012a). Com-
plex models define the person as collection of multi-
ple regions or shapes, i.e., part-based models, which
can be combined in order to be more flexible regard-
ing different poses and to support partial occlusions
(Felzenszwalb et al., 2010; Leibe et al., 2005). How-
ever, even such approaches have difficulties in dense
environments. The Implicit Shape Model (ISM) of
(Leibe et al., 2005) is improved in (Seemann et al.,
2007) by using a probabilistic formulation in order to
generate a model that is scalable from a general ob-
ject–class detector into a specific object–instance de-
tector, thus making the detection more reliable. The
detector in (Felzenszwalb et al., 2010) is improved
in (Girshick et al., 2011) by using a grammar model
which includes an additional “body part” simulating
possible occlusions. Also based on (Felzenszwalb
et al., 2010), in (Tang et al., 2014) a joint model is
proposed, which is trained to detect single people as
well as pairs of people under varying degrees of oc-
clusion.
Other approaches make use of additional exter-
nal information to the person model in order to in-
crease the detection performance in crowded scenar-
ios. The most typical ones include tracking (Garcia-
Martin and Martinez, 2012), motion (Patzold et al.,
2010), depth or 3D information, etc. The use of per-
son density estimation to improve person localization
and tracking performance in crowded scenes is pro-
posed in (Rodriguez et al., 2011). In (Milan et al.,
2014) a continuous energy minimization framework
for multi-target tracking, which includes explicit oc-
clusion reasoning and appearance modeling, is pre-
sented. Nevertheless, in the work presented in this
paper, we will disregard any possible additional im-
provement which could be achieved by using external
information to the person model. Instead of that, we
concentrate on the person model itself.
Most closely related to our work is the approach in
(Girshick et al., 2011), which demonstratesthe advan-
tages of taking into account in the person model the
possibility of failure or occlusion of some body parts.
In our case, we do not specifically train the model to
capture specific occlusion patterns. We define a more
generic scheme in which the absence of any partic-
ular body part can be modelled by defining multiple
configurations of the part-based models learned dur-
ing the training phase. Therefore, we are able to deal
with occlusions by automatically selecting which of
all the possible person model configurations adjust
better to any kind of occlusion. In particular, we solve
the problem posed to the approach in (Girshick et al.,
2011) by crowded scenarios, where the range of pos-
sible different occlusions is much bigger and, there-
fore, the complexity of the grammar model and its
training increases exponentially.
There are also other approaches that make use of
person models based only on some parts of the body
as the head (Ali and Dailey, 2012) or head and shoul-
ders (Zeng and Ma, 2010) since these are the most
visible parts in crowded scenarios. Our solution can
be considered as a generalization where any possible
body part configuration is evaluated in order to take
advantage not only on this specific simplified models
but also any possible useful configuration.
3 APPROACH
Our proposed approach is based on the detector pre-
sented in (Felzenszwalb et al., 2010) but, instead of
using the confidence provided by each of the individ-
ual body-part detectors for every person candidate, we
define several body-part detectors configurations in
order to robustly cope with partial occlusions, which
profusely appear in crowded scenarios.
3.1 Base Algorithm
The detector in (Felzenszwalb et al., 2010) is a part-
based person model. It consists of mixtures of multi-
scale deformable part models in a star-structure de-
fined by a root model, where the root and each of
the deformable body parts are modeled by a HOG as
firstly proposed in (Dalal and Triggs, 2005).
The detector proposed in (Felzenszwalb et al.,
2010) defines N body parts positioned around the root
filter (n = 0), which models the appearance of the
whole body. The N body parts are computed at twice
the resolution in relation to the root filter in order
to refine the detection based only on the root infor-
mation. Each of the n detectors, included the root
(n = 0, ..., N), is modeled by a 3-tuple (F
n
, v
n,0
, d
n
),
where F
n
is the HOG filter response (detection con-
fidence) for part n; v
n,0
is a two-dimensional vector
defining the relative position of part n with respect
to the anchor position (x
0
, y
0
) of the root; and d
n
is
a four-dimensional vector specifying coefficients of a
quadratic function defining the cost for each possible
SIGMAP2014-InternationalConferenceonSignalProcessingandMultimediaApplications
322