manan, 2006; Eichner and Ferrari, 2009; Andriluka
et al., 2009) build upon the PS framework, and use
standard sized template-based part-detectors to ap-
proximately locate parts in the image. These part
detectors are separately trained for each part from
the training dataset. We can observe that parts, par-
ticularly the lower and upper arms, have cylindrical
shape and can depict many shapes depending on their
configurations. When a person appears in fronto-
parallel plane, standard sized part detectors are suffi-
cient for correct localization. But, when certain parts
like lower arms move out of the plane, we observe
foreshortening and the standard detectors produce er-
roneous detections.
One can search for the foreshortening during part
detection, but the state space of each part (number of
different configurations) increases in such a way that
it becomes impractical to compute the pairwise con-
straints. (An example image is shown in Figure 1,
we can observe the wrong estimation of the lower left
arm due to foreshortening.) In our approach, we intro-
duce few levels of foreshortening when we perform
parts detection, and we propose an effective method
to prune the state space of each part. Our method
shows better localization for parts than the standard
sized template-based methods and thus gives better
results on challenging images.
Furthermore, in day to day images, we often ob-
serve color similarity between different parts of hu-
man body in both the presence as well as the absence
of clothes. For instance, left and right upper arms
have similar color irrespective of person clothing and
gender. We propose to exploit these color similarities
by adding two color similarity constraints between the
upper left-right arms pair and the lower left-right arms
pair, and show that these constraints improve pose
estimation when considered simultaneously with the
kinematic constraints.
Our contributions are the following: (1) we com-
pensate foreshortening in the parts, especially lower
and upper arms; (2) we exploit color similarity be-
tween left-right lower and upper arms and show better
results than the simple PS framework; (3) we present
a simple and effective method to reject part candidates
that are unlikely to be true part candidates; (4) we pro-
duce better results for the lower arms and comparable
results for other parts on the two challenging datasets
(Buffy V3.01 and PASCAl Stickmen V1.1).
In the rest of this paper, we first describe the re-
lated work in Section 2, and a brief overview of the
pictorial structures and its limitations in Section 3.
Then, we give detailed description of our framework
in Section 4, followed by the inference step in Sec-
tion 5. Finally, we show our experiments and results
in Section 6, and conclude in Section 7.
2 RELATED WORK
There has been a lot of research on human pose esti-
mation in the last four decades. We focus on the meth-
ods that overlap with our approach. First, (Fischler
and Elschlager, 1973) proposes the pictorial struc-
ture (PS) model, and (Felzenszwalb and Huttenlocher,
2005) proposes an efficient inference method focus-
ing on tree-based models that use Gaussian priors for
the kinematic constraints. (Andriluka et al., 2009)
builds upon the PS framework and uses discrimina-
tively trained part detectors for unary potentials. (Ra-
manan and Sminchisescu, 2006) proposes an advance
method of learning PS parameters that maximizes the
conditional likelihood of the parts, and captures more
complex inter-part interactions than Gaussian priors,
which we also use to train our kinematic constraints.
Along with the kinematic constraints, there have
been a few methods that use inter-part color similar-
ity for better localization of the parts. For instance,
(Eichner and Ferrari, 2009) uses Location Priors in
the window output of a person detector along with
the appearance information to initialize the unary po-
tentials for standard pictorial structure model. (Sapp
et al., 2010b) filters out less probable part locations
by using a cascade of pictorial structures, and uses
richer appearance models only at a later stage on
much smaller set of locations. The disadvantage with
this approach is that one might lose the correct lo-
cations for parts if he considers only the kinematic
constraints in the initial stages of the cascade. We,
on the other hand, directly include the constraints in
the graph and enforce them throughout the inference
stage.
There are few other approaches that use different
methods to get precise location of the parts. For in-
stance, (Gupta et al., 2008) models self-occlusion to
get precise location of the parts, (Karlinsky and Ull-
man, 2012) models the appearance of links that con-
nect two parts, and (Yang and Ramanan, 2011) pro-
poses a general flexible mixture model that augments
standard spring models and is able to capture more
complex configurations of parts.
3 PICTORIAL STRUCTURE (PS)
REVIEW
In this section, we provide a brief overview of the PS
framework (Felzenszwalb and Huttenlocher, 2005)
VISAPP2014-InternationalConferenceonComputerVisionTheoryandApplications
32