Revisiting Pose Estimation with Foreshortening Compensation and Color

Information

Achint Setia, Anoop R. Katti and Anurag Mittal

Department of Computer Science & Engineering, Indian Institute of Technology Madras, Chennai, India

Keywords:

Upper Body Pose Estimation, Foreshortening Compensation, Part Based Model, Loopy Belief Propagation,

Color Similarity.

Abstract:

This paper addresses the problem of upper body pose estimation. The task is to detect and estimate 2D

human conﬁguration in static images for six parts: head, torso, and left-right upper and lower arms. The

common approach to solve this has been the Pictorial Structure method (Felzenszwalb and Huttenlocher,

2005). We present this as a graphical model inference problem and use the loopy belief propagation algorithm

for inference. When a human appears in fronto-parallel plane, ﬁxed size part detectors are sufﬁcient and give

reliable detection. But when parts like lower and upper arms move out of the plane, we observe foreshortening

and the part detectors become erroneous. We propose an approach that compensates foreshortening in the

upper and lower arms, and effectively prunes the search state space of each part. Additionally, we introduce

two extra pairwise constraints to exploit the color similarity information between parts during inference to get

better localization of the upper and lower arms. Finally, we present experiments and results on two challenging

datasets (Buffy and ETHZ Pascal), showing improvements on the lower arms accuracy and comparable results

for other parts.

1 INTRODUCTION

This paper addresses the problem of upper body pose

estimation. The task is to detect and estimate 2D hu-

man conﬁguration in static images for six parts: head,

torso, left and right upper and lower arms. This is

a core problem in computer vision, and it is critical

for many applications such as human computer in-

teraction, image understanding, activity recognition,

etc. There are many representations for pose, among

which the stickman notation (the parts are labeled

with different line segments) is common. An exam-

ple of pose estimation task with stickman notation is

given in Figure 1.

The common approach to pose estimation, in the

last decade, has been the Pictorial Structures(PS)

model (Felzenszwalb and Huttenlocher, 2005) that

is based on local appearance of the parts and kine-

matic constraints (visualized as springs) on the pairs

of parts: parts are parameterized by pixel location

and orientation. The part appearance models are usu-

ally simple linear ﬁlters on edges, color and location

(Andriluka et al., 2009; Ramanan and Sminchisescu,

2006), and kinematic constraints are image indepen-

dent deformable costs that force two adjacent parts to

Figure 1: Stickman notation for upper body pose estimation

and the problem of foreshortening in the left lower arm (best

viewed in color).

be together. The framework is powerful and general,

yet it is a simple generative model that allows for ef-

ﬁcient and exact inference of the human pose conﬁg-

uration.

In our approach, we use a graphical model repre-

sentation of the upper body where vertices represent

the parts location and edges represent the pairwise

constraints between parts, and we perform inference

using the loopy belief propagation algorithm (Koller

and Friedman, 2009).

Many recent approaches (Sapp et al., 2010b; Ra-

Setia A., R. Katti A. and Mittal A..

Revisiting Pose Estimation with Foreshortening Compensation and Color Information.

DOI: 10.5220/0004669300310038

In Proceedings of the 9th International Conference on Computer Vision Theory and Applications (VISAPP-2014), pages 31-38

ISBN: 978-989-758-004-8

 2014 SCITEPRESS (Science and Technology Publications, Lda.)

manan, 2006; Eichner and Ferrari, 2009; Andriluka

et al., 2009) build upon the PS framework, and use

standard sized template-based part-detectors to ap-

proximately locate parts in the image. These part

detectors are separately trained for each part from

the training dataset. We can observe that parts, par-

ticularly the lower and upper arms, have cylindrical

shape and can depict many shapes depending on their

conﬁgurations. When a person appears in fronto-

parallel plane, standard sized part detectors are sufﬁ-

cient for correct localization. But, when certain parts

like lower arms move out of the plane, we observe

foreshortening and the standard detectors produce er-

roneous detections.

One can search for the foreshortening during part

detection, but the state space of each part (number of

different conﬁgurations) increases in such a way that

it becomes impractical to compute the pairwise con-

straints. (An example image is shown in Figure 1,

we can observe the wrong estimation of the lower left

arm due to foreshortening.) In our approach, we intro-

duce few levels of foreshortening when we perform

parts detection, and we propose an effective method

to prune the state space of each part. Our method

shows better localization for parts than the standard

sized template-based methods and thus gives better

results on challenging images.

Furthermore, in day to day images, we often ob-

serve color similarity between different parts of hu-

man body in both the presence as well as the absence

of clothes. For instance, left and right upper arms

have similar color irrespective of person clothing and

gender. We propose to exploit these color similarities

by adding two color similarity constraints between the

upper left-right arms pair and the lower left-right arms

pair, and show that these constraints improve pose

estimation when considered simultaneously with the

kinematic constraints.

Our contributions are the following: (1) we com-

pensate foreshortening in the parts, especially lower

and upper arms; (2) we exploit color similarity be-

tween left-right lower and upper arms and show better

results than the simple PS framework; (3) we present

a simple and effective method to reject part candidates

that are unlikely to be true part candidates; (4) we pro-

duce better results for the lower arms and comparable

results for other parts on the two challenging datasets

(Buffy V3.01 and PASCAl Stickmen V1.1).

In the rest of this paper, we ﬁrst describe the re-

lated work in Section 2, and a brief overview of the

pictorial structures and its limitations in Section 3.

Then, we give detailed description of our framework

in Section 4, followed by the inference step in Sec-

tion 5. Finally, we show our experiments and results

in Section 6, and conclude in Section 7.

2 RELATED WORK

There has been a lot of research on human pose esti-

mation in the last four decades. We focus on the meth-

ods that overlap with our approach. First, (Fischler

and Elschlager, 1973) proposes the pictorial struc-

ture (PS) model, and (Felzenszwalb and Huttenlocher,

2005) proposes an efﬁcient inference method focus-

ing on tree-based models that use Gaussian priors for

the kinematic constraints. (Andriluka et al., 2009)

builds upon the PS framework and uses discrimina-

tively trained part detectors for unary potentials. (Ra-

manan and Sminchisescu, 2006) proposes an advance

method of learning PS parameters that maximizes the

conditional likelihood of the parts, and captures more

complex inter-part interactions than Gaussian priors,

which we also use to train our kinematic constraints.

Along with the kinematic constraints, there have

been a few methods that use inter-part color similar-

ity for better localization of the parts. For instance,

(Eichner and Ferrari, 2009) uses Location Priors in

the window output of a person detector along with

the appearance information to initialize the unary po-

tentials for standard pictorial structure model. (Sapp

et al., 2010b) ﬁlters out less probable part locations

by using a cascade of pictorial structures, and uses

richer appearance models only at a later stage on

much smaller set of locations. The disadvantage with

this approach is that one might lose the correct lo-

cations for parts if he considers only the kinematic

constraints in the initial stages of the cascade. We,

on the other hand, directly include the constraints in

the graph and enforce them throughout the inference

stage.

There are few other approaches that use different

methods to get precise location of the parts. For in-

stance, (Gupta et al., 2008) models self-occlusion to

get precise location of the parts, (Karlinsky and Ull-

man, 2012) models the appearance of links that con-

nect two parts, and (Yang and Ramanan, 2011) pro-

poses a general ﬂexible mixture model that augments

standard spring models and is able to capture more

complex conﬁgurations of parts.

3 PICTORIAL STRUCTURE (PS)

REVIEW

In this section, we provide a brief overview of the PS

framework (Felzenszwalb and Huttenlocher, 2005)

VISAPP2014-InternationalConferenceonComputerVisionTheoryandApplications

followed by shortcomings of the PS and related ap-

proaches.

The human body is treated as an articulated struc-

ture of parts and represented using a graphical model

G = (V,E). Each node in G represents a part location

and each edge represents the kinematic constraint be-

tween physically connected pair of parts. The loca-

tion for i

part is given by l

= [x

,θ

, f

], where

) is the position of the part in image, θ

, f

are

orientation, foreshortening, and scale of the i

part

respectively. The conﬁguration, with n parts and an

image I, is represented by L = {l

},i = [1 . ..n], and

the posterior probability is written as:

P(L|I,Θ) ∝ exp

∑

φ(I|l

,Θ) +

∑

(i, j∈E)

ψ(l

)

(1)

Here the unary potentials φ(I|l

,Θ) provide the lo-

cal image evidence for the i

part located at l

with

learned appearance model Θ, and the pairwise poten-

tials ψ(l

) provide priors on the relative position of

parts enforcing kinematic constraints between them

(e.g. the lower arm must be attached to the upper

arm). The graph G is a tree with only unary and kine-

matic potentials, and the exact inference of the Max-

imum a posteriori (MAP) estimate can be performed

using dynamic programming in O(n×h

) time, where

n is the number of parts and h is the number of states

(state space size) for each part. The time complex-

ity is further reduced to O(n × h) by using Gaussian

priors and distance transform for the kinematic con-

straints computation (Felzenszwalb and Huttenlocher,

2005).

Limitations of PS and Related Approaches. The

recent approaches (Sapp et al., 2010b; Ramanan,

2006; Eichner and Ferrari, 2009; Andriluka et al.,

2009; Karlinsky and Ullman, 2012) consider parts

as rigid rectangular templates, and neglect foreshort-

ening in the upper and lower arms. Next, the pair-

wise kinematic constraints are usually modeled as

unimodal Gaussians (Felzenszwalb and Huttenlocher,

2005; Andriluka et al., 2009), which cannot capture

the true multinomial nature of interactions between

parts. Finally, few approaches (Felzenszwalb and

Huttenlocher, 2005; Andriluka et al., 2009) do not

utilize obvious image cues such as color similarity be-

tween the left and right arms during pose estimation.

4 OUR FRAMEWORK

Considering the limitations mentioned in Section 3,

we propose that the foreshortening search and the

color similarity constraints in the upper and lower

Figure 2: Flowchart of our algorithm.

arms are crucial for better localization of parts in any

upper body pose estimation method.

We implement the following in our framework:

to overcome the foreshortening problem, we add a

foreshortening search parameter f

for the upper and

lower arms (details in Section 4.2); to capture more

complex distributions for the kinematic constraints,

we use non-parametric distribution similar to (Ra-

manan, 2006) (details in Section 4.4); and to utilize

image color similarity cues during inference, we add

two new pairwise constraints: (1) upper left and right

arms, (2) lower left and right arms (details in Section

4.5).

Please note that after we add two new color sim-

ilarity constraints, we introduce cycles in the graph

G, and we can not perform MAP estimation using

dynamic programming. Instead, we ﬁrst reduce the

search space by preprocessing the image (details in

Section 4.3), and then use the loopy belief propaga-

tion framework thus obtaining the approximate ﬁnal

marginals for each part (details in Section 5).

Now, we represent the color similarity constraints

by edges C, the new posterior probability can be writ-

ten as:

RevisitingPoseEstimationwithForeshorteningCompensationandColorInformation

P(L|I, Θ) ∝ exp

∑

φ(I|l

,Θ) +

∑

(i, j∈E)

ψ(l

)

∑

(i, j∈C)

ω(l

)

(2)

where ω(l

) is the color similarity measure between

the part patches at locations l

and l

respectively. We

present the graphs with and without color similarity

constraints in Figure 5.

We present the high level description of our

full upper body pose estimation algorithm through a

ﬂowchart in Figure 2, and we describe each step in the

following sections.

(a) (b)

Figure 3: Preprocessing on the test image (best viewed in

color). (a) Upper body detection with shoulder regions

drawn in green and blue rectangles. (b) Foreground high-

lighting.

4.1 Preprocessing

The starting step of our algorithm is preprocessing on

the test image. In this step, we perform two operations

similar to (Eichner and Ferrari, 2009): upper body

detection, and foreground highlighting.

We use the Calvin upper body detector (Eichner

and Ferrari, 2009) that is based on the Histogram of

Oriented Gradients (HoG) features (Dalal and Triggs,

2005), the part based deformable models (Felzen-

szwalb et al., 2008), and the Haar cascade based face

detector (Viola and Jones, 2001). Next, we perform

foreground highlighting with the help of upper body

detection box and Grabcut (Rother et al., 2004). We

refer the reader to (Eichner and Ferrari, 2009) for

further details on the upper body detection and fore-

ground highlighting.

The upper body detector plays a crucial role in our

algorithm: it ﬁnds the locations of upright people in

images, helps reducing search space of body parts,

and provides scale information to normalize the scale

of the test image. Foreground mask further helps in

rejecting part candidates that are unlikely to be body

Figure 4: Foreshortening compensation (best viewed in

color).(a) Correct lower right arm candidate that has suf-

fered from foreshortening. (b) Rotated image at −θ

the part detection stage. (c) Vertically stretched image by

1/ f

. (d) Enlarged positive HoG weights of detector for

lower right arm. (e) Enlarged image patch from (c) notice

now detector will score higher when run on this patch.

parts. We explain the details of part candidates rejec-

tion in Section 4.3, where we utilize both the upper

body detection and foreground information. As an ex-

ample, we show upper body detection and foreground

highlighting on a sample test image in Figure 3.

4.2 Part Detection and Foreshortening

Compensation

After we perform the preprocessing, we have the ap-

proximate scale of the upper body from the upper

body detection box. We resize the test image to the

standard size on which our part detectors are trained,

and ﬁx the scale parameter s

= 1 for all parts loca-

tions l

= [x

,θ

, f

] during detection.

Next, we run the trained part detectors separately

for all six parts on the test image. Part detection score

or unary potential at location l

for the i

part gives

the evidence of how good the match is between the

image patch at location l

and the i

part. Formally,

we compute part detection score at location l

as:

φ(I|l

,Θ) = D

· H(l

) (3)

where D

is the trained part detector for the i

part

that has width w

and height h

, and H(l

) is the

HoG feature vector of the image patch at the location

having the same dimensions as D

During detection, as we mentioned before, body

parts have cylindrical shape and they are likely to suf-

fer from foreshortening. Foreshortening is different

from scale as it only affects length of the object while

the width remains the same. (e.g. an arm pointing

towards camera will have shorter length but the same

VISAPP2014-InternationalConferenceonComputerVisionTheoryandApplications

width compared to an arm in frontal plane.) To solve

this, we introduce a foreshortening search parameter

∈ [0.6, 0.8, 1.0] during part detection, and we run

part detectors on the input test image at different ori-

entations and foreshortening levels.

Training separate arm detectors for each orienta-

tion and foreshortening level is a tedious task: one has

to search for the training samples of different orienta-

tions and foreshortening levels in the provided train-

ing set. To avoid this, we use part detectors of ﬁxed

size, but we rotate and stretch the image by −θ

and

1/ f

respectively and then apply the detector. In this

way, we only have to train a single part detector for

each part.

We present an example in Figure 4 to provide a

clear visualization. Let us assume that the desired

candidate is the right lower arm having orientation

and foreshortening level f

(marked by red rect-

angle in Figure 4a), and our lower arm detector has

dimensions equal to the red rectangle in Figure 4c.

When checking for θ

orientation and f

foreshorten-

ing level, we rotate and stretch the image vertically by

−θ

and 1/ f

respectively. So that the detector gives

an appropriate high score for the desired candidate.

4.3 Parts Candidate Rejection

After we add an additional foreshortening search pa-

rameter f

in the part detection step, the state space for

each part increases. For instance, for a typical image

of size 100 × 100, if we compute part detectors for 24

orientations and 4 foreshortening levels, then the state

space for each part is h = 24 × 4 × 10

≈ 10

. In the

later stage, since we are using non-parametric kine-

matic constraints, pairwise potentials computation re-

quire large number of computations O(h

) ≈ 10

Therefore, it is essential to prune the state space for

each part before one computes pairwise constraints.

We assume that the part detection scores at the

right locations are higher than their neighboring

scores (which generally holds), and we sample only

the local maxima points from the part detector re-

sponse over the location l

= [x

,θ

, f

] for all six

parts. In this way, the points are not rejected even if

they have low absolute detection scores, which might

occur due to effects like bad illumination, contrast,

blur etc., as long as they possess a higher value within

their neighbourhood. After this pruning, we generally

have thousands of part candidates for each part.

Then, we utilize the upper body detection box to

prune the state space for the head and torso only. We

reject all candidates that do not overlap with the upper

body detection box. We have only a few head and

torso candidates after this step.

For upper arms, it can be noted that if the per-

son is frontal upright in the image, then the shoulders

tend to be in constant regions of upper body detec-

tion box. We call these regions as shoulder regions,

and heuristically deﬁne their location relative to up-

per body detection box. If the upper body detection

box is deﬁned as UB = [x

,w,h], where (x

) is

the top-left point and w,h are its width and height re-

spectively, we deﬁne the left shoulder region (LSh)

and the right shoulder region (RSh) as:

LSh =[x

,(y

+ 0.4h), 0.4w,0.5h]

RSh =[(x

+ 0.6w), (y

+ 0.4h), 0.4w,0.5h]

(4)

We exploit these shoulder regions to reduce both the

right and left upper arms candidates: we reject all the

candidates that lie outside the respective shoulder re-

gions, and usually after this, we have only few hun-

dred valid upper arms candidates. As an example in

Figure 3(a), shoulder regions are marked in green and

blue rectangles.

Finally, we use foreground information from the

preprocessing step to reduce the number of candidates

of the lower and upper arms. We keep a part candidate

if it lies on the foreground or it has a score higher

than a threshold t

, and reject it otherwise. After this

ﬁnal step, we usually have less than one thousand part

candidates for the lower arms.

4.4 Kinematic Constraints

Once we have the selected part candidates, we are

ready to compute the kinematic constraints that force

pairs of connected parts to stay together. These can

also be visualized as spring-like connections. For ex-

ample, the upper arms are attached to the torso, the

head is attached to the torso and so on.

We use discrete binning of the relative arrange-

ment of parts for the kinematic constraints similar to

(Ramanan, 2006). These constraints are between two

part patches located at l

and l

, and have the form:

ψ(l

) = α

bin(l

− l

) (5)

where bin(.) is the vectorized count of spatial and

angular histogram bins, and α

is a model parame-

ter that favors certain relative spatial and angular bins

between part patches located at l

and l

. The reason

for using these over Gaussian priors is that they cap-

ture more complex distributions. We learn α

from the

training set with the method speciﬁed in (Ramanan

and Sminchisescu, 2006).

4.5 Color Similarity Constraints

While the kinematic constraints are independent of

the image and force pairs of parts to stay together,

RevisitingPoseEstimationwithForeshorteningCompensationandColorInformation

Figure 5: Human upper body pose estimation graph (best

viewed in color). (a) Graph G with only unary potentials

and kinematic constraints ψ(l

). (b) Graph G

with addi-

tional color similarity constraints ω(l

the color similarity constraints force the pair of parts

to have similar color. These constraints encourage

pairs of part candidates that have similar colors and

discourage others that have different colors. For ex-

ample, the true candidates of left and right upper arms

will have similar color and they will deﬁnitely differ

from a background candidate in color.

We calculate color similarity between a pair of

part patches located at l

and l

by taking the neg-

ative of the modiﬁed Chi-squared distance (χ

) be-

tween their color histograms.

ω(l

) =

∑

− h

)

(6)

where h

, h

, and H are the concatenated histograms

of normalized red and green channels over the part

patches located at l

, l

, and the entire image respec-

tively, and h

, h

, and H

are the k

bin value of the

corresponding histogram.

We use histograms of normalized red and green

channel because they provide illumination invariance

even if the patches are widely distant and have dif-

ferent illumination properties. And, we divide with

the global histogram because it gives higher weight to

sparsely observed color values than frequently occur-

ring color values.

5 INFERENCE USING LOOPY

BELIEF PROPAGATION

After we compute all the pairwise constraints (kine-

matic and color similarity), we advance to the ﬁnal

inference step. When there are only unary potentials

(part detection scores) and kinematic constraints, the

graph G is a tree, but as soon as we add two color

similarity constraints, we introduce cycles in G. We

show the graph G having only unary potentials and

Figure 6: PCP curves with 1 and 3 foreshortening levels

with kinematic and color similarity constraints.

kinematic constraints in Figure 5a, and G

with addi-

tional color similarity constraints in Figure 5b.

Now since the graph G

has cycles, we cannot per-

form MAP estimation using dynamic programming.

Instead, we run the loopy belief propagation algo-

rithm (Koller and Friedman, 2009) on the selected

part candidates with the pairwise constraints. The

loopy belief propagation algorithm optimizes for the

posterior probability marginals of each part: the parts

interact with each other via belief messages. The mes-

sage from a node s with variable X to a node t with

variable Y is given by:

(Y ) =

∑

φ(X) × ζ(X,Y ) × ¯m(X) (7)

where ¯m(X) are the incoming messages at node s ex-

cluding the message from node t, φ(X) is the unary

potentials at node s, and ζ(X,Y ) are the pairwise po-

tentials (kinematic or color similarity) between the

variables X and Y . The algorithm passes the mes-

sages until they converge (have same value in two

consecutive iterations), and in the end, we get ap-

proximate marginals for each part. To get the best

match, we choose the top candidates among the re-

sultant marginals of all six parts for ﬁnal evaluation.

6 EXPERIMENTS

We evaluate our approach on the Buffy Stickmen

v3.01 (Ferrari et al., 2008) and ETHZ PASCAL Stick-

men v1.1 (Eichner and Ferrari, 2009) datasets. We

provide our implementation details in the following

section.

6.1 Implementation Details

We use separately learned part detectors of (Sapp

et al., 2010a) for all six parts. These are Gentleboost

VISAPP2014-InternationalConferenceonComputerVisionTheoryandApplications

Table 1: Experiments on Buffy Stickmen V3.01 at PCP

0.5

(all results are in percentage). Using 3 foreshortening level produces

better results. Color similarity constraints improve the accuracy. See text for details.

Method Torso Head Upper Arms Lower Arms Total

1 foreshortening + kinematic 100 98.5 92 56.6 82.6

1 foreshortening + kinematic + color sim. 100 98.5 92.2 57.3 83.0

3 foreshortening + kinematic 100 98.5 91.3 67.8 86.1

3 foreshortening + kinematic + color sim. 100 98.5 91.9 68.1 86.4

Table 2: Comparison to other methods at PCP

0.5

(all results are in percentage). See text for details.

Results on Buffy Stickmen V3.01 Dataset

Method Torso Head Upper Arms Lower Arms Total

(Andriluka et al., 2009) 90.7 95.5 79.3 41.2 73.5

(Eichner and Ferrari, 2009) 98.7 97.9 82.8 59.8 80.1

(Karlinsky and Ullman, 2012) 99.6 99.6 93.2 60.6 84.5

(Sapp et al., 2010b) 100 96.2 95.3 63.0 85.5

(Sapp et al., 2010a) 100 100 91.1 65.7 85.9

Ours 100 98.5 91.9 68.1 86.4

Results on PASCAL Stickmen V1.1 Dataset

(Sapp et al., 2010b) 99.3 88.1 79.0 49.3 74.0

(Eichner and Ferrari, 2009) 97.2 88.6 73.8 41.5 69.3

(Karlinsky and Ullman, 2012) 98.8 97.3 81.6 47.0 75.5

Ours 96.9 84.5 81.0 45.0 72.1

classiﬁers (Friedman et al., 2000) on HoG based fea-

tures (Dalal and Triggs, 2005). The ﬁlter template

size for arms (upper and lower) is 72 × 36, for head it

is 45 × 45, and for torso it is 100 × 90.

In these datasets, all the images have front up-

right people, so we run the torso detector for only

the vertical orientation, head detector for the verti-

cal and the 2 nearby orientations (0 ± 15

◦

), arm de-

tectors (all 4 types) for 3 foreshortening levels f

∈

[0.6,0.8,1.0] and evenly divided 24 orientations θ

∈

[0,15,.. . ,360]

◦

. We choose threshold t

= 1.0 in the

part candidate rejection stage, and use k = 16 bins

for the histograms in the color similarity computa-

tion. We normalize all unary and pairwise potentials

between [0,1] before the inference stage.

6.2 Results

Evaluation Measure. The criterion for correct pose

estimation from (Ferrari et al., 2008), called the

Percentage of Correctly estimated body Parts (PCP)

is the following: an estimated body part is considered

correct if its segment endpoints lie within r% of

the length of the ground-truth segment from their

annotated location. Commonly r = 50% is chosen

for reporting the results on these datasets.

Results on Buffy Stickmen v3.01. This dataset is

quite challenging due to many uncontrolled condi-

tions such as very cluttered images, dark illumination,

and people wearing clothing of different kind and

color. It has 5 seasons among which images from

seasons 3 and 4 are used for training, and images

from seasons 2,5, and 6 are used for testing. There are

276 testing images, out of which only 259 (93.48%)

give the correct upper body detection. We report our

experiments on the Buffy Dataset in Table 1. We

can see in ﬁrst and third rows of Table 1 that using 3

foreshortening levels over 1 produces better results

on average: the results are better for the lower arms

since foreshortening is mostly present in them than

the upper arms. Also, we get slightly better results

with two additional color similarity constraints than

using just the kinematic constraints. We plot the two

PCP curves in Figure 6 comparing results with 1 and

3 foreshortening levels, and we compare our results

with others in Table 2(upper). As shown, we perform

comparably well with (Sapp et al., 2010a; Sapp et al.,

2010b), improving over the lower arms accuracy.

Results on ETHZ PASCAL Stickmen v1.1. This

dataset is even more challenging as it has real world

low quality images with different illumination. We

use upper body detections provided by the dataset

and report results (as others) on only 412 test images.

We compare our results with other methods in Table

2(lower). Please note that these results are given in

(Karlinsky and Ullman, 2012) with training on the

PASCAL dataset itself. We did not re-train our part

detectors on this data set, instead we used old our part

RevisitingPoseEstimationwithForeshorteningCompensationandColorInformation

Figure 7: Sample results (best viewed in color). Buffy Stickmen V3.01 (left) ETHZ PASCAL Stickmen V1.01 (right).

detectors from Buffy dataset that are trained on sea-

sons 3,4 only. We get detection rate of 71% with our

algorithm, and as we can see, we get comparable re-

sults for the torso and lower arms, and good results

for the upper arms. We show sample results in Figure

7 CONCLUSIONS

In this paper, we have presented a fully automated

upper body pose estimation algorithm. Our algo-

rithm works with uncontrolled images with the only

assumption that the person is upright in the image,

and can easily be extended to the full body pose es-

timation. We have proposed a method to compen-

sate foreshortening in highly variable parts such as

the upper and lower arms, and a method that effec-

tively prunes the search state space of all the parts.

Additionally, we have added pairwise color similarity

constraints between the upper left-right and the lower

left-right arms pairs along with kinematic constraints,

to utilize the image cues for better localization of the

parts, and we have used loopy belief propagation al-

gorithm for the inference. We have shown experimen-

tally that better results can be achieved with our pro-

posed foreshortening compensation and color infor-

mation utilization. We have presented results on the

two challenging datasets with improvements on the

lower arms and comparable results for other parts.

REFERENCES

Andriluka, M., Roth, S., and Schiele, B. (2009). Pictorial

structures revisited: People detection and articulated

pose estimation. In Proc. CVPR 2009. IEEE.

Dalal, N. and Triggs, B. (2005). Histograms of oriented

gradients for human detection. In Proc. CVPR 2005.

IEEE.

Eichner, M. and Ferrari, V. (2009). Better appearance mod-

els for pictorial structures. In Proc. BMVC 2009.

British Machine Vision Association.

Felzenszwalb, P., McAllester, D., and Ramanan, D. (2008).

A discriminatively trained, multiscale, deformable

part model. In Proc. CVPR 2008.

Felzenszwalb, P. F. and Huttenlocher, D. P. (2005). Pictorial

structures for object recognition. IJCV 2005.

Ferrari, V., Marin-Jimenez, M., and Zisserman, A. (2008).

Progressive search space reduction for human pose es-

timation. In Proc. CVPR 2008. IEEE.

Fischler, M. A. and Elschlager, R. A. (1973). The repre-

sentation and matching of pictorial structures. IEEE

Transactions on Computers 1973.

Friedman, J., Hastie, T., and Tibshirani, R. (2000). Additive

logistic regression: a statistical view of boosting (with

discussion and a rejoinder by the authors). The Annals

of Statistics.

Gupta, A., Mittal, A., and Davis, L. S. (2008). Constraint in-

tegration for efﬁcient multiview pose estimation with

self-occlusions. PAMI 2008.

Karlinsky, L. and Ullman, S. (2012). Using linking fea-

tures in learning non-parametric part models. In Proc.

ECCV 2012. Springer Berlin Heidelberg.

Koller, D. and Friedman, N. (2009). Probabilistic graphical

models : principles and techniques. MIT Press.

Ramanan, D. (2006). Learning to parse images of articu-

lated bodies. In Proc. NIPS 2006.

Ramanan, D. and Sminchisescu, C. (2006). Training de-

formable models for localization. In Proc. CVPR

2006. IEEE.

Rother, C., Kolmogorov, V., and Blake, A. (2004). ”grab-

cut”: interactive foreground extraction using iterated

graph cuts. ACM Trans. Graph.

Sapp, B., Jordan, C., and Taskar, B. (2010a). Adaptive pose

priors for pictorial structures. In Proc. CVPR 2010.

IEEE.

Sapp, B., Toshev, A., and Taskar, B. (2010b). Cascaded

models for articulated pose estimation. In Proc. ECCV

2010. Springer Berlin / Heidelberg.

Viola, P. and Jones, M. (2001). Rapid object detection using

a boosted cascade of simple features. In Proc. CVPR

2001. IEEE.

Yang, Y. and Ramanan, D. (2011). Articulated pose estima-

tion with ﬂexible mixtures-of-parts. In Proc. CVPR

2011. IEEE.

VISAPP2014-InternationalConferenceonComputerVisionTheoryandApplications