Integrating Spatial Layout of Object Parts into Classiﬁcation without

Pairwise Terms

Application to Fast Body Parts Estimation from Depth Images

Mingyuan Jiu, Christian Wolf and Atilla Baskurt

Universit

e de Lyon, CNRS, INSA-Lyon, LIRIS, UMR5205, 69621, Villeurbanne, France

Keywords:

Object Detection, Pose Estimation, Spatial Layout, Unary Terms, Randomized Decision Forest, Kinect.

Abstract:

Object recognition or human pose estimation methods often resort to a decomposition into a collection of parts.

This local representation has signiﬁcant advantages, especially in case of occlusions and when the “object”

is non-rigid. Detection and recognition requires modeling the appearance of the different object parts as well

as their spatial layout. The latter can be complex and requires the minimization of complex energy functions,

which is prohibitive in most real world applications and therefore often omitted. However, ignoring the spatial

layout puts all the burden on the classiﬁer, whose only available information is local appearance. We propose

a new method to integrate the spatial layout into the parts classiﬁcation without costly pairwise terms. We

present an application to body parts classiﬁcation for human pose estimation. As a second contribution, we

introduce edge features from gray images as a complement to the well known depth features used for body

parts classiﬁcation from Kinect data.

1 INTRODUCTION

Object recognition is one of the fundamental prob-

lems in computer vision, as well as related problems

like face detection and recognition, person detection,

and associated pose estimation. Local representations

as collections of descriptors extracted from local im-

age patches are very popular. This representation al-

lows robustness against occlusions and permits non-

rigid matching of articulated objects, like humans and

animals.

For object recognition tasks, the known methods

in the literature vary in their degree of usage of spatial

relationships, between methods not using them at all,

for instance the bags of visual words model (Sivic and

Zisserman, 2003), and rigid matching methods using

all available information, e.g. based on RANSAC

(Fischler and Bolles, 1981). The former suffer from

low discriminative power, whereas the latter only

work for rigid transformations and cannot be used

for articulated objects. Methods for non-rigid match-

ing exist. Graph-matching and hyper-graph match-

ing, for instance, restricts the veriﬁcation of spatial

constraints to neighbors in the graph. However, non

trivial formulations require minimizing a complex en-

ergy functions and are NP-complete (Torresani et al.,

2008; Duchenne et al., 2009).

Pictorial structures, deformable parts based mod-

els, have been introduced as early as in 1973 (Fis-

chler and Elschlager, 1973). The more recent semi-

nal work creates a Bayesian parts based model of the

object and its parts, where the possible relative parts

locations are modeled as a tree structured Markov

random ﬁeld(Felzenszwalb and Huttenlocher, 2005).

The absence of cycles makes minimization of the un-

derlying energy function relatively fast — of course

much slower than a model without pairwise terms. In

(Felzenszwalb et al., 2010) the Bayesian model is re-

placed with a more powerful discriminative model,

where scale and relative position of each part are

treated as latent variables and searched by Latent

SVM.

A similar problem occurs in tasks where joint ob-

ject recognition and segmentation is required. Lay-

out CRFs and extensions model the object as a col-

lection of local parts (patches or even individual pix-

els), which are related through an energy function

(Winn and Shotton, 2006). However, unlike picto-

rial structures, the energy function here contains cy-

cles which makes minimization more complex, for

instance through graph cuts techniques. Furthermore,

the large number of labels makes the expansion move

algorithms inefﬁcient (Kolmogorov and Zabih, 2004).

In the original paper (Winn and Shotton, 2006), and

626

Jiu M., Wolf C. and Baskurt A..

Integrating Spatial Layout of Object Parts into Classiﬁcation without Pairwise Terms - Application to Fast Body Parts Estimation from Depth Images.

DOI: 10.5220/0004278206260631

In Proceedings of the International Conference on Computer Vision Theory and Applications (VISAPP-2013), pages 626-631

ISBN: 978-989-8565-47-1

 2013 SCITEPRESS (Science and Technology Publications, Lda.)

as in our proposed work, the unary terms are based on

randomized decision forests. Another related applica-

tion which could beneﬁt from this contribution is full

scene labelling (Farabet et al., 2012).

Pose estimation methods are also often naturally

solved through a decomposition into body parts. A

preliminary pixel classiﬁcation step segments the ob-

ject into body parts, from which the joint positions

can be estimated in a second step. The well known

method used for the MS Kinect system completely

ignores the spatial relationships between the objects

parts and puts all the classiﬁcation burden on the pixel

wise working classiﬁer (Shotton et al., 2011). The de-

cision function to be learned by the classiﬁer is com-

plex and therefore requires a learning machine with a

complex architecture, which is difﬁcult to learn. The

good performance of the system has been obtained

with an extremely large training set of 2 · 10

training

vectors extracted from 1 million images and training

on a computation cluster with 1000 nodes.

In this paper we propose a method which seg-

ments an object into parts through pixelwise classi-

ﬁcation and which integrates the spatial layout of the

part labels. Like the methods ignoring the spatial lay-

out, it is extremely fast as no additional step needs

to be added to pixelwise classiﬁcation and no energy

minimization is necessary. The (slight) additional

computational load only concerns learning at an of-

ﬂine stage. The goal is not to compete with methods

based on energy minimization, which is impossible

through pixelwise classiﬁcation only. The objective

is to improve the performance of pixelwise classiﬁca-

tion by using all available information during learn-

ing.

Classical learning machines working on data em-

bedded in a vector space, like neural networks, SVM,

randomized decision trees, Adaboost etc., are in prin-

cipal capable of learning arbitrary complex decision

functions, if the underyling prediction model (archi-

tecture) is complex enough. In reality the available

amount of training data and computational complex-

ity limit the complexity which can be learned. In most

cases only few data are available with respect to the

complexity of the problem. It is therefore often useful

to impose some structure on the model. In this work

we propose to use prior knowledge in the form of the

spatial layout of the labels to add structure to the de-

cision function learned by the learning machine.

The contributions of this paper are twofold:

• The integration of the spatial layout of part labels

into learning machines, in particular randomized

decision forests;

• We introduce features extracted from edges cal-

culated on the gray image and show that they can

provide valuable complementary information to

the traditional depth features.

The paper is organized as follows: section 2 presents

the learning procedure which integrates the spatial

layout into a randomized decision forest. Section 3

introduces edge comparison features which can com-

plement the classical depth features for pose estima-

tion. Section 4 explains the experiments on pose esti-

mation, and section 5 ﬁnally concludes.

2 LEARNING OBJECT PART

CLASSIFIERS FROM SPATIAL

LAYOUTS

We consider problems where the pixels i of an image

are classiﬁed as belonging to one of L target labels by

a learning machine whose alphabet is L = {1 . . . L}.

To this end, descriptors F

are extracted on each pixel

i and a local path around it, and the learning machine

takes a decision l

∈ L for each pixel. A powerful

prior can be deﬁned over the set of possible labellings

for a spatial object. Beyond the classical Potts model

known from image restoration (Geman and Geman,

1984), which favors equal labels for neighboring pix-

els over unequal labels, additional (soft) constraints

can be imposed. Labels of neighboring pixels can

be supposed to be equal, or at least compatible, i.e.

belonging to parts which are neighbors in the spatial

layout of the object. In computer vision this kind of

constraints is often modeled soft through the energy

potentials of a global energy function:

E(l

, . . . , l

) =

∑

U(l

, F

) + µ

∑

i∼ j

D(l

, l

) (1)

where the unary terms U (·) integrate conﬁdence of

a pixelwise employed learning machine and the pair-

wise terms D(·, ·) are over couples of neighbors i ∼ j.

In the case of certain simple models like the Potts

model, the energy function is submodular and the ex-

act solution can be calculated in polynomial time us-

ing graph cuts (Kolmogorov and Zabih, 2004). Tak-

ing the spatial layout of the object parts into account

results in non-submodular energy functions which are

difﬁcult to solve. Let’s note that even the complexity

of the submodular problem (quadratic on the number

of pixels in the worst case) is far beyond the complex-

ity of pixelwise classiﬁcation with unary terms only.

The goal of our work is to improve the learning

machine in the case where it is the only source of in-

formation, i.e. no pairwise terms are used for classiﬁ-

cation. Traditional learning algorithm in this context

are supervised and use as only input the training fea-

ture vectors f

as well as the training labels l

, where

IntegratingSpatialLayoutofObjectPartsintoClassificationwithoutPairwiseTerms-ApplicationtoFastBodyParts

EstimationfromDepthImages

627

i is over the pixels of the training set. We propose

to provide the learning machine with additional infor-

mation, namely the spatial layout of the labels of the

alphabet L .

2.1 Randomized Decision Forests

In this paper we focus on randomized decision forests

(RDF) as learning machines, because they have

shown to outperform other learning machines in this

kind of problem and they have become very popu-

lar in computer vision lately (Shotton et al., 2011).

Decision trees, as simple tree structured classiﬁers

with decision and terminal nodes, suffer from over-

ﬁtting. Randomized forests, on the other hand, over-

come this drawback by integrating distributions over

several trees.

The classical learning algorithm for RDFs (Lep-

etit et al., 2004) is training each tree separately, layer

by layer. Each layer is also trained separately, which

allows the training of deep trees with a complex pre-

diction model. The drawback of this approach is the

absence of any gradient on the error during training.

Instead, training maximizes the gain in information

based on Shannon entropy. In the following we give

a short description of the classical training procedure.

We describe the version of the learning algorithm

from (Shotton et al., 2011) which jointly learns fea-

tures and the parameters of the tree, i.e. the thresh-

olds for each decision node. We denote by θ the set

of all learned parameters (features and thresholds) for

each decision node. For each tree, a subset of training

instances is randomly sampled with replacement.

1. Randomly sample a set of candidates θ.

2. Partition the set of input vectors into two sets, one

for the left child and one for the right child accord-

ing to the threshold τ ∈ θ. Denote by Q the label

distribution of the parent and by Q

(θ) and Q

(θ)

the label distributions of the left and the right child

node, respectively.

3. Choose θ with the largest gain in information:

∗

= argmax

G(θ)

= argmax

H(Q)−

∑

s∈{l,r}

(θ)|

|Q|

H(Q

(θ))

(2)

where H(Q) is the Shannon entropy from class

distribution of set Q.

4. Recurse the left and right child until the prede-

ﬁned level or largest gain is arrived.

Classical: G(θ)=0.37 Classical: G(θ)=0.37

Spatial: G

(θ)=0.22 Spatial: G

(θ)=0.31

(a) (b) (c)

Figure 1: An example of three parts: (a) part layout; (b) a

parent distribution and its two child distributions for a given

θ; (c) a second more favorable case. The entropy gain for

the spatial learning cases are given with λ = 0.3.

2.2 Spatial Learning for Randomized

Decision Forests

In what follows we will integrate the additional infor-

mation on the spatial layout of the object parts into the

training algorithm, which will be done without using

any information on the training error. Let us ﬁrst re-

call that the target alphabet of the learning machine is

L = {1. . . L}, and then imagine that we create groups

of pairs of two labels, giving rise to a new alpha-

bet L

= {11, 12, . . . 1L, 21, 22, . . . 2L, . . . , LL}. Each

of the new labels is a combination of two original

labels. Assuming independent and identically dis-

tributed (i.i.d.) original labels, the probability of a

new label i j consisting of the pair of original labels i

and j is the product of the original probabilities, i.e.

p(i j) = p(i)p( j). The Shannon entropy of a distribu-

tion Q

over the new alphabet is therefore

H(Q

) =

∑

−p(k)log p(k) (3)

where k is over the new alphabet. This can be ex-

pressed in terms of the original distribution over the

original alphabet:

H(Q

) =

∑

i, j

−p(i)p( j) log[p(i)p( j)] (4)

We can now separate the new pairwise labels into

two different subsets, the set of neighboring labels

, and the set of not neighboring labels L

, with

= L

∪L

. We suppose that each original label is

neighbor of itself. In the same way, a distribution Q

over the new alphabet can be split into two different

distributions Q

and Q

from these two subsets.

Then a learning criterion can be deﬁned using the

gain in information obtained by parameters θ as a sum

over two parts of the histogram Q

, each part being

calculated over one subset of the labels:

(θ) = λ G

(θ) + (1 − λ) G

(θ) (5)

VISAPP2013-InternationalConferenceonComputerVisionTheoryandApplications

628

where

(θ) = H(Q

) −

∑

s∈{l,r}

(θ)|

H(Q

(θ)) (6)

Here, λ is a weight, and λ < 0.5 in order to give sep-

aration of non neighboring labels a higher priority.

Let’s consider a simple parts based model with

three parts numbered from 1 to 3 shown in ﬁgure 1.

We suppose that part 1 is a neighbor of 2, that 2 is a

neighbor of 3, but that 1 is not a neighbor of 3. Let’s

also consider two cases where a set of tree parameters

θ = {u, v, τ} splits a label distribution Q into two dis-

tributions, the left distribution Q

(θ) and the right one

(θ).

The distributions for the two different cases are

given in ﬁgure 1b and 1c, respectively. Here, we did

not compare the the individual values for Shannon

entropy gain between classical measure and spatial

measure, as the former is calculated from the unary

distribution and the latter on a pairwise distribution.

However, the difference between two cases can be ob-

served using the same measure. It can been seen that

the classical measure is equal for both cases: the en-

tropy gains are both 0.37. If we take into account the

spatial layout of the different parts, we can see that

the entropy gain is actually higher in the second case

(θ)=0.31) than the ﬁrst case (G

(θ)=0.22) when

setting λ=0.3. In the ﬁrst case, the highest gain in

information is obtained for parts 2 and 3, which are

neighbors. However, the highest gain comes from

parts 1 and 3 which are not neighbors in the second

case. This is consistent with what we advocate above

that a higher priority is set to non-neighbor labels.

3 DEPTH AND EDGE FEATURES

In (Shotton et al., 2011), depth features have been

proposed for pose estimation from Kinect depth im-

ages. One of their main advantages is their simplicity

and their computational efﬁciency. Brieﬂy, at a given

pixel x, the depth difference between two offsets cen-

tered at x is computed:

(I, x) = d

(x +

(x)

) − d

(x +

(x)

) (7)

where d

(x) is the depth at pixel x in image I, param-

eters θ = (u, v) are two offsets and are normalized

by the current depth for depth-invariance. A single

feature vector contains several differences, each com-

parison value being calculated from a different pair of

offsets u and v. These offsets are learned during train-

ing together with the prediction model, as described

in section 2.

3.1 Edge Features

Psychophysical studies show that we can recognize a

object only with its contour, In (Shotton et al., 2008),

contour is deﬁned as the outline (silhouette) together

with the internal edges of the object, which enable to

represent the spatial structure of the object. Here we

extend this concept further by introducing edge com-

parison features extracted from the grayscale image.

We propose two different types of features based on

edges, the ﬁrst using edge magnitude, and the second

using edge orientation.

In our settings, we need features whose positions

can be sampled by the training algorithm. How-

ever, contours are usually sparsely distributed, which

means that comparison features can not directly be

applied to edge images. Our solution to this problem

is inspired by chamfer distance matching, which is

a classical method to measure the similarity between

contours (Barrow et al., 1977). We compute a dis-

tance transform on the edge image, where the value

of each pixel is the distance to its nearest edge. Given

a grayscale image I and its binary edge image E, the

distance transform DT

is computed as:

(x) = min

:E(x

)=1

||x − x

|| (8)

The distance transform can be calculated in linear

time using a two-pass algorithm.

The edge magnitude feature is deﬁned as:

(x) = DT

(x + u) + DT

(x + v) (9)

where u and v are the same in (7). This feature indi-

cates the existence of edges near two offsets.

Edge orientation features can be computed in a

similar way. In the procedure of distance image, we

can get another orientation image O

, in which the

value of each pixel is the orientation of its nearest

edge:

(x) = Orientation



arg min

:E(x

)=1

||x − x



(10)

The feature is computed as the difference in orienta-

tion for two offsets:

(x) = O

(x + u) − O

(x + v) (11)

where the minus operator takes into account the cir-

cular nature of angles. We discretize the orientation

to alleviate the effect of noise.

The objective of both features is to capture the

edge distribution at speciﬁc locations in the image,

which will be learned by the RDF.

IntegratingSpatialLayoutofObjectPartsintoClassificationwithoutPairwiseTerms-ApplicationtoFastBodyParts

EstimationfromDepthImages

629

4 EXPERIMENTS

We performed experiments for an application of pose

estimation. We would like to point out that the learn-

ing method can be applied to any parts based model

which integrates pixelwise classiﬁcation with random

forests, for instance methods for joint object recogni-

tion and segmentation.

The proposed algorithm has been evaluated on the

CDC4CV Poselets dataset (Holt et al., 2011). Our

goal is not to beat the state of the art in pose esti-

mation, but to show that spatial learning is able to

improve pixelwise classiﬁcation of parts based mod-

els. The dataset contains upper body poses taken

with Kinect and consists of 345 training and 347

test depth images. The authors also supplied cor-

responding annotation ﬁles which contain the loca-

tions of 10 articulated parts: head(H), neck(N), left

shoulder(LS), right shoulder(RS), left elbow(LE), left

hand(LHA), right elbow(RE), right hand(RHA), left

hip(LH), right hip(RH). We created groundtruth seg-

mentations through nearest neighbor labeling. In our

experiments, the left/right elbow (LE,RE) and hand

(LHA,RHA) parts were extended to left/right upper

arm (LUA,RUA) and forearm (LFA, RFA) parts, we

also deﬁned the part below the waist as other (the

black area in the Figure 2b).

Unless otherwise speciﬁed, the following param-

eters have been used for RDF learning: 3 trees each

with a depth of 9; 2000 randomly selected pixels per

image, roughly distributed across the body; 4000 can-

didate pairs of offsets; 22 candidate thresholds; off-

sets and thresholds have been learned separately for

each node in the forest. For spatial learning, 28 pairs

of neighbors have been identiﬁed between the 10 parts

based on a pose where the subject stretches his arms.

The parameter λ was set to 0.4.

We evaluate our method at two levels: pixelwise

classiﬁcation and pairs of parts recognition. Pixel-

wise decisions are directly provided by the random

forest. Part localizations are obtained from the pixel-

wise results through pixel pooling. We create a poste-

rior probability map for each part from the results on

RDF. After non-maximum suppression and low pass

ﬁltering, the location with largest response is used

as an estimate of the part, and then the positions of

pairs of detected parts are calculated, which approxi-

mately correspond to joints and serve as the interme-

diate pose indicator. In the following, we denote them

by the pair of neighboring parts.

Table 1 shows classiﬁcation accuracies of the

three settings. A baseline has been created with clas-

sical RDF learning and depth features. Spatial learn-

ing with depth features only and with depth and edge

Table 1: Results on body part classiﬁcation in pixelwise

level: D=deph features; E=edge features.

Accuracy

Classical RDF with D 60.30%

Spatial D λ = 0.4 61.05%

Spatial D+E λ = 0.4 67.66%

features together are shown in table 1. We can see

that that spatial learning can obtain a performance

gain, although the layout is used in the prediction

model and no pairwise terms have been used. Figure

2 shows some classiﬁcation examples, which demon-

strates that spatial learning makes the randomized for-

est more discriminative. The segmentation output is

cleaner, especially at the borders.

At part level, we report our results of pairs of parts

according to the estimation metric by (Ferrari et al.,

2008): a pair of groundtruth parts is matched to a de-

tected pair if and only if the endpoints of the detected

pairs lie within a circle of radius r=50% of the length

of the groundtruth pair and centered on it. Table 2

shows our results on part level using several settings.

It demonstrates that spatial learning improves recog-

nition performance for most of parts.

The experiments at both pixelwise and part level

demonstrate that spatial learning makes randomized

forest more discriminative by integrating the spatial

layout into its prediction model. This proposition is

very simple and fast to implement, as the standard

pipeline can be still used. Only the learning method

has been changed, the testing code is unchanged.

There is no additional computational burden whatso-

ever during testing; a slight increase in computational

complexity can be observed for learning. No complex

discrete optimization problems need to be solved.

5 CONCLUSIONS

In this paper, we proposed a novel learning algorithm

for randomized decision forests which integrates in-

formation on the spatial layout of target labels. The

classiﬁcation algorithm is of exactly the same compu-

tational complexity, a slightly higher computational

burden is put on the learning stage. We applied our al-

gorithm on the body part classiﬁcation, although any

other application requiring the segmentation of an ob-

ject into parts may beneﬁt from the contribution. An-

other contribution extends the well known depth com-

parison features to edge comparison features obtained

from grayscale images. Results show that RDF in-

deed beneﬁts from the integration of the spatial layout

of parts and the edge features.

VISAPP2013-InternationalConferenceonComputerVisionTheoryandApplications

630

(a) (b) (c) (d) (e)

Figure 2: Examples of the pixelwise classiﬁcation: each row is an example, each column is a kind of classiﬁcation results, (a)

test depth image; (b) part segmentation; (c) classical classiﬁcation; (d) spatial learning with depth features; (e) spatial learning

with depth features and edge features.

Table 2: Correct rate(%) on pairs of parts for different feature settings in part level: D=deph features; E=edge features.

H-N LS-RS LS-LUA LUA-LFA RS-RUA RUA-RFA LS-LH RS-RH LH-RH Ave.

Classical E 46.69 0.29 20.46 2.02 0 21.90 77.81 1.15 19.88 21.13

Spatial E λ = 0.4 52.45 0 25.65 1.44 0 14.41 82.13 0.86 22.48 22.16

Classical D 85.30 0.29 39.77 1.15 0.29 17.58 49.86 0.29 16.14 23.41

Spatial D λ = 0.4 88.47 1.15 34.58 0.28 0.28 10.66 67.72 5.18 28.24 26.29

Spatial D+E λ = 0.4 89.05 0 53.31 0.86 0 25.65 72.91 0 13.54 28.37

REFERENCES

Barrow, H., tenenbaum, J., Bolles, R., and Wolf, H. (1977).

Parametric correspondence and chamfer matching:

two new techniques for image matching. In Joint con-

ference on artiﬁcial intelligence, pages 659–663.

Duchenne, O., Bach, F. R., Kweon, I.-S., and Ponce, J.

(2009). A tensor-based algorithm for high-order graph

matching. In CVPR, pages 1980–1987.

Farabet, C., Couprie, C., Najman, L., and LeCun, Y. (2012).

Scene parsing with multiscale feature learning, purity

trees, and optimal covers. In ICML.

Felzenszwalb, P., Girshick, R., McAllester, D., and Ra-

manan, D. (2010). Object detection with discrimina-

tively trained part based models. IEEE Transactions

on PAMI, 32(9):1627–1645.

Felzenszwalb, P. F. and Huttenlocher, D. P. (2005). Pictorial

Structures for Object Recognition. IJCV, 61(1):55–

79.

Ferrari, V., Marin-Jimenez, M., and Zisserman, A. (2008).

Progressive search space reduction for human pose es-

timation. In CVPR, pages 1–8.

Fischler, M. and Bolles, R. (1981). Random sample consen-

sus: A paradigm for model ﬁtting with applications

to image analysis and automated cartography. Com-

muncations of the ACM, 24:381395.

Fischler, M. and Elschlager, R. (1973). The representation

and matching of pictorial structures. IEEE Transac-

tions on Computer, 22(1):6792.

Geman, S. and Geman, D. (1984). Stochastic Relaxation,

Gibbs Distributions, and the Bayesian Restoration of

Images. IEEE Transactions on PAMI, 6(6):721–741.

Holt, B., Ong, E.-J., Cooper, H., and Bowden, R. (2011).

Putting the pieces together: Connected poselets for

human pose estimation. In ICCV Workshop on Con-

sumer Depth Cameras for Computer Vision.

Kolmogorov, V. and Zabih, R. (2004). What energy func-

tions can be minimized via graph cuts? IEEE Trans-

actions on PAMI, 26(2):147–159.

Lepetit, V., Pilet, J., and Fua, P. (2004). Point matching as a

classiﬁcation problem for fast and robust object pose

estimation. In CVPR, pages 244–250.

Shotton, J., Blake, A., and Cipolla, R. (2008). Mul-

tiscale categorical object recognition using contour

fragments. IEEE transactions on PAMI, 30(7):1270–

81.

Shotton, J., Fitzgibbon, A., Cook, M., Sharp, T., Finocchio,

M., Moore, R., Kipman, A., and Blake, A. (2011).

Real-time human pose recognition in parts from single

depth images. In CVPR.

Sivic, J. and Zisserman, A. (2003). Video google: A text

retrieval approach to object matching in videos. In

ICCV, volume 2, pages 1470–1477.

Torresani, L., Kolmogorov, V., and Rother, C. (2008). Fea-

ture correspondence via graph matching: Models and

global optimization. In ECCV, volume 2, pages 596–

609.

Winn, J. and Shotton, J. (2006). The layout consistent ran-

dom ﬁeld for recognizing and segmenting partially oc-

cluded objects. In CVPR, volume 1, pages 37–44.

IntegratingSpatialLayoutofObjectPartsintoClassificationwithoutPairwiseTerms-ApplicationtoFastBodyParts

EstimationfromDepthImages

631