LABELING HUMAN MOTION SEQUENCES USING GRAPHICAL

MODELS

Jos´e I. G´omez

Dpt. Computer Science, University of Jaen, Campus Las Lagunillas, Jaen, Spain

Manuel J. Mar´ın-Jim´enez, Nicolas P´erez de la Blanca

Dpt. Computing Science and Artiﬁcial Intelligence, University of Granada, ETSI Informatica y Telecomunicacion

Granada, Spain

Keywords:

Data labeling, Pose estimation, Graphical models, Human motion.

Abstract:

Graphical models have proved to be very efﬁcient models for labeling image data. In particular, they have

been used to label data samples from human body images. In this paper, the use of graphical models is studied

for human-body landmark localization. Here a new algorithm based on the Branch&Bound methodology,

improving the state of the art, is presented. The initialization stage is deﬁned as a local optimum labeling of the

sample data. An iterative improvement is given on the labeling space in order to reach new graphs with a lower

cost than the current best one. Two branch prune strategies are suggested under a B&B approach in order to

speed up the search: a) the use of heuristics; and b) the use of a node dominance criterion. Experimental results

on human motion databases show that our proposed algorithm behaves better than the classical Dynamic

Programming based approach.

1 INTRODUCTION

In this paper, graphical models based on Decompos-

able Triangulated Graphs (DTG) are used to approach

the human-body landmark localization and tracking

problems. Our approach is based on maximizing the

likelihood of the probabilistic graphical model from a

set of image measurements. Assuming that the object

is present in the image, the localization problem rela-

bels the sample data (image features) as part of the ob-

ject or background. Different approaches have been

proposed to label or match image data using graphi-

cal models (Gold and Rangarajan, 1996), (Grimson,

1990), (Haralick and Shapiro, 1992), (Rangarajan

et al., 1997), (Ullman, 1996). Some of these tech-

niques are very general in the sense that they are ap-

plied on any graphical model. However, these ap-

proaches are difﬁcult to apply on data with a substan-

tial amount of clutter. In (Caelli and Caetano, 2005),

graphical models are favorably compared with tradi-

tional relaxation approaches. Figure 1 broadly shows

the main steps of our approach to label new images

from the learned DTG models. In the case of human

motion labeling, the probability distribution deﬁned

by the learned graphical model, can be used as a con-

straint on the set of allowed graph conﬁgurations.

The new contributions of this paper are: i) we in-

troduce a new graph labeling algorithm for the DTG

class that improves the standard dynamic program-

ming strategy; ii) we show how the DTG class is able

to capture the human-body pose from different cam-

eras and to identify the point of view.

1.1 Related Work

The DTG class (Amit and Kong, 1996) is introduced

as a simpler graph class to register deformable pat-

terns on static images. More recently, the same graph-

ical model is used to label human body templates in

tracking problems (Song et al., 2003). In both ap-

proaches, a Dynamic Programming algorithm (DP)

was used as the search strategy to ﬁnd the best label-

ing from a learned model (Fischler and Elschlager,

1973). In a DTG, each node represents a landmark of

the object with an associated vector of measurements.

Although dynamic programming is a very convenient

technique to learn DTG models from data and also to

label new sample data, this strategy presents three im-

portant drawbacks that are analyzed in depth in sec-

tion 2.2. We propose a new algorithm that overcomes

488

I. Gómez J., J. Marín-Jiménez M. and Pérez de la Blanca N. (2009).

LABELING HUMAN MOTION SEQUENCES USING GRAPHICAL MODELS.

In Proceedings of the Fourth International Conference on Computer Vision Theory and Applications, pages 488-495

DOI: 10.5220/0001795704880495

 SciTePress

fedcb

Figure 1: Aim of this work. (a) Learned model from data samples ’Gestures’. (b) Input video frame. (c) Detected sample

points around the person. (d) Input to labeling algorithm. (e) Best graph conﬁguration found for input points and model

’Gestures’. Blue circles represent the points selected by our labeling algorithm. (f) Superimposed graph on input video

frame.

those drawbacks, based on a Branch&Bound (B& B)

strategy.

The search for the best labeling is addressed by the

minimum cost path on the tree of all possible label-

ings. However, the best labeling search problem is a

NP-hard problem. Therefore, we approach the search

by using a B& B strategy with an efﬁcient pruning

criterion on the labeling tree. This labeling space is

explored using a ﬁrst-in-depth strategy, delaying the

ﬁnal labeling assignment until a minimum cost so-

lution has been found. Two branch prune strategies

have been suggested under a B&B approach in or-

der to speed up the search: a) the use of heuristics;

and b) the use of a node dominance criterion (Fu-

jino and Fujiwara, 1994), (Ibaraki, 1977), (Yu and

Wah, 1988). Here we use a node dominance criterion

because it best suits our problem. In order to obtain

a worst case efﬁciency of O(N

), the same one us-

ing DP on this problem, we design our algorithm to

get the best trade-off between search time and solu-

tion quality, according to the criterion established in

(Hansen et al., 1997). In the experimental section 4,

we show that our algorithm runs in O(N

) in most of

our experiments.

Section 2 summarizes the probability model and

discusses the labeling problem using a DTG model.

Section 3 introduces our proposed labeling algorithm.

Section 4 shows the experimental results using differ-

ent human-body models. Section 5 presents our con-

clusions.

2 LABELING USING A DTG

2.1 The Probability Model

A Decomposed Triangulated Graph (DTG) is a trian-

gle collection where there is a vertex elimination or-

der so that the elimination of each node only affects

one triangle (Figure 2). A greedy algorithm (Song

et al., 2000), (Song et al., 2003) is proposed to es-

timate the probability model associated with a DTG

from image data. Here we follow the notation intro-

duced in (Song et al., 2003). Let S = {S

,...,S

}

be a set of nodes, and let X

, 1 ≤ i ≤ M be the mea-

sure associated to each node. The probability model

of a DTG describes the conditional dependences and

independences of triplets of features associated with

the triangles of the DTG. In this case, the probability

density of the set of nodes is

P(X

,...,X

) = P

∏

t=1

(1)

where S = {A

,...,A

};

),(A

),...,(A

) are the

triangles, and (A

,...,A

) is the vertex elim-

ination order. Let χ = { X

,...,X

} be a set

of samples where X

= {X

,...,X

}, 1 ≤ n ≤ N

represent the labeled data. We have to ﬁnd the DTG

G that maximizes the likelihood of the samples,

P(χ|G), where

logP(χ|G) ≃ −N·h(X

)−N·

∑

t=1

h(X

)

(2)

where h(·) is differential entropy or conditional dif-

ferential entropy (Cover and Thomas, 1991). In

other words, we want to ﬁnd the best sequence

),(A

),··· ,(A

) that mini-

mizes logP(χ|G).

2.2 Searching for the Best Labeling

The search for the best labeling on a DTG using DP

(Amit and Kong, 1996), (Song et al., 2000) has vari-

ous drawbacks:

A) Using DP the lowest cost node is always selected

as the best candidate in the triangle building process.

The search processis carried out in the vertex elimina-

tion order ﬁxed by the DTG. The label process starts

by searching for the pair of sampled points with the

least entropy and iterates adding new nodes. On each

LABELING HUMAN MOTION SEQUENCES USING GRAPHICAL MODELS

489

Figure 2: Example of DTG model on which the problem of vertex repetition is present. Between brackets the vertex elim-

ination order for the model. (a) The ﬁrst two triangles are obtained. We do not know yet whether these triangles are going

to be in the same DTG; (b) The problem appears when triangle 3 links triangles 1 and 2 in the same DTG; (c) Wrong DTG

obtained.

new iteration the sample point deﬁning the least en-

tropy triangle is labeled. Obviously, this early node

assignment only allows to explore a very small subset

of the labeling space skipping better optimum solu-

tions.

B) The same vertex (label node) can be associated

with different sample points in the labeling process.

In Figure 2 we show a simple example of a DTG with

six vertexes. Let us assume an image with n candidate

points to be labeled as part of the model, where n ≥ 6.

Let us apply a labeling process using DP (Song et al.,

2000), Figure 2 (a) shows two triangles built in the

ﬁrst step of the process. It is important to remark that

in the building process of both triangles the label (D)

has been used. Figure 2 (b) shows the time in which

the DP algorithm decides to join both triangles. Only

at this time it is possible to detect the problem. This

imposes a high cost in time since we must go back-

wards and forwards in the labeling process in order to

solve this problem (2 (c)).

C) The DP efﬁciency lower bound is T·O(N

) on

DTG (Amit and Kong, 1996), with N being the num-

ber of sample points and T being the number of tri-

angles. Nevertheless, the DP solution is not the best

possible one for this computational efﬁciency. In Sec-

tion (4) we show that for most of our experiments it

is possible to improve the DP labeling to O(N

). In

addition, on more complex types of graph the lower

bound of DP efﬁciency decreases linearly with the

size of the greatest clique shown in the graph. This

makes its use on complex graphs very inefﬁcient.

3 ALGORITHM FEP

3.1 General Description

In this section we propose a new algorithm that over-

comes the three drawbacks stated above. Let us as-

sume a DTG model with T triangles and let trian-

gle number T be the base triangle of the model.

The triangle number 1 is then associated to the

ﬁrst vertex in the elimination order of the model.

In order to deﬁne a cost for each base edge on

each triangle and for each full triangle, the entropy

measure associated to the vertexes in each case is

used. In other words, for each triangle t with 1 ≤

t ≤ T, Ψ

) = −h

) deﬁnes the cost

of the base edge (X

), and Ψ

) =

−h

) + h

) deﬁnes the cost of

adding the vertex X

to the edge (X

) in order

to build the triangle (X

). Our goal is to

ﬁnd the labeling minimizing the value of logP(χ|G)

as given by equation.2. On each iteration, we use ϒ to

denote the cost of the best DTG so far and Λ the cost

of building the new DTG. Λ < ϒ should of course be

veriﬁed in order to proceed with the building process.

Let us assume N sample data (points), our ap-

proach is deﬁned by two basic stages: a) to look for

a ﬁrst solution; and b) improve the current solution

until efﬁciency reaches O(N

3.1.1 Looking for the First Solution

We start by building an ordered list of base edges, us-

ing all the possible pairs of nodes, by increasing value

of their associated cost Ψ

). Then we expand

each edge associating to it the ordered list of candi-

date nodes also by increasing value of their cost. It-

erating this process recursively and using a heuristic

take-the-ﬁrst on each expanding list, an initial best so-

lution is obtained. The cost of this initial solution is

the ϒ initial value. The improving stage is deﬁned

from a prune strategy on the full labeling tree.

3.1.2 Iterative Improvement

For each list of candidate nodes to a triangle t we de-

ﬁne a dynamic acceptance threshold, DAT, as a pro-

portional value to the difference in entropy to the cur-

rent best solution, ϒ − Λ, divided by the remaining

number of triangles to ﬁt T − t + 1,

DAT =

ϒ− Λ

T − t + 1

> 0 (3)

VISAPP 2009 - International Conference on Computer Vision Theory and Applications

490

Figure 3: Algorithm FEP: (a) The initial best solution; (b) Searching for a better solution: if a worse DTG is found the

branch is bounded; (c) When a better solution is reached, the best solution is updated. This process is repeated until no

more candidate solutions left or O(N

) is reached.

This implies that as the iterative process pro-

gresses, the acceptance threshold for a new node de-

creases, thereby eliminating high cost candidates. Us-

ing the ﬁtting cost and DAT value of the initial solu-

tion we initialize the best ﬁtting cost ϒ and the upper

bound for the DAT value.

The iterative improvement progress from the ini-

tial solution as follows: i) we move two levels up in

the labeling tree (Figure 3 (b)) from the leaf node of

the initial solution; ii) from this new location we try to

build a new solution branching down the second can-

didate of the associated node list (Figure 3(b)); iii) if

the DAT value of the current node is lower than the

upper bound for DAT, we move down ﬁrst-in-depth

adding the current node to the labeling; iv) we iter-

ate again on the new node, using the same decision

criteria until we reach a leaf node or the DAT value

in the new node has a higher value than the DAT up-

per bound. On each leaf node we calculate ϒ− Λ, the

ﬁtting cost difference between the new solution and

the best solution so far. If negative, we accept the

new solution as the best current solution and we up-

date the Λ and DAT upper bound value. Otherwise,

we go up in the labeling tree and branch in breadth in

order to evaluate the next node. If we reach an O(N

)

efﬁciency during the search, we stop the process ac-

cepting as ﬁnal solution the best current one.

3.2 Computational Complexity

We apply the technique proposed in (Hansen et al.,

1997) to convert any heuristic search algorithm into

an anytime algorithm offering a trade-off between

search time and solution quality. In the ﬁtting of

a DTG using DP the computational complexity is

always O(T · N

) because the number of evaluated

nodes is NMAX

= T · N · (N − 1) · (N − 2) (see

(Amit and Kong, 1996), (Song et al., 2003)). Here

we assume, without loss of generality, that the num-

ber of triangles of the DTG is T > 2, and the num-

ber of points on each frame is N > 6 . In this con-

H(13)

N(14)

LS(11)

RS(1)

LE(10)

RE(7)

LW(9)

RW(6)

LH(8) RH(12)

LK(4)

RK(5)

LA(2)

RA(3)

H(3)

N(11)

LS(13)

RS(1)

LE(14)

RE(8)

LW(12)

RW(7)

LH(10)

RH(9)

LK(5)

RK(6)

LA(2)

RA(4)

H(4)

N(13)

LS(14)

RS(1)

LE(12)

RE(8)

LW(11)

RW(7)

LH(9)

RH(10)

LK(5)

RK(6)

LA(2)

RA(3)

(c)

(b)

(a)

H(12)

N(10)

LS(14)

RS(11)

LE(13)

RE(3)

LW(1)

RW(2)

LH(4)

RH(9)

LK(6)

RK(8)

LA(5)

RA(7)

H(11)

N(10)

LS(8)

RS(9)

LE(3)

RE(4)

LW(1)

RW(2)

LH(5)

RH(12)

LK(13)

RK(7)

LA(14)

RA(6)

H(8)

N(5)

LS(6)

RS(7)

LE(4)

RE(3)

LW(2)

RW(1)

LH(13)

RH(12)

LK(14)

RK(10)

LA(11)

RA(9)

(e)

(f)

(d)

Figure 4: Caltech database (top row): Human DTG models

for several points of view: (a) 0, (b) π/2, (c) π/4. Hu-

manEva database (bottom row): frontal models for differ-

ent actions: (d) Boxing, (e) Gestures and (f) ThrowCatch.

In brackets, vertex elimination order for each DTG. In all

cases N = 14 and T = 12.

text, N

< NMAX

. In FEP we set NMAX

as the

maximum number of nodes to evaluate, stopping the

algorithm when this value is reached. We take into

account the number of evaluated nodes in order to ﬁx

the actual efﬁciency achieved by FEP on each exper-

iment. Our experiments show that a best case com-

plexity of Ω(N

) can be reached.

4 EXPERIMENTS ON HUMAN

LANDMARK LABELING

We have tested the performance of the FEP algo-

rithm by carrying out different types of experiments.

Firstly, we have learned human-body models using

static and motion information in order to evaluate the

ﬁtted models. Secondly, we have tested the FEP per-

LABELING HUMAN MOTION SEQUENCES USING GRAPHICAL MODELS

491

formancein order to discriminate among several mod-

els the best one representing a given set of sample

points. In all the cases, we assume that the measure

vector of the nodes follows a multivariate Gaussian

distribution, which substantially simpliﬁes the evalu-

ation of the score measure given by equation (2) (see

(Song et al., 2003)).

4.1 Database Description

We have used two different databases. The Caltech‘s

Database (courtesy of C. Fanti), provides 3D infor-

mation on a set of human-body landmarks in mo-

tion (Fanti et al., 2003). This database contains 3500

samples containing 3D information (position and ve-

locity) of 14 ﬁxed landmarks on a walking human

body: head (H), neck (N), shoulders (LS,RS), el-

bows (LE,RE), wrists (LW,RW), hips (LH,RH), knees

(LK,RK) and ankles (LA,RA). Experiments from dif-

ferent points of view, 0, π/4, π/2, 3π/4 and π radi-

ans, have been conducted using the 3D information.

In order to carry out experiments with more complex

motions, we have also used some actions from the

HumanEva database (Sigal and Black, 2006). The

HumanEva database contains 4 actors performing a

set of 6 actions each one in 3 separate trials. Here

we focus on three of these actions: Box, Gesture and

ThrowCatch. This database provides images from

seven points of view: frontal view (camera C1), lat-

eral views (C2 and C3), and 4 diagonal views (BW1,

BW2, BW3 and BW4).

4.2 Labeling Experiments

The DTG models used in our experiments have been

learned using the algorithm proposed in (Song et al.,

2001). The labeling of a sample is considered cor-

rect if its cost is lower than the true cost assumed as

known (only for the test samples). This criterion un-

fortunately does not guarantee that the ﬁtted labeling

is equal to the true one. The ambiguity deﬁned by

the relative location of the feet in a walking person

seen from the side is impossible to solve by using this

model, and this conﬁguration will have an equal or

lower cost than the true labeling. Also background

points can be selected as part of the best labeling.

4.2.1 Caltech‘s Database

We have used 2500 image samples for learning the

DTG models and 1000 for the labeling experiments.

We learn two different types of models: a) static mod-

els, using only the projections of the 3D positions;

b) motion models, using both the position and veloc-

ity projections. Some of the ﬁtted models from this

(b)

(c)

(d)

(e)

(f)

(a)

Figure 5: FEP working over two samples with 34 points

(14+20): the ﬁrst sample (a, b, c) shows a π/2 radians point

of view where selected and expected points are coincident;

the second sample (d, e, f) shows a 0 radians point of view

where the points of both legs are exchanged; the remain-

ing body points are ﬁtted correctly. All the points in each

sample are shown in (a) and (d); the points selected by FEP

are shown in (b) and (e); ﬁnally, (c) and (f) show the ﬁtted

DTG. Green points: expected and selected labels are coinci-

dents; red diamonds: selected labels are not coincident with

expected labels; blue squares: expected labels that are not

coincident with selected labels.

database are shown in Figure 4 (a)-(c). Once the mod-

els have been learned we use them to label the remain-

ing 1000 samples.

To compare the FEP and DP robustness under

added noise, we run experiments with 14, 20 and 40

random added points over the original 14 points using

the learned static and motion models. In Figure 6 (a)

and (b), the results of both experiments are shown. It

can be observed that in general FEP outperforms DP.

Moreover, the percentage of samples with a labeling

cost equal or lower than the original model is almost

100% for the FEP algorithm in all cases. Only for the

samples in 3π/4 angle do we have 97.8%. Moreover,

in most of the experiments, FEP has O(N

) efﬁciency;

only for angles π/4 and 3π/4 does it reaches O(N

)

in certain tests with added noisy points (see Figure 7).

In ﬁgure 6 (c) and (d) a comparison on the timing efﬁ-

ciency of both techniques, FEP and DP, using the two

models is shown. In all the cases, FEP outperforms

DP.

It is not an easy task to establish a relationship

between the set of graphs evaluated by FEP and DP

respectively. The main difﬁculty is in the way the

graph is built. DP starts from the ﬁrst vertex accord-

ing to vertex elimination order, while FEP starts from

the base triangle, so that the ﬁrst vertex in DP is the

last added vertex in FEP. In order to assess the qual-

ity of the ﬁtted graph given by DP and FEP, we run

experiments on the Caltech’s database counting the

selected number of correct nodes by each algorithm.

Figure 8 shows the obtained results using the mo-

tion model where both techniques provide equivalent

performance. Results are very similar for the static

model, so only motion model is shown in Figure 8.

VISAPP 2009 - International Conference on Computer Vision Theory and Applications

492

FEPvsDP FEPvsDP

0 10 20 30 40

0.2

0.4

0.6

0.8

Noisy Points

Correct assignment rate

Motion Model

DP 0

DP pi

DP pi/2

DP 3*pi/4

DP pi/4

FEP 0

FEP pi

FEP pi/2

FEP 3*pi/4

FEP pi/4

0 10 20 30 40

0.2

0.4

0.6

0.8

Noisy Points

Correct assignment rate

Static Model

DP 0

DP pi

DP pi/2

DP 3*pi/4

DP pi/4

FEP 0

FEP pi

FEP pi/2

FEP 3*pi/4

FEP pi/4

(a) (b)

0 10 20 30 40

0.5

1.5

2.5

3.5

x 10

Noisy Points

Time per frame (ms)

Motion Model

DP 0

DP pi

DP pi/2

DP 3*pi/4

DP pi/4

FEP 0

FEP pi

FEP pi/2

FEP 3*pi/4

FEP pi/4

0 10 20 30 40

0.5

1.5

2.5

3.5

4.5

x 10

Noisy Points

Time per frame (ms)

Static Model

DP 0

DP pi

DP pi/2

DP 3*pi/4

DP pi/4

FEP 0

FEP pi

FEP pi/2

FEP 3*pi/4

FEP pi/4

Figure 6: FEP versus DP: (a) and (b) performance measured in ﬁtting cost, (c) and (d) timing per frame of FEP and DP

respectively for two models, several angles and different number of noisy points added to the 14 labels used in our models

(Figure 4).

0 14 20 40

O(N)

O(N^2)

O(N^3)

Noisy Points

Efficiency

Motion Model

pi/4

pi/2

3pi/4

4pi/4

(a)

0 14 20 40

O(N)

O(N^2)

O(N^3)

Noisy Points

Efficiency

Static Model

pi/4

pi/2

3pi/4

4pi/4

(b)

Figure 7: FEP computational complexity: (a) using motion

information; (b) without using motion information. Mea-

sured on tests shown in Figure 6.

4.2.2 HumanEva

For this database the same fourteen landmarks have

been used. Between 6000 and 8000 image samples

of the trial-3 of each action have been used for learn-

ing the DTG models. The only information used dur-

ing the learning process is the projection of the true

3D coordinates of the selected landmark points on

each actor. The 3D coordinates of the landmarks have

been calculated from the information provided with

0 10 20 30 40

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Noisy Points

Correct assignment rate

Motion Model

DP 0

DP pi

DP pi/2

DP 3*pi/4

DP pi/4

FEP 0

FEP pi

FEP pi/2

FEP 3*pi/4

FEP pi/4

Figure 8: FEPvsDP: Ratio of correct labels using motion

information.

the database. Models for the points of view associated

to cameras C1, C2 and C3 have been learned. Figure

4(d)-(f) shows the learned frontal models for the cam-

era C1. For the testing stage, about 1000 new image

samples from trial-1, for each one of the actions and

cameras, have been used. In this case the data sam-

ples have been generated projecting the full set of 3D

landmark points provided by HumanEva(around 190)

and selecting those projections inside a bounding box

centered on each landmark projection. In average we

obtained between 70 and 100 noisy points per sam-

ple, always including the fourteen landmarks of our

model. Figure 9 (a) shows labeling examples on data

LABELING HUMAN MOTION SEQUENCES USING GRAPHICAL MODELS

493

(b)

(c)

(a)

Figure 9: FEP working on HumanEva samples: (a) Box; (b) Gesture; (c) ThrowCatch. In all cases 14 landmarks and about

70-100 noisy points are used. See Figure 5 for color explanation.

1 2 3 4 5 6 7 8 9 10

0.2

0.4

0.6

0.8

Accepted distance error in percentage

Correct assignment rate

Ratio of correct assignment labels

2 4 6 8 10 12

0.4

0.5

0.6

0.7

0.8

0.9

Block size

Correct assignment rate

Selection of Right Camera

Box_C1

Box_C2

Box_C3

Gestures_C1

Gestures_C2

Gestures_C3

ThrowCatch_C1

ThrowCatch_C2

ThrowCatch_C3

2 4 6 8 10 12

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Block size

Correct assignment rate

Selection of Right Action

(a) (b) (c)

Figure 10: Results on HumanEva database: 9 different models combining actions and points of view are evaluated. In all

cases the actor-1 was selected.(a) Ratio of correct assigned labels according to the distance to its ground-truth landmark.(b)

Ratio of correct detected camera. Each sequence is evaluated with the 3 models associated to its own point of view against

the remaining 6 models. Consecutive windows from 1 to 12 frames are classiﬁed on each decision. (c) Ratio of correct

detected actions: the evaluation is performed using the 3 models associated to each action against the remaining 6 models.

from HumanEva using FEP.

Figure 10 (a) shows the FEP performance in se-

lecting the ground-truth labels on each frame. In this

experiment a labeled point is considered correct if its

distance to the ground-truth landmark is less than a

threshold in the range [1-10]% of the actor height.

The best ratio is reached for the Gestures action.

4.3 Point of View and Action

Discrimination

Regarding the application of DTG model in human-

body tracking or motion capture applications, an im-

portant point is to select on each instant the best

model according to the camera point of view. Experi-

ments on HumanEva sequences have been conducted

to evaluate the FEP capability on this task. Figure 10

(b) shows the FEP performance recognizing the right

point of view model. It is remarkable that using a

so simple scheme our results are very encouraging.

In the worst case, the ratio of successful detection is

greater than 80%. We have also conducted experi-

ments on action recognition. Figure 10 (c) shows the

capability of FEP to discriminate one action among

several ones by choosing the model with the best ﬁt-

ting score. Obviously this problem is much more dif-

ﬁcult that the previous one due to the high motion

variability present on each action.In this experiment

our aim is to recognize the action which is performed

in blocks of different number of frames (ranging from

1 to 12 frames). We evaluate different DTG models

on each frame in the target block. A majority vote

scheme is ﬁnally used to assign an action class to

each block. In this task, most of the test are above

70% of success and only for Gestures we have a score

of 30%. This is well explained because the Gesture

movements are so general that most of them are in-

cluded as part of the other actions.

5 CONCLUSIONS

We have proposed a new algorithm (FEP), based on

DTG models, which is able to ﬁnd an optimal label-

ing on human motion data. Experiments have been

carried out using two different databases: Caltech and

HumanEva. It has been shown that on the Caltech‘s

database the FEP algorithm is superior to the tradi-

tional DP algorithm both in terms of efﬁciency and

quality of the found solution, in spite of its simplic-

ity. The proposed approach could be generalized to

deal with graphical models of higher complexity. This

VISAPP 2009 - International Conference on Computer Vision Theory and Applications

494

point is the subject of forthcoming research.

ACKNOWLEDGEMENTS

This work was supported by the Spanish Ministry

of Science and Innovation under projects TIN2005-

01665 and Consolider Ingenio-2010 (MIPRCV).

REFERENCES

Amit, Y. and Kong, A. (1996). Graphical templates for

model registration. IEEE PAMI, 18(3):225–236.

Caelli, T. and Caetano, T. S. (2005). Graphical models

for graph matching: Approximate models and optimal

algorithms. Pattern Recognition Letters, 26(3):339–

346.

Cover, T. and Thomas, J. (1991). Elements of Information

Theory. John Wiley and Sons.

Fanti, C., Polito, M., and Perona, P. (2003). An improved

scheme for detection and labelling johanson displays.

In NIPS.

Fischler, M. A. and Elschlager, R. A. (1973). The repre-

sentation and matching of pictorial structures. IEEE

Trans. Comput., 22(1):67–92.

Fujino, T. and Fujiwara, H. (1994). A method of search

space pruning based on search state dominance. Jour-

nal of Circuits, Systems and Computers, 25(4):1–12.

Gold, S. and Rangarajan, A. (1996). A graduated assign-

ment algorithm for graph matching. IEEE Trans. Pat-

tern Anal. Mach. Intell., 18(4):377–388.

Grimson, W. E. L. (1990). Object recognition by computer:

the role of geometric constraints. MIT Press, Cam-

bridge, MA, USA.

Hansen, E. A., Zilberstein, S., and Danilchenko, V. A.

(1997). Anytime heuristic search: First results. Tech-

nical Report UM-CS-1997-050.

Haralick, R. M. and Shapiro, L. G. (1992). Computer and

Robot Vision. Addison-Wesley Longman Publishing

Co., Inc., Boston, MA, USA.

Ibaraki, T. (1977). The power of dominance relations

in branch-and-bound algorithms. Journal of ACM,

24(2):264–279.

Rangarajan, A., Chui, H., and Bookstein, F. L. (1997). The

softassign procrustes matching algorithm. In IPMI

’97: Proceedings of the 15th International Conference

on Information Processing in Medical Imaging, pages

29–42, London, UK. Springer-Verlag.

Sigal, L. and Black, M. (2006). Humaneva: Synchronized

video and motion capture dataset for evaluation of ar-

ticulated human motion. Technical report, Brown Uni-

versity, Department of Computer Science.

Song, Y., Feng, X., and Perona, P. (2000). Towards detec-

tion of human motion. IEEE Conference on Computer

Vision and Pattern Recognition (CVPR), 1:810–817.

Song, Y., Goncalves, L., and Perona, P. (2001). Learn-

ing probabilistic structure for human motion detec-

tion. IEEE Conference on Computer Vision and Pat-

tern Recognition (CVPR), 2:771–777.

Song, Y., Goncalves, L., and Perona, P. (2003). Unsuper-

vised learning of human motion. IEEE Trans. Patt.

Anal. and Mach. Intell., 25(7):1–14.

Ullman, S. (1996). High-level Vision. MIT Press.

Yu, C. and Wah, B. (1988). Learning dominance relations

in combined search problems. IEEE Transactions on

Software Engineering, 14(8):1155–1175.

LABELING HUMAN MOTION SEQUENCES USING GRAPHICAL MODELS

495