Human Body Orientation Estimation using a Committee based Approach

Manuela Ichim

, Robby T. Tan

, Nico van der Aa

and Remco Veltkamp

University Politehnica of Bucharest, Bucharest, Romania

Utrecht University, Utrecht, The Netherlands

Noldus Information Technology, Wageningen, The Neterlands

Keywords:

Human Body Orientation, Neural Network, Support Vector Machine, Gaussian Mixture Models, PCA,

Supervised Learning.

Abstract:

Human body orientation estimation is useful for analyzing the activities of a single person or a group of people.

Estimating body orientation can be subdivided in two tasks: human tracking and orientation estimation. In

this paper, the second task of orientation estimation is accomplished by using HoG descriptors and other cues

such as the velocity direction, the presence of face, and temporal smoothness. Three different classiﬁers:

Gaussian Mixture Model, Neural Network and Support Vector Machine, are combined with the information

from those cues to form a committee. The performance of the method is evaluated and the contribution to the

ﬁnal prediction of each classiﬁer is assessed. Overall, the performance of the proposed approach outperforms

the state-of-the-art method, both in terms of estimation accuracy, as well as computation time.

1 INTRODUCTION

The estimation of human body orientation is a task

with potential use in many areas of modeling hu-

man activity and interaction. Determining how people

move in an environment is a key step in understand-

ing their actions. Among other applications, video

surveillance systems can beneﬁt from the task.

The goal of this paper is to estimate the body ori-

entation of multiple human targets from a video se-

quence captured by a single view moving camera,

as shown in Figure 1. Accomplishing this goal re-

quires a few stages including human body detection

and tracking. Additional computation, such as deter-

mining real-world 3D position coordinates of the tar-

gets and velocity orientation, can improve the results.

Estimating human body orientation can be formu-

lated as a classiﬁcation task with multiple classes of

body rotation angles, where in this paper they are 8

distinct classes. The appearance of human targets is

modeled by a dense grid of HoG descriptors (Dalal

and Triggs, 2005), which are robust to scaling and

light conditions, thus increasing the consistency of

appearances within a given class. Additional cues are

also used, such as the velocity orientation of the tar-

gets, the presence of face, temporal smoothing, etc.

A few methods have been introduced in the lit-

erature. Chen et al. (Chen and Odobez, 2012) as-

Figure 1: Output of our method. The green boxes are the

detected humans. The red boxes are the detected heads. The

ﬁrst numbers are our estimated orientations, and the second

numbers are the references.

sume that bounding boxes for the bodies and heads of

the targets are given and information regarding their

velocity direction and velocity magnitude are known.

The ﬁnal result of the algorithm consists of orienta-

tion estimations for head and body. They use the ker-

nel based formulation to solve the problem. Tosato et

al.(Tosato et al., 2012) address the problem of human

orientation estimation by introducing a novel descrip-

tor, Weighted ARay of COvariances (WARCO). This

descriptor is based on the covariance of the features,

which has been previously used for pedestrian detec-

tion. WARCO enables the classiﬁcation of human tar-

gets possible despite some noisy pixels.

Lu et al. (Lu and Little, 2006) consider a template-

515

Ichim M., T. Tan R., van der Aa N. and Veltkamp R..

Human Body Orientation Estimation using a Committee based Approach.

DOI: 10.5220/0004673805150522

In Proceedings of the 9th International Conference on Computer Vision Theory and Applications (VISAPP-2014), pages 515-522

ISBN: 978-989-758-009-3

 2014 SCITEPRESS (Science and Technology Publications, Lda.)

based framework for tracking and recognizing ath-

letes’ actions using only visual information. The con-

sidered targets are encoded with a PCA-HoG descrip-

tor, obtained by applying Principal Component Anal-

ysis (PCA) to the Histogram of Oriented Gradients

(HoG) descriptor. This ensures a robust representa-

tion under variations in illumination and scale, while

keeping computational costs low.

In this paper, we offer four main contributions:

First, we introduce a method that incorporates a set of

different classiﬁers and cues, allowing us to be more

ﬂexible in choosing the classiﬁcation methods, and

to have the best results obtained from the combined

response from several classiﬁers (committee). Sec-

ond, the way velocity information is taken into con-

sideration. In the existing method (Chen and Odobez,

2012), the classes corresponding to the velocity an-

gle class and adjacent ones are favored over the oth-

ers, provided the magnitude of the velocity was above

a certain threshold. Because of the greater ﬂexibil-

ity of our method, we can model the velocity as a

pseudo-classiﬁer using a Gaussian distribution cen-

tered around the class indicated by the velocity di-

rection. Third, the use of face detection. An impor-

tant cue which allows human individuals to recognize

and estimate the orientation of other human targets is

the presence of the face. Face detection can be made

relatively fast and is reliable, provided a minimal set

of image quality are met. Fourth, the utilization of

temporal information. The state of the art method

(Chen and Odobez, 2012) considers the features of

the targets independently from one frame to the other.

However, since the video frames represent succes-

sive moments in time, and since human targets cannot

abruptly change their orientation in a short amount of

time (as between two consecutive frames), it is also

reasonable to include temporal information in the es-

timation process.

2 PROBLEM DESCRIPTION

The input data of the system is a video sequence from

a single view moving camera, depicting one or more

human targets moving freely into, within and away

from the scene. The goal is to estimate the orientation

angle around the vertical axis of the body for each hu-

man target at each frame of the video. The output val-

ues of the angle are discretized into 8 distinct classes:

{0, 45, 90, 135, 180, 225, 270, 315} degrees or, al-

ternatively, {E, NE, N, NW, W, SW, S, SE}, where S

or 270 degree represents the front direction, and N or

90 degree represents the back direction, as shown in

Figure 2.

Figure 2: The 8 classes of the body orientations.

Figure 3: Pipeline of our body orientation estimation.

3 COMMITTEE BASED

CLASSIFICATION METHOD

3.1 Pipeline

The pipeline for estimating the body orientation us-

ing our proposed method is summarized in the dia-

gram in Figure 3. The input is video frames where

human tracking is performed using the method de-

scribed in (Choi et al., 2012), which is preferred over

other tracking methods as the input is allowed to be

originated from a single moving camera. Besides, the

method is able to provide the estimates of the posi-

tions of human targets in the real world coordinate

system. This information is particularly useful to de-

termine the velocity direction and magnitude of the

targets, which is an important cue at a later stage.

Aside from the coordinates of the targets, the

method also returns bounding boxes of the targets.

From these bounding boxes HoG descriptors (Dalal

and Triggs, 2005) are extracted. These are then sup-

plied to several pre-trained classiﬁers (PCA+GMM,

Neural Network and Support Vector Machine), which

produce probability estimates for each of the 8 angle

classes.

Face is an important cue, since it restricts the plau-

sible angle values. To include this information, face

detection is performed on the bounding boxes. To

maintain the consistency of the probabilistic frame-

work, a uniform distribution based on the presence or

absence of a face is generated.

VISAPP2014-InternationalConferenceonComputerVisionTheoryandApplications

516

Another information is velocity direction and

magnitude. This information can be integrated in the

framework by ﬁtting a standard Gaussian distribu-

tion centered around the velocity direction of an an-

gle class, in such a way that a relatively high velocity

yields a high probability for the frontal direction, and

low for the other directions; while, a relatively low

velocity yields the same probability for all directions.

The response from all the above classiﬁers and ad-

ditional cues are combined and the estimated angle is

considered to be the one with the highest probabil-

ity. However, the ﬁnal result is ﬁltered using a sliding

window. This additional step is performed to ensure

the temporal smoothness of the change in orientation

and to minimize the effect of misclassiﬁcations.

3.2 Probabilistic Framework

The core idea of HoG based classiﬁcation using mul-

tiple classiﬁers is based on a probabilistic framework,

where the task of estimating the orientation of a par-

ticular target at a given moment in time (frame) can

be expressed as:

∗

= argmax

P(α|x) (1)

where α represents the desired angle class, having 8

possible values. x = (x

, f

) variable en-

compasses the information known about the target,

namely its HoG features for the body (x

∈ R

2268

), ve-

locity direction v

∈ {1,2, ...,8}, velocity magnitude

∈ R and face detection f

∈ {0, 1,2,3} (0 meaning

no face detection, 1 meaning left facing face detec-

tion, 2 meaning frontal face detection and 3 meaning

right facing face detection).

Maximizing Eq.(1) is proportional to maximize

the likelihood P(x|α), which is determined by the

combined response of the classiﬁers and cues men-

tioned in the pipeline, and can be expressed as:

P(x|α) ∝ exp

GMM

(x|α) + l

SV M

(x|α)

+ l

velocity

(x|α) + l

f ace

(x|α)

(2)

where l

GMM

(x|α), l

SV M

(x|α), l

velocity

(x|α)

and l

f ace

(x|α) denote the log-likelihood given by the

Gaussian Mixture Model classiﬁer, Neural Network

classiﬁer, Support Vector Machine classiﬁer, velocity

cue and face detection cue, respectively. Details on

the deﬁnitions of each of these likelihoods are given

in the following subsections.

Our decision of using a combination of classiﬁers,

rather than a single one is based on the argumenta-

tion given in (Bishop, 2007) which points out that the

overal error of the committee can only improve the

average error of each individual classiﬁer. Due to the

variability in response of each classiﬁer, the overall

error is expected to be better.

3.3 Gaussian Mixture Model

For each of the 8 classes, a Gaussian mixture model

is computed based on the data points in the training

dataset belonging to that class. Thus, the likelihood

associated with the Gaussian Mixture Model (GMM)

classiﬁer is:

GMM

(x|α) = logP

GMM

(x|α)

= log

∑

j=1

(α)

N (x

|µ

(α)

,Σ

(α)

) (3)

where C represents the number of components in a

Gaussian Mixture, N denotes the Gaussian distribu-

tion, π

(α)

is a weighting factor and µ

(α)

and Σ

(α)

repre-

sent the mean and covariance of each Gaussian distri-

bution. Note that the subscript indicates the Gaussian

within the Gaussian Mixture of a class, while the su-

perscript indicates the angle class. Fitting each Gaus-

sian Mixture onto the training data points of a given

class is accomplished using the standard Expectation-

Maximization (EM) algorithm (Bishop, 2007).

3.4 PCA

One of the limitations of the GMM is that the maxi-

mum likelihood estimation tends to produce singular

or near-singular covariance matrices if the data occu-

pies high-dimensional though actually lies on a lower

dimensional manifold (which is usually the case in

practice (Wang, 2011)). This happens as a Gaussian

distribution, part of the mixture, is driven towards

modeling a single data point. Another limitation of

the GMM is that ﬁtting over high-dimensional data is

a slow process, since the EM training algorithm is it-

erative and at each iteration covariance matrices are

computed for each Gaussian distribution in the mix-

ture.

To mitigate the above mentioned limitations, we

reduce the dimensionality of the HoG descriptors be-

fore using the GMM model for classiﬁcation. One

effective method is Principal Component Analysis.

PCA can be deﬁned as the orthogonal projection of

the data onto a lower dimensional linear space, such

that the variance of the projected data is maximized

(Hotelling, 1933). Because the variance of the data is

maximized, the separation between the points belong-

ing to different classes is preserved as much as possi-

ble. Additionally, the PCA can discard the redundant

HumanBodyOrientationEstimationusingaCommitteebasedApproach

517

and noisy information, thus improve the classiﬁcation

process.

3.5 Neural Network

The second HoG based classiﬁer to be included in the

committee is a Neural Network (NN). This method

handles the high dimensionality of the data in a more

natural way compared to PCA. The feed forward neu-

ral network can be regarded as an approach to ﬁx the

number of basis functions (represented here by the

individual neurons), but to allow them to be adap-

tive (represented by the connection weights between

the neurons, which can be regarded as parameters

adapted during training). Furthermore, the extraction

of relevant features in the data and the classiﬁcation

process are merged together. The disadvantage of this

ﬂexibility in automatically adapting the parameters

of the weights to the training data is that the objec-

tive function is no longer a convex function (Bishop,

2007). This translates in a more lengthy training pro-

cess, but the model is fast to process the testing data.

Since the high dimensionality of the data does not

represent an obstacle in implementing the Neural Net-

work classiﬁer, as it was the case for the GMM, all the

HoG features are used. Thus, the considered structure

of the network has 2268 input nodes for the body clas-

siﬁer.

3.6 SVM

The last HoG based classiﬁers is Support Vector Ma-

chine. The motivation for this choice is the good per-

formance obtained in various classiﬁcation tasks, par-

ticularly in object recognition, where features such as

the HoG descriptors are used.

Although at its core the SVM is a 2-class, vari-

ous techniques and algorithms have been developed

for multi-class classiﬁcation and probability estimates

for each of the classes. For our method, we use the

variant described in (Wu et al., 2004), which allows

multi-class classiﬁcation with soft assignments (prob-

ability estimates), as it can be integrated seamlessly in

the probabilistic framework previously described.

3.7 Velocity

The velocity direction often represents a cue for the

body orientation. However, two factors affect the pre-

cision of this cue: the inaccuracy of the estimation of

3D position for the targets and the dependency on the

speed of the target. The ﬁrst disadvantage represents

a limitation of the tracker. The second observation

−3 −2 −1 0 1 2 3

−3

−2

−1

Body orientation training data, reduced to 2 dimensions using PCA

class 1

class 2

class 3

class 4

class 5

class 6

class 7

class 8

Figure 4: HoG descriptors reduced to 2 dimensions using

PCA, where circles and plus signs indicate the opposite di-

rections.

relies on the assumption that a target with a high ve-

locity has a lower chance of changing its orientation

than one with a low velocity.

To make better use of both velocity direction and

speed, as well as to incorporate this information seam-

lessly into the previously described framework, we

build a pseudo-classiﬁer by deﬁning a Gaussian prob-

ability distribution centered around the angle class

corresponding to the velocity direction and with a

variance inversely proportional to the speed of the tar-

get. For a target moving with high speed, the prob-

ability of the target facing the movement direction

is relatively high, while a near-stationary target will

have a near equal probability for all angle classes, as

the Gaussian with a high variance will be close to an

uniform distribution across all angles.

velocity

(x|α) = logP

velocity

(x|α) = logN (α|v

,1/v

)

(4)

where v

and v

represent the velocity direction and

magnitude, respectively. N denotes the normal dis-

tribution.

3.8 Face Detection

One inherent limitation of the classiﬁers based on

HoG descriptors is that, given the relatively low res-

olution of individual targets, the HoG descriptor can

only represent the rough outline of the human body.

This causes a problem, since usually the appearance

of the human body outline is similar for diametrically

opposed angles, as suggested in Figure 4. In such

cases, a strong cue differentiating the two orientations

is the presence of a face.

Face detection can be performed relatively fast,

using for example a cascade Local Binary Pattern

classiﬁer (Liao et al., 2007). Furthermore, this classi-

ﬁer is able to provide information regarding the type

of face detection, i.e. frontal, left-lateral or right-

lateral, further aiding the orientation estimation pro-

cess.

VISAPP2014-InternationalConferenceonComputerVisionTheoryandApplications

518

Given the probabilistic framework described so

far, a reasonable approach to model this information

is by using an uniform probability distribution over

the values of the angle corresponding to the body ori-

entations in which the presence of a face is plausible.

Thus, the associated likelihood becomes:

f ace

(x|α) = logP

f ace

(x|α) (5)

f ace

(x|α) =







1/5 if f

6= 0 and α ∈ {1,5, 6,7,8}

0 if f

6= 0 and α ∈ {2,3, 4}

1/8 if f

= 0

(6)

Note that, the numerical values of the above equa-

tion correspond to the values of the uniform distribu-

tion. Thus, the ﬁrst two lines correspond to the situa-

tion in which a face is detected ( f

6= 0) and the prob-

ability is uniformly distributed over the 5 angles in

which the face can be visible (ﬁrst line), and all other

angles have a zero probability (second line). Lastly, if

no face is detected ( f

= 0), the probability is evenly

distributed among all angles (as the lack of a face de-

tection does not necessarily imply the absence of a

face in the image).

3.9 Temporal Smoothness

Another cue for the orientation estimation is based

on the observation that human targets do not usu-

ally change their orientation suddenly from frame

to frame, especially considering the fact that frames

succeed themselves at least at 1/24 seconds in most

video sequences. This can be regarded as a temporal

smoothness of the orientation angle.

Thus, to restrict the abrupt changes in estimated

orientation angles, we implement a sliding window

approach in which the ﬁnal estimated angle is deter-

mined by a majority vote from the angle estimations

of the current frame and the past 5 frames (window

size was determined empirically). If there is a tie be-

tween the angle class estimated for the current frame

and another value, the former will be taken.

4 EXPERIMENTAL RESULTS

The proposed method for orientation estimation has

several hyperparameters which inﬂuence the quality

of the classiﬁcation. These hyperparameters are the

number of dimensions to which the PCA reduces the

HoG descriptors for the GMM, the number of compo-

nents in each GMM, the number of neurons in the hid-

den layer of the Neural Network and the kernel type

used for the SVM.

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Number of components

Classification error of GMM for various parameter configurations

Number of dimensions

Classification error

0.2

0.3

0.4

0.5

0.6

0.7

Figure 5: Classiﬁcation error of the GMM classiﬁer for var-

ious parameter conﬁgurations (the number of dimensions

and the number of components), during validation stage.

To determine suitable values for these hyperpa-

rameters, we employ a k-fold cross-validation proce-

dure using the available training dataset. Thus, for

each parameter conﬁguration of a given classiﬁer, its

classiﬁcation accuracy was computed as an average

over the values obtained by training the classiﬁer with

a fraction of (k − 1)/k of the dataset and estimat-

ing the accuracy on the remaining 1/k fraction of the

dataset. The results of the cross-validation for each of

the classiﬁers are given in the following paragraphs.

GMM Validation. For determining the hyperpa-

rameters of the GMM classiﬁer, namely the number

of dimensions to which the PCA reduces the HoG de-

scriptors to and the number of components in each

mixture, we employed 4-fold cross validation.

The results of the cross-validation are presented

in Figure 5. It can be observed that for relatively low

numbers of dimensions, the performance of the clas-

siﬁer is poor, as some information is lost in the dimen-

sionality reduction process, making robust classiﬁca-

tion difﬁcult. The performance improves signiﬁcantly

after 20 dimensions and it stabilizes between 30 and

40 dimensions, suggesting that the high-dimensional

HoG features lie in a lower, 40 dimensional, manifold.

The number of components in each mixture has

less impact on the performance, when compared to

the number of dimensions. However, the higher er-

ror obtained for a single component indicates that

the data has a more complex structure than a simple

Gaussian distribution, while a high number suggests

overﬁtting, as the performance drops. The best values

are obtained for 2-3 components per mixture, these

providing the best approximation of the real structure

of the data.

Neural Network Validation. The Neural Network

classiﬁer has a single parameter, namely the number

of neurons in the hidden layer.

The evolution of the classiﬁcation error is shown

in Figure 6. The decreasing evolution of the classiﬁ-

cation error stabilized after a value approximately 60

HumanBodyOrientationEstimationusingaCommitteebasedApproach

519

10 20 30 40 50 60 70 80 90 100

0.15

0.16

0.17

0.18

0.19

0.2

0.21

0.22

0.23

Number of neurons in the hidden layer

Classification error

Classification error of the Neural Netork for various sizes of the hidden layer

Figure 6: Classiﬁcation error of the NN classiﬁer for vari-

ous parameter conﬁgurations, during validation stage.

0 500 1000 1500 2000 2500

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Dimension count

Classification error

Classification error for the SVM using various kernels and dimensions for the data

linear kernel

polynomial kernel

radial kernel

sigmoid kernel

Figure 7: Classiﬁcation error of the SVM classiﬁer for var-

ious parameter conﬁgurations, during validation stage.

nodes. Although it is impossible to assess the role

of each neuron and thus to provide a solid explana-

tion for the correlation between the number of neu-

rons and performance of the network, one can argue

that this size of the hidden layer is inﬂuenced by the

number of relevant features in the data, similarly to

the minimum number of dimensions that yield rea-

sonable good results. Should that be the case, the ac-

tivation of each neuron is more heavily inﬂuenced by

one of these implicit relevant features.

SVM Validation. For the SVM classiﬁer, our initial

intention was to also employ dimensionality reduc-

tion on the features, to obtain faster training times.

However, after assessing the performance for various

dimensions, as shown in Figure 7, and considering

manageable training durations, we decided to use all

2268 HoG dimensions for the SVM classiﬁer.

The plot from Figure 7 shows the evolution of the

SVM classiﬁcation error for various dimensions and

using several kernel functions, where we found that

the best performed kernel is the linear one.

4.1 Dataset Description

During the training of the classiﬁers we used sev-

eral datasets, to have a greater variety of appearances.

This, in turn, would be beneﬁcial to achieve a bet-

ter generalization of the training data and a good ex-

ploitation of the existing patterns. Some characteris-

tics of the datasets used during training are given in

Table 1.

For testing the proposed method, we used video

sequences from the Collective Activity dataset (Choi

et al., 2011), which depict multiple human targets

moving unrestricted in an urban environment. The

ground truth annotation is available once every 10

frames.

4.2 Results and Discussion

The results of the experiments to evaluate the perfor-

mance of our method are presented in Table 2.

Overall, the performances of the individual clas-

siﬁers vary to some extent. These variations ensures

the capability of a combination of classiﬁers to yield

better results. A certain dependence on the video se-

quence can also be observed, as all the classiﬁers ob-

tained better results on Seq 42 than Seq 15. Since

these classiﬁers take into consideration only the vi-

sual appearance of the targets, modeled by the HoG

descriptors, the only explanation for this behaviour is

the fact that the targets from Seq 42 resemble more

closely the targets used for the training of the classi-

ﬁer. This visual resemblance can further be explained

by a closer similarity of the angle of the camera at

which the images were captured, as well as a similar-

ity of the resolution of the images.

The error obtained by combining the response of

multiple classiﬁers proved to be better than the indi-

vidual responses. Thus, in the case of Seq 42, all the

combined responses yielded better results than the in-

dividual ones. As expected, the combinations includ-

ing the more robust classiﬁers, such as GMM+NN,

outperform the ones with the lower performing ones,

such as GMM+SVM. In the case of Seq 15, the more

pronounced poor result of the SVM classiﬁer has a

detrimental impact on the combined responses. Thus,

only the GMM+NN combination has a better per-

formance than any of its components, all others be-

ing roughly similar or even worse than the individual

components.

The second goal of these experiments was to as-

sess the impact of the individual cues considered. The

performance of the method when only the velocity is

used, proves to be better than the responses of any of

the individual or combined HoG-based classiﬁers, for

the considered video sequences Seq 15 and Seq 42,

thus highlighting the importance of this additional

cue. However, one might expect that for more partic-

ular video sequences in which the targets are mostly

stationary, the velocity cue would provide less infor-

mation and thus yield poorer results. The next conﬁg-

uration tested was the combination of the response of

VISAPP2014-InternationalConferenceonComputerVisionTheoryandApplications

520

Table 1: Number of data points per class for the datasets used during training.

Dataset Type C1 C2 C3 C4 C5 C6 C7 C8 Total

(M. Andriluka and Schiele.,

2010)

Body 400 749 644 749 400 622 545 622 4731

(Gernimo et al., 2007) Body 129 30 117 78 114 25 141 62 696

MIT Pedestrian Body 0 0 478 0 0 0 446 0 924

VIPeR Body 355 90 218 17 6 31 419 126 1262

Table 2: Mean and standard deviation of the error for various versions of our method on two video sequences from Collective

Activity dataset (Choi et al., 2011).

Method Seq 15 Seq 42

GMM 63.0446/28.1862 65.5102/31.1502

NN 69.7277/30.5169 56.3265/29.4306

SVM 82.2030/33.6133 59.6939/30.1965

GMM+NN 61.7079/ 28.7315 52.6531/28.4356

GMM+SVM 63.0446/28.9955 55.4082/29.1968

NN+SVM 70.6188/31.0409 50.5102/27.5921

GMM+SVM+NN 66.8317/ 30.2219 54.1837/28.8972

GMM+SVM+NN + Velocity 48.3416/24.7429 36.4286/23.2556

GMM+SVM+NN + Face 47.6733/24.8099 59.0816/30.2519

GMM+SVM+NN + Velocity + Face 37.4257/21.1888 42.5510/25.3564

GMM+SVM+NN + Velocity + Face + Temporal 38.9851/22.0431 23.2653/18.9088

the HoG-based classiﬁers and the velocity cue. A sig-

niﬁcant improvement was observed over the response

of the HoG-based classiﬁers, for both videos. How-

ever, in the case of Seq 15, where the HoG-based

classiﬁers yielded poor performance, the overall re-

sult when taking into account the velocity cue was

worse than in the case of using just the velocity. This

was not the case for Seq 42, where the performance

decreased dramatically, the mean error being lower

than either of the constituents’ responses.

Next, the face detection cue was assessed, also

in combination with the response of the HoG-based

classiﬁers. For Seq 15 the performance improved

in a similar fashion to the velocity cue, suggesting

a similar informational gain. However, in the case

of Seq 42 the performance dropped over one of the

HoG-based classiﬁers, most probably due to the high

number of false face detections. When combining

the two cues, velocity and face detection, the perfor-

mance increases in the case of Seq 15, where the two

cues taken individually generate similar results, while

in the case of Seq 42, the performance is still lower

than in the case of using just the velocity cue, due to

the poor performance given by the false face detec-

tion.

The last element tested was the effect of the tem-

poral smoothing. When combined with the response

of the HoG-based classiﬁers, the performance in-

creased, moderately for Seq 15 and more signiﬁ-

cantly for Seq 42. The larger improvement in the

second case can be explained by a higher number of

misclassiﬁcation, whose inﬂuence is reduced. When

combined with only the velocity cue, the performance

drops slightly for the ﬁrst video, but increases for

the second. This can be explained by a better veloc-

ity estimation in Seq 15, in which case the temporal

smoothing only delays in response. The increase in

the second case is also probably explained by inac-

curate estimations of the velocity. Similar trends are

followed in the last conﬁguration, involving all classi-

ﬁers and cues, where the temporal smoothness factor

has little inﬂuence on the performance from Seq 15,

in which the estimations provided by the classiﬁers,

the velocity and face detections seem to be more re-

liable. On the other hand, in the case of Seq 42, the

performance increase is signiﬁcant, as the error drops

to almost half, due to the fact that misclassiﬁcation,

inexact velocity estimation and false face detections,

are smoothed out.

We have compared our method with the state of

the art method (Chen and Odobez, 2012), and the re-

sults being presented in Table 3. Note that, since the

computation time is proportional to the squared num-

ber of training and testing data points, only a fraction

of the training set was used. For a better compari-

son, the results of our method using the same reduced

training set are also presented.

HumanBodyOrientationEstimationusingaCommitteebasedApproach

521

Table 3: Evaluation of the performance on a video sequence

(Seq 15 from Collective Activity dataset (Choi et al., 2011))

of the original method from (Chen and Odobez, 2012) and

our approach.

Method Error StdDev Average pro-

cessing time

per frame

(ms)

(Chen and

Odobez,

2012)

79.52 30.37 29948

Our approach 56.58 22.73 73

5 CONCLUSIONS

We have proposed a novel method for estimating hu-

man body orientation from a video based on a com-

mittee based approach. One of the beneﬁts of our

method is the faster computation time compared to

the state of the art method (Chen and Odobez, 2012).

Our method also allows for the use of multiple classi-

ﬁers, their individual responses being combined for a

more robust prediction. Another contribution refers to

the use of additional cues, such as face detections and

temporal smoothness, as well as an improved method

on the use of the velocity cue.

ACKNOWLEDGEMENTS

This research was supported by the FES project

COMMIT/.

REFERENCES

Bishop, C. M. (2007). Pattern recognition and machine

learning (information science and statistics).

Chen, C. and Odobez, J. (2012). We are not contortionists:

coupled adaptive learning for head and body orienta-

tion estimation in surveillance video. In Computer

Vision and Pattern Recognition (CVPR), 2012 IEEE

Conference on, pages 1544–1551. IEEE.

Choi, W., Pantofaru, C., and Savarese, S. (2012). A general

framework for tracking multiple people from a mov-

ing camera. IEEE Trans. on Pattern Analysis and Ma-

chine Intelligence.

Choi, W., Shahid, K., and Savarese, S. (2011). Collective

activity dataset. www.eecs.umich.edu/vision/activity-

dataset.html.

Dalal, N. and Triggs, B. (2005). Histograms of oriented gra-

dients for human detection. In Computer Vision and

Pattern Recognition, 2005. CVPR 2005. IEEE Com-

puter Society Conference on, volume 1, pages 886–

893. IEEE.

Gernimo, D., Sappa, A., Lpez, A., and Ponsa, D. (2007).

Adaptive image sampling and windows classiﬁcation

for on-board pedestrian detection. International Con-

ference on Computer Vision Systems.

Hotelling, H. (1933). Analysis of a complex of statistical

variables into principal components. Journal of edu-

cational psychology, 24(6):417.

Liao, S., Zhu, X., Lei, Z., Zhang, L., and Li, S. Z. (2007).

Learning multi-scale block local binary patterns for

face recognition. In Advances in Biometrics, pages

828–837. Springer.

Lu, W.-L. and Little, J. J. (2006). Simultaneous tracking

and action recognition using the pca-hog descriptor. In

Computer and Robot Vision, 2006. The 3rd Canadian

Conference on, pages 6–6. IEEE.

M. Andriluka, S. R. and Schiele., B. (2010). Monocular

3d pose estimation and tracking by detection. CVPR,

www.d2.mpi-inf.mpg.de/node/428.

Tosato, D., Spera, M., Cristani, M., and Murino, V. (2012).

Characterizing humans on riemannian manifolds. Pat-

tern Analysis and Machine Intelligence, IEEE Trans-

actions on.

Wang, J. (2011). Geometric structure of high-dimensional

data. In Geometric Structure of High-Dimensional

Data and Dimensionality Reduction, pages 51–77.

Springer.

Wu, T.-F., Lin, C.-J., and Weng, R. C. (2004). Probabil-

ity estimates for multi-class classiﬁcation by pairwise

coupling. The Journal of Machine Learning Research,

5:975–1005.

VISAPP2014-InternationalConferenceonComputerVisionTheoryandApplications

522