3D Face Pose Tracking using Low Quality Depth Cameras

Ahmed Rekik

, Achraf Ben-Hamadou

and Walid Mahdi

Sfax University, Multimedia InfoRmation systems and Advanced Computing Laboratory (MIRACL),

ole Technologique de Sfax, Route de Tunis Km 10, BP 242, 3021 Sfax, Tunisia

Paris-Est University, LIGM (UMR CNRS), Center for Visual Computing,

Ecole des Ponts ParisTech, 6-8 Av. Blaise Pascal, 77455 Marne-la-Vall

ee, France

Keywords:

3D Face Tracking, RGB-D Cameras, Visibility Constraint, Photo-consistency, Particle Filter.

Abstract:

This paper presents a new method for 3D face pose tracking in color image and depth data acquired by RGB-D

(i.e., color and depth) cameras (e.g., Microsoft Kinect, Canesta, etc.). The method is based on a particle ﬁlter

formalism. Its main contribution lies in the combination of depth and image data to face the poor signal-to-

noise ratio of low quality RGB-D cameras. Moreover, we consider a visibility constraint to handle partial

occlusions of the face. We demonstrate the accuracy and the robustness of our method by performing a set of

experiments on the Biwi Kinect head pose database.

1 INTRODUCTION

3D face pose tracking is becoming an important

task for many research domains in computer vision

like Human-Computer Interaction and face analy-

sis (Weise et al., 2011; Cai et al., 2010; Maurel

et al., 2008) and recognition (Kim et al., 2008).

Indeed, these research ﬁelds have dramatically in-

creased these very last years. This arises particularly

from the ubiquity of vision systems in our day life

(i.e., webcams in laptops, smart-phones, etc.) and

lately from the arrival of low-cost RGB-D cameras,

such as Microsoft Kinect and Canesta. Such new

cameras allow for synchronously capturing a color

image and a depth map of the scene with a rate of

about 30 acquisitions per second.These cameras pro-

vide a lower quality and much more noisy data that

bulky 3D scanners. However, they are efﬁcient in sev-

eral domains like gesture recognition and video gam-

ing. Kinect is a good example. Nowadays, many

applications use Kinect-like cameras. For example,

(Weise et al., 2011) try to customize avatars using

Kinect data and (Ramey et al., 2011) use Kinect cam-

eras to interact with machines (e.g., robots and com-

puters).

This paper aims at developing a new method for

3D face pose tracking in color and depth images ac-

quired from Kinect like cameras using a particle ﬁlter

formalism. Our method is robust to the poor signal-

to-noise ratio of such cameras. The main idea is

to combine depth and image data in the observation

model of the particle ﬁlter. Moreover, we handle par-

tial occlusions of the face by integrating a visibility

constraint in the observation model.

This paper is organized as follows. In the next

section, we present the previous work related to 3D

face pose tracking using color images and/or depth

maps. Then, we detail our tracking method in section

3. Finally, section 4 presents the experiments and the

results obtained for the evaluation of our method.

2 RELATED WORK

Several research works have been proposed in the lit-

erature for face pose estimation and tracking. These

can be broadly categorized into 2D image or depth

data based approaches.

The ﬁrst category gathers approaches that use

2D images to estimate the face pose. It refers to

the Appearance and Feature based methods. While

Appearance-based methods attempt to use holistic fa-

cial appearance (Morency et al., 2003), Feature-based

methods rely on the localization of speciﬁc facial fea-

tures and suppose that some of these are visible in all

poses (Yang and Zhang, 2002; Matsumoto and Zelin-

sky, 2000). In general, these methods suffer from par-

tial occlusions and are sensitive to the accuracy of fea-

ture detection methods.

223

Rekik A., Ben-Hamadou A. and Mahdi W..

3D Face Pose Tracking using Low Quality Depth Cameras.

DOI: 10.5220/0004220202230228

In Proceedings of the International Conference on Computer Vision Theory and Applications (VISAPP-2013), pages 223-228

ISBN: 978-989-8565-48-8

 2013 SCITEPRESS (Science and Technology Publications, Lda.)

Depth data based approach, however, rely only on

depth data to estimate the 3D face pose. Weise et

al. (Weise et al., 2011) use Iterative Closest Point

(ICP) with point-plane constraints and a temporal ﬁl-

ter to track the head pose in each frame. In (Fanelli

et al., 2011), Fanelli et al. present a system for es-

timating the orientation of the head from depth data

only. This approach is based on discriminative ran-

dom regression forests, given their capability to han-

dle large training datasets. Another approach is pre-

sented in (Breitenstein et al., 2008) where a shape sig-

nature is ﬁrst used to identify the nose tip in depth im-

ages. Then, several face pose hypotheses (pinned to

the localized nose tip) are generated and evaluated to

choose the best pose among them. These methods are

very sensitive to highly noisy depth data. Indeed, it is

difﬁcult to distinguish the face regions in highly noisy

data.

Recently, (Cai et al., 2010) have used both depth

and appearance cues for tracking the head pose in-

cluding facial deformations. This idea behind the

combination of 2D images and depth data is to over-

come the poor signal-to-noise ratio of low-cost RGB-

D cameras. Their approach relies on detecting and

tracking facial features in 2D images. Assuming that

the RGB-D camera is already calibrated, one can ﬁnd

the corresponding 3D coordinates of these detected

features. Finally, a generic deformable 3D face is

ﬁtted to the obtained 3D points. Nonetheless, this

method does not handle partial occlusions of the face.

Moreover, like Feature based methods, it is sensitive

to the accuracy of feature detection and tracking al-

gorithms. In the same vein, Seemann et al. (Seemann

et al., 2004) present a face pose estimation method

based on a trained neural networks to compute the

head pose from grayscale and disparity maps. Sim-

ilar to the method proposed by (Cai et al., 2010), this

method does not handle partial occlusions of the face.

This paper presents a new Appearance-Based

method for 3D face pose tracking in sequences of im-

age and depth data. To cope with noisy depth maps

provided by the RGB-D cameras, we use both depth

and image data in the observation model of the parti-

cle ﬁlter. Unlike (Cai et al., 2010), our method does

not rely on tracking 2D features in the images to esti-

mate the face pose. Instead, we have used the whole

visible texture of the face. In this way, the method

is less sensitive to the quality of the feature detection

and tracking in images. Moreover, our method han-

dles the case of facial partial occlusions by introduc-

ing a visibility constraint.

3 3D FACE POSE TRACKING

METHOD

Our tracking method is based on the Particle ﬁlter

formalism which is a Bayesian sequential importance

sampling technique. It recursively approximates the

posterior distribution using a set of N weighted parti-

cles (samples). In the case of 3D face pose tracking,

particles stand for 3D pose hypotheses (i.e., 3D posi-

tion and orientation of the face in the scene). For a

given frame t, we denote X

= {x

}

i=1

the set of par-

ticles and x

∈ R

is the i-th generated particle that

involves the 6 Degrees of Freedom (i.e., 3 translations

and 3 rotation angles) of a 3D rigid transformation.

The general framework of the particle ﬁlter is to

throw an important number N of particles to populate

the parameter space, each one representing a 3D face

pose. The observation model allows for computing a

weight w

for each particle x

according to the simi-

larity of the particle to the reference model and using

the observed data y

in frame t (i.e., color and depth

images). Thus, the posterior distribution p(x

| y

)

is approximated by the set of the weighted particles

, w

} with i ∈ {1, . . . , N}.

The transition model allows for propagating particles

between two consecutive frames. Indeed, at a current

frame t, particles ˆx

t−1

are assumed to approximate the

previous posterior. To approximate the current pos-

terior, each particle is propagated using a transition

model approximating the process function:

= ˆx

t−1

+ u

, (1)

where u

is a random vector having a normal distribu-

tion N (0, Σ) and the covariance matrix Σ is set exper-

imentally according to the prior knowledge on the ap-

plication. For instance, in Human-Computer Interac-

tion, the head displacements between two consecutive

frames are small and limited. Therefore, the values of

Σ entries may be small as well.

In last section, the general framework of our

method is presented. In the next section, the refer-

ence model and the observation model which are our

main contributions of this paper will be detailed.

3.1 Reference Model

To create the reference model M

re f

, we use a modi-

ﬁed version of the Candide 3D face model (see Fig-

ure 1(a)). The original Candide model (Ahlberg,

2001) is modiﬁed in order to remove the maximum of

non-rigid part (e.g., mouth and chin parts). We con-

sider only parts that present the minimum of anima-

tion and expression (see Figure 1(b)). Our reference

model is deﬁned as a set of K (K = 93) vertices V

VISAPP2013-InternationalConferenceonComputerVisionTheoryandApplications

224

(a) (b) (c)

Figure 1: Reference model initialization. (a) Original Can-

dide model; (b) Extracted rigid part of the Candide model;

jecting the the ﬁtted face model onto the color image.

and a set of J (J = 100) facets F . Each vertex v

∈ V

is a point in R

, and each facet f

∈ F is a triplet of

vertices.

The tracking process needs to initialize the refer-

ence model. This initialization step consists of ﬁt-

ting the Candide deformable model to the user’s face

and extract its texture. Assuming that a neutral face

is available in the ﬁrst frame, we detect the face in

the color image using the well-known Viola and Jones

method (Viola et al., 2003), then, we apply a facial

point detection algorithm (Valstar et al., 2010) on the

detected face to provide L=22 landmark points l

such

as eye corners, the nose tip, etc..Given the calibration

data between the color and the depth sensors of the

camera, we project each landmark point l

to the depth

map to compute its corresponding 3D point l

. Each

of these landmark points has a corresponding vertex

in the 3D face model. In a similar manner as (Lu and

Jain, 2006), to align the deformable face model to the

user’s face, we minimize a cost function C

init

which is

the euclidean distance between the extracted 3D land-

mark points and their corresponding points in the 3D

face model:

init

∑

r=1

− l

. (2)

In equation (2), we abuse slightly the notation and de-

note v

the vertex of the 3D face model that corre-

sponds to the landmark point l

. This notation be-

longs only to this equation and will not be used in the

rest of the article. We use BFGS quasi-Newton opti-

mization method to solve the shape and pose parame-

ters of the 3D face model by minimizing the cost func-

tion C

init

. Thereby, the shape of the reference model

is adapted to the shape of the user’s face. Afterwards,

the texture T ( f

) of each facet in the reference model

is obtained by projecting the face model on the color

image (see Figure 1(c)). Thus, we end up with a refer-

ence model M

re f

corresponding to the shape and the

texture of the user’s face.

3.2 Observation Model

The observation model allows the ﬁlter to evaluate the

generated particles X

according to the observed data

and the obtained reference model M

re f

. In other

words, this evaluation allows for weighting each par-

ticle proportionally to the probability p(y

| x

) of the

current measurement y

given the particle x

. An ap-

pearance model A

is ﬁrst generated from each parti-

cle x

. The appearance model A

consists of a set of K

vertices V





and a set of J facets F





. The co-

ordinates of each vertex v

k,t

∈ V





are computed

as follows:

k,t

= R

+ t

, (3)

where R

and t

are respectively the 3 × 3 rotation ma-

trix and the translation vector generated in a standard

way from the six parameters of the particle x

. The

texture T ( f

j,t

) of each facet f

j,t

in the appearance

model is deﬁned as a set of pixels in the triangle given

by the projection of f

j,t

in the color image. An annoy-

ing situation that occurs very often in face tracking is

when the person’s face is partially hidden by another

object (e.g., hand, etc.). Consequently, the texture

T ( f

j,t

) may be affected by the texture of the object in

the foreground, and as a results, the evaluation of the

particle x

becomes inaccurate. To avoid such situa-

tion, we introduce in our ﬁlter a visibility constraint

which states that a given pixel p

∈ T ( f

j,t

) is con-

sidered invisible if an external object exists between

the 3D face model (obtained by equation (3)) and the

RGB-D camera. Lets q the corresponding 3D point

of p

in the depth map. We deﬁne a binary function

δ(p

) that returns 1 if p

is visible, 0 otherwise:

δ(p

) =



1, i f d < ε,

0, otherwise,

(4)

where d is the euclidean distance between q and its

corresponding 3D point locate on the facet f

j,t

. The

pixel p

is considered visible if the distance d is lower

than a threshold

ε. Only pixels with δ(p

) equals 1

are used in the evaluation of particles.

The particle evaluation of a given particle x

de-

pends on two energies. The ﬁrst energy E



, P



measures the superimposition of the 3D face model

(i.e., corresponding to the particle x

) on the 3D point

cloud

acquired by the depth sensor of the RGB-D

camera. The second energy is a photo-consistency en-

ergy denoted by E



, M

re f



. It indicates the sim-

ilarity between textures of the reference model M

re f

The threshold ε is ﬁxed experimentally to 10 mm in our

setup.

Given the calibration data of the RGB-D camera, one

can obtain the point cloud form the depth map acquired by

the camera.

3DFacePoseTrackingusingLowQualityDepthCameras

225

and the appearance model A

. The combination of

these two energies is given by:

=α exp

(

−E

(

))

+ (1−α) exp

(

re f

))

, (5)

where α ∈ [0, 1] is weighting scalar. We remember

that the weights w

are used to select the best particle.

In the next two sections, we deﬁne the 3D and photo-

consistency energies.

3.2.1 3D Energy

The 3D energy indicates the closeness of an appear-

ance model A

(corresponding to the particle x

evaluate) to the point cloud P

acquired at a time t

and compares their shapes. Given the calibration data

of the RGB-D camera, we can deﬁne a set of K cor-

responding points {(v

k,t

, p

k,t

)}

k=1

between the ver-

tices V





forming the appearance model A

and the

point cloud P

, where p

k,t

∈ P

, v

k,t

∈ V





, and p

k,t

corresponds to the closest 3D point to v

k,t

in the point

cloud P

The more the appearance model A

is close to

the point cloud P

, the more the euclidean distance



, P



tends towards 0:



, P



∑

k=1

k,t

− p

k,t

. (6)

Similar to (Cai et al., 2010), we use the point to plane

distance as well. Let n

k,t

be the surface normal of

point v

k,t

. The point to plane distance reads:



, P



∑

k=1





k,t





k,t

− p

k,t





. (7)

These two distance measures deﬁned in equations (6)

and (7) are combined following equation (8) to give

the ﬁnal formula of the 3D energy E



, P





, P





, P



+ d



, P



(8)

3.2.2 Photo-consistency Energy

The photo-consistency energy E



, M

re f



is de-

ﬁned as the normalized cross-correlation between the

texture of the reference model M

re f

and the texture

of the appearance model A

. The photo-consistency

energy is computed as the average of the normal-

ized cross-correlation between each two correspond-

ing facet texture T ( f

) and T ( f

j,t



, M

re f



∑

j=1

NCC



T ( f

), T ( f

j,t

)



. (9)

We employ a barycentric warping scheme to ﬁnd

pixel correspondences between the texture of facets

T ( f

) and T ( f

j,t

). This is needed to compute the nor-

malized cross-correlation.

4 EXPERIMENTS AND RESULTS

This section details the experiments performed to

evaluate our face pose tracking method. We ﬁrst

assess the interplay between the 3D and the photo-

consistency energies. This ﬁrst assessment allows us

to demonstrate the importance of each of these ener-

gies and by the way to tune the weighting parameter

α of equation (5). Then, we evaluate the accuracy of

the 3D pose estimation using the Biwi Kinect Head

pose database (Fanelli et al., 2011) which is provided

with ground truth data. Finally, we demonstrate the

importance of the visibility constraint in our 3D face

tracking method.

Before detailing the evaluation, we will start by in-

troducing the Biwi database and the parameters of our

tracking method used during the evaluation. Then, we

will present the experiments and the obtained results.

Biwi Kinect Head Pose Database. (Cai et al.,

2010) evaluate the accuracy of their tracker by con-

sidering 2D errors only. It is basically the mean Eu-

clidean distance between manually annotated refer-

ence points (e.g., eye corners, etc.) in 2D images and

their corresponding ones estimated by the tracker and

back-projected on the images. We believe that this

process of manual annotation lacks precision and re-

peatability, which makes the evaluation unreliable for

3D pose estimation. We rather choose to perform the

evaluation of our method mainly with the Biwi Kinect

Head Pose Database (Fanelli et al., 2011) which con-

tains 24 sequences of 24 different persons. In each se-

quence, a person rotates and translates his face in dif-

ferent orientations. For each frame in the sequences,

depth and color images are provided as well as ground

truth face poses (3 translations in mm and 3 rotation

angles in degree).

The evaluation of our tracking method using the

Biwi database is done as follows. For a given se-

quence from the database, we apply our tracking

method. Then, we compare the obtained 3D face

poses to the ground truth ones. We deﬁne a position

error (i.e., distance between our face positions and the

ground truth ones) and three rotation errors which are

the difference between the obtained angles (i.e., yaw,

pitch, and roll) and the ground truth angles.

Parameters of the Tracking Method. Our track-

ing method is mainly designed for Human-Computer

Interaction applications. Generally in these applica-

tions, the head displacements between two consecu-

tive frames are limited. As a result, we experimentally

set Σ entries to small values (i.e., 4 mm for transla-

tions and 5

◦

for rotations). To populate the parameter

VISAPP2013-InternationalConferenceonComputerVisionTheoryandApplications

226

space, the number of particles is set to 200. More-

over, as it will be described later on, the weighting

parameter α is experimentally ﬁxed to 0.8.

4.1 Photo-consistency vs 3D

To show the interplay between the two energies in-

volved in our tracker, we apply for a same sequence

three conﬁguration of our method, namely, using

depth data only (α = 1), using 2D images only (i.e.,

α = 0), and using both depth and image data (i.e.,

α = 0.8). In Figure2, we show the variation of the

position and orientation errors with different values

of the weighting parameter α. We can see that the op-

timal combination of the 3D and photo-consistency

energies is given by α = 0.8.

Figure 2: Variation of the tracking errors with different val-

ues of α. Here all errors are normalized between 0 and 1

so they can be shown in a same scale. The tracking method

generates the best results with α equals 0.8. We can no-

tice that the position error continues to decrease even after

α = 0.8. However, the face pose is considered wrong be-

cause of the rotation errors.

4.2 Accuracy Evaluation

We apply our tracking method on each sequence of

the Biwi database. Then, we compare the obtained

3D face poses to the ground truth ones by considering

the position and orientation errors. The mean and the

standard deviation of the obtained errors are summa-

rized in Table 1.

Table 1: Mean and standard deviation of the position and

angle errors obtained for all sequences of the Biwi database.

mean error standard deviation

Position error 5.1 mm ±8 mm

Yaw error 5.13

◦

±3.33

◦

Pitch error 4.32

◦

±2.65

◦

Roll error 5.24

◦

±3.33

◦

These quantitative results show the accuracy of

our face pose tracking method. In comparison to

(Fanelli et al., 2011), we have better results. Indeed,

Fanelli et al. have 14.7 mm for the position mean er-

ror and 9.2

◦

, 8.5

◦

, and 8

◦

as mean errors respectively

on yaw, pitch, and roll angles.

4.3 Visibility Constraint Importance

In addition to these evaluations, we demonstrate the

robustness of our method against partial occlusions.

The Biwi database doesn’t include this kind of situa-

tion. Thus, we acquire our own sequences in which

a person intentionally passes an object or his hand to

make a partial occlusion of his face. Then, we ap-

ply our tracking method twice: ﬁrst with the visibility

constraint, second without visibility constraint. Fi-

nally, the evaluation is performed as follows. We have

manually labelled visible control points around the

eyes and the nose in every frame. Then, we computed

the average of the Euclidean distance between the la-

belled control points and the 2D projection of their

corresponding in the obtained 3D face model. Fig-

ure 3 shows the results of the tracking for a test data

sample. The evolution of these error measures along

the sequence is shown in Figure3(a). Even if this eval-

uation is done using 2D errors, we notice that the vis-

ibility constraint dramatically improves the quality of

out tracker in case of partial occlusions. Figures 3(b)

and 3(c) show the visual difference between the esti-

mated poses, respectively, with and without using the

visibility constraint.

5 CONCLUSIONS

This paper presents a new approach for 3D face pose

tracking using color and depth data from low-quality

RGB-D cameras. Our approach is based on the parti-

cle ﬁlter formalism. The particle evaluation model is

based on combining image and 3D data.

We have performed a quantitative evaluation of

the proposed method on the Biwi Kinect Head Pose

database, and we have demonstrated the impor-

tance of the interplay between the 3D and photo-

consistency energy computed to evaluate particles.

Future work, will extend our tracker to handle action

and expression deformation of the face.

Moreover, we intend to perform a GPU implemen-

tation of our method to evaluate simultaneously all

generated particles in each frame. Indeed, the actual

implementation of our method requires about 1 sec-

ond to estimate the face 3D pose for a new frame. The

GPU implementation can make the method faster and

we can rich a real time processing. The GPU imple-

mentation allows also to consider a larger number of

particles and to deal with more severe head displace-

ments.

3DFacePoseTrackingusingLowQualityDepthCameras

227

(a) (b) (c)

Figure 3: Tracking results in case of partial occlusions (occlusions occurred between frames 85 and 100). (a) Evolution of

the error along the sequence. (b) Visual result of the face pose estimation without visibility (VC) constraint for frame 94. (c)

Visual result of the face pose estimation using the visibility constraint for same frame. We notice a signiﬁcant improvement

of the face pose estimation when the visibility constraint is used.

REFERENCES

Ahlberg, J. (2001). Candide-3 - an updated parameterised

face. Technical report.

Breitenstein, M. D., K

uttel, D., Weise, T., Gool, L. J. V.,

and Pﬁster, H. (2008). Real-time face pose estimation

from single range images. In ”Proceedings of IEEE

Int. Conf. Computer Vision and Pattern Recognition,

pages 1–8.

Cai, Q., Gallup, D., Zhang, C., and Zhang, Z. (2010). 3d de-

formable face tracking with a commodity depth cam-

era. In European Conference on Computer Vision,

pages 229–242.

Fanelli, G., Weise, T., Gall, J., and Gool, L. V. (2011). Real

time head pose estimation from consumer depth cam-

eras. In Proceedings of the 33rd international confer-

ence on Pattern recognition, pages 101–110.

Kim, M., Kumar, S., Pavlovic, V., and Rowley, H. (2008).

Face tracking and recognition with visual constraints

in real-world videos. In ”Proceedings of IEEE

Int. Conf. Computer Vision and Pattern Recognition,

pages 1–8.

Lu, X. and Jain, A. K. (2006). Deformation modeling for

robust 3d face matching. In Proceedings of IEEE

Int. Conf. Computer Vision and Pattern Recognition,

pages 1377–1383.

Matsumoto, Y. and Zelinsky, A. (2000). An algorithm for

real-time stereo vision implementation of head pose

and gaze direction measurement. In International

Conference on Automatic Face and Gesture Recogni-

tion, pages 499–505.

Maurel, P., McGonigal, A., Keriven, R., and Chauvel, P.

(2008). 3D model ﬁtting for facial expression anal-

ysis under uncontrolled imaging conditions. In Pat-

tern Recognition, 2008. ICPR 2008. 19th Interna-

tional Conference on, pages 1–4.

Morency, L.-P., Sundberg, P., and Darrell, T. (2003). Pose

estimation using 3d view-based eigenspaces. In In

Proceedings of the IEEE International Workshop on

Analysis and Modeling of Faces and Gestures, pages

45–52.

Ramey, A., Gonz

alez-Pacheco, V., and Salichs, M. A.

(2011). Integration of a low-cost rgb-d sensor in a

social robot for gesture recognition. In Proceedings

of the 6th international conference on Human-robot

interaction, pages 229–230.

Seemann, E., Nickel, K., and Stiefelhagen, R. (2004). Head

pose estimation using stereo vision for human-robot

interaction. In International Conference on Automatic

Face and Gesture Recognition, pages 626 – 631.

Valstar, M. F., Martinez, B., Binefa, X., and Pantic, M.

(2010). Facial point detection using boosted regres-

sion and graph models. In Proceedings of IEEE

Int. Conf. Computer Vision and Pattern Recognition,

pages 2729–2736.

Viola, M., Jones, M. J., and Viola, P. (2003). Fast multi-

view face detection. In Proc. of Computer Vision and

Pattern Recognition.

Weise, T., Bouaziz, S., Li, H., and Pauly, M. (2011). Real-

time performance-based facial animation. ACM SIG-

GRAPH 2011, 30(4):77:1–77:10.

Yang, R. and Zhang, Z. (2002). Model-based head pose

tracking with stereovision. In Automatic Face and

Gesture Recognition, pages 255–260.

VISAPP2013-InternationalConferenceonComputerVisionTheoryandApplications

228