Depth data based approach, however, rely only on
depth data to estimate the 3D face pose. Weise et
al. (Weise et al., 2011) use Iterative Closest Point
(ICP) with point-plane constraints and a temporal fil-
ter to track the head pose in each frame. In (Fanelli
et al., 2011), Fanelli et al. present a system for es-
timating the orientation of the head from depth data
only. This approach is based on discriminative ran-
dom regression forests, given their capability to han-
dle large training datasets. Another approach is pre-
sented in (Breitenstein et al., 2008) where a shape sig-
nature is first used to identify the nose tip in depth im-
ages. Then, several face pose hypotheses (pinned to
the localized nose tip) are generated and evaluated to
choose the best pose among them. These methods are
very sensitive to highly noisy depth data. Indeed, it is
difficult to distinguish the face regions in highly noisy
data.
Recently, (Cai et al., 2010) have used both depth
and appearance cues for tracking the head pose in-
cluding facial deformations. This idea behind the
combination of 2D images and depth data is to over-
come the poor signal-to-noise ratio of low-cost RGB-
D cameras. Their approach relies on detecting and
tracking facial features in 2D images. Assuming that
the RGB-D camera is already calibrated, one can find
the corresponding 3D coordinates of these detected
features. Finally, a generic deformable 3D face is
fitted to the obtained 3D points. Nonetheless, this
method does not handle partial occlusions of the face.
Moreover, like Feature based methods, it is sensitive
to the accuracy of feature detection and tracking al-
gorithms. In the same vein, Seemann et al. (Seemann
et al., 2004) present a face pose estimation method
based on a trained neural networks to compute the
head pose from grayscale and disparity maps. Sim-
ilar to the method proposed by (Cai et al., 2010), this
method does not handle partial occlusions of the face.
This paper presents a new Appearance-Based
method for 3D face pose tracking in sequences of im-
age and depth data. To cope with noisy depth maps
provided by the RGB-D cameras, we use both depth
and image data in the observation model of the parti-
cle filter. Unlike (Cai et al., 2010), our method does
not rely on tracking 2D features in the images to esti-
mate the face pose. Instead, we have used the whole
visible texture of the face. In this way, the method
is less sensitive to the quality of the feature detection
and tracking in images. Moreover, our method han-
dles the case of facial partial occlusions by introduc-
ing a visibility constraint.
3 3D FACE POSE TRACKING
METHOD
Our tracking method is based on the Particle filter
formalism which is a Bayesian sequential importance
sampling technique. It recursively approximates the
posterior distribution using a set of N weighted parti-
cles (samples). In the case of 3D face pose tracking,
particles stand for 3D pose hypotheses (i.e., 3D posi-
tion and orientation of the face in the scene). For a
given frame t, we denote X
t
= {x
i
t
}
N
i=1
the set of par-
ticles and x
i
t
∈ R
6
is the i-th generated particle that
involves the 6 Degrees of Freedom (i.e., 3 translations
and 3 rotation angles) of a 3D rigid transformation.
The general framework of the particle filter is to
throw an important number N of particles to populate
the parameter space, each one representing a 3D face
pose. The observation model allows for computing a
weight w
i
t
for each particle x
i
t
according to the simi-
larity of the particle to the reference model and using
the observed data y
t
in frame t (i.e., color and depth
images). Thus, the posterior distribution p(x
i
t
| y
t
)
is approximated by the set of the weighted particles
{x
i
t
, w
i
t
} with i ∈ {1, . . . , N}.
The transition model allows for propagating particles
between two consecutive frames. Indeed, at a current
frame t, particles ˆx
t−1
are assumed to approximate the
previous posterior. To approximate the current pos-
terior, each particle is propagated using a transition
model approximating the process function:
x
t
= ˆx
t−1
+ u
t
, (1)
where u
t
is a random vector having a normal distribu-
tion N (0, Σ) and the covariance matrix Σ is set exper-
imentally according to the prior knowledge on the ap-
plication. For instance, in Human-Computer Interac-
tion, the head displacements between two consecutive
frames are small and limited. Therefore, the values of
Σ entries may be small as well.
In last section, the general framework of our
method is presented. In the next section, the refer-
ence model and the observation model which are our
main contributions of this paper will be detailed.
3.1 Reference Model
To create the reference model M
re f
, we use a modi-
fied version of the Candide 3D face model (see Fig-
ure 1(a)). The original Candide model (Ahlberg,
2001) is modified in order to remove the maximum of
non-rigid part (e.g., mouth and chin parts). We con-
sider only parts that present the minimum of anima-
tion and expression (see Figure 1(b)). Our reference
model is defined as a set of K (K = 93) vertices V
VISAPP2013-InternationalConferenceonComputerVisionTheoryandApplications
224