come this problem in appearance-based approaches
for face recognition. The first is to use features which
are invariant to these deformations, e. g. invariant to
changes of the facial pose relative to the camera. An-
other strategy is to use or synthetically generate a
large and representative training set. A third approach
is to separate the factors which code the identity of a
person from other sources of variation, such as pose
changes. This is addressed in this paper. The pos-
ture of the head in-front of the camera is estimated
from monocular images. The additional pose infor-
mation may be utilized to register the facial images
very precise and thereby make it possible to perform
face recognition using a pose normalized representa-
tion of faces.
A survey of head pose estimation in computer
vision was recently published by Murphy-Chutorian
and Trivedi, 2009. We use active appearance models
(AAM) to detect facial features in the images. Sub-
sequently, the head pose is determined from a sub-
set of the localized facial features using an analyt-
ical algorithm (DeMenthon and Davis, 1995, Mar-
tins and Batista, 2008). The algorithm can esti-
mate the pose from a single image using four or
more non-coplanar facial features positions and their
known relative geometry. Using three-dimensional
model points from the generic Candide-3 face model
(Ahlberg, 2001, Dornaika and Ahlberg, 2006) and
their image correspondences estimated using the ac-
tive appearance model, the posture of the head in-
front of the camera can be estimated.
Active appearance models are a common ap-
proach to build parametric statistical models of fa-
cial appearance (Cootes et al., 2001, Stegmann et al.,
2000). The desired field of application requires the al-
gorithm to work with many different faces, including
faces not seen during the training stage (Gross et al.,
2005). We use simultaneous optimization of pose and
texture parameters and formulate the fitting algorithm
using a smooth warping function (Bookstein, 1989).
The thin plate spline warping function is parametrized
efficiently to achieve some computational advantages.
A special focus is on the evaluation of the generaliza-
tion performance of the model fitting algorithm.
In Section 2 we introduce statistical models of fa-
cial appearance. Section 3 describes the smooth warp-
ing function. The pose estimation algorithm is ex-
plained in Section 4. With experimental results and
discussion in Section 5, we conclude in Section 6.
2 STATISTICAL MODELS OF
FACIAL APPEARANCE
We parametrize a dense representation of facial ap-
pearance using separate linear models for shape and
texture (Matthews and Baker, 2004). The shape
and texture parameters of the models are statistically
learned from a training set.
2.1 Facial Model
Shape information is represented by an ordered set
of l landmarks x
i
, i = 1 . . . l. These landmarks de-
scribe the planar facial shape of an individual in a
digital image. The landmarks are generally placed
on the boundary of prominent face components (Fig-
ure 2a). The two-dimensional landmark coordinates
are arranged in a shape matrix (Matthews et al., 2007)
s =
x
1
x
2
. . . x
l
>
, s ∈ R
l×2
. (1)
Active appearance models express an instance s
p
of a particular shape as mean shape s
0
and a linear
combination of n eigenshapes s
i
, i.e.
s
p
= s
0
+
n
X
i=1
p
i
s
i
. (2)
The coefficients p
i
constitute the shape parameter
vector p =
p
i
, . . . , p
n
>
. The mean shape s
0
and shape
variations s
i
are statistically learned using a training
set of annotated images (Figure 2a). Since reliable
pupil positions are available (Zhao and Grigat, 2006),
the training images can be aligned with respect to the
pupils in a common coordinate system I ⊂ R
2
us-
ing a rigid transformation. The images are rotated,
scaled and translated using a two-dimensional simi-
larity transform such that all the pupils fall in the same
position (Figure 2b). The mean shape s
0
and basis of
shape variations s
i
are obtained by applying princi-
pal component analysis (PCA) on the shapes of the
aligned training images (Cootes et al., 2001).
The texture part of the appearance is also modeled
using an affine linear model of variation. Texture is
defined as the intensities of a face at a discrete set A
0
of positions x in a shape-normalized space A ⊂ R
2
.
The texture of a face is vectorized by raster-scanning
it into a vector. Similar to the shape, λ =
λ
i
, . . . , λ
m
>
denotes a vector of texture parameters describing a
texture instance
a
λ
= a
0
+
m
X
i=1
λ
i
a
i
. (3)
VISAPP 2010 - International Conference on Computer Vision Theory and Applications
500