Figure 2: Virtual reality screens showing different driving
contexts: left). Pedestrians crossing the street, and right) a
bus stopped on the roadside.
detect and track the face and eyes of the driver dur-
ing several driving scenarios, allowing for further pro-
cessing of a driver’s visual search pattern behavior.
Figure 3 shows the input from the three cameras.
2 BACKGROUND
The techniques developed by Leinhart and Maydt
(Leinhart and Maydt, 2002) extend upon a machine-
learning approach that has originally been proposed
by Viola and Jones (Viola and Jones, 2001). The
rapid object detector they propose consists of a cas-
cade of boosted classifiers. Boosting is a machine
learning meta-algorithm used for performing super-
vised learning. These boosted classifiers are trained
on simple Haar-like, rectangular features chosen by
a learning algorithm based on AdaBoost (Freund and
Schapire, 1995). Viola and Jones have successfully
applied their object detection method to faces (Vi-
ola and Jones, 2004), while Cristinacce and Cootes
(Cristinacce and Cootes, 2003) have used the same
method to detect facial features. Leinhart and Maydt
extend the work of Viola and Jones by establishing a
new set of rotated Haar-like features which can also
be calculated very rapidly while reducing the false
alarm rate of the detector. In the techniques proposed
by Zhu and Ji (Zhu and Ji, 2006), a trained AdaBoost
face detector is employed to locate a face in a given
scene. A trained AdaBoost eye detector is applied
onto the resulting face region to find the eyes; a face
mesh, representing the landmark points model, is re-
sized and imposed onto the face region as a rough es-
timate. Refinement of the model by Zhu and Ji is ac-
complished by fast phase-based displacement estima-
tion on the Gabor coefficient vectors associated with
each facial feature. To cope with varying pose scenar-
ios, Wang et al. (Wang et al., 2006) use asymmetric
rectangular features, extended by Wang et al. from
the original symmetric rectangular features described
by Viola and Jones to represent asymmetric gray-level
features in profile facial images.
Shape modeling methods for the purpose of fa-
cial feature extraction are common among computer
vision systems due to their robustness (Medioni and
Kang, 2005). Active Shape Models (Cootes et al.,
1995) (ASM) and Active Appearance Models (Cootes
et al., 1998) (AAM) possess a high capacity for fa-
cial feature registration and extraction. Such effi-
ciency is attributed to the flexibility of these meth-
ods, thus compensating for variations in the appear-
ance of faces from one subject to another (Ghrabieh
et al., 1998). However, a problem displayed by both
ASM and AAM techniques is the need for initial reg-
istration of the shape model close to the fitted solu-
tion. Both methods are prone to local minima oth-
erwise (Cristinacce and Cootes, 2004). Cristinacce
and Cootes (Cristinacce and Cootes, 2006) use an
appearance model similar to that used in AAM, but
rather than approximating pixel intensities directly,
the model is used to generate feature templates via
the proposed Constrained Local Model (CLM) ap-
proach. Kanaujia et al. (Kanaujia et al., 2006) em-
ploy a shape model based on Non-negative Matrix
Factorization (NMF), as opposed to Principal Compo-
nent Analysis (PCA) traditionally used in ASM meth-
ods. NMF models larger variations of facial expres-
sions and improves the alignment of the model onto
corresponding facial features. Since large head rota-
tions make PCA and NMF difficult to use, Kanaujia et
al. use multi-class discriminative classifiers to detect
head pose from local face descriptors that are based
on Scale Invariant Feature Transforms (SIFT) (Lowe,
1999). SIFT is typically used for facial feature point
extraction on a given face image and works by pro-
cessing a given image and extracting features that are
invariant to the common problems associated with ob-
ject recognition such as scaling, rotation, translation,
illumination, and affine transformations.
3 TECHNIQUE DESCRIPTION
Our approach makes use of several techniques for
processing input sequences of drivers following given
scenarios in the simulator. Such techniques have been
used successfully on their own (Leinhart and Maydt,
2002; Lowe, 1999) and as part of a more extended
framework (Kanaujia et al., 2006). Acceptable face
and facial feature detections were produced at good
success rates. Each technique used in our framework
is treated as a module and these modules are classified
into two major groups: detectors, and trackers. Detec-
tors localize the facial regions automatically and lay a
base image to be used for tracking by other modules.
A base image is a visual capture of a particular facial
A CORRECTIVE FRAMEWORK FOR FACIAL FEATURE DETECTION AND TRACKING
131