A CORRECTIVE FRAMEWORK FOR FACIAL FEATURE

DETECTION AND TRACKING

Hussein O. Hamshari, Steven S. Beauchemin

Department of Computer Science, University of Western Ontario, 1151 Richmond Street, London, Canada

Denis Laurendeau

Department of Electrical and Computer Engineering, Universit´e Laval, Qu´ebec, Canada

Normand Teasdale

Departement de Medecine Sociale et Preventive Faculte de Medecine, Universit´e Laval, Qu´ebec, Canada

Keywords:

Face, facial feature, recognition, detection, tracking, corrective tracking, integrated framework, real-time.

Abstract:

Epidemiological studies indicate that automobile drivers from varying demographics are confronted by dif-

ﬁcult driving contexts such as negotiating intersections, yielding, merging and overtaking. We aim to detect

and track the face and eyes of the driver during several driving scenarios, allowing for further processing of

a driver’s visual search pattern behavior. Traditionally, detection and tracking of objects in visual media has

been performed using speciﬁc techniques. These techniques vary in terms of their robustness and computa-

tional cost. This research proposes a framework that is built upon a foundation synonymous to boosting. The

idea of an integrated framework employing multiple trackers is advantageous in forming a globally strong

tracking methodology. In order to model the effectiveness of trackers, a conﬁdence parameter is introduced

to help minimize the errors produced by incorrect matches and allow more effective trackers with a higher

conﬁdence value to correct the perceived position of the target.

1 INTRODUCTION

Our study of visual patterns of interest in drivers is

facilitated through the development of a robust com-

puter vision system. The intelligent system, devel-

oped as part of this project, is aimed at reinforcing

behaviors characterizing skilled drivers separate from

behaviors that are sub-optimal. To achieve such an

objective, new methods and tools are developed based

on the extraction of three types of information from

video streams captured during driving scenarios: head

motion in 3D space, eye motion (gazing), and facial

expressions of the driver in response to different stim-

uli.

The camera setup for this project allows for sev-

eral computer vision applications. Figure 1 shows the

schematics for the simulator and how the cameras are

put together to provide maximum knowledge of the

surrounding environment. The computer vision sys-

tem is comprised of: i) A set of three Black & White

ﬁrewirevideo cameras, ii) an infrared lighting system,

Figure 1: The simulator setup showing the different compo-

nents used for data acquisition.

iii) a virtual reality screen onto which the driving sce-

narios are projected (Figure 2), and iv) an eye tracker

camera.

Given the various tasks set out for the project, this

contribution is only concerned with the detection and

tracking of selected facial features in the given video

sequences from the three camera inputs. We aim to

130

O. Hamshari H., S. Beauchemin S., Laurendeau D. and Teasdale N. (2008).

A CORRECTIVE FRAMEWORK FOR FACIAL FEATURE DETECTION AND TRACKING.

In Proceedings of the Third International Conference on Computer Vision Theor y and Applications, pages 130-138

DOI: 10.5220/0001070401300138

 SciTePress

Figure 2: Virtual reality screens showing different driving

contexts: left). Pedestrians crossing the street, and right) a

bus stopped on the roadside.

detect and track the face and eyes of the driver dur-

ing several driving scenarios, allowing for further pro-

cessing of a driver’s visual search pattern behavior.

Figure 3 shows the input from the three cameras.

2 BACKGROUND

The techniques developed by Leinhart and Maydt

(Leinhart and Maydt, 2002) extend upon a machine-

learning approach that has originally been proposed

by Viola and Jones (Viola and Jones, 2001). The

rapid object detector they propose consists of a cas-

cade of boosted classiﬁers. Boosting is a machine

learning meta-algorithm used for performing super-

vised learning. These boosted classiﬁers are trained

on simple Haar-like, rectangular features chosen by

a learning algorithm based on AdaBoost (Freund and

Schapire, 1995). Viola and Jones have successfully

applied their object detection method to faces (Vi-

ola and Jones, 2004), while Cristinacce and Cootes

(Cristinacce and Cootes, 2003) have used the same

method to detect facial features. Leinhart and Maydt

extend the work of Viola and Jones by establishing a

new set of rotated Haar-like features which can also

be calculated very rapidly while reducing the false

alarm rate of the detector. In the techniques proposed

by Zhu and Ji (Zhu and Ji, 2006), a trained AdaBoost

face detector is employed to locate a face in a given

scene. A trained AdaBoost eye detector is applied

onto the resulting face region to ﬁnd the eyes; a face

mesh, representing the landmark points model, is re-

sized and imposed onto the face region as a rough es-

timate. Reﬁnement of the model by Zhu and Ji is ac-

complished by fast phase-based displacement estima-

tion on the Gabor coefﬁcient vectors associated with

each facial feature. To cope with varying pose scenar-

ios, Wang et al. (Wang et al., 2006) use asymmetric

rectangular features, extended by Wang et al. from

the original symmetric rectangular features described

by Viola and Jones to represent asymmetric gray-level

features in proﬁle facial images.

Shape modeling methods for the purpose of fa-

cial feature extraction are common among computer

vision systems due to their robustness (Medioni and

Kang, 2005). Active Shape Models (Cootes et al.,

1995) (ASM) and Active Appearance Models (Cootes

et al., 1998) (AAM) possess a high capacity for fa-

cial feature registration and extraction. Such efﬁ-

ciency is attributed to the ﬂexibility of these meth-

ods, thus compensating for variations in the appear-

ance of faces from one subject to another (Ghrabieh

et al., 1998). However, a problem displayed by both

ASM and AAM techniques is the need for initial reg-

istration of the shape model close to the ﬁtted solu-

tion. Both methods are prone to local minima oth-

erwise (Cristinacce and Cootes, 2004). Cristinacce

and Cootes (Cristinacce and Cootes, 2006) use an

appearance model similar to that used in AAM, but

rather than approximating pixel intensities directly,

the model is used to generate feature templates via

the proposed Constrained Local Model (CLM) ap-

proach. Kanaujia et al. (Kanaujia et al., 2006) em-

ploy a shape model based on Non-negative Matrix

Factorization (NMF), as opposed to Principal Compo-

nent Analysis (PCA) traditionally used in ASM meth-

ods. NMF models larger variations of facial expres-

sions and improves the alignment of the model onto

corresponding facial features. Since large head rota-

tions make PCA and NMF difﬁcult to use, Kanaujia et

al. use multi-class discriminative classiﬁers to detect

head pose from local face descriptors that are based

on Scale Invariant Feature Transforms (SIFT) (Lowe,

1999). SIFT is typically used for facial feature point

extraction on a given face image and works by pro-

cessing a given image and extracting features that are

invariant to the common problems associated with ob-

ject recognition such as scaling, rotation, translation,

illumination, and afﬁne transformations.

3 TECHNIQUE DESCRIPTION

Our approach makes use of several techniques for

processing input sequences of drivers following given

scenarios in the simulator. Such techniques have been

used successfully on their own (Leinhart and Maydt,

2002; Lowe, 1999) and as part of a more extended

framework (Kanaujia et al., 2006). Acceptable face

and facial feature detections were produced at good

success rates. Each technique used in our framework

is treated as a module and these modules are classiﬁed

into two major groups: detectors, and trackers. Detec-

tors localize the facial regions automatically and lay a

base image to be used for tracking by other modules.

A base image is a visual capture of a particular facial

A CORRECTIVE FRAMEWORK FOR FACIAL FEATURE DETECTION AND TRACKING

131

Figure 3: Visual input taken from the three cameras mounted on the simulator.

region and can be used to perform a match against

several other regions throughout the input sequence.

Trackers use the base image set out by the detec-

tors and employ matching algorithms to retrieve the

correct position of the same facial region across fol-

lowing frames. Our framework uses a tracker based

on Scale-Invariant Feature Transform (SIFT) (Lowe,

1999) and a second tracker that uses a normalized cor-

relation coefﬁcient (NCC) method as follows:

R(x, y) =

∑

h−1

′

∑

w−1

′

T(x

′

)

I(x+x

′

,y+y

′

)

∑

h−1

′

∑

w−1

′

T(x

′

)

∑

h−1

′

∑

w−1

′

I(x+x

′

,y+y

′

)

where

T(x

′

, y

′

) = T(x

′

, y

′

)−

I(x+x

′

, y+y

′

) = I(x+

′

, y+ y

′

)−

I, and where

T and

I stand for the average

value of pixels in the template raster and current win-

dow of the image, respectively. T(x, y) is the value of

the template pixel in the location (x, y) and I(x, y) is

the value of the image pixel in the location (x, y).

The rapid object detector (ROD) proposed by Vi-

ola and Jones (Viola and Jones, 2001) is a hybrid in

that it can be classiﬁed as both a detector and tracker;

it is employed to detect the face and localize the eyes,

while the SIFT and NCC trackers only deal with the

eye regions. Often, a tracker in our framework may

lose a target due to fast movement of the driver’s

head; a false positive base image may be registered

at that time and the off-target tracker may eventually

be tracking the wrong region as a consequence. As

a detector, the ROD localizes the face and facial fea-

tures automatically. As a tracker, it is used to track

the eyes in between frames and to correct off-target

trackers allowing for a second registration of a base

image. Figure 4 shows an example of how a tracker

can lose its target and provide misleading informa-

tion with regards to the position of the eyes. One

could argue that only one registration of the base im-

age should be used. However, given that the classiﬁer

is not perfect, and is vulnerable according to its as-

sociated false positive rate, the base image registered

could be an invalid region of the face, incorrectly per-

ceived as an eye. Several base image registrations are

therefore needed along the sequence.

The framework uses a look-up table composed of

blurred, scaled down Gaussian images. The Gaussian

pyramid method (Burt and Adelson, 1983) creates a

stack of images that are successively smaller; the base

image of the pyramid is deﬁned as the original im-

age and each image at any given level is composed

of local averages of pixel neighborhoods of the pre-

ceding level of the pyramid. Detectors employed in

our framework process the highest level of the pyra-

mid ﬁrst. In the case where an object of interest is not

detected, the next level down is processed in a sim-

ilar manner. The bottom-up approach used to detect

faces and eyes in our framework reduces the process-

ing time required by the detectors.

The three cameras available on the cockpit of the

simulator provide all views of the driver necessary to

achieve continuous tracking: A tracker may lose its

target if the driver was to check his/her blind spot,

but given the camera setup installed onto the cock-

pit, a driver can be studied at all times. In order to

achieve continuous tracking, the framework must de-

tect a change in a driver’s head pose, and act upon

such an event accordingly by ﬂipping between the

available views. For each frame that is processed by

the ROD tracker, the framework keeps track of the

number of hits and misses for the left m

and right

eyes within a detected face. Hits lower the value

of m

or m

whereas misses increase their values ac-

cordingly. A switch from one view to another occurs

when the value of either m

or m

exceeds a certain

threshold τ, signifying that one (or two) of the eyes

are being missed, leading to the conclusion that a head

pose is in progress. Depending on which eye has been

repeatedly missed, the appropriate view is placed as

the primary view and processed until another switch

is needed.

3.1 Corrective Tracking

The various methods used in our framework produce

good results, whether for detecting or tracking objects

VISAPP 2008 - International Conference on Computer Vision Theory and Applications

132

Figure 4: An example sequence where a tracker loses its target, performed on an annotated sequence of a talking face,

available at Dr. Tim Cootes’ website (Cootes, 2007). From left to right: a). The eyes, the tracker’s target in this case, have

been acquired and are tracked throughout several frames, b) the person’s head moved by a signiﬁcant amount and a base

image was registered according to the interval set, c) and, since the base image, registered by the tracker, is a false positive,

tracking is now being performed on the wrong region of the face.

of interest in a given scene. The quality of the Haar-

based classiﬁer used by the rapid object detector is

determined by its ability to correctly predict classiﬁ-

cations for unseen (negative)examples. The classiﬁer

must be evaluated with a set of examples that is sep-

arate from the training set of samples used to build

it. In many cases, some of the negative examples that

should be used during the training phase of the clas-

siﬁer are missed and, in such a case, errors are intro-

duced when the same negative example is seen by the

classiﬁer. Detectors and trackers can be corrective in

that errors introduced by one module in our frame-

work are likely to be corrected by one or more of the

modules throughout the input sequence. An off-target

tracker can be corrected by a hybrid detector/tracker

in order to allow for a second registration of a base

image of the eye and, vice versa, where a false pos-

itive detection of the eye region by the hybrid detec-

tor/tracker can be rectiﬁed by one or more trackers

with a true positive base image.

The framework has been designed with growth

in mind: extra trackers may need to be added to in-

crease the accuracy of the system as a whole. A

fourth tracker has been developed to illustrate the ease

of adding extra components to the framework. The

fourth tracker simply searches for the lowest aver-

age intensity over a m×n neighborhood of pixels in a

given region of interest; this operation, given a region

close to the eyes, translates to ﬁnding the darkestareas

in that region. Hence, the tracker now acts as a naive

Pupil Finder (PPL). A possible problem could occur

when the PPL tracker targets the eyebrows rather than

the pupil when considering the darkest regions; both

the pupils and the brows display comparable neigh-

borhood intensities and can be mistaken for one an-

other by the naive tracker. However, given that there

are other correcting trackers employed by the frame-

work, such problems can be easily and automatically

rectiﬁed by the other trackers. It is important to note

that the PPL method is less vulnerable to the off-target

tracking problem discussed previously mainly due to

the fact that it does not use a base image to perform

the search. Thus, an off-target PPL can be set on-

target by any given detector through a single true pos-

itive detection.

3.2 Conﬁdence Parameter

Trackers process each eye separately. Once a tracker

processes a given frame within the input sequence, a

displacement vector

−→

v is produced, which tells the

distance from the previous position of the eye to its

new position at the most recent frame, and the de-

tection window is placed accordingly. Since the ac-

curacy of each tracker differs, a conﬁdence parame-

ter ω is introduced to restrict weak trackers from in-

correctly displacing the detection window. The SIFT

tracker, for example, is more reliable than the NCC

tracker and, as a result, should be given a higher con-

ﬁdence value than that of NCC.

Given the trackers used by our framework, three

displacement vectors are produced: The ROD tracker

vector

−→

, the SIFT tracker vector

−→

, the NCC tracker

vector

−→

, and the PPL tracker vector

−→

. Addition-

ally, three associated conﬁdence parameters are set

for each of the trackers: ω

, ω

, and ω

. Ap-

plying a separate conﬁdence parameter to each of the

vectors produced by the trackers minimizes the er-

rors produced by incorrect matches and allows track-

ers with a higher conﬁdence value to correct the per-

ceived position of the eye through the displacement of

the detection window. The displacement of the detec-

tion window is then computed as follows:

−→

∑

i=0

−→

∑

i=0

(1)

A CORRECTIVE FRAMEWORK FOR FACIAL FEATURE DETECTION AND TRACKING

133

where n is the total number of trackers employed, ω

and

−→

are the associated conﬁdence parameter and

displacement vector, respectively, for tracker i, and

−→

is the ﬁnal displacement vector produced after pro-

cessing frame t. (1) assumes that each tracker i pro-

duces a displacement vector

−→

based on the process-

ing of the same exact region R(x, y) of the eyes at

each frame. The computational cost of running all the

trackers at each frame is great; Section 3.3 illustrates

how (1) can be simpliﬁed to increase the performance

of the framework.

3.3 Interval Parameter

Running all trackers in our framework at every frame

is computationally expensive. A more efﬁcient solu-

tion is to only employ a single tracker at any given

frame, as it helps increase the frame rate and pro-

duce smoother movement of the detection window.

An interval parameter κ is given to each tracker.

The NCC tracker can be run at more frequent inter-

vals in the framework than SIFT and, as a result, can

be given a smaller interval parameter. In addition to

the conﬁdence parameter, the deﬁnition of long-term,

short-term, and real-time corrective tracking are ex-

tended to include the interval parameter.

Given the three trackers used by our framework,

three interval parameters are assigned: κ

, κ

and κ

. Since some of the components of our frame-

work, namely the SIFT and NCC trackers, need to

rithms on their assigned frames, two additional inter-

val parameters are set : κ

base

and κ

base

. The overall

timeline is reﬂected in Figure 5.

ROD

Process

SIFT

NCC

Timeline

Figure 5: An example of a processed timeline according to

the assigned interval parameters for each of the three track-

ing component in our framework.

Since each frame in the sequence is only pro-

cessed by one component in our framework, dictated

by the interval parameters given to each tracker, (1) is

simpliﬁed as follows:

−→

= ω

−→

(2)

Addition of the vectors is eliminated from the

computation of (1) since each frame t is only pro-

cessed by a single tracker i.

3.4 Integration Algorithm

Run tracker

Tracker 1

Tracker 2

Tracker N

Update

κ ω

Figure 6: An overview of the tracker selection process em-

ployed by our framework.

The following describes, in detail, the algorithm em-

ployed by our framework:

1. Acquire new frame: Given the selected view

thought to contain a face, a new frame is acquired.

2. Build Gaussian pyramid: Once a new frame is

acquired, a Gaussian pyramid is built according to

Section 3.

3. Detect a face: The face detector is applied to the

entire scene, in search of a proﬁle face.

4. Track the eyes: If the eyes have not been detected

by the system yet, the eyedetector is run on the re-

gions of interest. In the case where the eyes have

already been registered by the trackers, the sys-

tem employs the appropriate tracker on the ROI,

according to the associated interval κ.

5. Update detection windows: The detection win-

dow for the left and right eyes are updated ac-

cording to the displacement vector produced by

the tracker employed, and adjusted using the con-

ﬁdence parameter ω associated with the tracker.

Figure 6 illustrates the tracker selection process.

6. View switching assessment: Once the frame has

been fully processed, results from the detectors

and some of the trackers are used to assess the

current, primary view, according to the thresholds

set in Section 3. A view switch is performed if

necessary.

4 RESULTS

To accurately determine the true positive and false

positive rates, our framework needs to compute the

VISAPP 2008 - International Conference on Computer Vision Theory and Applications

134

number of true and false positives as well as the num-

ber of true and false negatives; this is shown in Fig-

ure 7. The computation of true (or false) positives (or

negatives) is performed at a ﬁne pixel level to achieve

accuracy. The number of true positives, for example,

is calculated as the area where the detected and ac-

tual (ground truth) regions intersect, outlined in green

on Figure 7. The true positive rate (or sensitivity) is

computed as:

TP+ FN

(3)

where TP and FN are the total number of true posi-

tives and false negatives found, respectively, and α

is the true positive rate in the range [0. . . 1]. The false

positive rate (or 1 - speciﬁcity) is computed as:

FP+ TN

(4)

where FP and TN are the total number of false posi-

tives and true negatives found, respectively, and α

the false positive rate in the range [0. . . 1].

eye

not eye

eye

not eye

Region

Actual

Detected

Region

Image

TNFN

FPTP

Figure 7: An illustration determining true positives (TP),

false positives (FP), true negatives (TN), and false negatives

(FN). In the image representation on the right of this ﬁgure,

the solid-line box shows the true position of a given eye

(ground truth), whereas the dashed-line box shows the de-

tected region of the eye.

Computing the true positive and false positive

rates at such a ﬁne level provides an accurate rep-

resentation to be used in plotting the Receiver Op-

erating Characteristic (ROC) curves for our method.

However, an actual classiﬁcation does not need to be

described at such a ﬁne pixel level for a true outcome

to occur. Figure 8 shows three detection windows. In

Figure 8a, the detection window encapsulates the en-

tire region of the eye, and is hence considered to be a

hit. Figure 8c is classiﬁed as a miss since the detection

window deviates almost completely from the eye re-

gion, covering only a small fraction of true positives.

Figure 8b, however,does cover the majority of the eye

region, and therefore can be considered as a hit since

it correctly classiﬁes the region as containing an eye.

(a) Perfect hit (b) Acceptable hit (c) Miss

Figure 8: An illustration showing two hits (a) and (b), and

one miss (c), in determining whether or not the detection

window covers an eye.

As a result, we follow to describe a coarser

method for quantifying a hit rate based on whether

or not the detection window contains an eye:

H + M

(5)

where H and M are the total number of hits and

misses, respectively, and α

is the hit rate in the range

[0. . . 1]. The coverage of the number of true positives

as a fraction of the actual (ground truth) region of the

eye from Figure 7 can be modeled as per (3). To en-

sure that we also model false positives into our hit-

miss classiﬁcation scheme, the number of true posi-

tives as a fraction of the number of false positives is

accounted for as follows:

(6)

where TP and FP are the number of true positives

and false positives, respectively. The number of hits

H and misses M can then be computed as follows:



1 (α

≥ ρ

) ∧ (α

≥ ρ

)

0 otherwise

(7)

where α

and α

are the true positive fractions dis-

cussed previously, at frame t, and ρ

and ρ

are

thresholds by which leniency can be given to how a

hit is counted. A hit occurs when S is 1; otherwise, a

miss is counted.

4.1 Conﬁdence Parameter ω

The following experiments were performed to test

the performance of the separate methods when using

varying values for ω; the values of ω were sampled at:

0.1, 0.3, 0.5, 0.7, 0.9, and 1.0. The NCC, SIFT, and

PPL methods also employ the ROD method at a less

frequent interval to lessen the vulnerability to the off-

target tracking problem. The values showing the best,

worst, and average levels of performance are shown

on Figures 9 to 12. The area under the ROC curve

(AUC) was also computed for each curve to outline

the accuracy of the methods with respect to the as-

signed conﬁdence parameters. The AUC values for

A CORRECTIVE FRAMEWORK FOR FACIAL FEATURE DETECTION AND TRACKING

135

the curves in Figures 9 to 12 are summarized in Table

Table 1: The AUC for curves produced when the conﬁdence

parameter is varied.

Method ω AUC

0.1 0.878150

ROD 0.7 0.895555

1.0 0.883620

0.1 0.883433

NCC 0.3 0.889503

1.0 0.871861

0.1 0.882617

SIFT 0.9 0.882617

1.0 0.891356

0.1 0.891356

PPL 0.5 0.919164

1.0 0.930864

0 0.02 0.04 0.06 0.08 0.1 0.12

0.76

0.78

0.8

0.82

0.84

0.86

false positive rate

true positive rate

Rapid Object Detector (ω)

ω = 0.1

ω = 0.7

ω = 1.0

Line of no discrimination

Figure 9: ROC curves for the ROD method showing the

three, most descriptive curves.

4.2 Framework Integration

The following experiments were done to test the per-

formance of the methods when they are integrated to-

gether into a single, correctiveframework. The values

showing the best, worst, and average levels of perfor-

mance are shownon Figures 13 and 14. The AUC was

also computed for each curve to outline the accuracy

of the methods after integration. The AUC values for

the curves in Figures 13 and 14 are summarized in

Table 2.

5 CONCLUSIONS

The ROD is run more frequently when it is employed

on its own. The NCC, SIFT, and PPL methods em-

0.03 0.04 0.05 0.06 0.07 0.08

0.78

0.79

0.8

0.81

0.82

0.83

0.84

false positive rate

true positive rate

Normalized Correlation Coefficient (ω)

ω = 0.1

ω = 0.3

ω = 1.0

Line of no discrimination

Figure 10: ROC curves for the NCC method showing the

three, most descriptive curves.

0.02 0.03 0.04 0.05 0.06 0.07

0.8

0.81

0.82

0.83

0.84

0.85

false positive rate

true positive rate

Scale−Invariant Feature Transform (ω)

ω = 0.1

ω = 0.9

ω = 1.0

Line of no discrimination

Figure 11: ROC curves for the SIFT method showing the

three, most descriptive curves.

ploy the ROD method less frequently. As a result,

the ROD is meant to produce slightly better results

than any of the other methods due to the fact that

the detector is run more frequently. Off-target track-

ing problems help lower the performance of the NCC

and SIFT methods, given that the detector is not al-

lowed to run as often as when it is employed on its

own (ROD method). The PPL method gives the best

results with a 3.5% performance increase over the

ROD method. As explained in Section 3.1, the PPL

method is less vulnerable to off-target tracking prob-

lems, and, as a result, is shown to produce excellent

results over the other methods. The SIFT method

employs computationally expensive algorithms that

lower the frame rate of the system. As a result, SIFT

processes frames at the lowest level of the Gaussian

VISAPP 2008 - International Conference on Computer Vision Theory and Applications

136

0 0.02 0.04 0.06 0.08 0.1 0.12

0.8

0.82

0.84

0.86

0.88

0.9

false positive rate

true positive rate

Pupil Finder (ω)

ω = 0.1

ω = 0.5

ω = 1.0

Line of no discrimination

Figure 12: ROC curves for the PPL method showing the

three, most descriptive curves.

Table 2: The AUC for curves produced through integration

of the various methods.

Method AUC

ROD 0.879225

ROD + NCC 0.874807

ROD + SIFT 0.877594

ROD + PPL 0.934116

ROD + NCC + SIFT 0.878237

ROD + NCC + SIFT + PPL 0.939404

0.048 0.05 0.052 0.054 0.056 0.058 0.06 0.062

0.798

0.8

0.802

0.804

0.806

0.808

0.81

0.812

false positive rate

true positive rate

Framework Integration

ROD

ROD + NCC

ROD + SIFT

ROD + PPL

ROD + NCC + SIFT

ROD + NCC + SIFT + PPL

Line of no discrimination

Figure 13: ROC curves showing the performance of the

methods employed by the framework, and the integration

of those methods into a single, corrective framework. Only

ROD, NCC, SIFT, and ROD + NCC + SIFT methods are

shown on this graph.

0.03 0.04 0.05 0.06 0.07 0.08

0.88

0.89

0.9

0.91

0.92

0.93

0.94

false positive rate

true positive rate

Framework Integration

ROD

ROD + NCC

ROD + SIFT

ROD + PPL

ROD + NCC + SIFT

ROD + NCC + SIFT + PPL

Line of no discrimination

Figure 14: ROC curves showing the performance of the

methods employed by the framework, and the integration

of those methods into a single, corrective framework. Only

PPL and ROD + NCC + SIFT + PPL methods are shown on

this graph.

pyramid that is employed in our framework. How-

ever, and as can be seen in our results, the perfor-

mance of the SIFT tracker is also lowered, to maintain

an acceptable frame rate.

The conﬁdence parameters were chosen according

to the results presented in Section 4.1. The optimal

curve with the best AUC value was chosen and the ex-

periments were conducted accordingly. As mentioned

previously in this section, the NCC and SIFT methods

produced lower results than the ROD method due to

the fact that ROD was run more frequently when em-

ployed on its own. However, the integration of ROD,

NCC and SIFT is shown to produce results close to

that of the ROD method alone, as can be seen in Fig-

ure 13. The PPL method produced the best results

when compared to the ROD, NCC, and SIFT meth-

ods (all employed individually on top of the ROD

method). However, the integration of ROD, NCC,

SIFT, and PPL method further increases the perfor-

mance of the system by 0.5% over the PPL method.

The full integration of the methods into a single, cor-

rective framework then shows a performance boost of

6%. In terms of hit rate, which is a measure slightly

coarser than the true positiverate (as explained in Sec-

tion 4), the ROD method, when used alone, produces

a hit rate of 92.6285%. However, when integrating

the entire set of methods into the framework to work

together, the hit rate is increased to 99.4499%.

With a high level of accuracy comes a high level

of cost. All the experiments were performed on a lap-

top running a Intel

Pentium

M processor at 2.00

GHz. The mean frame rate when employing the ROD

method alone is found to be 16.8835 frames per sec-

A CORRECTIVE FRAMEWORK FOR FACIAL FEATURE DETECTION AND TRACKING

137

ond. The integration of the methods lowers the frame

rate to 12.3193 frames per second. The reason for

the low frame rate through integration comes back to

the implementation of SIFT, as it is computationally

costly. A slight change in conﬁguration of the param-

eters for the framework could, potentially, produce

higher frame rates at excellent performance levels.

This research is based on the hypothesis that vi-

sual search patterns of at-risk drivers provide vital in-

formation required for assessing driving abilities and

improving the skills of such drivers under varying

conditions. Combined with the signals measured on

the driver’s body and on the driving signals of a car

simulator, the visual information allows a complete

description and study of visual search pattern behav-

ior of drivers.

REFERENCES

Burt, P. and Adelson, E. (1983). The laplacian pyramid

as a compact image code. In IEEE Transactions on

Communications, volume 4, pages 532–540.

Cootes, T. (2007). Images with annotations of a talking

face. Available at:

http://www.isbe.man.ac.uk/

bim/data/talking_face/talking_face.html

November 4th, 2007.

Cootes, T., Cooper, D., Taylor, C., and Graham, J. (1995).

Active shape models - their training and application.

Computer Vision and Image Understanding, 61:38–

59.

Cootes, T., Edwards, G., and Taylor, C. (1998). Active

appearance models. In Conf. Computer Vision, vol-

ume 2, pages 484–498.

Cristinacce, D. and Cootes, T. (2003). Facial feature detec-

tion using adaboost with shape constraints. In Proc.

British Machine Vision Conference, pages 231–240.

Cristinacce, D. and Cootes, T. (2004). A comparison of

shape constrained facial feature detectors. In Proc.

Int. Conf. Automatic Face and Gesture Recognition,

pages 375–380.

Cristinacce, D. and Cootes, T. (2006). Feature detection

and tracking with constrained local models. In Proc.

British Machine Vision Conference, pages 929–938.

Freund, Y. and Schapire, R. (1995). A decision-theoretic

generalization of on-line learning and an application

to boosting. In European Conference on Computa-

tional Learning Theory, pages 23–37.

Ghrabieh, R. A., Hamarneh, G., and Gustavsson, T. (1998).

Review - active shape models - part ii: Image search

and classiﬁcation. In Proc. Swedish Symposium on

Image Analysis, pages 129–132.

Kanaujia, A., Huang, Y., and Metaxas, D. (2006). Emblem

detections by tracking facial features. In Proc. IEEE

Computer Vision and Pattern Recognition, pages 108–

108.

Leinhart, R. and Maydt, J. (2002). An extended set of haar-

like features for rapid object detection. In Proc. Int.

Conf. Image Processing, volume 1, pages 900–903.

Lowe, D. (1999). Object recognition from local scale-

invariant features. In Proc. Int. Conf. Computer Vi-

sion, volume 2, page 1150.

Medioni, G. and Kang, S. (2005). Emerging Topics in Com-

puter Vision. Prentice Hall.

Viola, P. and Jones, M. (2001). Rapid object detection using

a boosted cascade of simple features. In Proc. IEEE

Computer Vision and Pattern Recognition, volume 1,

pages 511–518.

Viola, P. and Jones, M. (2004). Robust real-time face detec-

tion. Int. J. Comput. Vision, 57:137–154.

Wang, Y., Liu, Y., Tao, L., and Xu, G. (2006). Real-time

multi-view face detection and pose estimation in video

stream. In Conf. Pattern Recognition, volume 4, pages

354–357.

Zhu, Z. and Ji, Q. (2006). Robust pose invariant facial

feature detection and tracking in real-time. In Proc.

Int. Conf. Pattern Recognition, volume 1, pages 1092–

1095.

VISAPP 2008 - International Conference on Computer Vision Theory and Applications

138