ENHANCED

PHASE–BASED DISPLACEMENT ESTIMATION

An Application to Facial Feature Extraction and Tracking

Mohamed Dahmane and Jean Meunier

Diro, Universit

e de Montr

eal, CP 6128, Succursale Centre-Ville

2920 Chemin de la tour, Montr

eal, Qu

ebec, Canada, H3C 3J7

Keywords:

Facial feature extraction, Facial analysis, Gabor wavelets, Tracking.

Abstract:

In this work, we develop a multi-scale approach for automatic facial feature detection and tracking. The

method is based on a coarse to ﬁne paradigm to characterize a set of facial ﬁducial points using a bank of

Gabor ﬁlters that have interesting properties such as directionality, scalability and hierarchy. When the ﬁrst

face image is captured, a trained grid is used on the coarsest level to estimate a rough position for each facial

feature. Afterward, a reﬁnement stage is performed from the coarsest to the ﬁnest (original) image level to get

accurate positions. These are then tracked over the subsequent frames using a modiﬁcation of a fast phase–

based technique. This includes a redeﬁnition of the conﬁdence measure and introduces a conditional disparity

estimation procedure. Experimental results show that facial features can be localized with high accuracy and

that their tracking can be kept during long periods of free head motion.

1 INTRODUCTION

The computer vision community is interested in the

development of techniques to ﬁgure out the main el-

ement of facial human communication in particular

for HCI applications or, with additional complexity,

meeting video analysis. In both cases, automatic fa-

cial analysis is highly sensitive to face tracking per-

formance, a task which is rendered difﬁcult due prin-

cipally to environment changes and particularly to its

great appearance variability under different head ori-

entations, its non–rigidity adds yet another degree of

difﬁculty. To overcome these problems, a great num-

ber of techniques have been developed which can be

divided into four categories: knowledge–, feature–,

template– and appearance–based (Yang, 2004).

Among these techniques, it is known that face analy-

sis by feature point tracking demonstrates high con-

current validity with manual FACS (Facial Action

Coding System) coding (Cohen et al., 1999), which

is promising for facial analysis (Cottrell et al., 2003).

Moreover, when facial attributes are correctly ex-

tracted, geometric feature–based methods typically

share some common advantages, such as explicit

face structure, practical implementation, collaborative

feature-wide error elimination (Hu et al., 2004). In

this context, several concepts were developed.

The classical matching technique extracts features

from two frames and tries to establish a correspon-

dence, whereas correlation-based techniques com-

pare windowed areas in two frames, and the maxi-

mum cross correlation value provides the new rela-

tive position. However, recent techniques have been

developed to determine the correct relative position

(disparity

) without any searching process as it is

required by the conventional ones. In this cate-

gory, phase–based approaches have attracted atten-

tion because of their biological motivation and robust-

ness (Theimer and Mallot, 1994; Fleet and Jepson,

1993).

In the literature, one can ﬁnd several attempts

at designing non–holistic methods based on Gabor

wavelets (Shen and Bai, 2006). Due to their interest-

ing and desirable properties including spatial locality,

self similar hierarchical representation, optimal joint

uncertainty in space and frequency as well as biolog-

ical plausibility (Flaton and Toborg, 1989). However,

use interchangeably the words ”disparity” and ”dis-

placement”

427

Dahmane M., Meunier J. and Meunier J. (2008).

ENHANCED PHASE–BASED DISPLACEMENT ESTIMATION - An Application to Facial Feature Extraction and Tracking.

In Proceedings of the Third International Conference on Computer Vision Theory and Applications, pages 427-433

DOI: 10.5220/0001081804270433

 SciTePress

most of them are based on the magnitude part of the

ﬁlter response (Lades et al., 1993; Tian et al., 2002;

Liu and Wechsler, 2003; Valstar and Pantic, 2006). In

fact, under special consideration, particularly because

of shift–variant property, the Gabor phase can be a

very discriminative information source (Zhang et al.,

2007).

In this paper, we use this property of Gabor phase

for facial feature tracking. In section 2, we describe

the Gabor-kernel family we are using. In section 3,

we introduce the adopted strategy for facial features

extraction. The tracking algorithm is given in section

4, including technical details and a discussion on its

derivation. Finally, we apply the approach to a facial

expression database, in section 5.

2 LOCAL FEATURE MODEL

BASED ON GABOR WAVELETS

2.1 Gabor Wavelets

A Gabor jet J(x) describes via a set of ﬁltering opera-

tion (eq. 1), the spatial frequency structure around the

pixel x, as a set of complex coefﬁcients.

(x) =

I(x

)Ψ

x − x

(1)

A Gabor wavelet is a complex plane wave modulated

by a Gaussian envelope:

(x) = η

−

kxk

2σ

ık

·x

− e

−

(2)

where σ = 2π, and k

= (k

) =

cos(φ

),k

sin(φ

)) deﬁnes the wave vector,

with

= 2

−

ν+2

π and φ

= µ

Notice that the last term of equation 2 compensates

for the non-null average value of the cosine compo-

nent. We choose the term η

so that the energy of the

wavelet Ψ

is unity (eq. 3).

(x)dx

= 1 (3)

A jet J(x) = {a

ıφ

/ j = µ +8ν}, is commonly de-

ﬁned as a set of 40 complex coefﬁcients constructed

from different Gabor ﬁlters spanning different orien-

tations (µ ∈ [0,7]) under different scales (ν ∈ [0,4]).

3 AUTOMATIC VISUAL

ATTRIBUTE DETECTION

3.1 Rough Face Localization

When the ﬁrst face image is captured, a pyramidal

image representation is created, where the coarsest

level is used to ﬁnd near optimal starting points for

the subsequent individual facial feature localization

stage. Each trained grid (Fig. 1) from a set of pre-

stored face grids is displaced as a rigid object over the

image. The grid position that maximizes the weighted

magnitude–based similarity function (eq. 4 and 5)

provides the best ﬁtting node positions.

1.2

1.1

2.1

2.2

2.3

3.1 3.2

4.1

4.2

4.3

5.1 5.2

6.1

6.2

6.4

6.3

Figure 1: Facial nodes with their respective code.

Sim(I,G) =

∏

S(J

) (4)

S(J,J

) refers to the similarity between the jets of the

corresponding nodes (eq. 5), L stands for the total

number of nodes.

S(J, J

) =

∑

with c





1 −

− a

+ a





(5)

The role of the weighting factor c

is to model the

amplitude–distortion δ as illustrated in ﬁgure 2.

Figure 2: Two different 3–dimensional jets. In the right sub-

ﬁgure, a not–weighted amplitude–based similarity S(J,J

)

would have given an incorrect perfect match value 1..

VISAPP 2008 - International Conference on Computer Vision Theory and Applications

428

3.2 Local Facial Feature Position

Reﬁnement

The rough facial grid-node positions are then inde-

pendently reﬁned by estimating the displacement us-

ing a hierarchical selective search. The calculated

displacements are propagated to subsequent hierarchy

level, and a reﬁnement operation is again performed.

The optimal displacements are, ﬁnally, given at the

ﬁnest image level.

The selective local search can be described as a

local 3 × 3 neighborhood search, which allows dis-

torting the grid until the maximum similarity value is

reached. The search is then reﬁned by propagating, to

the next ﬁner level, the three positions giving the high-

est similarity values. For each propagated potential

position P(x,y) the three adjacent neighboring posi-

tions P(x+1,y),P(x,y+1) and P(x+1,y+1) are also

explored. The selective search continues downward

until the ﬁnest level of the pyramid image is reached,

where the optimal position is maximum (eq. 5).

This procedure permits to decrease the inherent

complexity required to calculate the convolution un-

der an exhaustive search, ﬁrst by reducing the search

area (e.g. a 12 × 12 neighborhood on the ﬁnest level

will correspond only to a 3 × 3 on the coarsest one)

(Fig. 3), and second by using smaller–size jets in

coarser levels.

9/144

9/36

9/9

Figure 3: Hierarchical–selective search. The values in left

side denote the number of explored positions vs. the total

number that would be explored in the case of an exhaustive

search.

4 FACIAL ATTRIBUTES

TRACKING

Facial features tracking is performed by estimating

a displacement d via a disparity estimation tech-

nique (Theimer and Mallot, 1994), that exploits the

strong variation of the phases of the complex ﬁlter re-

sponse (Maurer and von der Malsburg, 1996).

Later adopted by (Zhu and Ji, 2006), this frame-

work investigated in (Maurer and von der Malsburg,

1996; Wiskott et al., 1997) is based on the maximiza-

tion of a phase–based similarity function which is

nothing else than a modiﬁed way to minimize the

squared error, within each frequency scale ν given

two jets J and J

(eq. 6), as it has been proposed

in (Theimer and Mallot, 1994).

∑

ν,µ

(∆φ

ν,µ

− k

ν,µ

· d

)

(6)

However, we assume that the merit of that framework

is the use of a saliency term (eq. 7) as weighting factor

ν,µ

, privileging displacement estimation from ﬁlters

with higher amplitude response. Also, for such re-

sponse it seems that phase is more stable (McKenna

et al., 1997).

= a

(7)

In (Theimer and Mallot, 1994), the weighting factor

represents a conﬁdence value (eq. 8), that assesses

the relevance of a single disparity estimate, and tends

to reduce the inﬂuence of erroneous ﬁlter responses.

= 1 −

− a

+ a

(8)

Both saliency term and normalized conﬁdence ignore

the phase of the ﬁlter response. In the present work,

we try to penalize the response of the erroneous ﬁl-

ters by using a new conﬁdence measure that combines

both amplitude and phase (eq. 9).

= a





1 −

− a

+ a





−

∆

2π

(9)

The ﬁrst term in this formulation represents the

saliency term that is incorporated as a squared value

of only the amplitude of the reference jet J which –

contrary to the probe jet J

– necessarily ensures high

conﬁdence. We mean here by the reference jet the

jet calculated from the previous frame or even a pre-

stored one. The second bracket squared-term holds

the normalized magnitude conﬁdence. While, the last

term, where

∆φ

2π

denotes the principal part of the

phase difference within the interval [−π,π), allows

giving more weight to ﬁlters where the phase differ-

ence has a favorable convergence while, at the same

time, limiting the inﬂuence of outlier ﬁlters.

ENHANCED PHASE–BASED DISPLACEMENT ESTIMATION - An Application to Facial Feature Extraction and

Tracking

429

The displacements can then be estimated with

sufﬁcient accuracy by minimizing (eq. 6) which leads

to a set of linear equations for d, that can be directly

resolved from (eq. 10).

d(J, J

) =

∑

−

∑

−

∑

−1

∑

∆φ

2π

∑

∆φ

2π

(10)

4.1 Iterative Disparity Computation

In (Theimer and Mallot, 1994), to obtain the disparity

within one scale, the feature displacement estimates

for each orientation were combined into one displace-

ment per scale (d

) using the least squared error crite-

rion (eq. 6). The optimal disparity is then calculated

by a combination of these estimates as an average

value over all scales with appropriate weights (eq. 8).

Whereas in various approaches, a least squared solu-

tion is obtained in one pass, over the overall consid-

ered frequencies (Wiskott et al., 1997), some of them

propose at ﬁrst to use the lower frequencies subset

(e.g. ν ∈ [2, 4]), and then to resolve for higher fre-

quencies subset (e.g. ν ∈ [0,2]).

These resolutions may carry an additive risk of un-

favorable results; that is knowing that at each scale,

there exists a displacement value above which its es-

timation would not be reliable, due to the lack of a

large overlap of the Gabor kernels. Obviously, this

value depends on the radius (σ/k

) of the Gaussian

envelope.

As the power spectrum of the Gabor signal (eq. 2)

is concentrated in the interval [−σ/ (2k

),σ/(2k

)],

we can compute the maximum disparity d

max

that can

be estimated within one scale (eq. 11).

max

(11)

If for example the true displacement is d = 7 pixels,

then according to the Gabor–kernel family we used

(section 2.1), only the lowest frequency band ﬁlter

gives a reliable estimation of the disparity.

So, the trick consists in estimating the disparity itera-

tively, from the lowest frequency to a highest critical

frequency, depending on a stopping criterion involv-

ing the maximum allowed disparity value that can be

effectively estimated. Some values are shown in ta-

ble 1 as a function of scale.

Given J(x) = {a

ıφ

} the reference jet and

(x + d) = {a

ıφ

} the probe jet i.e. the jet

calculated at the probe position (x + d), using

Table 1: Critical displacement for each frequency.

ν 0 1 2 3 4

max

(pixel) 2 ≈ 3 4 ≈ 6 8

the j

wavelet, an iterative disparity estimation

algorithm (Fig. 4) gives the optimal displacement

opt

, that makes the two jets the most similar possible.

Algorithm 1. ITERATIVEDISPARITYESTIMATION (x)

Initially set

with the lowest

frequency index;

Calculate

(x)

for the components that

refer to

at different orientations;

Estimate the disparity

δd

using

equation (10) by considering all the

processed frequencies at different

orientations;

Compensate for the phase

− k

· δd

2π

;

Cumulate the disparity

d = d +δd

;

Perform the convergence test, if

δd

greater than a threshold goto (3);

If the stopping criterion is not met,

i.e. the overall displacement

less than the critical displacement

value

max

, see Table (1), then put

ν = ν +1

(the next higher frequency)

and goto (2).

Figure 4: Conditional iterative disparity estimation algo-

rithm.

Iteratively, the conditional iterative disparity estima-

tion (Fig. 4) will unroll on the novel position x

new

←

x + d

opt

until a convergence criterion is achieved i.e.

opt

tends to 0 or the maximum number of iterations

max

iter

is reached. Herein, ν

critic

could keep its previ-

ous value, instead of starting, for each new position,

with the coarsest scale (i.e. ν

critic

= N

− 1).

5 EXPERIMENTAL RESULTS

The Hammal–Caplier face database (Hammal et al.,

2007) is used to test the proposed approach. In this

database, each video contains about 120 frames for

each of the 15 distinct subjects that are acting differ-

ent facial expressions (neutral, surprise, disgust and

joy) with some tolerance on rotation and tilting. We

used 30 videos with spatial resolution of (320 × 240).

VISAPP 2008 - International Conference on Computer Vision Theory and Applications

430

Table 2: Percentage of used frames to handle local facial deformations.

facial feature 1.1 1.2 2.1 2.2 2.3 3.1 3.2 4.1 4.2 4.3 5.1 5.2 6.1 6.2 6.3 6.4

(%) of used frames 2.5 1.8 3.9 4 3 2.3 3.4 4.2 3.7 2.7 1.5 2.4 3.8 8 2 9

A generic face grid (Fig. 1) is created using one frame

from each subject (frontal view). In order to handle

the facial deformation and prevent drifting, facial fea-

ture bunches are generated. Table 2 shows each land-

mark and the percentage of the total number of frames

required to create its own representative facial bunch.

As we can see the number increases with the degree

of variability of the local deformation that can be ob-

served for each facial feature. These percentages were

set empirically.

To locate the face grid, a search is performed over

the coarsest level of the 3 image-levels that we used.

Then a hierarchical selective reﬁnement is performed

using a weighted magnitude–based similarity to get

the optimal node positions. Figure 5 shows the results

corresponding to the position reﬁnement after rough

node positioning.

Figure 5: Nodes position reﬁnement (bottom) after rough

positioning (top).

Figure 6 shows the magnitude proﬁle corresponding

to (µ,ν) = (0,0) for node 2.1 (right inner–eye) from

a video where the subject is performing a disgust ex-

pression. Figure 7 illustrates the phase proﬁle of the

same subject with and without phase compensation

(φ

←

− k

· d

2π

) in Algorithm 1.

One can observe some large and sharp phase varia-

tions when non compensation is used, corresponding

to tracking failure.

Figure 8 shows three shots of a video showing a sub-

ject performing a disgust expression, the top subﬁg-

ure presents the last frame. In this ﬁgure, we can see

that the tracking has failed with a single jet (instead

of a bunch). It’s easy to see that the drifting can not

be measured from the magnitude proﬁle only (middle

row), because the magnitude changes smoothly with

the position. This is not the case for the phase (bot-

tom row) which is shift–variant, however by using a

shift–compensation and facial bunches as described

in Algorithm 1, we can correctly track the facial land-

marks (Fig. 9). In comparison with ﬁgure 8, the bot-

tom graph shows a horizontal and correct phase pro-

ﬁle (without node drifting). The reader can appreciate

the impact of such correction by looking in particular

at node

Figure 6: Amplitude proﬁle over time of Node 2.1 (right

inner–eye).

Figure 7: phase proﬁle : not–corrected (left) vs. corrected

(right) phase.

2.1 (right inner–eye) and 2.3 (right lower eyelid) in

ﬁgures 8 and 9.

In table 3, we summarize the tracking results of

16 facial features of 10 different persons with differ-

ent expressions. The mean error of node positions

using the proposed approach is presented in pixels.

From the last column, we can see how the use of fa-

cial bunches appreciably increases nodes positioning

and consequently the tracking accuracy.

ENHANCED PHASE–BASED DISPLACEMENT ESTIMATION - An Application to Facial Feature Extraction and

Tracking

431

Figure 8: A drifting case : Magnitude vs. Phase proﬁle.

Table 3: Mean position error (pixels).

Subject Without bunches With bunches

#1 4.28 1.78

#2 3.98 1.37

#3 5.07 2.03

#4 4.44 1.9

#5 4.17 1.7

#6 4.05 1.63

#7 4.69 1.5

#8 4.1 1.75

#9 5.85 2.49

#10 6.93 2.47

Figure 9: Drift avoidance.

6 CONCLUSIONS

In this work, we present a modiﬁcation of a phase–

based displacement estimation technique using a new

conﬁdence measure and a conditional disparity es-

timation. The proposed tracking algorithm permits

to eliminate accumulation of tracking errors to avoid

drifting, so offering a good facial landmark localiza-

tion, which is a crucial task in a feature–based fa-

cial expression recognition system. We notice that

in these experiments, excepts for the ﬁrst frame, no

geometry constraints were used to enforce the facial

shape conﬁguration, especially for features that are

difﬁcult to track.

More training sessions could be needed to obtain

pre-stored grids and features bunches that are rep-

VISAPP 2008 - International Conference on Computer Vision Theory and Applications

432

resentative of the variability of the human face ap-

pearance for initialisation and tracking respectively.

In this context, through available face databases, ad-

vanced statistical models of data can be obtained

using learning algorithms, such as EM (Jiao et al.,

2003).

To reinforce the reﬁnement step we are working

on improving the local structure by providing an al-

ternative appearance model which focuses more on

high frequency domain without necessarily altering

the relevant low frequency texture information, in-

stead of modeling the grey level appearance (Zhang

et al., 2003) or exploiting the global shape con-

straint (McKenna et al., 1997) which tends to smooth

out important details.

As future work, we plan to use facial feature bunches

to generate for each facial expression and for each

facial attribute what could constitute ”Expression

Bunches” for facial expression analysis.

ACKNOWLEDGEMENTS

This research was supported by the National Sci-

ences and Engineering Research Council (NSERC) of

Canada.

REFERENCES

Cohen, J., Zlochower, A., Lien, J., and Kanade, T. (1999).

Face Analysis by Feature Point Tracking Has Concur-

rent Validity with Manual FACS Coding. Psychophys-

iology 36(1):35–43.

Cottrell, G., Dailey, M., and Padgett, C. (2003). Is All

Faces Processing Holistic? The view from UCSD. M.

Wenger, J Twnsend (Eds), Computational, Geometric

and Process Perspectives on Facial Recognition, Con-

texts and Challenges: Contexts and Challenges, Erl-

baum.

Flaton, K. and Toborg, S. (1989). An approach to image

recognition using sparse ﬁlter graphs. In International

Joint Conference on Neural Networks, (1):313–320.

Fleet, D. and Jepson, A. (1993). Stability of phase informa-

tion. In IEEE Trans. on PAMI, 15(12):1253–1268.

Hammal, Z., Couvreur, L., Caplier, A., and Rombaut, M.

(2007). Facial expression classiﬁcation: An approach

based on the fusion of facial deformation unsing the

transferable belief model. In Int. Jour. of Approximate

Reasonning.

Hu, Y., Chen, L., Zhou, Y., and Zhang, H. (2004). Esti-

mating face pose by facial asymmetry and geometry.

In IEEE International Conference on Automatic Face

and Gesture Recognition.

Jiao, F., Li, S., Shum, H.-Y., and Schuurmans, D. (2003).

Face alignment using statistical models and wavelet

features. In Computer Vision and Pattern Recognition

(1) p. 321–327.

Lades, M., Vorbr

uggen, J. C., Buhmann, J., Lange, J., von

der Malsburg, C., W

urtz, R. P., and Konen, W. (1993).

Distortion invariant object recognition in the dynamic

link architecture. In IEEE Transactions on Computers

3(42):300–311.

Liu, C. and Wechsler, H. (2003). Independent component

analysis of gabor features for face recognition. In

IEEE Trans. on Neural Networks, (14):4, 919–928.

Maurer, T. and von der Malsburg, C. (1996). Tracking and

learning graphs and pose on image sequences of faces.

In 2nd International Conference on Automatic Face

and Gesture Recognition, p. 76.

McKenna, S., Gong, S., W

urtz, R., Tanner, J., and Banin,

D. (1997). Tracking facial feature points with gabor

wavelets and shape models. In Proceedings of the

First International Conference on Audio– and Video–

based Biometric Person Authentication, 1206(3):35–

42. Springer Verlag.

Shen, L. and Bai, L. (2006). A review on gabor wavelets

for face recognition. In Pattern Analysis and Applica-

tions, (9):2,273–292.

Theimer, W. and Mallot, H. (1994). Phase–based binocular

vergence control and depth reconstruction using active

vision. In CVGIP: Image Understanding, 60(3):343–

358.

Tian, Y., Kanade, T., and Cohn, J. (2002). Evaluation of

gabor wavelet–based facial action unit recognition in

image sequences of increasing complexity. In In Proc.

of the 5th IEEE Int. Conf. on Automatic Face and Ges-

ture Recognition.

Valstar, M. and Pantic, M. (2006). Fully automatic facial ac-

tion unit detection and temporal analysis. In CVPRW,

p. 149.

Wiskott, L., Fellous, J., Kr

uger, N., and von der Malsburg,

C. (1997). Face recognition by elastic bunch graph

matching. In IEEE Transactions on Pattern Analysis

and Machine Intelligence. 19(7):775–779.

Yang, M. (2004). Recent advances in face detection. In

Tutorial of IEEE Conferece on Pattern Recognition.

Zhang, B., Gao, W., Shan, S., and Wang, W. (2003). Con-

straint shape model using edge constraint and gabor

wavelet based search. In AVBPA03, 52–61.

Zhang, B., Shan, S., Chen, X., and Gao, W. (2007). His-

togram of gabor phase patterns (HGPP): A novel ob-

ject representation approach for face recognition. In

IEEE Tran. on Image Processing (16):1, pp.57-68.

Zhu, Z. and Ji, Q. (2006). Robust pose invariant facial

feature detection and tracking in real-time. In ICPR,

1092-1095.

ENHANCED PHASE–BASED DISPLACEMENT ESTIMATION - An Application to Facial Feature Extraction and

Tracking

433