SIMULTANEOUS REGISTRATION AND CLUSTERING FOR

TEMPORAL SEGMENTATION OF FACIAL GESTURES FROM

VIDEO

Fernando De la Torre, Joan Campoy, Jeffrey F. Cohn and Takeo Kanade

Robotics Institute, Carnegie Mellon University, Pittsburgh, USA

Keywords:

Facial expression analysis, Clustering, Facial Gesture, Learning, Temporal segmentation.

Abstract:

Temporal segmentation of facial gestures from video sequences is an important unsolved problem for auto-

matic facial analysis. Recovering temporal gesture structure from a set of 2D facial features tracked points is a

challenging problem because of the difﬁculty of factorizing rigid and non-rigid motion and the large variability

in the temporal scale of the facial gestures. In this paper, we propose a two step approach for temporal seg-

mentation of facial gestures. The ﬁrst step consist on clustering shape and appearance features into a number

of clusters and the second step involves temporally grouping these clusters.

Results on clustering largely depend on the registration process. To improve the clustering/registration, we

propose a Parameterized Cluster Analysis (PaCA) method that jointly performs registration and clustering.

Besides the joint clustering/registration, PaCA solves the rounding off problem of existing spectral graph

methods for clustering. After the clustering is performed, we group sets of clusters into facial gestures. Several

toy and real examples show the beneﬁts of our approach for temporal facial gesture segmentation.

1 INTRODUCTION

Temporal segmentation of facial gestures from video

sequences is an important unsolved problem towards

automatic facial interpretation. Recovering tempo-

ral gesture structure from a set of 2D facial features

tracked points is a challenging problem because of

the difﬁculty of factorizing rigid and non-rigid mo-

tion and the variability of temporal scales for differ-

ent facial gestures. This problem is particulary hard if

the sequence contains subtle expression changes and

strong pose changes (most real interesting video se-

quences). In this paper, we propose a two step ap-

proach to temporal segmentation of facial gestures.

The ﬁrst step groups the shape and appearance fea-

tures of facial features into a given number of clusters.

The second step ﬁnds the temporal grouping of these

clusters (see ﬁg. 1).

A key for the success of the clustering relies on the

registration step. If the tracker do not explicitly track

with a 3D model is usually hard to decouple rigid and

non-rigid motion. In this paper, we propose Para-

meterized Cluster Analysis (PaCA), that jointly per-

forms registration and clustering. Once the clustering

is done, we propose a simple but effective way of dis-

Figure 1: Temporal segmentation of facial gestures.

covering temporal structure in the set of clusters. Ad-

ditionally, a new matrix formulation for clustering is

introduced that enlightens connections between clus-

tering methods.

110

De la Torre F., Campoy J., F. Cohn J. and Kanade T. (2007).

SIMULTANEOUS REGISTRATION AND CLUSTERING FOR TEMPORAL SEGMENTATION OF FACIAL GESTURES FROM VIDEO.

In Proceedings of the Second International Conference on Computer Vision Theory and Applications - IU/MTSV, pages 110-115

 SciTePress

2 PREVIOUS WORK

There has been quite substantial research efforts de-

voted to facial expression analysis over the past few

years. Most of the work focus on tracking (Zhao and

Chellappa, 2006), facial expression recognition or ac-

tion units recognition (Lucey et al., 2006; Cohn et al.,

2006). Little attention has been paid to the problem

of temporal segmentation of facial gestures that can

greatly beneﬁt the recognition process. Exception is

the pioneering work of Mase and Pentland (Mase and

Pentland, 1990) that shows how zeros of the velocity

of the facial motion parameters were found to be use-

ful for the temporal segmentation and its applications

to lip reading. Recently, Joey (Hoey, 2001) present

a Multilevel Bayesian network for learning the dy-

namics of facial expression. In related work, Irani

and Zelnik (Zelnik-Manor and Irani, 2004) propose a

modiﬁcation of factorization algorithms for structure

from motion to provide temporal clustering of non-

rigid motion.

Most previous work assumes an accurate registra-

tion process before the segmentation step. Accurate

registration of the non-rigid facial features is still an

open research problem (Zhao and Chellappa, 2006),

in particular decoupling rigid and non-rigid motion

from 2D. Unlike previous research, in this paper we

propose an algorithm that jointly performs registra-

tion and clustering as a ﬁrst step toward temporal seg-

mentation of facial gestures. Moreover, we develop a

simple but effective way to group these clusters into

temporally coherent chunks.

3 MATRIX FORMULATION FOR

CLUSTERING

In this section we review the state of the art in clus-

tering algorithms using a new matrix formulation that

enlightens the connection between several clustering

methods and suggests new optimization schemes for

spectral clustering.

3.1 K-means

K-means (MacQueen, 1967; Jain, 1988) is one of the

simplest and most popular unsupervised learning al-

gorithms to solve the clustering problem. Clustering

refers to the partition of n data points into c disjoint

clusters. k-means clustering splits a set of n objects

into c groups by maximizing the between-clusters

variation relative to within-cluster variation. That is,

k-means clustering ﬁnds the partition of the data that

is a local optimum of the following energy function:

J(µ

,...,µ

) =

∑

i=1

∑

j∈C

||d

− µ

where d

(see

notation

) is a vector representing the j

data point

and µ

is the geometric centroid of the data points for

class i. The optimization criteria in previous eq. can

be rewritten in matrix form as:

(M,G) = ||D −MG

(1)

sub ject to G1

= 1

and g

i j

∈ {0, 1}

where G ∈ ℜ

n×c

and M ∈ ℜ

d×c

. G is a dummy in-

dicator matrix, such that

∑

i j

= 1, g

i j

∈ {0,1} and

i j

is 1 if d

belongs to class C

, c denotes the number

of classes and n the number of samples. The columns

of D ∈ ℜ

d×n

contain the original data points, d is the

dimension of the data. Recall that the equivalence be-

tween the k-means error function and eq. 1 is only

valid if G strictly satisﬁes the constraints.

The k-means algorithm performs coordinate de-

scent in E

(M,G). Given the actual value of the

means M, the ﬁrst step ﬁnds for each data point d

the g

such that one of the columns is one and the

rest 0 and minimizes eq. 1. The second step opti-

mizes over M = DG(G

−1

, equivalent to compute

the mean of each cluster. Although it can be proven

that alternating these two steps will always terminate,

the k-means algorithm does not necessarily ﬁnd the

optimal conﬁguration over all possible assignments.

It typically runs multiple times and the best solution

is chosen. Despite these limitations, the algorithm is

used fairly frequently as a result of its ease of imple-

mentation and effectiveness.

Eliminating M, eq. 1 can be rewritten as:

(G) = ||D −DG(G

−1

= tr(D

−tr((G

−1

DG) ≥

∑

min(d,n)

i=c+1

(2)

where λ

are the eigenvalues of D

D. Min-

imizing eq. 2 is equivalent to maximizing

tr((G

−1

DG). Ignoring the special struc-

ture of G and considering the continuous domain, the

optimum G value that optimizes eq. 2 is given by the

eigenvectors of the covariance matrix D

D and the er-

ror is E

∑

min(d,n)

i=c+1

. A similar reasoning has been

reported by (Ding and He, 2004; Zha et al., 2001),

Bold capital letters denote a matrix D, bold lower-case

letters a column vector d. d

represents the j column of the

matrix D. d

i j

denotes the scalar in the row i and column

j of the matrix D and the scalar i-th element of a column

vector d

. All non-bold letters will represent variables of

scalar nature. diag is an operator that transforms a vector to

a diagonal matrix or takes the diagonal of the matrix into a

vector. ◦ denotes the Hadamard or point-wise product. 1

∈

ℜ

k×1

is a vector of ones. I

∈ ℜ

k×k

is the identity matrix.

tr(A) =

∑

is the trace of the matrix A and |A| denotes

the determinant. ||A||

= tr(A

A) = tr(AA

) designates

the Frobenious norm of a matrix.

SIMULTANEOUS REGISTRATION AND CLUSTERING FOR TEMPORAL SEGMENTATION OF FACIAL

GESTURES FROM VIDEO

111

where they show that a lower bound of eq. 2 is given

by the residual eigenvalues. The continuous solution

of G lies in the c − 1 subspace spanned by the ﬁrst

c−1 eigenvectors with highest eigenvalues (Ding and

He, 2004) of D

3.2 Spectral Clustering

Spectral graph methods for clustering are popular be-

cause of ease of programming and because they ac-

complishes a good trade-off between achieved per-

formance and computational complexity. Recently,

(Dhillon et al., 2004; de la Torre and Kanade, 2006)

point out the connections between k-means and stan-

dard spectral graph algorithms, such as Normalized

Cuts (Shi and Malik, 2000), by means of kernel meth-

ods. The kernel trick is a standard way of lifting

the points of a dataset to a higher dimensional space,

where points are more likely to be linearly separable

(assuming that the right mapping is found). Let us

consider a lifting of the original points to a higher di-

mensional space, Γ = [ φ(d

) φ(d

) ··· φ(d

) ] where

φ is a high dimensional mapping. The kernelized ver-

sion of eq. 1 will be:

(M,G) = ||(Γ −MG

)W||

(3)

where we have introduced a weighting matrix W

for normalization purposes. Eliminating M =

ΓWW

G(G

−1

, it can be shown that:

∝ −tr((G

−1

ΓWW

G) (4)

where Γ

Γ is the standard afﬁnity matrix in Normal-

ized Cuts (Shi and Malik, 2000). After a change

of variable Z = G

W, the previous equation can be

expressed as E

(Z) ∝ −tr((ZZ

)

−1

ΓWZ

Choosing W = diag(Γ

Γ1

)

−0.5

the problem is

equivalent to solving the Normalized Cuts problem.

Observe that this formulation is more general since

it allows for arbitrary kernels and weights. Also, ob-

serve that the weight matrix could be used to reject

the inﬂuence of a pair of data points with unknown

similarity (i.e. missing data).

4 PARAMETERIZED CLUSTER

ANALYSIS

Good registration is critical for segmentation of sub-

tle facial gestures. However, decoupling the rigid and

non-rigid motion of the face is a challenging problem

even if 3D models are used. In this section, we pro-

pose PaCA that jointly performs clustering and reg-

istration and alleviates the registration problem for

clustering.

4.1 Energy Function for PaCA

The key idea of PaCA is to parameterize the shape

features Γ in the clustering function. This can be done

easily by relating the clustering problem to an error

function:

(A,G,M) = ||Γ − MG

(5)

Unlike previous section, Γ = φ(T(A,D)) =

[ φ(A

) φ(A

) · · · φ(A

) ] is a parameterized

version of the data that accounts for mis-registrations.

Each column in T(A,D) ∈ R

d×n

, t

represents the

warped shape A

, where A

is a linear transforma-

tion matrix with the motion parameters. φ is a generic

mapping, usually to a higher dimensional space. The

mapping can be inﬁnite dimensional (e.g. Gaussian

kernel).

After optimizing over M = ΓG(G

−1

, eq. 5 is

equivalent to:

(A,G) = ||Γ(A)(I −G(G

−1

)||



Γ(A)

Γ(A)(I − G(G

−1

)



(6)

taking into account that (I − G(G

−1

) is a

idempotent matrix. K(A) = Γ(A)

Γ(A) is the stan-

dard afﬁnity matrix or kernel matrix, where each el-

ement in the case of exponential kernel is: k

i j

−

||A

−A

2σ

. Where A

is an afﬁnity matrix of 4

(if translation is removed), 6 or 8 parameters. D

∈

(d/2)×n

is a data matrix, such that the ﬁrst row con-

tains the x-coordinates and the second row contains

the y-coordinates.

4.2 Motion Models

In this paper, we will assume that in the video the face

of the subject is relatively far away from the camera

and that locally the eye region or mouth is a planar

surface. It is well known (Adiv, 1985) that the 2D

projected motion ﬁeld of a 3D planar surface can be

recovered under orthographic projection (x = X and

y = Y ) by an afﬁne model f(x, a), parameterized by

a = [a

... a

]

f(x,a) =









x − x

y − y



(7)

where x

= (x

)

is the center position of the ob-

ject.

4.3 Solving the Optimization Problem

Assuming the matrix A is known, optimizing eq. 6

reduces to:

(G) ∝ tr((G

−1

KG) (8)

VISAPP 2007 - International Conference on Computer Vision Theory and Applications

112

To impose non-negativity constraints in g

i j

, we para-

meterize G as the product of two matrices G = V ◦ V

(de la Torre and Kanade, 2006) and use a gradient de-

scent strategy to search for an optimum:

n+1

= V

− η

∂G(V

)

∂V

(9)

∂G(V

)

∂V

= (I

− G(G

−1

)KG(G

−1

◦ V

The major problem with the update of eq. 9 is

to determine the optimal η

. In our case, η

is deter-

mined with a line search strategy. To impose G1

= 1

in each iteration, the V is normalized to satisfy the

constraint. Because eq. 9 is prone to local minima,

we start from several random initial points and select

the solution with minimum error.

Assuming G is known optimizing w.r.t. A has

to minimize: E

(A) = tr



K(A)F



where, F = (I −

G(G

−1

). To optimize w.r.t. A we use a linear

time algorithm that uses gradient descent. In the case

of exponential kernel,

n+1

= A

− η

∂E

∂A

(10)

∂E

∂A

= −

∑

j=1

−

||A

−A

2σ

− A

As before, η

is determined with a line search

strategy (Fletcher, 1987).

4.4 Initialization and Clustering

Features

Optimizing eq. 6 w.r.t A and G is a non-convex prob-

lem prone to local minima, that without a good initial-

ization is likely to get stuck into a bad minimum. To

give an initial estimate of the matrix K we compute

all possible pairwise afﬁne distances between the set

of shape points and with this estimate optimize over

G. Observe that at this point K is symmetric but not

necessarily deﬁnite positive.

We assume that several facial features of the face

have been tracked using Active Appearance Models

(AAM) (Matthews and Baker, 2004) (see ﬁg. 6).

Once the facial feature points have been tracked, we

use PaCA to jointly cluster and register the shape.

However, using only the shape as the only feature

is not very reliable for capturing subtle facial ges-

tures. For instance, we can have two completely dif-

ferent gestures with the same shape (see ﬁg. 2 bot-

tom). To compensate for this effect, we also incor-

porate appearance features. The appearance features

are extracted by a geometric invariant histogram re-

cently introduced (Domke and Aloimonos, 2006). We

can decouple the effects of registration in the appear-

ance representation since the histogram proposed in

(Domke and Aloimonos, 2006) is invariant to per-

spective transformations (see ﬁg. 2). During the

clustering process we over-sample the shape feaures

to achieve robustness against noise(see ﬁg. 2).

Figure 2: Features used in temporal segmentation.

5 DISCOVERING TEMPORAL

CLUSTERS

Once the the facial features have been clustered into

coherent shape/appearance clusters, the goal is to

group the clusters into facial gestures. In this section,

we propose a simple but effective method to search

for temporal coherent clusters.

5.1 Removing Temporal Redundancy

In a ﬁrst step, we automatically detect all neutral ex-

pressions (i.e. action unit 0-AU0) (Cohn et al., 2006)

since is usually the most common facial ”cluster” and

useful in many recognition tasks. The algorithm to

detect the AU0 works as follows. First, we compute

a normalized error between the shape/appearance at

time t and time t − 1. A two-state Hidden Markov

Model (HMM) is used to temporally segment the time

instants that contain appearance/shape changes. The

transition probabilities in the HMM are computed us-

ing a logistic regression function (i.e.

1+e

−βx

and

1+e

−β(x+τ)

), where β, τ are parameters computed from

the error histogram. To ﬁnd a maximum a posteri-

ori solution, the standard Viterbi algorithm (dynamic

programming) is executed. In the state represent-

ing still conﬁgurations of the face there are exam-

ples of AU0 and examples of other AU that are sta-

tic for few frames. In the next step, we separate

SIMULTANEOUS REGISTRATION AND CLUSTERING FOR TEMPORAL SEGMENTATION OF FACIAL

GESTURES FROM VIDEO

113

these two cases by performing spectral clustering with

shape/appearance features. All the clusters such that

the average mean aperture of the mouth is smaller that

a threshold are classiﬁed as AU0. Fig 3 illustrates the

process.

Figure 3: Process to automatically detect AU0.

The second important step towards discovering

temporal clusters is to achieve temporal invariance to

the speed of the facial gesture. Towards this end, we

ﬁrst remove all the consecutive clusters that are the

same and just the changes among consecutive clus-

ters will remain. After this process is done, the video

is reduced to about 10 − 20% its original length.

5.2 Temporal Correlation to Discover

Facial Gestures

Once we have simpliﬁed the temporal representation

of the video sequence, we are ready to ﬁnd tempo-

ral patterns of different lengths in the video sequence.

Since we have substantially reduced the amount of

temporal data available, we use an exhaustive ap-

proach to search over all possible cluster sequences of

different lengths and in the sequence to ﬁnd the same

temporal pattern.

The algorithm starts selecting long patterns (usu-

ally 8 − 9 clusters), and it searches over the whole se-

quence for peaks of the normalized correlation. All

the peaks that have normalized correlation 1 (i.e. is

the same pattern) are removed from the sequence,

later the rest of the patterns with smaller length are

iteratively discovered with the same approach.

Fig. 4 shows how the algorithm works in syn-

thetic data. We have made a sequence with three tem-

poral clusters of length 4 (ﬁg. 4.a and 4.b). The al-

gorithm automatically discovers that there are 3 tem-

poral clusters and correctly identiﬁes them.

Figure 4: a) 3 synthetic clusters b) synthetic sequence c)

Temporal clusters found with our algorithm.

6 EXPERIMENTS

In this section we report preliminary experiments with

synthetic and real data.

6.1 Synthetic Data

We have synthetically created three different shape

prototypes (ﬁg. 5.b) and perturbed them with 50 ran-

dom afﬁne transformations (ﬁg. 5.a). After running

PaCA we can see the mean of the shape for each clus-

ter in the second row of ﬁg. 5 is correctly recovered.

PaCA has correctly clustered the original shapes.

Figure 5: First row: superimposed perturbed shapes for

each cluster. Second row: superimposed aligned shapes.

Also original prototypes.

6.2 Expression Segmentation

In this experiment, we have recorded a video se-

quence where the face of the subject is naturally mak-

ing ﬁve different facial gestures (sad, taking out the

tongue, speaking, smiling, and neutral). We use AAM

VISAPP 2007 - International Conference on Computer Vision Theory and Applications

114

to track the sequence (see ﬁg. 6). After the tracking

is done, we automatically detect the AU0 and remove

the temporal redundancy in the cluster sequence, re-

ducing the sequence to 20% its original length (see

ﬁg. 7.a and 7.b). Later, the temporal segmentation

algorithm discovers the facial gestures shown in 7.c.

Observe that there are some time windows that are

not classiﬁed, this windows correspond to one time

or unusual facial gestures. We have visually checked

the correctness of our approach. The results in

video can be downloaded from www.cs.cmu.edu/ ∼

f torre/ExpressionSegmentation.avi.

Figure 6: AAM tracking across several frames.

Figure 7: a) Original sequence of clusters. b) Sequence

of clusters with just the transitions. c) Discovered facial

gestures.

REFERENCES

Adiv, G. (1985). Determining three-dimensional motion

and structure from optical ﬂow generated by several

moving objects. IEEE transactions on Pattern Analy-

sis and Machine Intelligence, 7(4):384–401.

Cohn, J., Ambadar, Z., and Ekman, P. (2006). Observer-

based measurement of facial expression with the facial

action coding system.

de la Torre, F. and Kanade, T. (2006). Discriminative clus-

ter analysis. In International Conference on Machine

Learning.

Dhillon, I. S., Guan, Y., and Kulis, B. (2004). A uniﬁed

view of kernel k-means, spectral clustering and graph

partitioning. In UTCS Technical Report TR-04-25.

Ding, C. and He, X. (2004). K-means clustering via princi-

pal component analysis. In International Conference

on Machine Learning, volume 1.

Domke, J. and Aloimonos, Y. (2006). Deformation and

viewpoint invariant color histograms. In BMVC.

Fletcher, R. (1987). Practical methods of optimization. John

Wiley and Sons.

Hoey, J. (2001). Hierarchical unsupervised learning of fa-

cial expression categories. In Detection and Recog-

nition of Events in Video, 2001. Proceedings. IEEE

Workshop on s.

Jain, A. K. (1988). Algorithms For Clustering Data. Pren-

tice Hall.

Lucey, S., Matthews, I., Hu, C., Ambadar, Z., de la Torre,

F., and Cohn, J. (2006). AAM derived face represen-

tations for robust facial action recognition. In Proc.

International Conference on Automatic Face and Ges-

ture Recognition.

MacQueen, J. B. (1967). Some methods for classiﬁca-

tion and analysis of multivariate observations. In 5-th

Berkeley Symposium on Mathematical Statistics and

Probability.Berkeley, University of California Press.,

pages 1:281–297.

Mase, K. and Pentland, A. (1990). Automatic lipreading by

computer. (J73-D-II(6)):796–803.

Matthews, I. and Baker, S. (2004). Active appearance mod-

els revisited. International Journal of Computer Vi-

sion, 60(2):135–164.

Shi, J. and Malik, J. (2000). Normalized cuts and image

segmentation. IEEE Transactions on Pattern Analysis

and Machine Intelligence, 22(8).

Zelnik-Manor, L. and Irani, M. (2004). Temporal factoriza-

tion vs. spatial factorization. In ECCV.

Zha, H., Ding, C., Gu, M., He, X., and Simon., H. (2001).

Spectral relaxation for k-means clustering. In Neural

Information Processing Systems, pages 1057–1064.

Zhao, W. and Chellappa, R. (2006). (Editors). Face

Processing: Advanced Modeling and Methods. El-

sevier.

SIMULTANEOUS REGISTRATION AND CLUSTERING FOR TEMPORAL SEGMENTATION OF FACIAL

GESTURES FROM VIDEO

115