2 PREVIOUS WORK
There has been quite substantial research efforts de-
voted to facial expression analysis over the past few
years. Most of the work focus on tracking (Zhao and
Chellappa, 2006), facial expression recognition or ac-
tion units recognition (Lucey et al., 2006; Cohn et al.,
2006). Little attention has been paid to the problem
of temporal segmentation of facial gestures that can
greatly benefit the recognition process. Exception is
the pioneering work of Mase and Pentland (Mase and
Pentland, 1990) that shows how zeros of the velocity
of the facial motion parameters were found to be use-
ful for the temporal segmentation and its applications
to lip reading. Recently, Joey (Hoey, 2001) present
a Multilevel Bayesian network for learning the dy-
namics of facial expression. In related work, Irani
and Zelnik (Zelnik-Manor and Irani, 2004) propose a
modification of factorization algorithms for structure
from motion to provide temporal clustering of non-
rigid motion.
Most previous work assumes an accurate registra-
tion process before the segmentation step. Accurate
registration of the non-rigid facial features is still an
open research problem (Zhao and Chellappa, 2006),
in particular decoupling rigid and non-rigid motion
from 2D. Unlike previous research, in this paper we
propose an algorithm that jointly performs registra-
tion and clustering as a first step toward temporal seg-
mentation of facial gestures. Moreover, we develop a
simple but effective way to group these clusters into
temporally coherent chunks.
3 MATRIX FORMULATION FOR
CLUSTERING
In this section we review the state of the art in clus-
tering algorithms using a new matrix formulation that
enlightens the connection between several clustering
methods and suggests new optimization schemes for
spectral clustering.
3.1 K-means
K-means (MacQueen, 1967; Jain, 1988) is one of the
simplest and most popular unsupervised learning al-
gorithms to solve the clustering problem. Clustering
refers to the partition of n data points into c disjoint
clusters. k-means clustering splits a set of n objects
into c groups by maximizing the between-clusters
variation relative to within-cluster variation. That is,
k-means clustering finds the partition of the data that
is a local optimum of the following energy function:
J(µ
µ
µ
1
,...,µ
µ
µ
n
) =
∑
c
i=1
∑
j∈C
i
||d
j
− µ
µ
µ
i
||
2
2
where d
j
(see
notation
1
) is a vector representing the j
th
data point
and µ
µ
µ
i
is the geometric centroid of the data points for
class i. The optimization criteria in previous eq. can
be rewritten in matrix form as:
E
1
(M,G) = ||D −MG
T
||
F
(1)
sub ject to G1
c
= 1
n
and g
i j
∈ {0, 1}
where G ∈ ℜ
n×c
and M ∈ ℜ
d×c
. G is a dummy in-
dicator matrix, such that
∑
j
g
i j
= 1, g
i j
∈ {0,1} and
g
i j
is 1 if d
i
belongs to class C
j
, c denotes the number
of classes and n the number of samples. The columns
of D ∈ ℜ
d×n
contain the original data points, d is the
dimension of the data. Recall that the equivalence be-
tween the k-means error function and eq. 1 is only
valid if G strictly satisfies the constraints.
The k-means algorithm performs coordinate de-
scent in E
1
(M,G). Given the actual value of the
means M, the first step finds for each data point d
j
,
the g
j
such that one of the columns is one and the
rest 0 and minimizes eq. 1. The second step opti-
mizes over M = DG(G
T
G)
−1
, equivalent to compute
the mean of each cluster. Although it can be proven
that alternating these two steps will always terminate,
the k-means algorithm does not necessarily find the
optimal configuration over all possible assignments.
It typically runs multiple times and the best solution
is chosen. Despite these limitations, the algorithm is
used fairly frequently as a result of its ease of imple-
mentation and effectiveness.
Eliminating M, eq. 1 can be rewritten as:
E
2
(G) = ||D −DG(G
T
G)
−1
G
T
||
F
= tr(D
T
D)
−tr((G
T
G)
−1
G
T
D
T
DG) ≥
∑
min(d,n)
i=c+1
λ
i
(2)
where λ
i
are the eigenvalues of D
T
D. Min-
imizing eq. 2 is equivalent to maximizing
tr((G
T
G)
−1
G
T
D
T
DG). Ignoring the special struc-
ture of G and considering the continuous domain, the
optimum G value that optimizes eq. 2 is given by the
eigenvectors of the covariance matrix D
T
D and the er-
ror is E
2
=
∑
min(d,n)
i=c+1
λ
i
. A similar reasoning has been
reported by (Ding and He, 2004; Zha et al., 2001),
1
Bold capital letters denote a matrix D, bold lower-case
letters a column vector d. d
j
represents the j column of the
matrix D. d
i j
denotes the scalar in the row i and column
j of the matrix D and the scalar i-th element of a column
vector d
j
. All non-bold letters will represent variables of
scalar nature. diag is an operator that transforms a vector to
a diagonal matrix or takes the diagonal of the matrix into a
vector. ◦ denotes the Hadamard or point-wise product. 1
k
∈
ℜ
k×1
is a vector of ones. I
k
∈ ℜ
k×k
is the identity matrix.
tr(A) =
∑
i
a
ii
is the trace of the matrix A and |A| denotes
the determinant. ||A||
F
= tr(A
T
A) = tr(AA
T
) designates
the Frobenious norm of a matrix.
SIMULTANEOUS REGISTRATION AND CLUSTERING FOR TEMPORAL SEGMENTATION OF FACIAL
GESTURES FROM VIDEO
111