(a) (b)
Figure 2: (a) Real person facial expression. Marks asso-
ciated to FAP3 are encircled in red. (b) Synthesized facial
expression.
and cheek movements). Similarly, several subsets of
marks can be associated to the different FAPs.
6 EXPERIMENTAL RESULTS
For the audio-visual training, videos of a talking per-
son with reference marks on the region around the
person’s mouth were recorded at a rate of 30 frames
per seconds, with a resolution of (320×240) pixels.
The audio was recorded at 11025Hz synchronized
with the video. The videos consist of sequences of the
Spanish utterances corresponding to the digits zero to
nine in random order. For the re-training of the audio
part of the AV-HMM, an only-audio database consist-
ing of recordings of sequences of the utterances cor-
responding to the digits zero to nine by 25 speakers
(balance proportion of males and females) was col-
lected.
Experiments were performed with AV-HMM with
full and diagonal covariance matrices, different num-
ber of states and mixtures in the ranges [3,20] and
[2,19], respectively, and different values of the co-
articulation parameter t
c
in the range [2, 5]. In the
experiments, the audio feature vector a
t
is composed
by the first eleven non-DC Mel-Cepstral coefficients,
while the visual feature vector o
v
is of dimension two
(K = 2 in equation (10)). The performances of the
different models were compared by computing the
Average Mean Square Error (AMSE)(ε), and the Av-
erage Correlation Coefficient (ACC)(ρ) between the
true and estimated visual parameters, defined as
ε =
1
TK
K
∑
k=1
1
σ
2
v
k
T
∑
t=1
o
′
vt
k
− o
vt
k
2
(11)
ρ =
1
TK
T
∑
t=1
K
∑
k=1
(o
vt
k
− µ
v
k
)(o
′
vt
k
− µ
′
v
k
)
σ
v
k
σ
′
v
k
(12)
respectively, where µ
v
k
and σ
v
k
denote the mean and
the variance of the true visual observation, respec-
tively, and µ
′
v
k
and σ
′
v
k
denote the mean and variance
of the estimated visual parameters, respectively.
For the quantification of the visual estimation ac-
curacy, a separate audio-visual dataset, different from
the training dataset, was employed. The following re-
sults correspond to a co-articulation parameter t
c
= 5,
which proves to be the optimal value in the given
range. Fig. 3(a) and Fig. 3(b), show the AMSE and
the ACC as a function of the number of states and the
number of mixtures for an AV-HMM with full covari-
ance matrix. In this case, equation (7) applies for the
estimation of the visual observations o
′
vt
. As can be
observed, as the number of states and the number of
mixtures increase, the AMSE increases and the ACC
decreases, indicating that the accuracy of the estima-
tion deteriorates. This is probably due to the bias-
variance tradeoff inherent to any estimation problem.
The optimal values for the number of the states and
mixtures would be for this case N = 4 and M = 2,
respectively, corresponding to ε = 0.47 and ρ = 0.75.
Fig. 3(c) and Fig. 3(d), show the AMSE and the
ACC as a function of the number of states and the
number of mixtures for an AV-HMM with diagonal
covariance matrix. In this case, equation (8) applies
for the estimation of the visual observations o
′
vt
. As
can be observed, to obtain a similar accuracy a more
complex model (larger number of states or mixtures)
is required. For this case, the optimal values are N =
19 and M = 3, corresponding to ε = 0.47 and ρ =
0.76.
The use of full covariance matrices affects the
computational complexity during the training stage
but, since this is carried out off-line, this does not rep-
resent a problem. During the synthesis stage (visual
estimation through HMM inversion), and due to the
low dimension of the visual feature vector (K = 2),
the computational load is similar to the case of using
diagonal covariance matrices for the same number of
states and mixtures.
The above arguments allow one to conclude that
the use of full covariance matrices is preferable from
the point of view of both computational complexity
and accuracy.
The true and estimated visual parameters for the
case of full covariance matrices with N = 4 states and
M = 2 mixtures (optimal values) are represented in
Fig. 4, where a good agrement can be observed.
7 CONCLUSIONS
A speech driven MPEG-4 compliant facial animation
system was introduced in this paper. A joint AV-
HMM is proposed to represent the audio-visual data
and an algorithm for HMM inversion was derived for
the general case of considering full covariance matri-
SIGMAP 2008 - International Conference on Signal Processing and Multimedia Applications
172