The quality of animations generated using our synthe-
sis technique are on a similar to those produced from
2D video (e.g. (Ezzat et al., 2002; Brand, 1999; Bre-
gler et al., 1997)), with the added advantages of full
control of orientation.
4 CONCLUSIONS
A data-driven approach to 3D visual speech synthesis
based on captured 3D video of faces has been pre-
sented. Recent advances in 3D video capture have
achieved simultaneous video-rate acquisition of facial
shape and appearance. In this paper we have intro-
duced face synthesis based on a graph representation
of a phonetically segmented 3D video corpus. This
approach is analogous to previous work in face syn-
thesis by resampling 2D video (Bregler et al., 1997)
and 2D video textures(Sch
¨
odl et al., 2000). Face
synthesis for novel speech utterances is achieved by
optimisation of the path through the graph and con-
catenation of segments of the captured 3D video. A
novel metric using a hierarchical wavelet decomposi-
tion is introduced to identify transitions between 3D
video frames with similar facial shape, appearance
and dynamics. This metric allows efficient compu-
tation of the similarity between 3D video frames for
a large corpus to produce transitions without visual
artifacts. Results are presented for facial synthesis
from a corpus of 12minutes (18000 frames) of 3D
video. Visual speech synthesis of novel sentences
achieves a visual quality comparable to the captured
3D video allowing highly realistic synthesis without
post-processing. The data-driven approach to 3D face
synthesis requires minimal manual intervention be-
tween 3D video capture and facial animation from
speech. Future extensions to the system introducing
expression and secondary facial movements in a thor-
oughly engaging synthetic character are foreseen.
REFERENCES
Brand, M. (1999). Voice puppetry. In Proceedings of
SIGGRAPH ’99, pages 21–28, New York, NY, USA.
ACM Press/Addison-Wesley Publishing Co.
Bregler, C., Covell, M., and Slaney, M. (1997). Video
rewrite: driving visual speech with audio. In Pro-
ceedings of SIGGRAPH ’97, pages 353–360, New
York, NY, USA. ACM Press/Addison-Wesley Pub-
lishing Co.
Cao, Y., Faloutsos, P., and Pighin, F. (2003). Unsupervised
learning for speech motion editing. In Eurograph-
ics/ACM SIGGRAPH Symposium on Computer Ani-
mation ’03, pages 225–231.
Cohen, M. and Massaro, D. (1993). Modeling coarticula-
tion in synthetic visual speech. In Computer Anima-
tion ’93, pages 139–156.
Cosatto, E. and Graf, H. (2000). Photo-realistic talking
heads from image samples. IEEE Transactions on
Multimedia, 2(3):152–163.
Ezzat, T., Geiger, G., and Poggio, T. (2002). Trainable vide-
orealistic speech animation. In Proceedings of SIG-
GRAPH ’02, pages 388–398, New York, NY, USA.
ACM Press.
Kalberer, G. and Van Gool, L. (2002). Realistic face ani-
mation for speech. Journal of Visualization and Com-
puter Animation, 13(2):97–106.
Kovar, L., Gleicher, M., and Pighin, F. (2002). Motion
graphs. In Proceedings of SIGGRAPH ’02, pages
473–482, New York, NY, USA. ACM Press.
Kshirsagar, S. and Magnenat-Thalmann, N. (2003). Visyl-
lable based speech animation. In Eurographics’03,
pages 632–640.
L
¨
ofqvist, A. (1990). Speech as audible gestures. In Hard-
castle, W. and Marchal, A., editors, Speech Produc-
tion and Speech Modeling, pages 289–322. Kluwer.
McGurk, H. and MacDonald, J. (1976). Hearing lips and
seeing voices. Nature, (264):746–748.
Sch
¨
odl, A., Szeliski, R., Salesin, D. H., and Essa, I.
(2000). Video textures. In Proceedings of SIG-
GRAPH ’00, pages 489–498, New York, NY, USA.
ACM Press/Addison-Wesley Publishing Co.
Shashua, A. and Levin, A. (2001). Linear image coding
for regression and classification using the tensor-rank
principle. In Proceedings of the 2001 IEEE Computer
Society Conference on Computer Vision and Pattern
Recognition., pages 42–49.
Stollnitz, E., DeRose, T., and Salesin., D. (1995). Wavelets
for computer graphics: A primer. 15:76–84.
Sumby, W. and Pollack, I. (1954). Visual contribution to
speech intelligibility in noise. 26:212–215.
Taylor, P., Black, A., and Caley, R. (1998). The architecture
of the the festival speech synthesis system. In Third
International Workshop on Speech Synthesis.
Wang, Y., Huang, X., Lee, C.-S., Zhang, S., Li, Z., Samaras,
D., Metaxas, D., Elgammal, A., and Huang, P. (2004).
High resolution acquisition, learning and transfer of
dynamic 3-d facial expressions. In Eurographics’04,
pages 677–686.
Ypsilos, I., Hilton, A., and Rowe, S. (2004). Video-rate
capture of dynamic face shape and appearance. In 6
th
IEEE International Conference on Automatic Face
and Gesture Recognition, pages 117–123.
Zhang, L., Snavely, N., Curless, B., and Seitz, S. (2004).
Spacetime faces: high resolution capture for model-
ing and animation. ACM Transactions on Graphics,
23(3):548–558.
GRAPP 2007 - International Conference on Computer Graphics Theory and Applications
62