VISUAL SPEECH SYNTHESIS FROM 3D VIDEO

J. D. Edge; A. Hilton

doi:10.5220/0002080400570062

VISUAL SPEECH SYNTHESIS FROM 3D VIDEO

J. D. Edge, A. Hilton

2007

Abstract

Data-driven approaches to 2D facial animation from video have achieved highly realistic results. In this paper we introduce a process for visual speech synthesis from 3D video capture to reproduce the dynamics of 3D face shape and appearance. Animation from real speech is performed by path optimisation over a graph representation of phonetically segmented captured 3D video. A novel similarity metric using a hierarchical wavelet decomposition is presented to identify transitions between 3D video frames without visual artifacts in facial shape, appearance or dynamics. Face synthesis is performed by playing back segments of the captured 3D video to accurately reproduce facial dynamics. The framework allows visual speech synthesis from captured 3D video with minimal user intervention. Results are presented for synthesis from a database of 12minutes (18000 frames) of 3D video which demonstrate highly realistic facial animation.

References

Brand, M. (1999). Voice puppetry. In Proceedings of SIGGRAPH 7899, pages 21-28, New York, NY, USA. ACM Press/Addison-Wesley Publishing Co.
Bregler, C., Covell, M., and Slaney, M. (1997). Video rewrite: driving visual speech with audio. In Proceedings of SIGGRAPH 7897, pages 353-360, New York, NY, USA. ACM Press/Addison-Wesley Publishing Co.
Cao, Y., Faloutsos, P., and Pighin, F. (2003). Unsupervised learning for speech motion editing. In Eurographics/ACM SIGGRAPH Symposium on Computer Animation 7803, pages 225-231.
Cohen, M. and Massaro, D. (1993). Modeling coarticulation in synthetic visual speech. In Computer Animation 7893, pages 139-156.
Cosatto, E. and Graf, H. (2000). Photo-realistic talking heads from image samples. IEEE Transactions on Multimedia, 2(3):152-163.
Ezzat, T., Geiger, G., and Poggio, T. (2002). Trainable videorealistic speech animation. In Proceedings of SIGGRAPH 7802, pages 388-398, New York, NY, USA. ACM Press.
Kalberer, G. and Van Gool, L. (2002). Realistic face animation for speech. Journal of Visualization and Computer Animation, 13(2):97-106.
Kovar, L., Gleicher, M., and Pighin, F. (2002). Motion graphs. In Proceedings of SIGGRAPH 7802, pages 473-482, New York, NY, USA. ACM Press.
Kshirsagar, S. and Magnenat-Thalmann, N. (2003). Visyllable based speech animation. In Eurographics'03, pages 632-640.
Löfqvist, A. (1990). Speech as audible gestures. In Hardcastle, W. and Marchal, A., editors, Speech Production and Speech Modeling, pages 289-322. Kluwer.
McGurk, H. and MacDonald, J. (1976). Hearing lips and seeing voices. Nature, (264):746-748.
Schödl, A., Szeliski, R., Salesin, D. H., and Essa, I. (2000). Video textures. In Proceedings of SIGGRAPH 7800, pages 489-498, New York, NY, USA. ACM Press/Addison-Wesley Publishing Co.
Shashua, A. and Levin, A. (2001). Linear image coding for regression and classification using the tensor-rank principle. In Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition., pages 42-49.
Stollnitz, E., DeRose, T., and Salesin., D. (1995). Wavelets for computer graphics: A primer. 15:76-84.
Sumby, W. and Pollack, I. (1954). Visual contribution to speech intelligibility in noise. 26:212-215.
Taylor, P., Black, A., and Caley, R. (1998). The architecture of the the festival speech synthesis system. In Third International Workshop on Speech Synthesis.
Wang, Y., Huang, X., Lee, C.-S., Zhang, S., Li, Z., Samaras, D., Metaxas, D., Elgammal, A., and Huang, P. (2004). High resolution acquisition, learning and transfer of dynamic 3-d facial expressions. In Eurographics'04, pages 677-686.
Ypsilos, I., Hilton, A., and Rowe, S. (2004). Video-rate capture of dynamic face shape and appearance. In 6th IEEE International Conference on Automatic Face and Gesture Recognition, pages 117-123.
Zhang, L., Snavely, N., Curless, B., and Seitz, S. (2004). Spacetime faces: high resolution capture for modeling and animation. ACM Transactions on Graphics, 23(3):548-558.

Download

Paper Citation

in Harvard Style

D. Edge J. and Hilton A. (2007). VISUAL SPEECH SYNTHESIS FROM 3D VIDEO . In Proceedings of the Second International Conference on Computer Graphics Theory and Applications - Volume 2: GRAPP, ISBN 978-972-8865-72-6, pages 57-62. DOI: 10.5220/0002080400570062

in Bibtex Style

@conference{grapp07,
author={J. D. Edge and A. Hilton},
title={VISUAL SPEECH SYNTHESIS FROM 3D VIDEO},
booktitle={Proceedings of the Second International Conference on Computer Graphics Theory and Applications - Volume 2: GRAPP,},
year={2007},
pages={57-62},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0002080400570062},
isbn={978-972-8865-72-6},
}

in EndNote Style

TY - CONF
JO - Proceedings of the Second International Conference on Computer Graphics Theory and Applications - Volume 2: GRAPP,
TI - VISUAL SPEECH SYNTHESIS FROM 3D VIDEO
SN - 978-972-8865-72-6
AU - D. Edge J.
AU - Hilton A.
PY - 2007
SP - 57
EP - 62
DO - 10.5220/0002080400570062