Luca Cappelletta, Naomi Harte


Phonemes are the standard modelling unit in HMM-based continuous speech recognition systems. Visemes are the equivalent unit in the visual domain, but there is less agreement on precisely what visemes are, or how many to model on the visual side in audio-visual speech recognition systems. This paper compares the use of 5 viseme maps in a continuous speech recognition task. The focus of the study is visual-only recognition to examine the choice of viseme map. All the maps are based on the phoneme-to-viseme approach, created either using a linguistic method or a data driven method. DCT, PCA and optical flow are used to derive the visual features. The best visual-only recognition on the VidTIMIT database is achieved using a linguistically motivated viseme set. These initial experiments demonstrate that the choice of visual unit requires more careful attention in audio-visual speech recognition system development.


  1. Bouguet (2002). Pyramidal Implementation of Lucas Kanade Feature Tracker. Description of the algorithm.
  2. Bozkurt, Eroglu, Q., Erzin, Erdem, and Ozkan (2007). Comparison of phoneme and viseme based acoustic units for speech driven realistic lip animation. In 3DTV Conference, 2007, pages 1-4.
  3. Bregler and Konig (1994). 'Eigenlips' for robust speech recognition. In Acoustics, Speech, and Signal Processing, 1994. ICASSP-94., 1994 IEEE International Conference on, volume ii, pages II/669-II/672 vol.2.
  4. Cappelletta and Harte (2010). Nostril detection for robust mouth tracking. In Irish Signals and Systems Conference, pages 239 - 244, Cork.
  5. Cappelletta, L. and Harte, N. (2011). Viseme definitios comparison for visual-only speech recognition. In Proceedings of 19th European Signal Processing Conference (EUSIPCO), pages 2109-2113.
  6. Ezzat and Poggio (1998). Miketalk: a talking facial display based on morphing visemes. In Computer Animation 98. Proceedings, pages 96-102.
  7. Goldschen, A. J., Garcia, O. N., and Petajan, E. (1994). Continuous optical automatic speech recognition by lipreading. In Proceedings of the 28th Asilomar Conference on Signals, Systems, and Computers, pages 572-577.
  8. Hazen (2006). Visual model structures and synchrony constraints for audio-visual speech recognition. Audio, Speech, and Language Processing, IEEE Transactions on, 14(3):1082-1089.
  9. Hazen, Saenko, La, and Glass (2004). A segment-based audio-visual speech recognizer: data collection, development, and initial experiments. In Proceedings of the 6th international conference on Multimodal interfaces, pages 235-242, State College, PA, USA. ACM.
  10. Heckmann, Kroschel, Savariaux, and Berthommier (2002). DCT-Based Video Features for Audio-Visual Speech Recognition. In International Conference on Spoken Language Processing, volume 1, pages 1925-1928, Denver, CO, USA.
  11. Hilder, Theobald, and Harvey (2010). In pursuit of visemes. In International Conference on Auditoryvisual Speech Processing.
  12. Jeffers and Barley (1971). Speechreading (Lipreading). Charles C Thomas Pub Ltd.
  13. Lee and Yook (2002). Audio-to-Visual Conversion Using Hidden Markov Models. In Proceedings of the 7th Pacific Rim International Conference on Artificial Intelligence: Trends in Artificial Intelligence, pages 563- 570. Springer-Verlag.
  14. Lucas and Kanade (1981). An iterative image registration technique with an application to stereo vision. In Proceedings of Imaging Understanding Workshop.
  15. Neti, Potamianos, Luettin, Matthews, Glotin, Vergyri, Sison, Mashari, and Zhou (2000). Audio-visual speech recognition. Technical report, Center for Language and Speech Processing, The Johns Hopkins University, Baltimore.
  16. Pandzic, I. S. and Forchheimer, R. (2003). MPEG-4 Facial Animation: The Standard, Implementation and Applications. John Wiley & Sons, Inc., New York, NY, USA.
  17. Potamianos, Neti, Gravier, Garg, and Senior (2003). Recent advances in the automatic recognition of audio-visual speech. Proceeding of the IEEE, 91(9):1306-1326.
  18. Saenko, K. (2004). Articulary Features for Robust Visual Speech Recognition. Master thesis, Massachussetts Institute of Technology.
  19. Sanderson (2008). Biometric Person Recognition: Face, Speech and Fusion. VDM-Verlag.

Paper Citation

in Harvard Style

Cappelletta L. and Harte N. (2012). PHONEME-TO-VISEME MAPPING FOR VISUAL SPEECH RECOGNITION . In Proceedings of the 1st International Conference on Pattern Recognition Applications and Methods - Volume 2: ICPRAM, ISBN 978-989-8425-99-7, pages 322-329. DOI: 10.5220/0003731903220329

in Bibtex Style

author={Luca Cappelletta and Naomi Harte},
booktitle={Proceedings of the 1st International Conference on Pattern Recognition Applications and Methods - Volume 2: ICPRAM,},

in EndNote Style

JO - Proceedings of the 1st International Conference on Pattern Recognition Applications and Methods - Volume 2: ICPRAM,
SN - 978-989-8425-99-7
AU - Cappelletta L.
AU - Harte N.
PY - 2012
SP - 322
EP - 329
DO - 10.5220/0003731903220329