AN AUDIO-VISUAL SPEECH RECOGNITION SYSTEM FOR TESTING NEW AUDIO-VISUAL DATABASES

Tsang-Long Pao, Wen-Yuan Liao

Abstract

For past several decades, visual speech signal processing has been an attractive research topic for overcoming certain audio-only recognition problems. In recent years, there have been many automatic speech-reading systems proposed that combine audio and visual speech features. For all such systems, the objective of these audio-visual speech recognizers is to improve recognition accuracy, particularly in the difficult condition. In this paper, we will focus on visual feature extraction for the audio-visual recognition. We create a new audio-visual database which was recorded in two languages, English and Mandarin. The audio-visual recognition consists of two main steps, the feature extraction and recognition.We extract the visual motion feature of the lip using the front end processing. The Hidden Markov model (HMM) is used for the audio-visual speech recognition. We will describe our audio-visual database and use this database in our proposed system, with some preliminary experiments.

References

  1. T. Chen, “ Audio-visual speech processing,” in IEEE Signal Processing Magazine, Jan. 2001
  2. T. Chen and R. Rao, “Audiovisual interaction in multimedia communication,” in ICASSP, vol. 1. Munich, Apr. 1997, pp. 179-182.
  3. C. C. Chibelushi, F. Deravi, and J. S. D. Mason, “A review of speech-based bimodal recognition,” in IEEE Trans. Multimedia, vol. 4, Feb. 2002, pp. 23-37.
  4. M. N. Kaynak, Q. Zhi, etc, “Analysis of Lip Geometric Features for Audio-Visual Speech Recognition,” IEEE Transaction on Systems, Man, and Cybernetics-Part A: Systems and Humans, vol. 34, July 2004, pp. 564- 570.
  5. J. Luettin and G. Potamianos and C. Neti, “Asynchronous stream modeling for large vocabulary audio-visual speech recognition,” 2001.
  6. I. Matthews, T. F. Cootes, J. A. Bangham, S. Cox, and R. Harvey, “Extraction of visual features for lipreading,” in IEEE Trans. pattern analysis and machine intelligence, vol. 24, 2002, pp. 198-213.
  7. S. Nakamura, “Statistical multimodal integration for audio-visual speech processing,” in IEEE Trans. Neural Networks, vol.13, July 2002, pp. 854-866.
  8. G. Poamianos, etc,“Recent Advances in the Automatic Recognition of Audiovisual Speech” in Proceeding of the IEEE, Vol. 91, No. 9, September 2003.
Download


Paper Citation


in Harvard Style

Pao T. and Liao W. (2006). AN AUDIO-VISUAL SPEECH RECOGNITION SYSTEM FOR TESTING NEW AUDIO-VISUAL DATABASES . In Proceedings of the First International Conference on Computer Vision Theory and Applications - Volume 2: VISAPP, ISBN 972-8865-40-6, pages 192-196. DOI: 10.5220/0001369101920196


in Bibtex Style

@conference{visapp06,
author={Tsang-Long Pao and Wen-Yuan Liao},
title={AN AUDIO-VISUAL SPEECH RECOGNITION SYSTEM FOR TESTING NEW AUDIO-VISUAL DATABASES},
booktitle={Proceedings of the First International Conference on Computer Vision Theory and Applications - Volume 2: VISAPP,},
year={2006},
pages={192-196},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0001369101920196},
isbn={972-8865-40-6},
}


in EndNote Style

TY - CONF
JO - Proceedings of the First International Conference on Computer Vision Theory and Applications - Volume 2: VISAPP,
TI - AN AUDIO-VISUAL SPEECH RECOGNITION SYSTEM FOR TESTING NEW AUDIO-VISUAL DATABASES
SN - 972-8865-40-6
AU - Pao T.
AU - Liao W.
PY - 2006
SP - 192
EP - 196
DO - 10.5220/0001369101920196