Authors:
Luca Cappelletta
and
Naomi Harte
Affiliation:
Trinity College Dublin, Ireland
Keyword(s):
AVSR, Viseme, PCA, DCT, Optical flow.
Related
Ontology
Subjects/Areas/Topics:
Applications
;
Artificial Intelligence
;
Audio and Speech Processing
;
Cardiovascular Imaging and Cardiography
;
Cardiovascular Technologies
;
Digital Signal Processing
;
Health Engineering and Technology Applications
;
Knowledge Engineering and Ontology Development
;
Knowledge-Based Systems
;
Multimedia
;
Multimedia Signal Processing
;
Natural Language Processing
;
Pattern Recognition
;
Signal Processing
;
Software Engineering
;
Symbolic Systems
;
Telecommunications
Abstract:
Phonemes are the standard modelling unit in HMM-based continuous speech recognition systems. Visemes are the equivalent unit in the visual domain, but there is less agreement on precisely what visemes are, or how many to model on the visual side in audio-visual speech recognition systems. This paper compares the use of 5 viseme maps in a continuous speech recognition task. The focus of the study is visual-only recognition to examine the choice of viseme map. All the maps are based on the phoneme-to-viseme approach, created either using a linguistic method or a data driven method. DCT, PCA and optical flow are used to derive the visual features. The best visual-only recognition on the VidTIMIT database is achieved using a linguistically motivated viseme set. These initial experiments demonstrate that the choice of visual unit requires more careful attention in audio-visual speech recognition system development.