Authors:
Shinnosuke Isobe
1
;
Satoshi Tamura
2
;
Yuuto Gotoh
3
and
Masaki Nose
3
Affiliations:
1
Graduate School of Natural Science and Technology, Gifu University, Gifu, Japan
;
2
Faculty of Engineering, Gifu University, Gifu, Japan
;
3
Ricoh Company, Ltd., Kanagawa, Japan
Keyword(s):
Scene Classification, Audio-visual Speech Recognition, Multi-angle Lipreading, Anomaly Detection, Neural Vocoder.
Abstract:
Recently, Audio-Visual Speech Recognition (AVSR), one of robust Automatic Speech Recognition (ASR) methods against acoustic noise, has been widely researched. AVSR combines ASR and Visual Speech Recognition (VSR). Considering real applications, we need to develop VSR that can accept frontal and non-frontal face images, and reduce computational time for image processing. In this paper, we propose an efficient multi-angle AVSR method using a Parallel-WaveGAN-based scene classifier. The classifier estimates whether given speech data were recorded in clean or noisy environments. Multi-angle AVSR is conducted if our scene classification detected noisy environments to enhance the recognition accuracy, whereas only ASR is performed if the classifier predicts clean speech data to avoid the increase of processing time. We evaluated our framework using two multi-angle audio-visual database: an English corpus OuluVS2 having 5 views and a Japanese phrase corpus GAMVA consisting of 12 views. Expe
rimental results show that the scene classifier worked well, and using multi-angle AVSR achieved higher recognition accuracy than ASR. In addition, our approach could save processing time by switching recognizers according to noise condition.
(More)