Efficient Multi-angle Audio-visual Speech Recognition using Parallel WaveGAN based Scene Classifier

Shinnosuke Isobe, Satoshi Tamura, Yuuto Gotoh, Masaki Nose

2022

Abstract

Recently, Audio-Visual Speech Recognition (AVSR), one of robust Automatic Speech Recognition (ASR) methods against acoustic noise, has been widely researched. AVSR combines ASR and Visual Speech Recognition (VSR). Considering real applications, we need to develop VSR that can accept frontal and non-frontal face images, and reduce computational time for image processing. In this paper, we propose an efficient multi-angle AVSR method using a Parallel-WaveGAN-based scene classifier. The classifier estimates whether given speech data were recorded in clean or noisy environments. Multi-angle AVSR is conducted if our scene classification detected noisy environments to enhance the recognition accuracy, whereas only ASR is performed if the classifier predicts clean speech data to avoid the increase of processing time. We evaluated our framework using two multi-angle audio-visual database: an English corpus OuluVS2 having 5 views and a Japanese phrase corpus GAMVA consisting of 12 views. Experimental results show that the scene classifier worked well, and using multi-angle AVSR achieved higher recognition accuracy than ASR. In addition, our approach could save processing time by switching recognizers according to noise condition.

Download


Paper Citation


in Harvard Style

Isobe S., Tamura S., Gotoh Y. and Nose M. (2022). Efficient Multi-angle Audio-visual Speech Recognition using Parallel WaveGAN based Scene Classifier. In Proceedings of the 11th International Conference on Pattern Recognition Applications and Methods - Volume 1: ICPRAM, ISBN 978-989-758-549-4, pages 449-460. DOI: 10.5220/0010846000003122


in Bibtex Style

@conference{icpram22,
author={Shinnosuke Isobe and Satoshi Tamura and Yuuto Gotoh and Masaki Nose},
title={Efficient Multi-angle Audio-visual Speech Recognition using Parallel WaveGAN based Scene Classifier},
booktitle={Proceedings of the 11th International Conference on Pattern Recognition Applications and Methods - Volume 1: ICPRAM,},
year={2022},
pages={449-460},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0010846000003122},
isbn={978-989-758-549-4},
}


in EndNote Style

TY - CONF

JO - Proceedings of the 11th International Conference on Pattern Recognition Applications and Methods - Volume 1: ICPRAM,
TI - Efficient Multi-angle Audio-visual Speech Recognition using Parallel WaveGAN based Scene Classifier
SN - 978-989-758-549-4
AU - Isobe S.
AU - Tamura S.
AU - Gotoh Y.
AU - Nose M.
PY - 2022
SP - 449
EP - 460
DO - 10.5220/0010846000003122