speech signals are not polluted by acoustic noise. This
means that only applying ASR is sufficient to obtain
acceptable speech recognition performance.
Regarding the first problem, we have already de-
veloped a multi-angle VSR, which can accept not
only frontal but also diagonal or profiles face images
(S.Isobe et al., 2021c; S.Isobe et al., 2020; S.Isobe
et al., 2021d). Therefore, in this article we would like
to focus on the second problem about processing time.
The straightforward strategy to handle the issue is to
introduce a noise estimator or a scene classifier; if the
estimator judges given audio data as a clean speech,
we simply perform only ASR to obtain recognition
results; otherwise, we start to carry out image pro-
cessing followed by running multi-angle AVSR.
In this work we choose an anomaly-detection-
based scene classifier for this purpose. We usu-
ally adopt an anomalous sound detection approach in
which a classifier is trained using only acoustically
clean utterance data. One of conventional schemes
for anomaly detection is to employ a reconstruction
model, such as Autoencoder (AE) and Variational Au-
toencoder (VAE). However, in this case it is consid-
ered that these models are not appropriate; it is hard
for AE and VAE to reconstruct data having com-
plicated structures like speech signals, and in fact,
our preliminary experiments show the reconstruction
hardly succeeded. Therefore, in this work, we employ
an anomaly-detection-based scene classifier using a
Parallel WaveGAN architecture (R.Yamamoto et al.,
2020), which is one of neural vocoders. Once the
Parallel WaveGAN model can be built significantly,
it can well generate clean speech signals, on the other
hand, cannot correctly generate utterance data con-
taminated by acoustic noise. Consequently, noisy
speech signals must have higher anomaly scores, and
can be easily discriminated from clean audio data.
We conducted evaluation experiments using two
databases for multi-angle AVSR: an open corpus
OuluVS2 (I.Anina et al., 2015) and a GAMVA
(S.Isobe et al., 2021a) database which was proposed
in our previous research. We employed our multi-
angle VSR method, in which several angle-specific
VSR models were simultaneously applied and in-
tegrated based on angle estimation results (S.Isobe
et al., 2021c). As an ASR model, a 2D Convolu-
tional Neural Network (2DCNN) was chosen; mel-
frequency spectrograms were given to the model.
Angle-specific VSR models each consisted of a
3DCNN, and Bidirectional Long Short-Term Mem-
ory (Bi-LSTM) was adopted as an angle estimation
model using facial feature points. Experimental re-
sults showed that our proposed multi-angle AVSR
method with the scene classifier achieved higher
recognition accuracy than ASR only, and faster than
our previous multi-angle AVSR method.
The rest of this paper is organized as follows. In
Section 2, we briefly review related works. Section
3 summarizes our proposed system. Section 4 in-
troduces a proposed scene classification method, fol-
lowed by ASR, multi-angle AVSR, and angle estima-
tor in Section 5. Two multi-angle audio-visual cor-
pora, experimental setup, results, and discussion are
described in Section 6. Finally Section 7 concludes
this paper.
2 RELATED WORK
In this section, we briefly introduce AVSR, multi-
angle VSR / AVSR, neural vocoder based Text-To-
Speech (TTS) and anomaly detection.
2.1 AVSR
Many research works have been conducted focusing
on AVSR. In this paper we would like to introduce
a couple of state-of-the-art works. An AVSR sys-
tem based on a recurrent-neural-network transducer
architecture was built in (T.Makino et al., 2019).
They evaluated the system using the LRS3-TED data
set, achieving high performance. In (P.Zhou et al.,
2019), the authors proposed a multimodal attention-
based method for AVSR, which could automati-
cally learn fused representations from both modal-
ities based on their importance. They employed
sequence-to-sequence architectures, and confirmed
high recognition performance under both acoustically
clean and noisy conditions. Another AVSR system
using a transformer-based architecture was proposed
in (G.Paraskevopoulos et al., 2020). Experimental re-
sults show that on the How2 data set, the system rela-
tively improved word error rate over sub-word predic-
tion models. In (S.Isobe et al., 2021b), we proposed
an AVSR method based on Deep Canonical Correla-
tion Analysis (DCCA). DCCA consequently gener-
ates projections from two modalities into one com-
mon space, so that the correlation of projected vectors
could be maximized. We thus employed DCCA tech-
niques with audio and visual modalities to enhance
the robustness of ASR. As a result, we confirmed
DCCA features of each modality can improve com-
pared to original features, and got better ASR results
in various noisy environments.
ICPRAM 2022 - 11th International Conference on Pattern Recognition Applications and Methods
450