vanced deep-learning speech recognition model built
for a particular major language, by adapting the model
to a minority language having few resources. As de-
scribed above, we now have high-performance mod-
els such as BERT (Devlin et al., 2018) or its develop-
ment XLNet(Yang et al., 2019). For major languages,
it is relatively easy to obtain end-to-end deep learn-
ing models in which the acoustic frontend part can
be available for the other languages. Therefore, if we
adapt, or conduct fine-tuning to, the language model
part in the model for a minority language using its
small number of data, we may obtain a speech rec-
ognizer for the minority language, employing a state-
of-the-art deep learning architecture. We compare the
proposed approach with the conventional scheme; a
recurrent neural network model is chosen as a com-
petitive model, as the model can be built using the
small number of training utterances. We evaluate both
methods by recognition accuracy. Note that in this
paper, for experimental reasons, we consider English
as a major language and Japanese as a minority lan-
guage. That is, we use only a few Japanese utterances
in our experiments.
Our contribution is as follows; we show the pos-
sibility to accomplish a deep-learning-based speech
recognition model for a minority language, by adapt-
ing a pre-trained HuBERT model that is trained using
a large-scale corpus in a major language. It finally en-
ables us to improve recognition accuracy, compared
with any model that is trained using a small number
of dataset in the minority language.
2 RELATED WORK
In recent years, speech recognition using deep learn-
ing has made remarkable progress; a decade ago re-
searchers started to employ deep learning mainly to
extract acoustic features, followed by composing it to
hidden Markov models which were commonly used
in order to compute feature observation probabilities
instead of Gaussian mixtures. After that, recurrent
neural networks such as Long Short-Term Memory
(LSTM) were chosen to replace conventional mod-
els. As many architectures e.g. attention mecha-
nism and Transformer appeared, speech recognizers
have exploited them. For instance, Quartznet (Kri-
man et al., 2020), Conformer (Gulati et al., 2020), and
Contextnet (Han et al., 2020) have improved the ac-
curacy of speech recognition by training large models
with the large number of parameters.
To build such the models, it is essential to pre-
pare labeled data for model training. However, only
a small portion of the existing speech data is well
labeled, while the others have not yet. Therefore,
the method of using not only labeled speech data
but also unlabeled speech data for training large-
scale models has been considered. One example of
self-supervised learning that uses unlabeled speech as
training data is wav2vec (Schneider et al., 2019), that
performs expression learning by contrastive learning.
The wav2vec method conducted pre-training with un-
labeled speech and then carry out fine-turing with the
small number of labeled speech. Another approach,
HuBERT (Hsu et al., 2021), was proposed which was
an expression learning model that follows wav2vec
and performed pre-training by clustering raw speech
waveforms.
For minority languages, low-resource speech
recognition methods have been studied in the past.
One scheme (Bansal et al., 2018) used a model based
on LSTM for speech-to-text translation and showed
that fine-turning after pre-training with large data im-
proved the accuracy. Multi-lingual speech recogni-
tion schemes (Dalmia et al., 2018), (Fathima et al.,
) and meta-learning (Hsu et al., 2019) method have
been proposed. Furthermore, some attempts have
been made to improve the accuracy by performing
data augmentation to compensate the lack of train-
ing data. For instance, MixSpeech (Hsu et al., 2019)
trained a recognition model using a weighted combi-
nation of two different speech features as input and
improved the accuracy.
3 METHODOLOGY
This section describes details of our proposed
scheme.
We employ HuBERT as a recognition model.
As described, this work aims at making a speech
recognition model for minority language from a
model for major language. Therefore, the HuBERT
model is firstly pre-trained on large-scale English
speech data. HuBERT is an expression learning
model and is considered to have high generalization
performance. Hence, HuBERT is expected to well
learn the general acoustic feature extraction part in
addition to the particular language modeling part for
English. After pre-training, we perform fine-tuning
to the model on a Japanese speech data set consisting
of a smaller amount of data. By carrying out fine-
tuning, it is expected to replace the language model-
ing part for English into a Japanese language model,
with keeping the acoustic processing part. It is known
that Japanese has fewer phonemes than English, that
is, some phonemes used in English like /th/ and /ae/
are missing. The Japanese phoneme set is thus a sub-
Speech Recognition for Minority Languages Using HuBERT and Model Adaptation
351