Table 2: Examples of English ASR results in Figure 1 (a). Underlined words mean recognition errors.
Reference Recognized characters
THE WHOLE CAMP WAS COLLECTE D BEFORE A ROT CABIN
ON THE OUTER EDGE OF THE CLEARING
THE WHOLE CAMP WAS COLLECTED BEFORE A RUDE CAB IN
ON THE OUTER EDGE OF THE CLEARING
INTO THE LAND BEYOND THE SYRIAN DESERT BUT EITHER
OF THEM DREAMED THAT THE SCATTERED AND DISUNITED
TRIBES OF ARABIA WOULD E V ER COMBINE OR BECOME A
SERIOUS DANGER
INTO THE LAND BEYOND THE SYRIAN DESERT BUT NEITHER
OF THEM DREAMED THAT THE SCATTERED AND DISUNITED
TRIBES OF ARABIA WOULD E V ER COMBINE OR BECOME A
SERIOUS DANGER
(c) We further introduce a new text decoder for the
indigenous language. Next, we make an text au-
toencod er for the m inor language, consisting of
the above encoder as well as the decoder. We ap-
ply model train ing using the text data written in
the indigenous language, only to the decoder.
(d) Finally, we use the HuBERT encoder as a feature
extractor and the text decoder for the indigenous
languag e, to build an ASR system for the mino r
languag e. It is said that English phonemes fully
cover Japanese vowels and consonants, therefore,
the English feature extractor is expected to also
work for Japanese speech data as well. Note that
in this paper, to improve the p erformance we ap-
ply fine-tuning not only to the decoder but also to
a part of the encoder.
The ad vantage of this scheme is that, ideally we do
not need any speech data for the indigenous la nguage,
or only a few data may be significant for fine-tuning
to finalize the A SR model. As mentioned above, it
is har d to collect speech data for such a minor lan-
guage with a small population . In our scheme we uti-
lize a pre-trained SSL-based feature extractor, that is
originally built for different languages, because a hu-
man speech production system is language inde pen-
dent . Furthermo re, for indigen ous languages, it is rel-
atively easier to collect text data than speech data; w e
can obtain text data from official government docu-
ments, textbooks, news sites and internet articles such
as Wikipedia.
3 EXPERIMENT
We conducted exper iments to evaluate the effective-
ness of our proposed approach . First, we report
preliminar y experime ntal results about training data
size for an indigenous langua ge. Second, we exam-
ine our NMT and autoencoder performance to check
Japanese encoder and decoder. Finally, we evaluate
our Japanese ASR. Table 1 shows model training an d
fine-tunin g settings in the following experiments.
3.1 Preliminary Experiments
3.1.1 Machine Translation
In our pr evious work, we investigated the influence of
training data size and model complexity in NMT. We
used the M ultiUN and Wikipedia data sets provided in
OPUS ( Tiedemann, 201 2), in order to obtain par allel
sentences. We then chose German-English sentence
pairs as a training data set. Though German has a
large population, in this experiment German is treated
as an indigenous language, while English was a ma-
jor language . We employed a pr e-trained NMT model
provided by OpenNMT (Klein et al. , 2017), that was
based on a tiny transformer; the encoder and decoder
had six layers respectively. A transformer model was
then explored w ith different settings, such as the num-
ber of layers in the encoder and decoder parts, and the
number of training sente nces.
It turns out that, with the small data set, we can
build an NM T model, which achieves roughly the
same performance as the pre-trained model, by ad-
justing the hyperparameters; we should make an en-
coder for indigenous language smaller to m a intain
the translation performance, while the decoder should
still be large because it directly affects the perfor-
mance. It is well known that the larger the training
data set becomes, the better NMT performance is. On
the other hand, it is sometimes hard to obtain larger
data sets. Acco rding to our preliminary results, in this
work we decided to use 10,000 sentences in the fol-
lowing experiments, which is quite small compared to
the data set used in existing works.
3.1.2 English Speech Recognition
Next, we tested an English ASR shown in Figure 1
(a). We adopted an English Hu BERT model provid ed
by Facebook, which was trained using 960 -hour spo-
ken data f rom Librispeech (Panayotov et al., 2015).
The transformer consisted of a CNN encoder a nd a
12-layer transfor mer. As an En glish text decoder, we
employed a two-layer transformer, each having 12 at-
tention heads. When building the ASR system, in the
encoder transformer, we fixed the eight layers on the