Language-Aware and Language-Agnostic Multilingual Speech

Recognition with a Single Model

Karol Nowakowski

1 a

and Michal Ptaszynski

2 b

Tohoku University of Community Service and Science, Sakata, Yamagata, Japan

Kitami Institute of Technology, Kitami, Hokkaido, Japan

Keywords:

Speech Recognition, Multilingual Learning, Adapters, Language Identiﬁers, Slavic Languages.

Abstract:

In recent years, there has been increasing interest in multilingual speech recognition systems, where a single

model can transcribe speech in multiple languages. Additional beneﬁt of multilingual learning is that it allows

for cross-lingual transfer, often leading to better performance, especially in low-resource languages. On the

other hand, multilingual models suffer from errors caused by confusion between languages. This problem

can be mitigated by providing the information about language identity as an additional input to the model.

In this research, we carry out experiments using a modern state-of-the-art ASR system architecture based

on a pretrained multilingual wav2vec 2.0 model and adapter modules trained for the downstream task, and

conﬁrm that multilingual supervised learning with language identiﬁers is a viable method for improving the

system’s overall performance. Furthermore, we ﬁnd that training with language identiﬁers still yields a model

with better average performance than the model trained without such information, even if language identity is

unknown at inference time.

1 INTRODUCTION

Previous works on automatic speech recognition

(Toshniwal et al., 2018; Conneau et al., 2021; Babu

et al., 2021), as well as other language processing

tasks (Johnson et al., 2017; Conneau et al., 2020),

have shown that combining data in multiple languages

to train a single model that can support all of them,

instead of developing separate monolingual models,

not only saves resources, but can also result in better

performance, especially for low-resource languages.

This is particularly true if the languages used in

training are closely related or share common linguis-

tic traits (Conneau et al., 2021; Nowakowski et al.,

2023).

In the most basic approach where data in multi-

ple languages is simply pooled together and no ad-

ditional information is supplied, the model needs to

learn to infer the input language identity. This re-

quirement can be removed by incorporating language

identiﬁers (LID) into the system, in the form of a

language vector (Li et al., 2018; Toshniwal et al.,

2018) or LID preﬁxes attached to each utterance in

https://orcid.org/0000-0001-7435-4061

https://orcid.org/0000-0002-1910-9183

the data (Nowakowski and Ptaszynski, 2023), result-

ing in improved performance. With the exception of

(Nowakowski and Ptaszynski, 2023), previous studies

seem to make an assumption – either implicitly or ex-

plicitly, as in (Zhou et al., 2022; Houston et al., 2024)

– that language identiﬁers are only useful if used both

in training and inference.

In recent years, state-of-the-art multilingual ASR

systems are often being built by combining an initial

speech recognition model or speech representation

model – such as wav2vec 2.0 (Baevski et al., 2020)

– pretrained on multilingual data, with language-

speciﬁc ﬁne-tuning (Kannan et al., 2019; Conneau

et al., 2021). Recently, (Pratap et al., 2023) de-

veloped a speech recognition model with separate

adapter modules (Rebufﬁ et al., 2017; Houlsby et al.,

2019) for more than 1,000 languages. On the other

hand, (Nowakowski et al., 2023) and (Nowakowski

and Ptaszynski, 2023) demonstrated that multilingual

supervised ﬁne-tuning can obtain improved results in

a low-resource setting.

In this research, we build a speech recognition

model supporting seven Slavic languages by training

a single adapter module and using language identi-

ﬁers. We show that on average the proposed approach

achieves lower error rates than monolingual adapters.

808

Nowakowski, K. and Ptaszynski, M.

Language-Aware and Language-Agnostic Multilingual Speech Recognition with a Single Model.

DOI: 10.5220/0013319500003905

In Proceedings of the 14th International Conference on Pattern Recognition Applications and Methods (ICPRAM 2025), pages 808-813

ISBN: 978-989-758-730-6; ISSN: 2184-4313

Furthermore, we investigate the performance of a

model trained with LID information on test data with-

out such information, and ﬁnd that it yields similar

or lower error rates than a multilingual model trained

in a language-agnostic manner. This means that the

same model can be utilized both in scenarios where

the input language is explicitly speciﬁed and when it

is unknown, without loss in quality of the system’s

output compared to a setup using two separate mod-

els.

The remainder of this paper is organized as fol-

lows. Section 2 provides an overview of related re-

search. In Section 3, we explain our research method-

ology. Section 4 describes the experimental setup.

In Section 5, we present and analyze our experimen-

tal results. Section 6 discusses the limitations of our

study. Finally, Section 7 offers concluding remarks

and ideas for future improvements.

2 RELATED WORK

Multilingual neural speech recognition with language

identiﬁers injected into model’s input has been stud-

ied, among others, by (Li et al., 2018; Toshniwal

et al., 2018; Zhu et al., 2020; Nowakowski and

Ptaszynski, 2023).

Alternative methods for utilizing language iden-

tity information include using a multi-task learning

architecture for jointly recognizing speech and pre-

dicting language identity (Toshniwal et al., 2018;

Zhang et al., 2022), implementing a language-speciﬁc

gating mechanism (Kim and Seltzer, 2018), and

language-speciﬁc attention heads (Zhu et al., 2020).

Recently, and in particular after the paradigm shift

in language processing that took place with the advent

of pretrained language representation models such as

BERT (Devlin et al., 2019) and wav2vec 2.0 (Baevski

et al., 2020), the development of multilingual ASR

systems has often been performed as a two-stage pro-

cess, where multilingual pretraining is followed by

language-speciﬁc ﬁne-tuning (Kannan et al., 2019;

Conneau et al., 2021). On the other hand, studies by

(Nowakowski et al., 2023; Nowakowski and Ptaszyn-

ski, 2023) demonstrated that ﬁne-tuning jointly on

multiple languages can lead to improved results in a

low-resource setting. In order to avoid catastrophic

forgetting and reduce computational cost, ﬁne-tuning

is in many cases only applied to a subset of the model

layers (Liu et al., 2024) or utilizes adapter modules

inserted into the pretrained network (Kannan et al.,

2019; Pratap et al., 2023).

Using a combination of multilingual adapter ﬁne-

tuning and LID in multilingual ASR was previously

investigated by (Shen et al., 2023). Unlike in our

study, they did not observe improvements in overall

performance in comparison to monolingual baselines

when using a large-sized (>300M parameters) pre-

trained model.

(Nowakowski and Ptaszynski, 2023) found the

positive effects of using language identiﬁers in model

training to persist, even if they are unavailable at in-

ference time. Based on these results, they proposed

a hypothesis that “the additional knowledge about the

relationships and differences between the languages

used in ﬁne-tuning, learned by the agency of the lan-

guage identiﬁers, can be to a large extent reused in in-

ference regardless of their presence in the new data”.

However, they only experimented with bilingual and

trilingual models and performed all the evaluations on

a single language.

3 PROPOSED METHOD

Our system is based on the MMS architecture (Pratap

et al., 2023), which uses a pretrained wav2vec 2.0

model and language-speciﬁc adapter modules trained

to transcribe speech. Compared to the original frame-

work proposed in (Pratap et al., 2023), we intro-

duce the following modiﬁcations: (i) training a single

adapter module jointly on data in multiple languages,

and (ii) adding language identity information to the

model’s input.

3.1 Multilingual Adapter Training

We investigate the possibility of improving overall

ASR performance in a multilingual setting by train-

ing a single adapter module for transcribing speech in

multiple languages, rather than separate adapters for

each language. To that end, we simply pool the la-

beled data of all languages together to form a single

multilingual training set.

3.2 Language-Aware Training

With the aim of reducing the number of errors caused

by confusion between languages, we provide the

model with the information about language iden-

tity (LID). Speciﬁcally, we follow (Nowakowski and

Ptaszynski, 2023) in including this information di-

rectly in the data, in the form of short, artiﬁcially gen-

erated audio clips (different for each language) pre-

ﬁxed to every audio ﬁle in the corpus. The LID pre-

ﬁxes are 25ms long, which corresponds to the recep-

tive ﬁeld of the wav2vec 2.0 encoder (Baevski et al.,

2020).

Language-Aware and Language-Agnostic Multilingual Speech Recognition with a Single Model

809

Table 1: Statistics of the Common Voice data used in our experiments.

ces bul slk srp mkd slv hsb total

Data split Train

Utterances 20,144 4,849 3,258 1,879 1,686 1,388 808 34,012

Hours 26.6 7.0 3.5 1.5 2.0 1.3 1.5 43.4

Data split Validation

Utterances 9,009 2,766 2,588 1,583 1,289 1,232 172 18,639

Hours 11.5 4.3 3.1 1.1 1.5 1.3 0.3 23.1

Data split Test

Utterances 9,067 3,201 2,647 1,539 1,097 1,242 444 19,237

Hours 11.6 4.9 3.2 1.4 1.5 1.4 0.8 24.8

In addition to performing experiments with LID

information available both in training and evaluation,

we test the hypothesis – suggested by (Nowakowski

and Ptaszynski, 2023) – that language-aware training

can be helpful in multilingual ASR even in a scenario

where the input language identity is unknown at in-

ference time.

4 EXPERIMENT SETUP

4.1 Data

We use a subset of the Common Voice Corpus 17.0

(Ardila et al., 2020), obtained through the Hugging

Face Datasets library

(Lhoest et al., 2021). Speciﬁ-

cally, we ﬁne-tune and test our models on the data in

seven Slavic languages with varying resource levels:

Czech (ces), Bulgarian (bul), Slovak (slk), Serbian

(srp), Macedonian (mkd), Slovene (slv), and Upper

Sorbian (hsb). Table 1 shows the data statistics per

language. We use the ofﬁcial train, validation and test

splits. We resample the audio data to 16 kHz. Tran-

scriptions are preprocessed by removing punctuation

and lowercased.

4.2 Training and Inference

We use a pretrained MMS model checkpoint

as the

base for our ASR models and ﬁne-tune it by re-

initializing and training the adapter layers and the out-

put layer only, while freezing the rest of the model

parameters. We train all the models for 10 epochs,

using a batch size of 32. The learning rate is warmed

up for the ﬁrst 5% of updates to a peak of 1e-3, and

linearly decayed after that. We evaluate on the vali-

dation data every 100 steps and at the end of training,

https://huggingface.co/datasets/mozilla-foundation/

common voice 17 0

https://huggingface.co/facebook/mms-1b-l1107

and select the best checkpoint based on Character Er-

ror Rate. Concerning other hyperparameters, we fol-

low (von Platen, 2023). Each training experiment is

run three times with a different random seed in each

execution. Inference is performed without a language

model. We carry out all experiments using the Hug-

ging Face Transformers library (ver. 4.45.1) (Wolf

et al., 2020).

5 RESULTS AND DISCUSSION

Table 2 compares the results obtained by monolin-

gual baseline models and multilingual models trained

and tested with LID preﬁxes in the data. Language-

aware multilingual ﬁne-tuning results in lower error

rates on 4 out of 7 languages and on average. Further-

more, it yields more stable performance: while the

error rates for models ﬁne-tuned on monolingual Ser-

bian (srp) and Slovene (slv) data exhibit very high

variance, for multilingual adapters standard deviation

never exceeds ± 1.0. On the other hand, the results on

the two languages with the largest amount of training

data – namely, Czech (ces) and Bulgarian (bul) – are

worse, which might suggest that this approach is not

beneﬁcial for high-resource languages.

Next, we examine the performance of multilin-

gual models ﬁne-tuned with LID preﬁxes when ap-

plied to data without them, and compare it to base-

line multilingual models trained without LID infor-

mation. The results are shown in the upper two rows

of Table 3 (# audio prefixes = 0 and # audio

prefixes = 7, respectively). Although removing

LID information from the test data leads to a relative

increase by 44% in average Character Error Rate, the

models trained with language identiﬁers still perform

better on four languages and on average.

We are interested in ﬁnding out whether strong

performance of the models trained in a language-

aware manner on test data without LID preﬁxes is

due to positive impact of providing language iden-

ICPRAM 2025 - 14th International Conference on Pattern Recognition Applications and Methods

810

Table 2: Comparison of monolingual baseline models and multilingual models trained and tested on data with language

identiﬁers. We report the mean and standard deviation of the Character Error Rates and Word Error Rates obtained in three

executions with different random seeds. Bold font indicates the best results for each language.

# training

Metric languages ces bul slk srp mkd slv hsb avg

CER 1 2.2 ± .0 3.5 ± .0 5.0 ± .1 12.9 ± 15.9 3.3 ± .0 31.0 ± 48.2 7.1 ± .0 9.3

CER 7 2.3 ± .0 3.6 ± .1 4.7 ± .1 3.0 ± .5 3.3 ± .0 3.2 ± .1 6.9 ± .0 3.9

WER 1 11.0 ± .1 17.3 ± .1 23.2 ± .5 27.2 ± 24.5 17.1 ± .3 42.7 ± 49.0 33.8 ± .3 24.6

WER 7 11.3 ± .0 18.2 ± .2 21.5 ± .4 9.9 ± .9 17.5 ± .3 14.5 ± .2 33.5 ± .3 18.1

Table 3: Comparison of the results on test data without LID preﬁxes, obtained by using multilingual models trained with (i)

the original data without audio preﬁxes, (ii) the data modiﬁed by prepending a language-speciﬁc (LID) audio preﬁx to each

training sample, and (iii) the data modiﬁed by adding the same audio preﬁx to all training samples. We report the mean and

standard deviation of the Character Error Rates and Word Error Rates obtained in three executions with different random

seeds. Bold font indicates the best results for each language.

# audio

Metric preﬁxes ces bul slk srp mkd slv hsb avg

CER 0 2.4 ± .0 3.7 ± .1 6.1 ± .2 12.9 ± 1.9 3.8 ± .0 6.0 ± .5 7.0 ± .2 6.0

CER 7 2.5 ± .1 3.6 ± .0 6.0 ± .4 10.3 ± 4.1 4.0 ± .3 5.9 ± .8 7.1 ± .1 5.6

CER 1 2.4 ± .0 3.7 ± .1 6.0 ± .2 14.2 ± 2.6 3.9 ± .0 6.3 ± .3 7.7 ± .8 6.3

WER 0 11.7 ± .1 18.4 ± .3 24.5 ± .2 22.2 ± 2.4 18.9 ± .4 17.9 ± .8 34.1 ± .6 21.1

WER 7 12.0 ± .2 18.1 ± .1 24.1 ± .4 19.8 ± 4.9 19.3 ± .3 17.9 ± .8 34.8 ± .7 20.9

WER 1 11.9 ± .1 18.4 ± .4 24.6 ± .4 24.3 ± 3.3 19.0 ± .2 18.8 ± .2 34.9 ± .7 21.7

Table 4: The ratios of out-of-vocabulary character counts in test predictions to the number of test samples for each language.

We report the mean and standard deviation obtained by three models trained with different random seeds. Bold font indicates

the best results for each language.

# audio

preﬁxes ces bul slk srp mkd slv hsb avg

0 .02 ± .00 .03 ± .00 .33 ± .07 1.39 ± .25 .15 ± .02 .79 ± .14 .01 ± .02 .39

7 .02 ± .02 .01 ± .01 .32 ± .13 0.99 ± .53 .24 ± .17 .80 ± .25 .00 ± .00 .34

1 .02 ± .00 .02 ± .00 .30 ± .04 1.58 ± .37 .21 ± .04 .88 ± .11 .32 ± .49 .47

tity information. An alternative possible explanation

is that the audio preﬁxes might be implicitly regular-

izing the model and preventing or reducing overﬁt-

ting to low-data languages. In order to verify this,

we carry out an additional experiment where a sin-

gle audio preﬁx is used for all training samples, re-

gardless of the language. The results are presented

in the bottom row of Table 3 (# audio prefixes

= 1). On all of the languages where the models

trained with language-speciﬁc audio preﬁxes outper-

formed the baseline, they also outperform this ap-

proach, which indicates that it is indeed the language

identity information conveyed by the preﬁxes that

contributes to their solid performance.

Additionally, we verify whether incorporating

LID information during training helps mitigate the

problem of language confusion in language-agnostic

inference. To this end, for each language, we count

the number of out-of-vocabulary characters

in the

We deﬁne an out-of-vocabulary character as a char-

test predictions and normalize this value by dividing

it by the size of the evaluation set for that language.

The results, shown in Table 4, indicate that the model

trained with language-speciﬁc audio preﬁxes is less

prone to language confusion than the other systems.

The above results seem to corroborate the hypoth-

esis that using explicit language identity information

during multilingual training can contribute to learn-

ing a better model of speech features with robust de-

cision boundaries between languages, facilitating not

only language-aware, but also language-agnostic in-

ference. While, with the exception of Serbian, the dif-

ferences in error rates in favor of the models trained

with LID are not large, and on three out of seven lan-

guages they are slightly outperformed by the baseline

method, the observation that both approaches offer

comparable performance is already important, as it

means that a single model – trained in a language-

acter that does not occur in the training data for the target

language.

Language-Aware and Language-Agnostic Multilingual Speech Recognition with a Single Model

811

aware manner – can be used both when the input lan-

guage is explicitly speciﬁed, in which case it performs

substantially better than the fully language-agnostic

counterpart, as well as in a setting where language

identity is unknown at inference time and needs to be

inferred by the model – which it can do as well as

or better than a model trained without LID preﬁxes.

This approach can potentially allow for reductions in

the cost of developing and maintaining multilingual

ASR models.

6 LIMITATIONS

Since our system architecture is based on a pre-

trained wav2vec 2.0 model, its performance and ca-

pacity for cross-lingual transfer may be inﬂuenced by

the amount and characteristics of the pretraining data

(particularly the pretraining data in languages being

considered in our experiments). This aspect was not

analyzed in the present study.

Although our results demonstrate strong per-

formance for multilingual models, the experiments

were conducted using very small training datasets.

Whether our approach would match the performance

of dedicated monolingual models in a high-resource

scenario remains an open question.

7 CONCLUSIONS AND FUTURE

WORK

We have demonstrated the effectiveness of ﬁne-tuning

a multilingual pretrained speech representation model

for speech recognition by training a single adapter

module jointly on labeled speech data in multiple lan-

guages (instead of adding separate adapters for each

language) and providing the information about lan-

guage identity, in the form of language-speciﬁc audio

preﬁxes attached to the data. Furthermore, in exper-

iments using evaluation data without language iden-

tiﬁers, the proposed approach yielded better overall

performance than models trained without LID pre-

ﬁxes, suggesting that the beneﬁts of language-aware

multilingual training can persist even when language

identity information is absent during inference.

Due to limited computational resources, in this re-

search we only focused on languages with very small

training datasets (namely, less than 50 hours of la-

beled data). In the future we will investigate whether

the observations made in our experiments also hold in

a setting with relatively large amounts of data avail-

able. Apart from that, we are planning to perform

experiments on a group of unrelated languages. We

will also investigate ﬁne-tuning the base model (or a

subset of its layers) as an alternative to re-training the

adapter layers only.

ACKNOWLEDGEMENTS

This work was supported by JSPS KAKENHI Grant

Number JP22K17952.

REFERENCES

Ardila, R., Branson, M., Davis, K., Henretty, M., Kohler,

M., Meyer, J., Morais, R., Saunders, L., Tyers, F. M.,

and Weber, G. (2020). Common Voice: A Massively-

Multilingual Speech Corpus. In Proceedings of the

12th Conference on Language Resources and Evalua-

tion (LREC 2020), pages 4211–4215.

Babu, A., Wang, C., Tjandra, A., Lakhotia, K., Xu, Q.,

Goyal, N., Singh, K., von Platen, P., Saraf, Y., Pino,

J., Baevski, A., Conneau, A., and Auli, M. (2021).

XLS-R: Self-supervised Cross-lingual Speech Repre-

sentation Learning at Scale. arXiv, abs/2111.09296.

Baevski, A., Zhou, H., Mohamed, A., and Auli, M.

(2020). wav2vec 2.0: A Framework for Self-

Supervised Learning of Speech Representations.

ArXiv, abs/2006.11477.

Conneau, A., Baevski, A., Collobert, R., Mohamed, A., and

Auli, M. (2021). Unsupervised Cross-lingual Repre-

sentation Learning for Speech Recognition. In Inter-

speech.

Conneau, A., Khandelwal, K., Goyal, N., Chaudhary, V.,

Wenzek, G., Guzm

an, F., Grave, E., Ott, M., Zettle-

moyer, L., and Stoyanov, V. (2020). Unsupervised

Cross-lingual Representation Learning at Scale. In

Proceedings of the 58th Annual Meeting of the As-

sociation for Computational Linguistics, pages 8440–

8451, Online. Association for Computational Linguis-

tics.

Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K.

(2019). BERT: Pre-training of Deep Bidirectional

Transformers for Language Understanding. In Pro-

ceedings of the 2019 Conference of the North Amer-

ican Chapter of the Association for Computational

Linguistics: Human Language Technologies, Volume

1 (Long and Short Papers), pages 4171–4186, Min-

neapolis, Minnesota. Association for Computational

Linguistics.

Houlsby, N., Giurgiu, A., Jastrzebski, S., Morrone, B.,

De Laroussilhe, Q., Gesmundo, A., Attariyan, M., and

Gelly, S. (2019). Parameter-efﬁcient transfer learning

for NLP. In Chaudhuri, K. and Salakhutdinov, R.,

editors, Proceedings of the 36th International Con-

ference on Machine Learning, volume 97 of Pro-

ceedings of Machine Learning Research, pages 2790–

2799. PMLR.

ICPRAM 2025 - 14th International Conference on Pattern Recognition Applications and Methods

812

Houston, B., Sadjadi, O., Hou, Z., Vishnubhotla, S., and

Han, K. (2024). Improving multilingual asr robustness

to errors in language input. In Interspeech 2024.

Johnson, M., Schuster, M., Le, Q. V., Krikun, M., Wu, Y.,

Chen, Z., Thorat, N., Vi

egas, F., Wattenberg, M., Cor-

rado, G., Hughes, M., and Dean, J. (2017). Google’s

multilingual neural machine translation system: En-

abling zero-shot translation. Transactions of the Asso-

ciation for Computational Linguistics, 5:339–351.

Kannan, A., Datta, A., Sainath, T. N., Weinstein, E., Ram-

abhadran, B., Wu, Y., Bapna, A., Chen, Z., and Lee, S.

(2019). Large-scale multilingual speech recognition

with a streaming end-to-end model. In Interspeech

2019, pages 2130–2134.

Kim, S. and Seltzer, M. L. (2018). Towards language-

universal end-to-end speech recognition. In 2018

IEEE International Conference on Acoustics, Speech

and Signal Processing (ICASSP), page 4914–4918.

IEEE Press.

Lhoest, Q., Villanova del Moral, A., Jernite, Y., Thakur,

A., von Platen, P., Patil, S., Chaumond, J., Drame,

M., Plu, J., Tunstall, L., Davison, J.,

sko, M., Chh-

ablani, G., Malik, B., Brandeis, S., Le Scao, T., Sanh,

V., Xu, C., Patry, N., McMillan-Major, A., Schmid,

P., Gugger, S., Delangue, C., Matussi

ere, T., Debut,

L., Bekman, S., Cistac, P., Goehringer, T., Mustar, V.,

Lagunas, F., Rush, A., and Wolf, T. (2021). Datasets:

A community library for natural language processing.

In Proceedings of the 2021 Conference on Empiri-

cal Methods in Natural Language Processing: System

Demonstrations, pages 175–184, Online and Punta

Cana, Dominican Republic. Association for Compu-

tational Linguistics.

Li, B., Sainath, T. N., Sim, K. C., Bacchiani, M., Wein-

stein, E., Nguyen, P., Chen, Z., Wu, Y., and Rao, K.

(2018). Multi-dialect speech recognition with a sin-

gle sequence-to-sequence model. In 2018 IEEE Inter-

national Conference on Acoustics, Speech and Signal

Processing (ICASSP), page 4749–4753. IEEE Press.

Liu, Y., Yang, X., and Qu, D. (2024). Exploration of

whisper ﬁne-tuning strategies for low-resource asr.

EURASIP J. Audio Speech Music Process., 2024(1).

Nowakowski, K. and Ptaszynski, M. (2023). Improving

low-resource speech recognition through multilingual

ﬁne-tuning with language identiﬁers and self-training.

In Wu, J.-L. and Su, M.-H., editors, Proceedings of the

35th Conference on Computational Linguistics and

Speech Processing (ROCLING 2023), pages 63–70,

Taipei City, Taiwan. The Association for Computa-

tional Linguistics and Chinese Language Processing

(ACLCLP).

Nowakowski, K., Ptaszynski, M., Murasaki, K., and

Nieuwa

zny, J. (2023). Adapting multilingual speech

representation model for a new, underresourced lan-

guage through multilingual ﬁne-tuning and continued

pretraining. Information Processing & Management,

60(2):103148.

Pratap, V., Tjandra, A., Shi, B., Tomasello, P., Babu, A.,

Kundu, S., Elkahky, A., Ni, Z., Vyas, A., Fazel-

Zarandi, M., Baevski, A., Adi, Y., Zhang, X., Hsu,

W.-N., Conneau, A., and Auli, M. (2023). Scaling

speech technology to 1,000+ languages.

Rebufﬁ, S.-A., Bilen, H., and Vedaldi, A. (2017). Learn-

ing multiple visual domains with residual adapters. In

Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H.,

Fergus, R., Vishwanathan, S., and Garnett, R., editors,

Advances in Neural Information Processing Systems,

volume 30. Curran Associates, Inc.

Shen, Z., Guo, W., and Gu, B. (2023). Language-universal

adapter learning with knowledge distillation for end-

to-end multilingual speech recognition.

Toshniwal, S., Sainath, T. N., Weiss, R. J., Li, B., Moreno,

P., Weinstein, E., and Rao, K. (2018). Multilingual

speech recognition with a single end-to-end model.

In 2018 IEEE International Conference on Acoustics,

Speech and Signal Processing (ICASSP), pages 4904–

4908.

von Platen, P. (2023). Fine-tuning mms adapter

models for multi-lingual asr. Online:

https://huggingface.co/blog/mms adapters.

Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue,

C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtow-

icz, M., Davison, J., Shleifer, S., von Platen, P., Ma,

C., Jernite, Y., Plu, J., Xu, C., Scao, T. L., Gugger,

S., Drame, M., Lhoest, Q., and Rush, A. M. (2020).

Transformers: State-of-the-art natural language pro-

cessing. In Proceedings of the 2020 Conference on

Empirical Methods in Natural Language Processing:

System Demonstrations, pages 38–45, Online. Asso-

ciation for Computational Linguistics.

Zhang, C., Li, B., Sainath, T., Strohman, T., Mavandadi,

S., Chang, S.-Y., and Haghani, P. (2022). Streaming

end-to-end multilingual speech recognition with joint

language identiﬁcation. In Interspeech 2022, pages

3223–3227.

Zhou, L., Li, J., Sun, E., and Liu, S. (2022). A conﬁg-

urable multilingual model is all you need to recognize

all languages. In ICASSP 2022 - 2022 IEEE Inter-

national Conference on Acoustics, Speech and Signal

Processing (ICASSP), pages 6422–6426.

Zhu, Y., Haghani, P., Tripathi, A., Ramabhadran, B., Far-

ris, B., Xu, H., Lu, H., Sak, H., Leal, I., Gaur, N.,

Moreno, P. J., and Zhang, Q. (2020). Multilingual

speech recognition with self-attention structured pa-

rameterization. In Interspeech.

Language-Aware and Language-Agnostic Multilingual Speech Recognition with a Single Model

813