
Houston, B., Sadjadi, O., Hou, Z., Vishnubhotla, S., and
Han, K. (2024). Improving multilingual asr robustness
to errors in language input. In Interspeech 2024.
Johnson, M., Schuster, M., Le, Q. V., Krikun, M., Wu, Y.,
Chen, Z., Thorat, N., Vi
´
egas, F., Wattenberg, M., Cor-
rado, G., Hughes, M., and Dean, J. (2017). Google’s
multilingual neural machine translation system: En-
abling zero-shot translation. Transactions of the Asso-
ciation for Computational Linguistics, 5:339–351.
Kannan, A., Datta, A., Sainath, T. N., Weinstein, E., Ram-
abhadran, B., Wu, Y., Bapna, A., Chen, Z., and Lee, S.
(2019). Large-scale multilingual speech recognition
with a streaming end-to-end model. In Interspeech
2019, pages 2130–2134.
Kim, S. and Seltzer, M. L. (2018). Towards language-
universal end-to-end speech recognition. In 2018
IEEE International Conference on Acoustics, Speech
and Signal Processing (ICASSP), page 4914–4918.
IEEE Press.
Lhoest, Q., Villanova del Moral, A., Jernite, Y., Thakur,
A., von Platen, P., Patil, S., Chaumond, J., Drame,
M., Plu, J., Tunstall, L., Davison, J.,
ˇ
Sa
ˇ
sko, M., Chh-
ablani, G., Malik, B., Brandeis, S., Le Scao, T., Sanh,
V., Xu, C., Patry, N., McMillan-Major, A., Schmid,
P., Gugger, S., Delangue, C., Matussi
`
ere, T., Debut,
L., Bekman, S., Cistac, P., Goehringer, T., Mustar, V.,
Lagunas, F., Rush, A., and Wolf, T. (2021). Datasets:
A community library for natural language processing.
In Proceedings of the 2021 Conference on Empiri-
cal Methods in Natural Language Processing: System
Demonstrations, pages 175–184, Online and Punta
Cana, Dominican Republic. Association for Compu-
tational Linguistics.
Li, B., Sainath, T. N., Sim, K. C., Bacchiani, M., Wein-
stein, E., Nguyen, P., Chen, Z., Wu, Y., and Rao, K.
(2018). Multi-dialect speech recognition with a sin-
gle sequence-to-sequence model. In 2018 IEEE Inter-
national Conference on Acoustics, Speech and Signal
Processing (ICASSP), page 4749–4753. IEEE Press.
Liu, Y., Yang, X., and Qu, D. (2024). Exploration of
whisper fine-tuning strategies for low-resource asr.
EURASIP J. Audio Speech Music Process., 2024(1).
Nowakowski, K. and Ptaszynski, M. (2023). Improving
low-resource speech recognition through multilingual
fine-tuning with language identifiers and self-training.
In Wu, J.-L. and Su, M.-H., editors, Proceedings of the
35th Conference on Computational Linguistics and
Speech Processing (ROCLING 2023), pages 63–70,
Taipei City, Taiwan. The Association for Computa-
tional Linguistics and Chinese Language Processing
(ACLCLP).
Nowakowski, K., Ptaszynski, M., Murasaki, K., and
Nieuwa
˙
zny, J. (2023). Adapting multilingual speech
representation model for a new, underresourced lan-
guage through multilingual fine-tuning and continued
pretraining. Information Processing & Management,
60(2):103148.
Pratap, V., Tjandra, A., Shi, B., Tomasello, P., Babu, A.,
Kundu, S., Elkahky, A., Ni, Z., Vyas, A., Fazel-
Zarandi, M., Baevski, A., Adi, Y., Zhang, X., Hsu,
W.-N., Conneau, A., and Auli, M. (2023). Scaling
speech technology to 1,000+ languages.
Rebuffi, S.-A., Bilen, H., and Vedaldi, A. (2017). Learn-
ing multiple visual domains with residual adapters. In
Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H.,
Fergus, R., Vishwanathan, S., and Garnett, R., editors,
Advances in Neural Information Processing Systems,
volume 30. Curran Associates, Inc.
Shen, Z., Guo, W., and Gu, B. (2023). Language-universal
adapter learning with knowledge distillation for end-
to-end multilingual speech recognition.
Toshniwal, S., Sainath, T. N., Weiss, R. J., Li, B., Moreno,
P., Weinstein, E., and Rao, K. (2018). Multilingual
speech recognition with a single end-to-end model.
In 2018 IEEE International Conference on Acoustics,
Speech and Signal Processing (ICASSP), pages 4904–
4908.
von Platen, P. (2023). Fine-tuning mms adapter
models for multi-lingual asr. Online:
https://huggingface.co/blog/mms adapters.
Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue,
C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtow-
icz, M., Davison, J., Shleifer, S., von Platen, P., Ma,
C., Jernite, Y., Plu, J., Xu, C., Scao, T. L., Gugger,
S., Drame, M., Lhoest, Q., and Rush, A. M. (2020).
Transformers: State-of-the-art natural language pro-
cessing. In Proceedings of the 2020 Conference on
Empirical Methods in Natural Language Processing:
System Demonstrations, pages 38–45, Online. Asso-
ciation for Computational Linguistics.
Zhang, C., Li, B., Sainath, T., Strohman, T., Mavandadi,
S., Chang, S.-Y., and Haghani, P. (2022). Streaming
end-to-end multilingual speech recognition with joint
language identification. In Interspeech 2022, pages
3223–3227.
Zhou, L., Li, J., Sun, E., and Liu, S. (2022). A config-
urable multilingual model is all you need to recognize
all languages. In ICASSP 2022 - 2022 IEEE Inter-
national Conference on Acoustics, Speech and Signal
Processing (ICASSP), pages 6422–6426.
Zhu, Y., Haghani, P., Tripathi, A., Ramabhadran, B., Far-
ris, B., Xu, H., Lu, H., Sak, H., Leal, I., Gaur, N.,
Moreno, P. J., and Zhang, Q. (2020). Multilingual
speech recognition with self-attention structured pa-
rameterization. In Interspeech.
Language-Aware and Language-Agnostic Multilingual Speech Recognition with a Single Model
813