Conneau, A., Ma, M., Khanuja, S., Zhang, Y., Axelrod,
V., Dalmia, S., Riesa, J., Rivera, C., and Bapna, A.
(2023). Fleurs: Few-shot learning evaluation of uni-
versal representations of speech. In 2022 IEEE Spoken
Language Technology Workshop (SLT), pages 798–
805. IEEE.
Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K.
(2019). BERT: Pre-training of deep bidirectional
transformers for language understanding. In Pro-
ceedings of the 2019 Conference of the North Amer-
ican Chapter of the Association for Computational
Linguistics: Human Language Technologies, Volume
1 (Long and Short Papers), pages 4171–4186, Min-
neapolis, Minnesota. Association for Computational
Linguistics.
D
¨
undar, E. B., Kilic¸, O. F., Cekic¸, T., Manav, Y., and Deniz,
O. (2020). Large scale intent detection in turkish short
sentences with contextual word embeddings. In KDIR,
pages 187–192.
Heafield, K. (2011). KenLM: Faster and smaller language
model queries. In Callison-Burch, C., Koehn, P.,
Monz, C., and Zaidan, O. F., editors, Proceedings of
the Sixth Workshop on Statistical Machine Transla-
tion, pages 187–197, Edinburgh, Scotland. Associa-
tion for Computational Linguistics.
Kim, J., Kong, J., and Son, J. (2021). Conditional varia-
tional autoencoder with adversarial learning for end-
to-end text-to-speech. In International Conference on
Machine Learning, pages 5530–5540. PMLR.
Kong, J., Kim, J., and Bae, J. (2020). Hifi-gan: Genera-
tive adversarial networks for efficient and high fidelity
speech synthesis. Advances in neural information pro-
cessing systems, 33:17022–17033.
Luo, R., Tan, X., Wang, R., Qin, T., Li, J., Zhao, S., Chen,
E., and Liu, T.-Y. (2021). Lightspeech: Lightweight
and fast text to speech with neural architecture search.
In ICASSP 2021-2021 IEEE International Confer-
ence on Acoustics, Speech and Signal Processing
(ICASSP), pages 5699–5703. IEEE.
Ma, R., Wu, X., Qiu, J., Qin, Y., Xu, H., Wu, P., and Ma,
Z. (2023). Internal language model estimation based
adaptive language model fusion for domain adapta-
tion.
McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., and
Sonderegger, M. (2017). Montreal forced aligner:
Trainable text-speech alignment using kaldi. In In-
terspeech, volume 2017, pages 498–502.
Pratap, V., Tjandra, A., Shi, B., Tomasello, P., Babu, A.,
Kundu, S., Elkahky, A., Ni, Z., Vyas, A., Fazel-
Zarandi, M., et al. (2024). Scaling speech technology
to 1,000+ languages. Journal of Machine Learning
Research, 25(97):1–52.
Qin, Z., Zhao, W., Yu, X., and Sun, X. (2023). Open-
voice: Versatile instant voice cloning. arXiv preprint
arXiv:2312.01479.
Radford, A., Kim, J. W., Xu, T., Brockman, G., McLeavey,
C., and Sutskever, I. (2022). Robust speech recogni-
tion via large-scale weak supervision.
Radford, A., Kim, J. W., Xu, T., Brockman, G., McLeavey,
C., and Sutskever, I. (2023). Robust speech recogni-
tion via large-scale weak supervision. In International
conference on machine learning, pages 28492–28518.
PMLR.
Ren, Y., Hu, C., Tan, X., Qin, T., Zhao, S., Zhao, Z.,
and Liu, T.-Y. (2020). Fastspeech 2: Fast and high-
quality end-to-end text to speech. arXiv preprint
arXiv:2006.04558.
Schweter, S. (2020). Berturk - bert models for turkish.
Stepanov, I. and Shtopko, M. (2024). Gliner multi-task:
Generalist lightweight model for various information
extraction tasks. arXiv preprint arXiv:2406.12925.
Yamamoto, R., Song, E., and Kim, J.-M. (2020). Parallel
wavegan: A fast waveform generation model based on
generative adversarial networks with multi-resolution
spectrogram. In ICASSP 2020-2020 IEEE Interna-
tional Conference on Acoustics, Speech and Signal
Processing (ICASSP), pages 6199–6203. IEEE.
Zen, H., Dang, V., Clark, R., Zhang, Y., Weiss, R. J., Jia,
Y., Chen, Z., and Wu, Y. (2019). Libritts: A cor-
pus derived from librispeech for text-to-speech. arXiv
preprint arXiv:1904.02882.
An End-to-End Generative System for Smart Travel Assistant
479