
system in diverse scenarios of varying language com-
plexity. In addition, a combination of ASR algorithms
and generative AI will be evaluated and trained on
the unstructured format of speech data from doctor-
patient interactions. Moreover, data collection to mit-
igate the requirement for linguistic resources will be
conducted with the inclusion of medical terminolo-
gies and their Bangla synonyms.
REFERENCES
Ahmed, A., Inoue, S., Kai, E., Nakashima, N., and Nohara,
Y. (2013). Portable health clinic: A pervasive way to
serve the unreached community for preventive health-
care. In Distributed, Ambient, and Pervasive Inter-
actions: First International Conference, DAPI 2013,
Held as Part of HCI International 2013, Las Vegas,
NV, USA, July 21-26, 2013. Proceedings 1, pages 265–
274. Springer.
Alam, S., Sushmit, A., Abdullah, Z., Nakkhatra, S., Ansary,
M., Hossen, S. M., Mehnaz, S. M., Reasat, T., and Hu-
mayun, A. I. (2022). Bengali common voice speech
dataset for automatic speech recognition. arXiv
preprint arXiv:2206.14053.
Bhattacharya, S., Choudhury, M., Sarkar, S., and Basu, A.
(2005). Inflectional morphology synthesis for bengali
noun, pronoun and verb systems. In Proc. of the Na-
tional Conference on Computer Processing of Bangla
(NCCPB 05), pages 34–43.
Conneau, A., Baevski, A., Collobert, R., Mohamed, A., and
Auli, M. (2020). Unsupervised cross-lingual represen-
tation learning for speech recognition. arXiv preprint
arXiv:2006.13979.
Deb, A., Nag, S., Mahapatra, A., Chattopadhyay, S., Marik,
A., Gayen, P. K., Sanyal, S., Banerjee, A., and Kar-
makar, S. (2023). Beats: Bengali speech acts recogni-
tion using multimodal attention fusion. arXiv preprint
arXiv:2306.02680.
Hossain, F., Islam, R., Ahmed, M. T., and Ahmed, A.
(2022). Technical requirements to design a personal
medical history visualization tool for doctors. In Pro-
ceedings of the 8th International Conference on Hu-
man Interaction and Emerging Technologies. IHIET,
https://ihiet. org.
Islam, J., Mubassira, M., Islam, M. R., and Das, A. K.
(2019). A speech recognition system for bengali lan-
guage using recurrent neural network. In 2019 IEEE
4th international conference on computer and com-
munication systems (ICCCS), pages 73–76. IEEE.
Khare, S., Mittal, A. R., Diwan, A., Sarawagi, S., Jyothi, P.,
and Bharadwaj, S. (2021). Low resource asr: The sur-
prising effectiveness of high resource transliteration.
In Interspeech, pages 1529–1533.
Kibria, S., Samin, A. M., Kobir, M. H., Rahman, M. S.,
Selim, M. R., and Iqbal, M. Z. (2022). Bangladeshi
bangla speech corpus for automatic speech recogni-
tion research. Speech Communication, 136:84–97.
Magueresse, A., Carles, V., and Heetderks, E. (2020). Low-
resource languages: A review of past work and future
challenges. arXiv preprint arXiv:2006.07264.
Mandal, S., Yadav, S., and Rai, A. (2020). End-to-
end bengali speech recognition. arXiv preprint
arXiv:2009.09615.
Mani, A., Palaskar, S., and Konam, S. (2020). Towards un-
derstanding asr error correction for medical conversa-
tions. In Proceedings of the first workshop on natural
language processing for medical conversations, pages
7–11.
Murtoza, S., Alam, F., Sultana, R., Chowdhur, S., and Khan,
M. (2011). Phonetically balanced bangla speech cor-
pus. In Proc. Conference on Human Language Tech-
nology for Development, volume 2011, pages 87–93.
Rakib, F. R., Dip, S. S., Alam, S., Tasnim, N., Shihab,
M. I. H., Ansary, M. N., Hossen, S. M., Meghla,
M. H., Mamun, M., Sadeque, F., et al. (2023a). Ood-
speech: A large bengali speech recognition dataset
for out-of-distribution benchmarking. arXiv preprint
arXiv:2305.09688.
Rakib, M., Hossain, M. I., Mohammed, N., and Rahman, F.
(2023b). Bangla-wave: Improving bangla automatic
speech recognition utilizing n-gram language models.
In Proceedings of the 2023 12th International Confer-
ence on Software and Computer Applications, pages
297–301.
Schultz, T. and Waibel, A. (2001). Language-independent
and language-adaptive acoustic modeling for speech
recognition. Speech Communication, 35(1-2):31–51.
Shahgir, H., Sayeed, K. S., and Zaman, T. A. (2022). Apply-
ing wav2vec2 for speech recognition on bengali com-
mon voices dataset. arXiv preprint arXiv:2209.06581.
Showrav, T. T. (2022). An automatic speech recognition
system for bengali language based on wav2vec2 and
transfer learning. arXiv preprint arXiv:2209.08119.
The Editors of Encyclopedia Britannica (2023). Bengali
language.
Tong, S., Garner, P. N., and Bourlard, H. (2017).
Multilingual training and cross-lingual adaptation
on ctc-based acoustic model. arXiv preprint
arXiv:1711.10025.
Toshniwal, S., Sainath, T. N., Weiss, R. J., Li, B., Moreno,
P., Weinstein, E., and Rao, K. (2018). Multilingual
speech recognition with a single end-to-end model.
In 2018 IEEE international conference on acoustics,
speech and signal processing (ICASSP), pages 4904–
4908. IEEE.
Wang, Y., Shi, Y., Zhang, F., Wu, C., Chan, J., Yeh,
C.-F., and Xiao, A. (2021). Transformer in action:
a comparative study of transformer-based acoustic
models for large scale speech recognition applica-
tions. In ICASSP 2021-2021 IEEE International Con-
ference on Acoustics, Speech and Signal Processing
(ICASSP), pages 6778–6782. IEEE.
HEALTHINF 2024 - 17th International Conference on Health Informatics
760