HuggingFace (2023). Hugging Face Asteroid. https://hugg
ingface.co/models?library=asteroid.
Jain, R., Barcovschi, A., Yiwere, M., Corcoran, P., and
Cucu, H. (2023). Adaptation of whisper mod-
els to child speech recognition. arXiv preprint
arXiv:2307.13008.
Luo, Y., Chen, Z., and Yoshioka, T. (2020). Dual-path rnn:
efficient long sequence modeling for time-domain
single-channel speech separation. In ICASSP 2020-
2020 IEEE International Conference on Acoustics,
Speech and Signal Processing (ICASSP), pages 46–
50. IEEE.
Luo, Y. and Mesgarani, N. (2019). Conv-tasnet: Surpassing
ideal time–frequency magnitude masking for speech
separation. IEEE/ACM transactions on audio, speech,
and language processing, 27(8):1256–1266.
Mach
´
a
ˇ
cek, D., Dabre, R., and Bojar, O. (2023). Turn-
ing whisper into real-time transcription system. arXiv
preprint arXiv:2307.14743.
Mao, R., Chen, G., Zhang, X., Guerin, F., and Cambria, E.
(2023). Gpteval: A survey on assessments of chatgpt
and gpt-4. arXiv preprint arXiv:2308.12488.
Mozzila (2023). RNNoise: Learning Noise Suppression.
https://jmvalin.ca/demo/rnnoise/.
Mul, A. (2023). Enhancing dutch audio transcription
through integration of speaker diarization into the au-
tomatic speech recognition model whisper. Master’s
thesis.
Nacimiento-Garc
´
ıa, E., Gonz
´
alez-Gonz
´
alez, C. S., and
Guti
´
errez-Vela, F. L. (2021). Automatic captions on
video calls, a must for the elderly: Using mozilla
deepspeech for the stt. In Proceedings of the XXI In-
ternational Conference on Human Computer Interac-
tion, pages 1–7.
Pariente, M., Cornell, S., Cosentino, J., Sivasankaran, S.,
Tzinis, E., Heitkaemper, J., Olvera, M., St
¨
oter, F.-R.,
Hu, M., Mart
´
ın-Do
˜
nas, J. M., et al. (2020). Asteroid:
the pytorch-based audio source separation toolkit for
researchers. arXiv preprint arXiv:2005.04132.
Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glem-
bek, O., Goel, N., Hannemann, M., Motlicek, P.,
Qian, Y., Schwarz, P., et al. (2011). The kaldi speech
recognition toolkit. In IEEE 2011 workshop on auto-
matic speech recognition and understanding, number
CONF. IEEE Signal Processing Society.
Radford, A., Kim, J. W., Xu, T., Brockman, G., McLeavey,
C., and Sutskever, I. (2023). Robust speech recog-
nition via large-scale weak supervision. In Inter-
national Conference on Machine Learning, pages
28492–28518. PMLR.
Ravanelli, M., Parcollet, T., Plantinga, P., Rouhe, A., Cor-
nell, S., Lugosch, L., Subakan, C., Dawalatabad, N.,
Heba, A., Zhong, J., et al. (2021). Speechbrain:
A general-purpose speech toolkit. arXiv preprint
arXiv:2106.04624.
Rebai, I., Benhamiche, S., Thompson, K., Sellami, Z.,
Laine, D., and Lorr
´
e, J.-P. (2020). Linto platform: A
smart open voice assistant for business environments.
In Proceedings of the 1st International Workshop on
Language Technology Platforms, pages 89–95.
RNNoise (2023). Github RNNoise. https://github.com/xip
h/rnnoise.
Spiller, T. R., Ben-Zion, Z., Korem, N., Harpaz-Rotem, I.,
and Duek, O. (2023). Efficient and accurate transcrip-
tion in mental health research-a tutorial on using whis-
per ai for sound file transcription.
Suznjevic, M. and Saldana, J. (2016). Delay limits for real-
time services. IETF draft.
Trabelsi, A., Warichet, S., Aajaoun, Y., and Soussilane, S.
(2022). Evaluation of the efficiency of state-of-the-
art speech recognition engines. Procedia Computer
Science, 207:2242–2252.
Union, I. T. (2016). Mean opinion score interpretation and
reporting. Standard, International Telecommunication
Union, Geneva, CH.
Valin, J.-M. (2018). A hybrid dsp/deep learning approach
to real-time full-band speech enhancement. In 2018
IEEE 20th international workshop on multimedia sig-
nal processing (MMSP), pages 1–5. IEEE.
Vaseghi, S. V. (2008). Advanced digital signal processing
and noise reduction. John Wiley & Sons.
ICAART 2024 - 16th International Conference on Agents and Artificial Intelligence
1228