
tures and spectro-temporal features for speaker recog-
nition. Phonetics and Speech Sciences, 7(1):3–10.
Corretge, R. (2022). Praat vocal toolkit. Available at: http:
//www.praatvocaltoolkit.com.
Davis, S. and Mermelstein, P. (1980). Comparison of para-
metric representations for monosyllabic word recog-
nition in continuously spoken sentences. IEEE Trans.
Acoust. Speech Signal Process., 28(4):357–366.
Dehak, N., Kenny, P. J., Dehak, R., Dumouchel, P., and
Ouellet, P. (2010). Front-end factor analysis for
speaker verification. IEEE Trans. Audio Speech Lang.
Process., 19(4):788–798.
Gay, T., Ushijima, T., Hiroset, H., and Cooper, F. S. (1974).
Effect of speaking rate on labial consonant-vowel ar-
ticulation. Journal of Phonetics, 2(1):47–63.
Gold, E., French, P., and Harrison, P. (2013). Examining
long-term formant distributions as a discriminant in
forensic speaker comparisons under a likelihood ratio
framework. In Proc. Meet. Acoust., volume 19. AIP
Publishing.
Hansen, J. H. and Hasan, T. (2015). Speaker recognition by
machines and humans: A tutorial review. IEEE Signal
Process. Mag., 32(6):74–99.
Hinton, G., Deng, L., Yu, D., Dahl, G. E., Mohamed, A.-
r., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P.,
Sainath, T. N., et al. (2012). Deep neural networks for
acoustic modeling in speech recognition: The shared
views of four research groups. IEEE Signal Process.
Mag., 29(6):82–97.
Imaizumi, S. and Kiritani, S. (1989). Effect of speaking
rate on formant trajectories and inter-speaker varia-
tions. Ann. Bull. RILP, 23:27–37.
Jahangir, R., Teh, Y. W., Memon, N. A., Mujtaba, G., Za-
reei, M., Ishtiaq, U., Akhtar, M. Z., and Ali, I. (2020).
Text-independent speaker identification through fea-
ture fusion and deep neural network. IEEE Access,
8:32187–32202.
Ko, T., Peddinti, V., Povey, D., and Khudanpur, S. (2015).
Audio augmentation for speech recognition. In Inter-
speech, volume 2015, page 3586.
Koo, H., Jeong, S., Yoon, S., and Kim, W. (2020). Develop-
ment of speech emotion recognition algorithm using
mfcc and prosody. ICEIC, pages 1–4.
Liu, X., Sahidullah, M., and Kinnunen, T. (2021). Learn-
able mfccs for speaker verification. In ISCAS, pages
1–5. IEEE.
Lounnas, K., Lichouri, M., and Abbas, M. (2022). Analysis
of the effect of audio data augmentation techniques
on phone digit recognition for algerian arabic dialect.
ICAASE, pages 1–5.
McDougall, K. (2006). Dynamic features of speech and the
characterization of speakers: Toward a new approach
using formant frequencies. Int. J. Speech Lang. Law,
13(1):89–126.
McFee, B., Raffel, C., Liang, D., Ellis, D. P., McVicar, M.,
Battenberg, E., and Nieto, O. (2015). librosa: Audio
and music signal analysis in python. In SciPy, pages
18–24.
Mefferd, A. S. and Green, J. R. (2010). Articulatory-to-
acoustic relations in response to speaking rate and
loudness manipulations. J. Speech Lang. Hear. Res.,
53:1206–1219.
Messaoud, Z. B. and Hamida, A. (2011). Combining for-
mant frequency based on variable order lpc coding
with acoustic features for timit phone recognition. Int.
J. Speech Technol., 14:393.
Nath, D. and Kalita, S. (2015). Composite feature selection
method based on spoken word and speaker recogni-
tion. Int. J. Comput. Appl., 121:18–23.
Nolan, F. (1987). The phonetic bases of speaker recogni-
tion: Cambridge studies in speech science and com-
munication, cambridge university press, cambridge,
1983, 221 pp. isbn 0-521-24486-2.
Nugroho, K., Noersasongko, E., Purwanto, Muljono, and
Setiadi, D. (2021). Enhanced indonesian ethnic
speaker recognition using data augmentation deep
neural network. J. King Saud Univ. Comput. Inf. Sci.,
34:4375–4384.
Reynolds, D. A., Quatieri, T. F., and Dunn, R. B. (2000).
Speaker verification using adapted gaussian mixture
models. Digital signal processing, 10(1-3):19–41.
Rose, P. (2002). Forensic Speaker Identification. Interna-
tional Forensic Science and Investigation. Taylor &
Francis.
Shahrebabaki, A. S., Imran, A. S., Olfati, N., and Svendsen,
T. (2018). Acoustic feature comparison for different
speaking rates. In Proc. Human-Computer Interaction
(HCI), pages 176–189. Springer.
Shaiman, S., Adams, S. G., and Kimelman, M. D. (1997).
Velocity profiles of lip protrusion across changes in
speaking rate. J. Speech Lang. Hear. Res., 40(1):144–
158.
Snyder, D., Garcia-Romero, D., Sell, G., Povey, D., and
Khudanpur, S. (2018). X-vectors: Robust dnn embed-
dings for speaker recognition. ICASSP 2018, pages
5329–5333.
Tilsen, S. (2014). Selection and coordination of articulatory
gestures in temporally constrained production. Jour-
nal of Phonetics, 44:26–46.
Trottier, L., Chaib-draa, B., and Giguere, P. (2015). Tempo-
ral feature selection for noisy speech recognition. In
Proc. Can. Conf. Artif. Intell., pages 155–166.
Tuller, B., Harris, K. S., and Kelso, J. S. (1982). Stress and
rate: Differential transformations of articulation. J.
Acoust. Soc. Am., 71(6):1534–1543.
Vaswani, A. (2017). Attention is all you need. Advances in
Neural Information Processing Systems.
Weismer, G. and Berry, J. (2003). Effects of speaking rate
on second formant trajectories of selected vocalic nu-
clei. J. Acoust. Soc. Am., 113(6):3362–3378.
Xie, W., Nagrani, A., Chung, J. S., and Zisserman,
A. (2019). Utterance-level aggregation for speaker
recognition in the wild. In ICASSP 2019, pages 5791–
5795. IEEE.
Zeng, X., Yin, S., and Wang, D. (2015). Learning
speech rate in speech recognition. arXiv preprint
arXiv:1506.00799.
Zhou, Y., Xiong, C., and Socher, R. (2017). Improved reg-
ularization techniques for end-to-end speech recogni-
tion. ArXiv, abs/1712.07108.
ICPRAM 2025 - 14th International Conference on Pattern Recognition Applications and Methods
672