
significantly enhances the robustness of the system,
which is particularly important in low-resource lan-
guages.
Overall, our findings highlight the potential of a
data-centric approach to overcome the challenges in-
herent in low-resource speaker recognition, paving
the way for more efficient and effective systems in
this domain. Future work may explore further opti-
mizations in the enrollment process and investigate
how additional techniques, such as data augmenta-
tion, can be applied to further improve performance
in low-resource settings.
REFERENCES
Aronowitz, H. (2014). Inter dataset variability compen-
sation for speaker recognition. In 2014 IEEE Inter-
national Conference on Acoustics, Speech and Signal
Processing (ICASSP), pages 4002–4006.
Chung, J. S., Huh, J., and Mun, S. (2019). Delving into
voxceleb: environment invariant speaker recognition.
arXiv preprint arXiv:1910.11238.
Chung, J. S., Huh, J., Mun, S., Lee, M., Heo, H. S., Choe,
S., Ham, C., Jung, S., Lee, B.-J., and Han, I. (2020).
In defence of metric learning for speaker recognition.
arXiv preprint arXiv:2003.11982.
Dawalatabad, N., Ravanelli, M., Grondin, F., Thienpondt,
J., Desplanques, B., and Na, H. (2021). Ecapa-tdnn
embeddings for speaker diarization. arXiv preprint
arXiv:2104.01466.
Dehak, N., Kenny, P. J., Dehak, R., Dumouchel, P., and
Ouellet, P. (2010). Front-end factor analysis for
speaker verification. IEEE Transactions on Audio,
Speech, and Language Processing, 19(4):788–798.
Deng, J., Guo, J., Xue, N., and Zafeiriou, S. (2019). Ar-
cface: Additive angular margin loss for deep face
recognition. In Proceedings of the IEEE/CVF con-
ference on computer vision and pattern recognition,
pages 4690–4699.
Glembek, O., Ma, J., Mat
ˇ
ejka, P., Zhang, B., Plchot, O.,
B
¨
urget, L., and Matsoukas, S. (2014). Domain adapta-
tion via within-class covariance correction in i-vector
based speaker recognition systems. In 2014 IEEE In-
ternational Conference on Acoustics, Speech and Sig-
nal Processing (ICASSP), pages 4032–4036.
He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep resid-
ual learning for image recognition. In Proceedings of
the IEEE conference on computer vision and pattern
recognition, pages 770–778.
Jung, J.-w., Kim, Y. J., Heo, H.-S., Lee, B.-J., Kwon,
Y., and Chung, J. S. (2022). Pushing the limits of
raw waveform speaker recognition. arXiv preprint
arXiv:2203.08488.
Kimball, O., Schmidt, M., Gish, H., and Waterman, J.
(1997). Speaker verification with limited enrollment
data. In Eurospeech, pages 967–970.
Kurita, T. (2019). Principal component analysis (pca).
Computer vision: a reference guide, pages 1–4.
Li, J., Zhang, K., Wang, S., Li, H., Mak, M.-W., and Lee,
K. A. (2024). On the effectiveness of enrollment
speech augmentation for target speaker extraction.
Li, L., Wang, D., Kang, J., Wang, R., Wu, J., Gao, Z., and
Chen, X. (2022). A principle solution for enroll-test
mismatch in speaker recognition. IEEE/ACM Trans-
actions on Audio, Speech, and Language Processing,
30:443–455.
Mak, M.-W., Hsiao, R., and Mak, B. (2006). A comparison
of various adaptation methods for speaker verification
with limited enrollment data. In 2006 IEEE Inter-
national Conference on Acoustics Speech and Signal
Processing Proceedings, volume 1, pages I–I.
Mingote, V., Miguel, A., Gim
´
enez, A. O., and Lleida, E.
(2020). Training speaker enrollment models by net-
work optimization. In INTERSPEECH, pages 3810–
3814.
Mohd Hanifa, R., Isa, K., and Mohamad, S. (2021).
A review on speaker recognition: Technology and
challenges. Computers & Electrical Engineering,
90:107005.
Ngo, M.-N. and Le, L.-Q. (2024). Evaluation of command
and speaker recognition on vietnamese voice dataset
to enhance security. In 2024 International Confer-
ence on Multimedia Analysis and Pattern Recognition
(MAPR), pages 1–6.
Peddinti, V., Povey, D., and Khudanpur, S. (2015). A time
delay neural network architecture for efficient model-
ing of long temporal contexts. In Interspeech, pages
3214–3218.
Snyder, D., Garcia-Romero, D., Sell, G., Povey, D., and
Khudanpur, S. (2018). X-vectors: Robust dnn embed-
dings for speaker recognition. In 2018 IEEE inter-
national conference on acoustics, speech and signal
processing (ICASSP), pages 5329–5333. IEEE.
Thanh, P. V., Hoa, N. X. T., Vu, H. L., and Trang, N. T. T.
(2023). Vietnam-celeb: a large-scale dataset for viet-
namese speaker recognition.
Wang, Q., Rao, W., Sun, S., Xie, L., Chng, E. S., and Li, H.
(2018). Unsupervised domain adaptation via domain
adversarial training for speaker recognition. In 2018
IEEE International Conference on Acoustics, Speech
and Signal Processing (ICASSP), pages 4889–4893.
Zeinali, H., Wang, S., Silnova, A., Mat
ˇ
ejka, P., and Pl-
chot, O. (2019). But system description to voxceleb
speaker recognition challenge 2019. arXiv preprint
arXiv:1910.12592.
Data-Centric Optimization of Enrollment Selection in Speaker Identification
351