Unsupervised Data-driven Hidden Markov Modeling for Text-dependent Speaker Verification

Dijana Petrovska-Delacrétaz, Houssemeddine Khemiri

Abstract

We present a text-dependent speaker verification system based on unsupervised data-driven Hidden Markov Models (HMMs) in order to take into account the temporal information of speech data. The originality of our proposal is to train unsupervised HMMs with only raw speech without transcriptions, that provide pseudo phonetic segmentation of speech data. The proposed text-dependent system is composed of the following steps. First, generic unsupervised HMMs are trained. Then the enrollment speech data for each target speaker is segmented with the generic models, and further processing is done in order to obtain speaker and text adapted HMMs, that will represent each speaker. During the test phase, in order to verify the claimed identity of the speaker, the test speech is segmented with the generic and the speaker dependent HMMs. Finally, two approaches based on log-likelihood ratio and concurrent scoring are proposed to compute the score between the test utterance and the speaker’s model. The system is evaluated on Part1 of the RSR2015 database with Equal Error Rate (EER) on the development set, and Half Total Error Rate (HTER) on the evaluation set. An average EER of 1.29% is achieved on the development set, while for the evaluation part the average HTER is equal to 1.32%.

References

  1. Aronowitz, H. (2012). Text dependent speaker verification using a small development set. In The IEEE Odyssey Speaker and Language Recognition Workshop.
  2. Bahaghighat, M. K., Sahba, F., and Tehrani, E. (2012). Text-dependent speaker recognition by combination of lbg vq and dtw for persian language. International Journal of Computer Applications, 51(16):23-27.
  3. Baum, L. E., Petrie, T., Soules, G., and Weiss, N. (1970). A maximization technique occurring in the statistical analysis of probabilistic functions of markov chains. The Annals of Mathematical Statistics, 41(1):164- 171.
  4. Boies, D., Hébert, M., and Heck, L. (2004). Study on the effect of lexical mismatch in text-dependent speaker verification. In The IEEE Odyssey Speaker and Language Recognition Workshop, pages 1-5.
  5. Chollet, G., C? ernockÉ, J., Constantinescu, A., Deligne, S., and Bimbot, F. (1999). Towards ALISP: a proposal for Automatic Language Independent Speech Processing, pages 375-388. NATO ASI Series. Springer Verlag.
  6. Deligne, S. and Bimbot, F. (1997). Inference of variablelength linguistic and acoustic units by multigrams. Speech Communication, 23(3):223-241.
  7. Dutta, T. (2008). Dynamic time warping based approach to text-dependent speaker identification using spectrograms. In Congress on Image and Signal Processing, volume 2, pages 354-360.
  8. Furui, S. (1981). Cepstral analysis technique for automatic speaker verification. IEEE Transactions on Acoustics, Speech and Signal Processing, 29(2):254-272.
  9. Garofolo, J., Lamel, L., Fisher, W., Fiscus, J., Pallett, D., Dahlgren, N., and Zue, V. (1993). Timit acousticphonetic continuous speech corpus. In Linguistic Data Consortium.
  10. Gravier, G. (2003). Speech Signal Processing Toolkit, release 4.0.
  11. Hannani, A. E. (2007). Text-Independant Speaker Verification Based On High-Level Information Extracted With Data-Driven Methods. PhD thesis, University of Fribourg (Switzerland) and INT/SITEVRY (France).
  12. Hébert, M. (2008). Text-dependent speaker recognition. In Springer handbook of speech processing, pages 743- 762. Springer.
  13. Kato, T. and Shimizu, T. (2003). Improved speaker, verification over the cellular phone network using phoneme-balanced and digit-sequence-preserving connected digit patterns. In International Conference on Acoustics, Speech, and Signal Processing ICASSP, volume 2, pages 57-60.
  14. Khemiri, H. (2013). Unified data-driven approach for audio indexing, retrieval and recognition. Theses, Télécom ParisTech.
  15. Khemiri, H., Petrovska-Delacrétaz, D., and Chollet, G. (2014). Alisp-based data compression for generic audio indexing. In Data Compression Conference, pages 273-282.
  16. Larcher, A., Bonastre, J., Fauve, B., Lee, K., Lévy, C., Mason, H. L. J., and Parfait, J. (2013). Alize 3.0-open source toolkit for state-of-the-art speaker recognition. In the Annual Conference of the International Speech Communication Association (Interpseech), pages 2768-2773.
  17. Larcher, A., Lee, K., Ma, B., and Li, H. (2014). Textdependent speaker verification: Classifiers, databases and RSR2015. Speech Communication, 60:56 - 77.
  18. Linde, Y., Buzo, A., and Gray, R. (1980). An algorithm for vector quantizer design. IEEE Transactions on Communications, 28(1):84-95.
  19. Martin, A. F. and Greenberg, C. S. (2010). The NIST 2010 speaker recognition evaluation. In the Annual Conference of the International Speech Communication Association (Interpseech), pages 2726-2729.
  20. Matsui, T. and Furui, S. (1993). Concatenated phoneme models for text-variable speaker recognition. In International Conference on Acoustics, Speech, and Signal Processing (ICASSP), volume 2, pages 391-394.
  21. Ramasubramanian, V., Das, A., and Kumar, V. P. (2006). Text-dependent speaker-recognition using one-pass dynamic programming algorithm. In International Conference on Acoustics, Speech and Signal Processing (ICASSP), volume 1, pages I-I.
  22. Siu, M., Gish, H., Chan, A., and Belfield, W. (2010). Improved topic classification and keyword discovery using an hmm-based speech recognizer trained without supervision. In the Annual Conference of the International Speech Communication Association (Interpseech).
  23. Siu, M., Gish, H., Lowe, S., and Chan, A. (2011). Unsupervised audio pattern discovery using hmm-based self-organized units. In the Annual Conference of the International Speech Communication Association (Interpseech).
  24. Stafylakis, T., Kenny, P., Alam, M., and Kockmann, M. (2016). Speaker and channel factors in text-dependent speaker recognition. IEEE/ACM Transactions on Audio, Speech and Language Processing, 24:65-78.
  25. Stafylakis, T., Kenny, P., Ouellet, P., Perez, J., Kockmann, M., and Dumouchel, P. (2013). Text-dependent speaker recognition using plda with uncertainty propagation. In the Annual Conference of the International Speech Communication Association (Interpseech), page 36843688.
  26. Subramanya, A., Zhang, Z., Surendran, A., Nguyen, P., Narasimhan, M., and Acero, A. (2007). A generativediscriminative framework using ensemble methods for text-dependent speaker verification. In International Conference on Acoustics, Speech and Signal Processing (ICASSP), volume 4, pages 225-228.
  27. Variani, E., Lei, X., McDermott, E., Moreno, I. L., and Gonzalez-Dominguez, J. (2014). Deep neural networks for small footprint text-dependent speaker verification. In International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4080- 4084.
  28. Viterbi, A. (1967). Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. IEEE Transactions on Information Theory, 13:260- 269.
  29. Wagner, M., Summerfield, C., Dunstone, T., Summerfield, R., and Moss, J. (2006). An evaluation of ”commercial off-the-shelf” speaker verification systems. In The IEEE Odyssey Speaker and Language Recognition Workshop, pages 1-8.
  30. Woo, S., Lim, C., and Osman, R. (2000). Text-dependent speaker recognition using the fuzzy artmap neural network. In IEEE International Conference on Electrical and Electronic Technology, volume 1, pages 33-38.
  31. Young, S., Evermann, G., Gales, M., Hain, T., Kershaw, D., Liu, X. A., Moore, G., Odell, J., Ollason, D., Povey, D., Valtchev, V., and Woodland, P. (2006). The HTK Book (for HTK Version 3.4).
Download


Paper Citation


in Harvard Style

Petrovska-Delacrétaz D. and Khemiri H. (2017). Unsupervised Data-driven Hidden Markov Modeling for Text-dependent Speaker Verification . In Proceedings of the 6th International Conference on Pattern Recognition Applications and Methods - Volume 1: ICPRAM, ISBN 978-989-758-222-6, pages 199-207. DOI: 10.5220/0006202001990207


in Bibtex Style

@conference{icpram17,
author={Dijana Petrovska-Delacrétaz and Houssemeddine Khemiri},
title={Unsupervised Data-driven Hidden Markov Modeling for Text-dependent Speaker Verification},
booktitle={Proceedings of the 6th International Conference on Pattern Recognition Applications and Methods - Volume 1: ICPRAM,},
year={2017},
pages={199-207},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0006202001990207},
isbn={978-989-758-222-6},
}


in EndNote Style

TY - CONF
JO - Proceedings of the 6th International Conference on Pattern Recognition Applications and Methods - Volume 1: ICPRAM,
TI - Unsupervised Data-driven Hidden Markov Modeling for Text-dependent Speaker Verification
SN - 978-989-758-222-6
AU - Petrovska-Delacrétaz D.
AU - Khemiri H.
PY - 2017
SP - 199
EP - 207
DO - 10.5220/0006202001990207