Speaker Identification with Short Sequences of Speech Frames

Giorgio Biagetti, Paolo Crippa, Alessandro Curzi, Simone Orcioni, Claudio Turchetti

Abstract

In biometric person identification systems, speaker identification plays a crucial role as the voice is the more natural signal to produce and the simplest to acquire. Mel frequency cepstral coefficients (MFCCs) have been widely adopted for decades in speech processing to capture the speech-specific characteristics with a reduced dimensionality. However, although their ability to de-correlate the vocal source and the vocal tract filter make them suitable for speech recognition, they show up some drawbacks in speaker recognition. This paper presents an experimental evaluation showing that reducing the dimension of features by using the discrete Karhunen-Loève transform (DKLT), guarantees better performance with respect to conventional MFCC features. In particular with short sequences of speech frames, that is with utterance duration of less than 1 s, the performance of truncated DKLT representation are always better than MFCC.

References

  1. Bhardwaj, S., Srivastava, S., Hanmandlu, M., and Gupta, J. R. P. (2013). GFM-based methods for speaker identification. IEEE Trans. Cybernetics, 43(3):1047-1058.
  2. Bimbot, F. et al. (2004). A tutorial on text-independent speaker verification. EURASIP Journal on Applied Signal Processing, 2004:430-451.
  3. Campbell, J. P., J. (1997). Speaker recognition: A tutorial. Proceedings of the IEEE, 85(9):1437-1462.
  4. Figueiredo, M. A. F. and Jain, A. K. (2002). Unsupervised learning of finite mixture models. IEEE Trans. Pattern Analysis and Machine Intelligence, 24(3):381-396.
  5. Fukunaga, K. (1990). Introduction to Statistical Pattern Recognition. Academic Press.
  6. Gish, H. and Schmidt, M. (1994). Text-independent speaker identification. IEEE Signal Processing Magazine, 11(4):18-32.
  7. Jain, A. K., Duin, R. P. W., and Mao, J. (2000). Statistical pattern recognition: A review. IEEE Trans. Pattern Analysis and Machine Intelligence, 22(1):4-37.
  8. Jain, A. K., Ross, A., and Prabhakar, S. (2004). An introduction to biometric recognition. IEEE Trans. Circuits and Systems for Video Technology, 14(1):4-20.
  9. Kinnunen, T. and Li, H. (2010). An overview of textindependent speaker recognition: From features to supervectors. Speech Communication, 52(1):12 - 40.
  10. Maina, C. W. and Walsh, J. M. (2011). Joint speech enhancement and speaker identification using approximate Bayesian inference. IEEE Trans. Audio, Speech, and Language Processing, 19(6):1517-1529.
  11. McLaughlin, N., Ming, J., and Crookes, D. (2013). Robust multimodal person identification with limited training data. IEEE Trans. Human-Machine Systems, 43(2):214-224.
  12. Patra, S. and Acharya, S. K. (2011). Dimension reduction of feature vectors using WPCA for robust speaker identification system. In 2011 Int. Conf. Recent Trends in Information Technology (ICRTIT), pages 28-32.
  13. Reynolds, D. A. (2002). An overview of automatic speaker recognition technology. In 2002 IEEE Int. Conf. Acoustics, Speech, and Signal Processing (ICASSP), volume 4, pages IV-4072-IV-4075.
  14. Reynolds, D. A. and Rose, R. (1995). Robust textindependent speaker identification using Gaussian mixture speaker models. IEEE Trans. Speech and Audio Processing, 3(1):72-83.
  15. Sadjadi, S. O. and Hansen, J. H. L. (2014). Blind spectral weighting for robust speaker identification under reverberation mismatch. IEEE/ACM Trans. Audio, Speech, and Language Processing, 22(5):937-945.
  16. Therrien, C. W. (1992). Discrete Random Signals and Statistical Signal Processing. Prentice Hall PTR, Upper Saddle River, NJ, USA.
  17. Togneri, R. and Pullella, D. (2011). An overview of speaker identification: Accuracy and robustness issues. IEEE Circuits and Systems Magazine, 11(2):23-61.
  18. Zhao, X., Shao, Y., and Wang, D. (2012). CASA-based robust speaker identification. IEEE Trans. Audio, Speech, and Language Processing, 20(5):1608-1616.
  19. Zhao, X., Wang, Y., and Wang, D. (2014). Robust speaker identification in noisy and reverberant conditions. IEEE/ACM Trans. Audio, Speech, and Language Processing, 22(4):836-845.
  20. Zilca, R. D., Kingsbury, B., Navratil, J., and Ramaswamy, G. N. (2006). Pseudo pitch synchronous analysis of speech with applications to speaker recognition. IEEE Trans. Audio, Speech, Lang. Process., 14(2):467-478.
Download


Paper Citation


in Harvard Style

Biagetti G., Crippa P., Curzi A., Orcioni S. and Turchetti C. (2015). Speaker Identification with Short Sequences of Speech Frames . In Proceedings of the International Conference on Pattern Recognition Applications and Methods - Volume 2: ICPRAM, ISBN 978-989-758-077-2, pages 178-185. DOI: 10.5220/0005191701780185


in Bibtex Style

@conference{icpram15,
author={Giorgio Biagetti and Paolo Crippa and Alessandro Curzi and Simone Orcioni and Claudio Turchetti},
title={Speaker Identification with Short Sequences of Speech Frames},
booktitle={Proceedings of the International Conference on Pattern Recognition Applications and Methods - Volume 2: ICPRAM,},
year={2015},
pages={178-185},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0005191701780185},
isbn={978-989-758-077-2},
}


in EndNote Style

TY - CONF
JO - Proceedings of the International Conference on Pattern Recognition Applications and Methods - Volume 2: ICPRAM,
TI - Speaker Identification with Short Sequences of Speech Frames
SN - 978-989-758-077-2
AU - Biagetti G.
AU - Crippa P.
AU - Curzi A.
AU - Orcioni S.
AU - Turchetti C.
PY - 2015
SP - 178
EP - 185
DO - 10.5220/0005191701780185