Experiments on Adaptation Methods to Improve Acoustic Modeling for French Speech Recognition

Saeideh Mirzaei, Pierrick Milhorat, Jérôme Boudy, Gérard Chollet, Mikko Kurimo

2016

Abstract

To improve the performance of Automatic Speech Recognition (ASR) systems, the models must be retrained in order to better adjust to the speaker’s voice characteristics, the environmental and channel conditions or the context of the task. In this project we focus on the mismatch between the acoustic features used to train the model and the vocal characteristics of the front-end user of the system. To overcome this mismatch, speaker adaptation techniques have been used. A significant performance improvement has been shown using using constrained Maximum Likelihood Linear Regression (cMLLR) model adaptation methods, while a fast adaptation is guaranteed by using linear Vocal Tract Length Normalization (lVTLN).We have achieved a relative gain of approximately 9.44% in the word error rate with unsupervised cMLLR adaptation. We also compare our ASR system with the Google ASR and show that, using adaptation methods, we exceed its performance.

References

  1. Ahadi, S. and Woodland, P. C. (1997). Combined bayesian and predictive techniques for rapid speaker adaptation of continuous density hidden markov models. Computer speech & language, 11(3):187-206.
  2. Allauzen, C., Riley, M., Schalkwyk, J., Skut, W., and Mohri, M. (2007). Openfst: A general and efficient weighted finite-state transducer library. In Implementation and Application of Automata, pages 11-23. Springer.
  3. Anastasakos, T., McDonough, J., Schwartz, R., and Makhoul, J. (1996). A compact model for speakeradaptive training. In Spoken Language, 1996. ICSLP 96. Proceedings., Fourth International Conference on, volume 2, pages 1137-1140. IEEE.
  4. Chen, S. F. and Goodman, J. (1999). An empirical study of smoothing techniques for language modeling. Computer Speech & Language, 13(4):359-393.
  5. Davis, S. B. and Mermelstein, P. (1980). Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. Acoustics, Speech and Signal Processing, IEEE Transactions on, 28(4):357-366.
  6. Eide, E. and Gish, H. (1996). A parametric approach to vocal tract length normalization. In Acoustics, Speech, and Signal Processing, 1996. ICASSP-96. Conference Proceedings., 1996 IEEE International Conference on, volume 1, pages 346-348. IEEE.
  7. Galliano, S., Geoffrois, E., Gravier, G., Bonastre, J.-F., Mostefa, D., and Choukri, K. (2006). Corpus description of the ester evaluation campaign for the rich transcription of french broadcast news. In Proceedings of LREC, volume 6, pages 315-320.
  8. Galliano, S., Gravier, G., and Chaubard, L. (2009). The ester 2 evaluation campaign for the rich transcription of french radio broadcasts. In Interspeech, volume 9, pages 2583-2586.
  9. Gauvain, J.-L. and Lee, C.-H. (1994). Maximum a posteriori estimation for multivariate gaussian mixture observations of markov chains. Speech and audio processing, ieee transactions on, 2(2):291-298.
  10. Gravier, G., Adda, G., Paulson, N., Carré, M., Giraudel, A., and Galibert, O. (2012). The etape corpus for the evaluation of speech-based tv content processing in the french language. In LREC-Eighth international conference on Language Resources and Evaluation, page na.
  11. Kumar, N. and Andreou, A. G. (1998). Heteroscedastic discriminant analysis and reduced rank hmms for improved speech recognition. Speech communication, 26(4):283-297.
  12. Leggetter, C. J. and Woodland, P. C. (1995). Maximum likelihood linear regression for speaker adaptation of continuous density hidden markov models. Computer Speech & Language, 9(2):171-185.
  13. Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., Hannemann, M., Motlí c?ek, P., Qian, Y., Schwarz, P., et al. (2011). The kaldi speech recognition toolkit.
  14. Prasad, N. V. and Umesh, S. (2013). Improved cepstral mean and variance normalization using bayesian framework. In Automatic Speech Recognition and Understanding (ASRU), 2013 IEEE Workshop on, pages 156-161. IEEE.
  15. Shinoda, K. and Lee, C.-H. (1997). Structural map speaker adaptation using hierarchical priors. In Automatic Speech Recognition and Understanding, 1997. Proceedings., 1997 IEEE Workshop on, pages 381-388. IEEE.
  16. Stolcke, A. et al. (2002). Srilm-an extensible language modeling toolkit. In INTERSPEECH.
Download


Paper Citation


in Harvard Style

Mirzaei S., Milhorat P., Boudy J., Chollet G. and Kurimo M. (2016). Experiments on Adaptation Methods to Improve Acoustic Modeling for French Speech Recognition . In Proceedings of the 5th International Conference on Pattern Recognition Applications and Methods - Volume 1: ICPRAM, ISBN 978-989-758-173-1, pages 278-282. DOI: 10.5220/0005703702780282


in Bibtex Style

@conference{icpram16,
author={Saeideh Mirzaei and Pierrick Milhorat and Jérôme Boudy and Gérard Chollet and Mikko Kurimo},
title={Experiments on Adaptation Methods to Improve Acoustic Modeling for French Speech Recognition},
booktitle={Proceedings of the 5th International Conference on Pattern Recognition Applications and Methods - Volume 1: ICPRAM,},
year={2016},
pages={278-282},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0005703702780282},
isbn={978-989-758-173-1},
}


in EndNote Style

TY - CONF
JO - Proceedings of the 5th International Conference on Pattern Recognition Applications and Methods - Volume 1: ICPRAM,
TI - Experiments on Adaptation Methods to Improve Acoustic Modeling for French Speech Recognition
SN - 978-989-758-173-1
AU - Mirzaei S.
AU - Milhorat P.
AU - Boudy J.
AU - Chollet G.
AU - Kurimo M.
PY - 2016
SP - 278
EP - 282
DO - 10.5220/0005703702780282