ACCURACY OF MP3 SPEECH RECOGNITION UNDER REAL-WORD CONDITIONS - Experimental Study

Petr Pollak, Martin Behunek

2011

Abstract

This paper presents the study of speech recognition accuracy with respect to different levels of MP3 compression. Special attention is focused on the processing of speech signals with different quality, i.e. with different level of background noise and channel distortion. The work was motivated by possible usage of ASR for off-line automatic transcription of audio recordings collected by standard wide-spread MP3 devices. The realized experiments have proved that although MP3 format is not optimal for speech compression it does not distort speech significantly especially for high or moderate bit rates and high quality of source data. The accuracy of connected digits ASR decreased consequently very slowly up to the bit rate 24 kbps. For the best case of PLP parameterization in close-talk channel just 3% decrease of recognition accuracy was observed while the size of the compressed file was approximately 10% of the original size. All results were slightly worse under presence of additive background noise and channel distortion in a signal but achieved accuracy was also acceptable in this case especially for PLP features.

References

  1. Barras, C., Lamel, L., and Gauvain, J.-L. (2001). Automatic transcription of compressed broadcast audio. In Proc. of the IEEE International Conference on Acoustics, Speech, and Signal Processing, pages 265-268, Salt Lake City, USA.
  2. Barras, C., Lamel, L., and Gauvain, J.-L. (2001). Automatic transcription of compressed broadcast audio. In Proc. of the IEEE International Conference on Acoustics, Speech, and Signal Processing, pages 265-268, Salt Lake City, USA.
  3. Bouvigne, G. (2007). MP3 standard. Homepage. http://www.mp3-tech.org.
  4. Bouvigne, G. (2007). MP3 standard. Homepage. http://www.mp3-tech.org.
  5. Bo?ril, H., Fousek, P., and Pollák, P. (2006). Data-driven design of front-end filter bank for Lombard speech recognition. In Proc. of ICSLP 2006, Pittsburgh, USA.
  6. Bo?ril, H., Fousek, P., and Pollák, P. (2006). Data-driven design of front-end filter bank for Lombard speech recognition. In Proc. of ICSLP 2006, Pittsburgh, USA.
  7. Brandenburg, K. and Popp, H. (2000). An introduction to MPEG layer 3. EBU Technical Review.
  8. Brandenburg, K. and Popp, H. (2000). An introduction to MPEG layer 3. EBU Technical Review.
  9. Byrne, W., Doermann, D., Franz, M., Gustman, S., Hajic?, J., Oard, D., Pichney, M., Psutka, J., Ramabhadran, B., Soergel, D., Ward, T., and Zhu, W.-J. (2004). Automatic recognition of spontaneous speech for access to multilingual oral history archives. IEEE Trans. on Speech and Audio Processing, Vol.12(No.4):420-435.
  10. Byrne, W., Doermann, D., Franz, M., Gustman, S., Hajic?, J., Oard, D., Pichney, M., Psutka, J., Ramabhadran, B., Soergel, D., Ward, T., and Zhu, W.-J. (2004). Automatic recognition of spontaneous speech for access to multilingual oral history archives. IEEE Trans. on Speech and Audio Processing, Vol.12(No.4):420-435.
  11. Chen, S. S., Eide, E., Gales, M. J. F., Gopinath, R. A., Kanvesky, D., and Olsen, P. (2002). Automatic transcription of broadcast news. Speech Communication, Vol.37(No.1-2):69-87.
  12. Chen, S. S., Eide, E., Gales, M. J. F., Gopinath, R. A., Kanvesky, D., and Olsen, P. (2002). Automatic transcription of broadcast news. Speech Communication, Vol.37(No.1-2):69-87.
  13. Cheng, M. and et. al. (2008). LAME MP3 encoder 3.99 alpha 10. http://www.free-codecs.com.
  14. Cheng, M. and et. al. (2008). LAME MP3 encoder 3.99 alpha 10. http://www.free-codecs.com.
  15. ELRA (2009). Czech SPEECON database. Catalog No. S0298. http://www.elra.info.
  16. ELRA (2009). Czech SPEECON database. Catalog No. S0298. http://www.elra.info.
  17. ETSI (2007). Digital cellular telecommunications system (Phase 2+) (GSM). Test sequences for the Adaptive Multi-Rate (AMR) speech codec. http://www.etsi.org.
  18. ETSI (2007). Digital cellular telecommunications system (Phase 2+) (GSM). Test sequences for the Adaptive Multi-Rate (AMR) speech codec. http://www.etsi.org.
  19. Fousek, P. (2006). CtuCopy-Universal feature extractor and speech enhancer. http://noel.feld.cvut.cz/speechlab.
  20. Fousek, P. (2006). CtuCopy-Universal feature extractor and speech enhancer. http://noel.feld.cvut.cz/speechlab.
  21. Fousek, P. and Pollák, P. (2003). Additive noise and channel distortion-robust parameterization tool. performance evaluation on Aurora 2 & 3. In Proc. of Eurospeech 2003, Geneve, Switzerland.
  22. Fousek, P. and Pollák, P. (2003). Additive noise and channel distortion-robust parameterization tool. performance evaluation on Aurora 2 & 3. In Proc. of Eurospeech 2003, Geneve, Switzerland.
  23. Gauvain, J.-L., Lamel, L., and Adda, G. (2002). The LIMSI broadcast news transcription system. Speech Communication, Vol.37(No.1-2):89-108.
  24. Gauvain, J.-L., Lamel, L., and Adda, G. (2002). The LIMSI broadcast news transcription system. Speech Communication, Vol.37(No.1-2):89-108.
  25. Hermansky, H. (1990). Perceptual linear predictive (PLP) analysis of speech. Journal of the Acoustical Society of America, Vol.87(No.4):1738-1752.
  26. Hermansky, H. (1990). Perceptual linear predictive (PLP) analysis of speech. Journal of the Acoustical Society of America, Vol.87(No.4):1738-1752.
  27. Huang, X., Acero, A., and Hon, H.-W. (2001). Spoken Language Processing. Prentice Hall.
  28. Huang, X., Acero, A., and Hon, H.-W. (2001). Spoken Language Processing. Prentice Hall.
  29. ITU-T (2007). International Telecommunication Union Recommendation G.729, coding of speech at 8 kbit/s using conjugate-structure algebraic-code-excited linear prediction(CS-ACELP). http://www.itu.int/ITU-T.
  30. ITU-T (2007). International Telecommunication Union Recommendation G.729, coding of speech at 8 kbit/s using conjugate-structure algebraic-code-excited linear prediction(CS-ACELP). http://www.itu.int/ITU-T.
  31. Makhoul, J., Kubala, F., Leek, T., Liu, D., Nguyen, L., Schwartz, R., and Srivastava, A. (2000). Speech and language technologies for audio indexing and retrieval. Proc. of the IEEE, Vol.88(No.8):1338-1353.
  32. Makhoul, J., Kubala, F., Leek, T., Liu, D., Nguyen, L., Schwartz, R., and Srivastava, A. (2000). Speech and language technologies for audio indexing and retrieval. Proc. of the IEEE, Vol.88(No.8):1338-1353.
  33. Nouza, J., C?erva, P., and Z?dánskÉ, J. (2009). Very large vocabulary voice dictation for mobile devices. In Proc. of Interspeech 2009, pages 995-998, Brighton, UK.
  34. Nouza, J., C?erva, P., and Z?dánskÉ, J. (2009). Very large vocabulary voice dictation for mobile devices. In Proc. of Interspeech 2009, pages 995-998, Brighton, UK.
  35. Pollák, P. and C?ernockÉ, J. (2004). Czech SPEECON adult database. Technical report.
  36. Pollák, P. and C?ernockÉ, J. (2004). Czech SPEECON adult database. Technical report.
  37. Psutka, J., Müller, L., and Psutka, J. V. (2001). Comparison of MFCC and PLP parameterization in the speaker independent continuous speech recognition task. In Proc. of Eurospeech 2001, Aalborg, Denmark.
  38. Psutka, J., Müller, L., and Psutka, J. V. (2001). Comparison of MFCC and PLP parameterization in the speaker independent continuous speech recognition task. In Proc. of Eurospeech 2001, Aalborg, Denmark.
  39. Psutka, J., Psutka, J., Ircing, P., and Hoidekr, J. (2003). Recognition of spontaneously pronounced TV icehockey commentary. In Proc. of ISCA & IEEE Workshop on Spontaneous Speech Processing and Recognition, pages 83-86, Tokyo.
  40. Psutka, J., Psutka, J., Ircing, P., and Hoidekr, J. (2003). Recognition of spontaneously pronounced TV icehockey commentary. In Proc. of ISCA & IEEE Workshop on Spontaneous Speech Processing and Recognition, pages 83-86, Tokyo.
  41. Rajnoha, J. and Pollák, P. (2011). Asr systems in noisy environment: Analysis and solutions for increasing noise robustness. Radioengineering, Vol.20(No.1):74-84.
  42. Rajnoha, J. and Pollák, P. (2011). Asr systems in noisy environment: Analysis and solutions for increasing noise robustness. Radioengineering, Vol.20(No.1):74-84.
  43. Valin, J.-M. (2007). The speex codec manual. version 1.2 beta 3. http://www.speex.org.
  44. Valin, J.-M. (2007). The speex codec manual. version 1.2 beta 3. http://www.speex.org.
  45. Vane?k, J. and Psutka, J. (2010). Gender-dependent acoustic models fusion developed for automatic subtitling of parliament meetings broadcasted by the Czech TV. In Proc. of Text, Speech and Dialog, pages 431-438, Brno, Czech Republic.
  46. Vane?k, J. and Psutka, J. (2010). Gender-dependent acoustic models fusion developed for automatic subtitling of parliament meetings broadcasted by the Czech TV. In Proc. of Text, Speech and Dialog, pages 431-438, Brno, Czech Republic.
  47. Young, S. and et al. (2009). The HTK Book, Version 3.4.1. Cambridge.
  48. Young, S. and et al. (2009). The HTK Book, Version 3.4.1. Cambridge.
Download


Paper Citation


in Harvard Style

Pollak P. and Behunek M. (2011). ACCURACY OF MP3 SPEECH RECOGNITION UNDER REAL-WORD CONDITIONS - Experimental Study . In Proceedings of the International Conference on Signal Processing and Multimedia Applications - Volume 1: SIGMAP, (ICETE 2011) ISBN 978-989-8425-72-0, pages 5-10. DOI: 10.5220/0003512600050010


in Harvard Style

Pollak P. and Behunek M. (2011). ACCURACY OF MP3 SPEECH RECOGNITION UNDER REAL-WORD CONDITIONS - Experimental Study . In Proceedings of the International Conference on Signal Processing and Multimedia Applications - Volume 1: SIGMAP, (ICETE 2011) ISBN 978-989-8425-72-0, pages 5-10. DOI: 10.5220/0003512600050010


in Bibtex Style

@conference{sigmap11,
author={Petr Pollak and Martin Behunek},
title={ACCURACY OF MP3 SPEECH RECOGNITION UNDER REAL-WORD CONDITIONS - Experimental Study},
booktitle={Proceedings of the International Conference on Signal Processing and Multimedia Applications - Volume 1: SIGMAP, (ICETE 2011)},
year={2011},
pages={5-10},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0003512600050010},
isbn={978-989-8425-72-0},
}


in Bibtex Style

@conference{sigmap11,
author={Petr Pollak and Martin Behunek},
title={ACCURACY OF MP3 SPEECH RECOGNITION UNDER REAL-WORD CONDITIONS - Experimental Study},
booktitle={Proceedings of the International Conference on Signal Processing and Multimedia Applications - Volume 1: SIGMAP, (ICETE 2011)},
year={2011},
pages={5-10},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0003512600050010},
isbn={978-989-8425-72-0},
}


in EndNote Style

TY - CONF
JO - Proceedings of the International Conference on Signal Processing and Multimedia Applications - Volume 1: SIGMAP, (ICETE 2011)
TI - ACCURACY OF MP3 SPEECH RECOGNITION UNDER REAL-WORD CONDITIONS - Experimental Study
SN - 978-989-8425-72-0
AU - Pollak P.
AU - Behunek M.
PY - 2011
SP - 5
EP - 10
DO - 10.5220/0003512600050010


in EndNote Style

TY - CONF
JO - Proceedings of the International Conference on Signal Processing and Multimedia Applications - Volume 1: SIGMAP, (ICETE 2011)
TI - ACCURACY OF MP3 SPEECH RECOGNITION UNDER REAL-WORD CONDITIONS - Experimental Study
SN - 978-989-8425-72-0
AU - Pollak P.
AU - Behunek M.
PY - 2011
SP - 5
EP - 10
DO - 10.5220/0003512600050010