ognized. It is affected by perceptual masking
in MP3 compression scheme and decreasing of
short-time Fourier analysis frequency resolution
used in computation of MFCC and PLP features.
• Generally, the loss of accuracy is very small up
to bit-rate 24 kbps. In this case the size of com-
pressed data is just 10% of full precision linear
PCM and ACC decreased by 6% for MFCC and
only by 3% for PLP features.
• The results are worse for noisy channels where
50% decrease of ACC can be observed for MFCC
features, comparing standard PCM and 24 kbps
MP3 speech signal from desktop microphone.
This decrease is just about 8% for PLP features.
• Realized experiments proved that MP3 com-
pressed speech files used in standardly available
consumer devices such as MP3 players, recorders,
or mobile phones, can be used for off-line auto-
matic conversion of speech into text without crit-
ical loss of an accuracy. PLP features seem to be
preferable for speech recognition in this case.
ACKNOWLEDGEMENTS
This research was supported by grants GA
ˇ
CR
102/08/0707 “Speech Recognition under Real-
World Conditions” and by research activity MSM
6840770014 “Perspective Informative and Commu-
nications Technicalities Research”.
REFERENCES
Barras, C., Lamel, L., and Gauvain, J.-L. (2001). Automatic
transcription of compressed broadcast audio. In Proc.
of the IEEE International Conference on Acoustics,
Speech, and Signal Processing, pages 265–268, Salt
Lake City, USA.
Bouvigne, G. (2007). MP3 standard. Homepage.
http://www.mp3-tech.org.
Boˇril, H., Fousek, P., and Poll´ak, P. (2006). Data-driven
design of front-end filter bank for Lombard speech
recognition. In Proc. of ICSLP 2006, Pittsburgh, USA.
Brandenburg, K. and Popp, H. (2000). An introduction to
MPEG layer 3. EBU Technical Review.
Byrne, W., Doermann, D., Franz, M., Gustman, S., Hajiˇc,
J., Oard, D., Pichney, M., Psutka, J., Ramabhadran,
B., Soergel, D., Ward, T., and Zhu, W.-J. (2004). Au-
tomatic recognition of spontaneous speech for access
to multilingual oral history archives. IEEE Trans. on
Speech and Audio Processing, Vol.12(No.4):420–435.
Chen, S. S., Eide, E., Gales, M. J. F., Gopinath, R. A.,
Kanvesky, D., and Olsen, P. (2002). Automatic tran-
scription of broadcast news. Speech Communication,
Vol.37(No.1-2):69–87.
Cheng, M. and et. al. (2008). LAME MP3 encoder 3.99
alpha 10. http://www.free-codecs.com.
ELRA (2009). Czech SPEECON database. Catalog No.
S0298. http://www.elra.info.
ETSI (2007). Digital cellular telecommunications system
(Phase 2+) (GSM). Test sequences for the Adaptive
Multi-Rate (AMR) speech codec. http://www.etsi.org.
Fousek, P. (2006). CtuCopy-Universal feature extractor and
speech enhancer. http://noel.feld.cvut.cz/speechlab.
Fousek, P. and Poll´ak, P. (2003). Additive noise and channel
distortion-robust parameterization tool. performance
evaluation on Aurora 2 & 3. In Proc. of Eurospeech
2003, Geneve, Switzerland.
Gauvain, J.-L., Lamel, L., and Adda, G. (2002). The LIMSI
broadcast news transcription system. Speech Commu-
nication, Vol.37(No.1-2):89–108.
Hermansky, H. (1990). Perceptual linear predictive (PLP)
analysis of speech. Journal of the Acoustical Society
of America, Vol.87(No.4):1738–1752.
Huang, X., Acero, A., and Hon, H.-W. (2001). Spoken Lan-
guage Processing. Prentice Hall.
ITU-T (2007). International Telecommunication Union
Recommendation G.729, coding of speech at 8 kbit/s
using conjugate-structure algebraic-code-excited lin-
ear prediction(CS-ACELP). http://www.itu.int/ITU-T.
Makhoul, J., Kubala, F., Leek, T., Liu, D., Nguyen, L.,
Schwartz, R., and Srivastava, A. (2000). Speech
and language technologies for audio indexing and re-
trieval. Proc. of the IEEE, Vol.88(No.8):1338–1353.
Nouza, J.,
ˇ
Cerva, P., and
ˇ
Zd´ansk´y, J. (2009). Very large vo-
cabulary voice dictation for mobile devices. In Proc.
of Interspeech 2009, pages 995–998, Brighton, UK.
Poll´ak, P. and
ˇ
Cernock´y, J. (2004). Czech SPEECON adult
database. Technical report.
Psutka, J., M¨uller, L., and Psutka, J. V. (2001). Compari-
son of MFCC and PLP parameterization in the speaker
independent continuous speech recognition task. In
Proc. of Eurospeech 2001, Aalborg, Denmark.
Psutka, J., Psutka, J., Ircing, P., and Hoidekr, J. (2003).
Recognition of spontaneously pronounced TV ice-
hockey commentary. In Proc. of ISCA & IEEE Work-
shop on Spontaneous Speech Processing and Recog-
nition, pages 83–86, Tokyo.
Rajnoha, J. and Poll´ak, P. (2011). Asr systems in noisy envi-
ronment: Analysis and solutions for increasing noise
robustness. Radioengineering, Vol.20(No.1):74–84.
Valin, J.-M. (2007). The speex codec manual. version 1.2
beta 3. http://www.speex.org.
Vanˇek, J. and Psutka, J. (2010). Gender-dependent acous-
tic models fusion developed for automatic subtitling
of parliament meetings broadcasted by the Czech TV.
In Proc. of Text, Speech and Dialog, pages 431–438,
Brno, Czech Republic.
Young, S. and et al. (2009). The HTK Book, Version 3.4.1.
Cambridge.
SIGMAP 2011 - International Conference on Signal Processing and Multimedia Applications
10