BROADCAST NEWS PHONEME RECOGNITION BY SPARSE CODING
Joseph Razik, Sébastien Paris, Hervé Glotin
2012
Abstract
We present in this paper a novel approach for the phoneme recognition task that we want to extend to an automatic speech recognition system (ASR). Usual ASR systems are based on a GMM-HMM combination that represents a fully generative approach. Current discriminative methods are not tractable in large scale data set case, especially with non-linear kernel. In our system, we introduce a new scheme using jointly sparse coding and an approximation additive kernel for fast SVM training for phoneme recognition. Thus, on a broadcast news corpus, our system outperforms the use of GMMs by around 2.5% and is computationally linear to the number of samples.
References
- Arous, N. and Ellouze, N. (2003). Cooperative supervised and unsupervised learning algorithm for phoneme recognition in continuous speech and speakerindependent context. Neurocomputing, 51:225-235.
- Coates, A., Lee, H., and Ng, A. Y. (2011). An analysis of single-layer networks in unsupervised feature learning. In Artificial Intelligence and Statistics (AISTATS), page 9.
- Davis, S. and Mermelstein, P. (1980). Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans. ASSP, 28:357-366.
- Fan, R.-E., Chang, K.-W., Hsieh, C.-J., Wang, X.-R., and Lin, C.-J. (2008). LIBLINEAR: A library for large linear classification. Journal of Machine Learning Research, 9:1871-1874.
- Galliano, S., Geoffrois, E., Gravier, G., Bonastre, J., Mostefa, D., and Choukri, K. (2006). Corpus description of the ester evaluation campaign for the rich transcription of french broadcast news. In LREC, pages 315-320.
- Gravier, G., Bonastre, J., Galliano, S., and Geoffrois, E. (2004). The ester evaluation compaign of rich transcription of french broadcast news. In LREC.
- Hammer, B., Strickert, M., and Villmann, T. (2004). Relevance lvq versus svm. In Artificial Intelligence and Softcomputing, springer lecture notes in artificial intelligence, volume 3070, pages 592-597. Springer.
- Hsieh, C.-J., Chang, K.-W., Lin, C.-J., and Keerthi, S. S. (2008). A dual coordinate descent method for largescale linear svm.
- Huang, X., Acero, A., and Hon, H. (2001). Spoken Language Processing: A Guide to Theory, Algorithm and System Development. Prentice Hall.
- Hyvärinen, A. and Oja, E. (2000). Independent component analysis: algorithms and applications. Neural Netw., 13:411-430.
- Illina, I., Fohr, D., Mella, O., and Cerisara, C. (2004). The automatic news transcription system : Ants, some real time experiments. In ICSLP, pages 377-380.
- Joachims, T., Finley, T., and Yu, C.-N. (2009). Cuttingplane training of structural svms. Machine learning, 77(1):27-59.
- Lin, C.-J., Weng, R. C., and Keerthi, S. S. (2008). Trust region newton method for logistic regression. J. Mach. Learn. Res., 9.
- Mairal, J., Bach, F., Ponce, J., and Sapiro, G. (2009). Online dictionary learning for sparse coding. In Proceedings of the 26th Annual International Conference on Machine Learning, ICML 7809, pages 689-696, New York, NY, USA. ACM.
- Mairal, J., Bach, F., Ponce, J., and Sapiro, G. (2010). Online learning for matrix factorization and sparse coding. Journal of Machine Learning Research, 11:19- 60.
- Mairal, J., Bach, F., Ponce, J., Sapiro, G., and Zisserman, A. (2008). Supervised dictionary learning. Advances Neural Information Processing Systems, pages 1033- 1040.
- Maji, S., Berg, A. C., and Malik, J. (2009). Classification using intersection kernel support vector machines is efficient. In CVPR.
- Paris, S. (2011). Scenes/objects classification toolbox. http://www.mathworks.com/matlabcentral/fileexchan ge/29800-scenesobjects-classification-toolbox.
- Rabiner, L. and Juang, B. (1993). Fundamentals of Speech Recognition. Prentice Hall PTR.
- Ranzato, M., Krizhevsky, A., and Hinton, G. (2010). Factored 3-way restricted boltzmann machines for modeling natural images. In International Conference on Artificial Intelligence and Statistics AISTATS.
- Razik, J., Mella, O., Fohr, D., and Haton, J.-P. (2011). Frame-synchronous and local confidence measures for automatic speech recognition. IJPRAI, 25(2):157- 182.
- Rudin, C., Schapire, R. E., and Daubechies, I. (2007). Analysis of boosting algorithms using the smooth margin function. The Annals of Statistics, 35(6):2723-2768.
- Schölkopf, B., Platt, J. C., Shawe-Taylor, J. C., Smola, A. J., and Williamson, R. C. (2001). Estimating the support of a high-dimensional distribution. Neural Comput., 13:1443-1471.
- Shalev-Shwartz, S., Singer, Y., Srebro, N., and Cotter, A. (2007). Pegasos: Primal estimated sub-gradient solver for svm.
- Sivaram, G., Nemala, S., M. Elhilali, T. T., and Hermansky, H. (2010). Sparse coding for speech recognition. In ICASSP, pages 4346-4349.
- Smit, W. J. and Barnard, E. (2009). Continuous speech recognition with sparse coding. Computer Speech and Language, 23:200-219.
- Tan, M., Wang, L., and Tsang, I. W. (2010). Learning sparse svm for feature selection on very high dimensional datasets. In ICML, page 8.
- Tomi Kinnunen, H. L. (2010). An overview of textindependent speaker recognition: From features to supervectors. Speech Communication, 52:12-40.
- Vapnik, V. N. (1998). Statistical Learning Theory. WileyIntersciences.
- Vedaldi, A. and Zisserman, A. (2011). Efficient additive kernels via explicit feature maps. IEEE PAMI.
- Wang, J., Yang, J., Kai Yu, F. L., Huang, T., and Gong, Y. (2010). Locality-constrained linear coding for image classification. CVPR'10.
- Yang, J., Yu, K., Gong, Y., and Huang, T. S. (2009). Linear spatial pyramid matching using sparse coding for image classification. In CVPR.
- Young, S., Evermann, G., Kershaw, D., Moore, G., Odell, J., Ollason, D., Valtchev, V., and Woodland, P. (1995). The HTK Book. Entropic Ltd., Cambridge, England.
- Yu, K., Lin, Y., and Lafferty, J. (2011). Learning image representations from the pixel level via hierarchical sparse coding. In CVPR, pages 1713-1720.
Paper Citation
in Harvard Style
Razik J., Paris S. and Glotin H. (2012). BROADCAST NEWS PHONEME RECOGNITION BY SPARSE CODING . In Proceedings of the 1st International Conference on Pattern Recognition Applications and Methods - Volume 2: ICPRAM, ISBN 978-989-8425-99-7, pages 191-197. DOI: 10.5220/0003778201910197
in Bibtex Style
@conference{icpram12,
author={Joseph Razik and Sébastien Paris and Hervé Glotin},
title={BROADCAST NEWS PHONEME RECOGNITION BY SPARSE CODING},
booktitle={Proceedings of the 1st International Conference on Pattern Recognition Applications and Methods - Volume 2: ICPRAM,},
year={2012},
pages={191-197},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0003778201910197},
isbn={978-989-8425-99-7},
}
in EndNote Style
TY - CONF
JO - Proceedings of the 1st International Conference on Pattern Recognition Applications and Methods - Volume 2: ICPRAM,
TI - BROADCAST NEWS PHONEME RECOGNITION BY SPARSE CODING
SN - 978-989-8425-99-7
AU - Razik J.
AU - Paris S.
AU - Glotin H.
PY - 2012
SP - 191
EP - 197
DO - 10.5220/0003778201910197