Effect of Feature Smoothing Methods in Text Classification Tasks

David Vilar, Hermann Ney, Alfons Juan, Enrique Vidal

Abstract

The number of features to be considered in a text classification system is given by the size of the vocabulary and this is normally in the range of the tens or hundreds of thousands even for small tasks. This leads to parameter estimation problems for statistical based methods and countermeasures have to be found. One of the most widely used methods consists of reducing the size of the vocabulary according to a well defined criterion in order to be able to reliably estimate the set of parameters. In the field of language modeling this problem is also encountered and several smoothing techniques have been developed. In this paper we show that using the full vocabulary together with a suitable choice of the smoothing technique for the text classification task obtains better results than the standard feature selection techniques.

References

  1. McCallum, A., Nigam, K.: A comparison of event models for naive Bayes text classification. In: AAAI/ICML-98 Workshop on Learning for Text Categorization, AAAI Press (1998) 41- 48
  2. Lafuente, J., Juan, A.: Comparación de Codificaciones de Documentos para Clasificación con K Vecinos Más Próximos. In: Proc. of the I Jornadas de Tratamiento y Recuperación de Información (JOTRI), València (Spain) (2002) 37-44 (In spanish).
  3. Nigam, K., Lafferty, J., McCallum, A.: Using maximum entropy for text classification (1999)
  4. Yang, Y., Pedersen, J.O.: A comparative study on feature selection in text categorization. In Fisher, D.H., ed.: Proceedings of ICML-97, 14th International Conference on Machine Learning, Nashville, US, Morgan Kaufmann Publishers, San Francisco, US (1997) 412-420
  5. Cover, T.M., Thomas, J.A.: Elements of Information Theory. Wiley Series in Telecommunications. John Wiley & Sons, New York, NY, USA (1991)
  6. Nigam, K., McCallum, A.K., Thrun, S., Mitchell, T.M.: Text classification from labeled and unlabeled documents using EM. Machine Learning 39 (2000) 103-134
  7. Ney, H., Martin, S., Wessel, F.: Satistical Language Modeling Using Leaving-One-Out. In: Corpus-based Methods in Language and Speech Proceesing. Kluwer Academic Publishers, Dordrecht, the Netherlands (1997) 174-207
  8. Juan, A., Ney, H.: Reversing and Smoothing the Multinomial Naive Bayes Text Classifier. In: Proc. of the 2nd Int. Workshop on Pattern Recognition in Information Systems (PRIS 2002), Alacant (Spain) (2002) 200-212
  9. Duda, R.O., Hart, P.E., Stork, D.G.: Pattern Classification. John Wiley & Sons, New York, NY, USA (2001)
  10. Efron, B., Tibshirani, R.J.: An Introduction to the Bootstrap. Chapman & Hall, New York, NY, USA (1993)
  11. Group, C.T.L.: World wide knowledge base (web kb) project. (http://www2.cs.cmu.edu/afs/cs.cmu.edu/project/theo-11/www/wwkb/)
  12. Chen, S.F., Goodman, J.: An empirical study of smoothing techniques for language modeling. In Joshi, A., Palmer, M., eds.: Proceedings of the Thirty-Fourth Annual Meeting of the Association for Computational Linguistics, San Francisco, Morgan Kaufmann Publishers (1996) 310-318
Download


Paper Citation


in Harvard Style

Vilar D., Ney H., Juan A. and Vidal E. (2004). Effect of Feature Smoothing Methods in Text Classification Tasks . In Proceedings of the 4th International Workshop on Pattern Recognition in Information Systems - Volume 1: PRIS, (ICEIS 2004) ISBN 972-8865-01-5, pages 108-117. DOI: 10.5220/0002682001080117


in Bibtex Style

@conference{pris04,
author={David Vilar and Hermann Ney and Alfons Juan and Enrique Vidal},
title={Effect of Feature Smoothing Methods in Text Classification Tasks},
booktitle={Proceedings of the 4th International Workshop on Pattern Recognition in Information Systems - Volume 1: PRIS, (ICEIS 2004)},
year={2004},
pages={108-117},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0002682001080117},
isbn={972-8865-01-5},
}


in EndNote Style

TY - CONF
JO - Proceedings of the 4th International Workshop on Pattern Recognition in Information Systems - Volume 1: PRIS, (ICEIS 2004)
TI - Effect of Feature Smoothing Methods in Text Classification Tasks
SN - 972-8865-01-5
AU - Vilar D.
AU - Ney H.
AU - Juan A.
AU - Vidal E.
PY - 2004
SP - 108
EP - 117
DO - 10.5220/0002682001080117