Feature Transformation and Reduction for Text Classification

Artur J. Ferreira, Mario Figueiredo

Abstract

Text classification is an important tool for many applications, in supervised, semi-supervised, and unsupervised scenarios. In order to be processed by machine learning methods, a text (document) is usually represented as a bag-of-words (BoW). A BoW is a large vector of features (usually stored as floating point values), which represent the relative frequency of occurrence of a given word/term in each document. Typically, we have a large number of features, many of which may be non-informative for classification tasks and thus the need for feature transformation, reduction, and selection arises. In this paper, we propose two efficient algorithms for feature transformation and reduction for BoW-like representations. The proposed algorithms rely on simple statistical analysis of the input pattern, exploiting the BoW and its binary version. The algorithms are evaluated with support vector machine (SVM) and AdaBoost classifiers on standard benchmark datasets. The experimental results show the adequacy of the reduced/transformed binary features for text classification problems as well as the improvement on the test set error rate, using the proposed methods.

References

  1. D. Achlioptas. Database-friendly random projections. In ACM Symposium on Principles of Database Systems, pages 274-281, Santa Barbara, USA, 2001.
  2. E. Bingham and H. Mannila. Random projection in dimensionality reduction: applications to image and text data. In KDD'01: Proc. of the 7th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, pages 245-250, San Francisco, USA, 2001.
  3. F. Colas, P. Paclk, J. Kok, and P. Brazdil. Does SVM really scale up to large bag of words feature spaces? In M. Berthold, J. Shawe-Taylor, and N. Lavrac, editors, Proceedings of the 7th International Symposium on Intelligent Data Analysis (IDA 2007), volume 4723 of LNCS, pages 296-307. Springer, 2007.
  4. F. Escolano, P. Suau, and B. Bonev. Information Theory in Computer Vision and Pattern Recognition. Springer, 2009.
  5. F. Fleuret and I. Guyon. Fast binary feature selection with conditional mutual information. Journal of Machine Learning Research, 5:1531-1555, 2004.
  6. G. Forman. An extensive empirical study of feature selection metrics for text classification. Journal of Machine Learning Research, 3:1289-1305, 2003.
  7. Y. Freund and R. Schapire. Experiments with a new boosting algorithm. In Thirteenth International Conference on Machine Learning, pages 148-156, Bari, Italy, 1996.
  8. I. Guyon, S. Gunn, M. Nikravesh, and L. Zadeh (Editors). Feature Extraction, Foundations and Applications. Springer, 2006.
  9. T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning. Springer, 2nd edition, 2009.
  10. K. Hyunsoo, P. Howland, and H. Park. Dimension reduction in text classification with support vector machines. Journal of Machine Learning Research, 6:37-53, 2005.
  11. T. Joachims. Learning to Classify Text Using Support Vector Machines. Kluwer Academic Publishers, 2001.
  12. P. Li, T. Hastie, and K. Church. Very sparse random projections. In KDD 7806: Proc. of the 12th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, pages 287-296, Philadelphia, USA, 2006.
  13. C. Manning, P. Raghavan, and H. Schütze. Introduction to Information Retrieval. Cambridge University Press, 2008.
  14. E. Mohamed, S. El-Beltagy, and S. El-Gamal. A feature reduction technique for improved web page clustering. In Innovations in Information Technology, pages 1-5, 2006.
  15. R. Schapire and Y. Singer. Boostexter: A boosting-based system for text categorization. Machine Learning, 39((2/3)):135-168, 2000.
  16. K. Torkkola. Discriminative features for text document classification. Pattern Analysis and Applications, 6(4):301-308, 2003.
  17. V. Vapnik. The Nature of Statistical Learning Theory. Springer-Verlag, 1999.
  18. A. Vezhnevets and V. Vezhnevets. Modest adaboost - teaching adaboost to generalize better. Graphicon, 12(5):987-997, September 2005.
Download


Paper Citation


in Harvard Style

J. Ferreira A. and Figueiredo M. (2010). Feature Transformation and Reduction for Text Classification . In Proceedings of the 10th International Workshop on Pattern Recognition in Information Systems - Volume 1: PRIS, (ICEIS 2010) ISBN 978-989-8425-14-0, pages 72-81. DOI: 10.5220/0003028100720081


in Bibtex Style

@conference{pris10,
author={Artur J. Ferreira and Mario Figueiredo},
title={Feature Transformation and Reduction for Text Classification},
booktitle={Proceedings of the 10th International Workshop on Pattern Recognition in Information Systems - Volume 1: PRIS, (ICEIS 2010)},
year={2010},
pages={72-81},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0003028100720081},
isbn={978-989-8425-14-0},
}


in EndNote Style

TY - CONF
JO - Proceedings of the 10th International Workshop on Pattern Recognition in Information Systems - Volume 1: PRIS, (ICEIS 2010)
TI - Feature Transformation and Reduction for Text Classification
SN - 978-989-8425-14-0
AU - J. Ferreira A.
AU - Figueiredo M.
PY - 2010
SP - 72
EP - 81
DO - 10.5220/0003028100720081