GU METRIC - A New Feature Selection Algorithm for Text Categorization

Gulden Uchyigit, Keith Clark

Abstract

To improve scalability of text categorization and reduce over-fitting, it is desirable to reduce the number of words used for categorisiation. Further, it is desirable to achieve such a goal automatically without sacrificing the categorization accuracy. Such techniques are known as automatic feature selection methods. Typically this is done in the way that each word is assigned a weight (using a word scoring metric) and the top scoring words are then used to describe a document collection. There are several word scoring metrics which have been employed in literature. In this paper we present a novel feature selection method called the GU metric. The details of comparative evaluation of all the other methods are given. The results show that the GU metric outperforms some of the other well known feature selection methods.

References

  1. Caropresso, M., Matwin, S., and Sebastiani, F. (2001). A learner independent evaluation of usefulness of statistical phrases for automated text categroization. In Chin, A., editor, Text Databases and Document Management: Theory and Practice, pages 78 - 102. idea group publishing.
  2. Church, K. W. and Hanks, P. (1998). Word association norms, mutual information and leixicography. In ACL 27, pages 76-83, Vancouver Canada.
  3. Dumais, S. T. and Chen, H. (2000). Hierarchical classification of web content. In SIGIR'.
  4. Dumais, S. T., Platt, J., Heckerman, D., and Sahami, M. (1998). Inductive learning algorithms and representations for text. In ACM-CIKM, pages 148-155.
  5. Dutoit, S., Yang, H., Callow, J., and Speed, P. (2002). Statistical methods for identifying differently expressed genes in replicated cdna microarray experiments.
  6. Journal of American Statistic Association, (97):77- 86.
  7. Fano, R. (1961). Transmission of Information. MIT Press, Cambridge, MA.
  8. Forman, G. (2003). An extensive empirical study of feature selection metrics for text classification. The Journal of Machine Learning Research, 3.
  9. Galavotti, L., Sebastiani, F., and Simi, M. (2000). Experiments on the use of feature selection and negative evidence in automated text categorization. In Borbinha, J. L. and Baker, T., editors, Proceedings of ECDL-00, 4th European Conference on Research and Advanced Technology for Digital Libraries, pages 59-68, Lisbon, PT. Springer Verlag, Heidelberg, DE.
  10. Joachims, T. (1997). A probabilistic analysis of the rocchio algorithm with tfidf for text categorization. In ICML, pages 143-151.
  11. Lang, K. (1995). Newsweeder: Learning to filter netnews. In 12th International Conference on Machine Learning.
  12. Mladenic, D. (1998). Machine Learning on nonhomogeneous, distributed text data. PhD thesis, University of Ljubljana,Slovenia.
  13. Mladenic, D., Brank, J., Grobelnik, M., and Milic-Frayling, N. (2004). Feature selection using linear classifier weights: Interaction with classification models. In ACM, editor, SIGIR'04.
  14. Ng, H., Goh, W., and Low, K. (1997). Feature selection, perceptron learning, and a usability case study for text categorization. In SIGIR 7897: Proceedings of the 20th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, July 27-31, 1997, Philadelphia, PA, USA, pages 67- 73. ACM.
  15. Pazzani, M. and Billsus, D. (1997). Learning and revising user profiles: The identification of interesting web sites. Machine Learning, 27:313-331.
  16. Quinlan, J. R. (1993). C4.5: Programs for Machine Learning. Morgan Kaufmann.
  17. Rijsbergen, V. (1979). Information Retrieval. Butterworths, London 2nd edition.
  18. Ruiz, M. E. and Srinivasan, P. (2002). Hierarchical text categorization using neural networks. Information Retrieval, 5(1):87-118.
  19. Y.Yang and Pedersen, J. (1997). A comparative study on feature selection in text categorization. In Fisher, D. H., editor, Proceedings of ICML-97, 14th International Conference on Machine Learning, pages 412- 420, Nashville, US. Morgan Kaufmann Publishers, San Francisco, US.
  20. Zheng, Z. and Srihari, R. (2003). Optimally combining positive and negative features for text categorization. In ICML-KDD'2003 Workshop: Learning from Imbalanced Data Sets II, Washington, DC.
Download


Paper Citation


in Harvard Style

Uchyigit G. and Clark K. (2007). GU METRIC - A New Feature Selection Algorithm for Text Categorization . In Proceedings of the Ninth International Conference on Enterprise Information Systems - Volume 2: ICEIS, ISBN 978-972-8865-89-4, pages 399-402. DOI: 10.5220/0002365503990402


in Bibtex Style

@conference{iceis07,
author={Gulden Uchyigit and Keith Clark},
title={GU METRIC - A New Feature Selection Algorithm for Text Categorization},
booktitle={Proceedings of the Ninth International Conference on Enterprise Information Systems - Volume 2: ICEIS,},
year={2007},
pages={399-402},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0002365503990402},
isbn={978-972-8865-89-4},
}


in EndNote Style

TY - CONF
JO - Proceedings of the Ninth International Conference on Enterprise Information Systems - Volume 2: ICEIS,
TI - GU METRIC - A New Feature Selection Algorithm for Text Categorization
SN - 978-972-8865-89-4
AU - Uchyigit G.
AU - Clark K.
PY - 2007
SP - 399
EP - 402
DO - 10.5220/0002365503990402