A Study on Term Weighting for Text Categorization: A Novel Supervised Variant of tf.idf

Giacomo Domeniconi, Gianluca Moro, Roberto Pasolini, Claudio Sartori

Abstract

Within text categorization and other data mining tasks, the use of suitable methods for term weighting can bring a substantial boost in effectiveness. Several term weighting methods have been presented throughout literature, based on assumptions commonly derived from observation of distribution of words in documents. For example, the idf assumption states that words appearing in many documents are usually not as important as less frequent ones. Contrarily to tf.idf and other weighting methods derived from information retrieval, schemes proposed more recently are supervised, i.e. based on knownledge of membership of training documents to categories. We propose here a supervised variant of the tf.idf scheme, based on computing the usual idf factor without considering documents of the category to be recognized, so that importance of terms frequently appearing only within it is not underestimated. A further proposed variant is additionally based on relevance frequency, considering occurrences of words within the category itself. In extensive experiments on two recurring text collections with several unsupervised and supervised weighting schemes, we show that the ones we propose generally perform better than or comparably to other ones in terms of accuracy, using two different learning methods.

References

  1. Bloehdorn, S. and Hotho, A. (2006). Boosting for text classification with semantic features. In Mobasher, B., Nasraoui, O., Liu, B., and Masand, B., editors, Advances in Web Mining and Web Usage Analysis, volume 3932 of Lecture Notes in Computer Science, pages 149-166. Springer Berlin Heidelberg.
  2. Breiman, L. (2001). Random forests. Machine Learning, 45(1):5-32.
  3. Carmel, D., Mejer, A., Pinter, Y., and Szpektor, I. (2014). Improving term weighting for community question answering search using syntactic analysis. In Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management, CIKM 7814, pages 351-360, New York, NY, USA. ACM.
  4. Debole, F. and Sebastiani, F. (2003). Supervised term weighting for automated text categorization. In In Proceedings of SAC-03, 18th ACM Symposium on Applied Computing, pages 784-788. ACM Press.
  5. Deisy, C., Gowri, M., Baskar, S., Kalaiarasi, S., and Ramraj, N. (2010). A novel term weighting scheme midf for text categorization. Journal of Engineering Science and Technology, 5(1):94-107.
  6. Deng, Z.-H., Luo, K.-H., and Yu, H.-L. (2014). A study of supervised term weighting scheme for sentiment analysis. Expert Systems with Applications, 41(7):3506- 3513.
  7. Deng, Z.-H., Tang, S.-W., Yang, D.-Q., Li, M. Z. L.-Y., and Xie, K.-Q. (2004). A comparative study on feature weight in text categorization. In Advanced Web Technologies and Applications, pages 588-597. Springer.
  8. Dietterich, T. G. (1998). Approximate statistical tests for comparing supervised classification learning algorithms. Neural Comput., 10(7):1895-1923.
  9. Domeniconi, G., Moro, G., Pasolini, R., and Sartori, C. (2014). Cross-domain text classification through iterative refining of target categories representations. In Proceedings of the 6th International Conference on Knowledge Discovery and Information Retrieval.
  10. Galavotti, L., Sebastiani, F., and Simi, M. (2000). Experiments on the use of feature selection and negative evidence in automated text categorization. In Research and Advanced Technology for Digital Libraries, pages 59-68. Springer.
  11. Hassan, S. and Banea, C. (2006). Random-walk term weighting for improved text classification. In In Proceedings of TextGraphs: 2nd Workshop on Graph Based Methods for Natural Language Processing. ACL, pages 53-60.
  12. Joachims, T. (1998). Text categorization with support vector machines: Learning with many relevant features.
  13. Lan, M., Sung, S.-Y., Low, H.-B., and Tan, C.-L. (2005). A comparative study on term weighting schemes for text categorization. In Neural Networks, 2005. IJCNN'05. Proceedings. 2005 IEEE International Joint Conference on, volume 1, pages 546-551. IEEE.
  14. Lan, M., Tan, C. L., Su, J., and Lu, Y. (2009). Supervised and traditional term weighting methods for automatic text categorization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31(4):721-735.
  15. Largeron, C., Moulin, C., and Géry, M. (2011). Entropy based feature selection for text categorization. In Proceedings of the 2011 ACM Symposium on Applied Computing, SAC 7811, pages 924-928, New York, NY, USA. ACM.
  16. Leopold, E. and Kindermann, J. (2002). Text categorization with support vector machines. how to represent texts in input space? Mach. Learn., 46(1-3):423-444.
  17. Lewis, D. D. (1995). Evaluating and optimizing autonomous text classification systems. In Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval, SIGIR 7895, pages 246-254, New York, NY, USA. ACM.
  18. Liu, Y., Loh, H. T., and Sun, A. (2009). Imbalanced text classification: A term weighting approach. Expert Syst. Appl., 36(1):690-701.
  19. Luo, Q., Chen, E., and Xiong, H. (2011). A semantic term weighting scheme for text categorization. Expert Syst. Appl., 38(10):12708-12716.
  20. Paltoglou, G. and Thelwall, M. (2010). A study of information retrieval weighting schemes for sentiment analysis. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, ACL 7810, pages 1386-1395, Stroudsburg, PA, USA. Association for Computational Linguistics.
  21. Papineni, K. (2001). Why inverse document frequency? In Proceedings of the second meeting of the North American Chapter of the Association for Computational Linguistics on Language technologies, pages 1-8. Association for Computational Linguistics.
  22. Ren, F. and Sohrab, M. G. (2013). Class-indexing-based term weighting for automatic text classification. Inf. Sci., 236:109-125.
  23. Robertson, S. (2004). Understanding inverse document frequency: on theoretical arguments for idf. Journal of documentation, 60(5):503-520.
  24. Ropero, J., Gómez, A., Carrasco, A., and León, C. (2012). A fuzzy logic intelligent agent for information extraction: Introducing a new fuzzy logic-based term weighting scheme. Expert Systems with Applications, 39(4):4567-4581.
  25. Sebastiani, F. (2002). Machine learning in automated text categorization. ACM Comput. Surv., 34(1):1-47.
  26. Song, S.-K. and Myaeng, S. H. (2012). A novel term weighting scheme based on discrimination power obtained from past retrieval results. Inf. Process. Manage., 48(5):919-930.
  27. Sparck Jones, K. (1988). Document retrieval systems. chapter A statistical interpretation of term specificity and its application in retrieval, pages 132-142. Taylor Graham Publishing, London, UK, UK.
  28. Tokunaga, T. and Makoto, I. (1994). Text categorization based on weighted inverse document frequency. In Special Interest Groups and Information Process Society of Japan (SIG-IPSJ. Citeseer.
  29. Tsai, F. S. and Kwee, A. T. (2011). Experiments in term weighting for novelty mining. Expert Systems with Applications, 38(11):14094-14101.
  30. Wang, D. and Zhang, H. (2013). Inverse-categoryfrequency based supervised term weighting schemes for text categorization. Journal of Information Science and Engineering, 29(2):209-225.
Download


Paper Citation


in Harvard Style

Domeniconi G., Moro G., Pasolini R. and Sartori C. (2015). A Study on Term Weighting for Text Categorization: A Novel Supervised Variant of tf.idf . In Proceedings of 4th International Conference on Data Management Technologies and Applications - Volume 1: DATA, ISBN 978-989-758-103-8, pages 26-37. DOI: 10.5220/0005511900260037


in Bibtex Style

@conference{data15,
author={Giacomo Domeniconi and Gianluca Moro and Roberto Pasolini and Claudio Sartori},
title={A Study on Term Weighting for Text Categorization: A Novel Supervised Variant of tf.idf},
booktitle={Proceedings of 4th International Conference on Data Management Technologies and Applications - Volume 1: DATA,},
year={2015},
pages={26-37},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0005511900260037},
isbn={978-989-758-103-8},
}


in EndNote Style

TY - CONF
JO - Proceedings of 4th International Conference on Data Management Technologies and Applications - Volume 1: DATA,
TI - A Study on Term Weighting for Text Categorization: A Novel Supervised Variant of tf.idf
SN - 978-989-758-103-8
AU - Domeniconi G.
AU - Moro G.
AU - Pasolini R.
AU - Sartori C.
PY - 2015
SP - 26
EP - 37
DO - 10.5220/0005511900260037