A NOVEL SUPERVISED TEXT CLASSIFIER FROM A SMALL TRAINING SET

Fabio Clarizia, Francesco Colace, Massimo De Santo, Luca Greco, Paolo Napoletano

Abstract

Text classification methods have been evaluated on supervised classification tasks of large datasets showing high accuracy. Nevertheless, due to the fact that these classifiers, to obtain a good performance on a test set, need to learn from many examples, some difficulties may be found when they are employed in real contexts. In fact, most users of a practical system do not want to carry out labeling tasks for a long time only to obtain a better level of accuracy. They obviously prefer algorithms that have high accuracy, but do not require a large amount of manual labeling tasks. In this paper we propose a new supervised method for single-label text classification, based on a mixed Graph of Terms, that is capable of achieving a good performance, in term of accuracy, when the size of the training set is 1% of the original. The mixed Graph of Terms can be automatically extracted from a set of documents following a kind of term clustering technique weighted by the probabilistic topic model. The method has been tested on the top 10 classes of the ModApte split from the Reuters-21578 dataset and learnt on 1% of the original training set. Results have confirmed the discriminative property of the graph and have confirmed that the proposed method is comparable with existing methods learnt on the whole training set.

References

  1. Berkhin, P. (2006). A survey of clustering data mining techniques. In Kogan, J., Nicholas, C., and Teboulle, M., editors, Grouping Multidimensional Data, pages 25- 71. Springer Berlin Heidelberg.
  2. Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer.
  3. Blei, D. M., Ng, A. Y., and Jordan, M. I. (2003). Latent dirichlet allocation. Journal of Machine Learning Research, 3(993-1022).
  4. Christopher D. Manning, P. R. and Schtze, H. (2009). Introduction to Information Retrieval. Cambridge University.
  5. Griffiths, T. L., Steyvers, M., and Tenenbaum, J. B. (2007). Topics in semantic representation. Psychological Review, 114(2):211-244.
  6. Ko, Y. and Seo, J. (2009). Text classification from unlabeled documents with bootstrapping and feature projection techniques. Inf. Process. Manage., 45:70-83.
  7. Lewis, D. D., Yang, Y., Rose, T. G., and Li, F. (2004). Rcv1: A new benchmark collection for text categorization research. J. Mach. Learn. Res., 5:361-397.
  8. McCallum, A., Nigam, K., Rennie, J., and Seymore, K. (1999). A machine learning approach to building domain-specific search engines. In Proceedings of the 16th international joint conference on Artificial intelligence - Volume 2, pages 662-667. Morgan Kaufmann.
  9. Noam, S. and Naftali, T. (2001). The power of word clusters for text classification. In In 23rd European Colloquium on Information Retrieval Research.
  10. Sebastiani, F. (2002). Machine learning in automated text categorization. ACM Comput. Surv., 34:1-47.
Download


Paper Citation


in Harvard Style

Clarizia F., Colace F., De Santo M., Greco L. and Napoletano P. (2011). A NOVEL SUPERVISED TEXT CLASSIFIER FROM A SMALL TRAINING SET . In Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: SSTM, (IC3K 2011) ISBN 978-989-8425-79-9, pages 537-545. DOI: 10.5220/0003661105450553


in Bibtex Style

@conference{sstm11,
author={Fabio Clarizia and Francesco Colace and Massimo De Santo and Luca Greco and Paolo Napoletano},
title={A NOVEL SUPERVISED TEXT CLASSIFIER FROM A SMALL TRAINING SET},
booktitle={Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: SSTM, (IC3K 2011)},
year={2011},
pages={537-545},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0003661105450553},
isbn={978-989-8425-79-9},
}


in EndNote Style

TY - CONF
JO - Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: SSTM, (IC3K 2011)
TI - A NOVEL SUPERVISED TEXT CLASSIFIER FROM A SMALL TRAINING SET
SN - 978-989-8425-79-9
AU - Clarizia F.
AU - Colace F.
AU - De Santo M.
AU - Greco L.
AU - Napoletano P.
PY - 2011
SP - 537
EP - 545
DO - 10.5220/0003661105450553