ASSESSING PROGRESSIVE FILTERING TO PERFORM HIERARCHICAL TEXT CATEGORIZATION IN PRESENCE OF INPUT IMBALANCE

Andrea Addis, Giuliano Armano, Eloisa Vargiu

Abstract

The more the amount of available data (e.g., in digital libraries), the greater the need for high-performance text categorization algorithms. So far, the work on text categorization has been mostly focused on “flat” approaches, i.e., algorithms that operate on non-hierarchical classification schemes. Hierarchical approaches are expected to perform better in presence of subsumption ordering among categories. In fact, according to the “divide et impera” strategy, they partition the problem into smaller subproblems, each being expected to be simpler to solve. In this paper, we illustrate and discuss the results obtained by assessing the “Progressive Filtering” (PF) technique, used to perform text categorization. Experiments, on the Reuters Corpus (RCV1- v2) and on DZMOZ datasets, are focused on the ability of PF to deal with input imbalance. In particular, the baseline is: (i) comparing the results to those calculated resorting to the corresponding flat approach; (ii) calculating the improvement of performance while augmenting the pipeline depth; and (iii) measuring the performance in terms of generalization- / specialization- / misclassification-error and unknown-ratio. Experimental results show that, for the adopted datasets, PF is able to counteract great imbalances between negative and positive examples.

References

  1. Chawla, N. V., Bowyer, K. W., Hall, L. O., and Kegelmeyer, W. P. (2002). SMOTE: synthetic minority oversampling technique. Journal of Artificial Intelligence Research, 16:321-357.
  2. Cost, W. and Salzberg, S. (1993). A weighted nearest neighbor algorithm for learning with symbolic features. Machine Learning, 10:57-78.
  3. D'Alessio, S., Murray, K., and Schiaffino, R. (2000). The effect of using hierarchical classifiers in text categorization. In Proceedings of of the 6th International Conference on Recherche dInformation Assiste par Ordinateur (RIAO), pages 302-313.
  4. Dumais, S. T. and Chen, H. (2000). Hierarchical classification of Web content. In Belkin, N. J., Ingwersen, P., and Leong, M.-K., editors, Proceedings of SIGIR-00, 23rd ACM International Conference on Research and Development in Information Retrieval, pages 256- 263, Athens, GR. ACM Press, New York, US.
  5. Esuli, A., Fagni, T., and Sebastiani, F. (2008). Boosting multi-label hierarchical text categorization. Inf. Retr., 11(4):287-313.
  6. Gaussier, E., Goutte, C., Popat, K., and Chen, F. (2002). A hierarchical model for clustering and categorising documents. In Proceedings of the 24th BCS-IRSG European Colloquium on IR Research, pages 229-247, London, UK. Springer-Verlag.
  7. Japkowicz, N. (2000). Learning from imbalanced data sets: a comparison of various strategies. In AAAI Workshop on Learning from Imbalanced Data Sets.
  8. Koller, D. and Sahami, M. (1997). Hierarchically classifying documents using very few words. In Fisher, D. H., editor, Proceedings of ICML-97, 14th International Conference on Machine Learning, pages 170- 178, Nashville, US. Morgan Kaufmann Publishers, San Francisco, US.
  9. Kotsiantis, S., Kanellopoulos, D., and Pintelas, P. (2006). Handling imbalanced datasets: a review. GESTS International Transactions on Computer Science and Engineering, 30:25-36.
  10. Kotsiantis, S. and Pintelas, P. (2003). Mixture of expert agents for handling imbalanced data sets. Ann Math Comput Teleinformatics, 1:46-55.
  11. Kubat, M. and Matwin, S. (1997). Addressing the curse of imbalanced training sets: One-sided selection. In In Proceedings of the Fourteenth International Conference on Machine Learning, pages 179-186. Morgan Kaufmann.
  12. Lewis, D., Yang, Y., Rose, T., and Li, F. (2004). RCV1: A new benchmark collection for text categorization research. Journal of Machine Learning Research, 5:361-397.
  13. Lewis, D. D. (1995). Evaluating and optimizing autonomous text classification systems. In SIGIR 7895: Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval, pages 246-254, New York, NY, USA. ACM.
  14. McCallum, A. K., Rosenfeld, R., Mitchell, T. M., and Ng, A. Y. (1998). Improving text classification by shrinkage in a hierarchy of classes. In Shavlik, J. W., editor, Proceedings of ICML-98, 15th International Conference on Machine Learning, pages 359-367, Madison, US. Morgan Kaufmann Publishers, San Francisco, US.
  15. Mladenic, D. and Grobelnik, M. (1998). Feature selection for classification based on text hierarchy. In Text and the Web, Conference on Automated Learning and Discovery CONALD-98.
  16. Rousu, J., Saunders, C., Szedmak, S., and Shawe-Taylor, J. (2005). Learning hierarchical multi-category text classification models. In ICML 7805: Proceedings of the 22nd international conference on Machine learning, pages 744-751, New York, NY, USA. ACM.
  17. Ruiz, M. E. and Srinivasan, P. (2002). Hierarchical text categorization using neural networks. Information Retrieval, 5(1):87-118.
  18. Weigend, A. S., Wiener, E. D., and Pedersen, J. O. (1999). Exploiting hierarchy in text categorization. Information Retrieval, 1(3):193-216.
  19. Wu, G. and Chang, E. Y. (2003). Class-boundary alignment for imbalanced dataset learning. In In ICML 2003 Workshop on Learning from Imbalanced Data Sets, pages 49-56.
  20. Yan, R., Liu, Y., Jin, R., and Hauptmann, A. (2003). On predicting rare classes with svm ensembles in scene classification. In Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP 7803). 2003 IEEE International Conference on, volume 3, pages III-21- 4 vol.3.
  21. Yang, Y. (1996). An evaluation of statistical approaches to medline indexing. In Proceedings of the American Medical Informatic Association (AMIA), pages 358- 362.
Download


Paper Citation


in Harvard Style

Addis A., Armano G. and Vargiu E. (2010). ASSESSING PROGRESSIVE FILTERING TO PERFORM HIERARCHICAL TEXT CATEGORIZATION IN PRESENCE OF INPUT IMBALANCE . In Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2010) ISBN 978-989-8425-28-7, pages 14-23. DOI: 10.5220/0003066300140023


in Bibtex Style

@conference{kdir10,
author={Andrea Addis and Giuliano Armano and Eloisa Vargiu},
title={ASSESSING PROGRESSIVE FILTERING TO PERFORM HIERARCHICAL TEXT CATEGORIZATION IN PRESENCE OF INPUT IMBALANCE},
booktitle={Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2010)},
year={2010},
pages={14-23},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0003066300140023},
isbn={978-989-8425-28-7},
}


in EndNote Style

TY - CONF
JO - Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2010)
TI - ASSESSING PROGRESSIVE FILTERING TO PERFORM HIERARCHICAL TEXT CATEGORIZATION IN PRESENCE OF INPUT IMBALANCE
SN - 978-989-8425-28-7
AU - Addis A.
AU - Armano G.
AU - Vargiu E.
PY - 2010
SP - 14
EP - 23
DO - 10.5220/0003066300140023