SCALABILITY OF TEXT CLASSIFICATION

Ioannis Antonellis, Christos Bouras, Vassilis Poulopoulos, Anastasios Zouzias

Abstract

We explore scalability issues of the text classification problem where using (multi)labeled training documents we try to build classifiers that assign documents into classes permitting classification in multiple classes. A new class of classification problems, called ‘scalable’ is introduced that models many problems from the area of Web mining. The property of scalability is defined as the ability of a classifier to adjust classification results on a ‘per-user’ basis. Furthermore, we investigate on different ways to interpret personalization of classification results by analyzing well known text datasets and exploring existent classifiers. We present solutions for the scalable classification problem based on standard classification techniques and present an algorithm that relies on the semantic analysis using document decomposition into its sentences. Experimental results concerning the scalability property and the performance of these algorithms are provided using the 20newsgroup dataset and a dataset consisting of web news.

References

  1. I. Antonellis, C. Bouras, V. Poulopoulos, “Personalized News Categorization through Scalable Text Classification”, In Proc. of 8th Asia Pacific Web Conference (APWEB 2006), pp. 391-401 (to appear)
  2. F. Sebastiani, “Machine Learning in automated text categorization”, ACM Comput. Surv 2002., Vol 34, No 1, pp 1 - 47
  3. G. Kumaran and J. Allan, “Text classification and named entities for new event detection”, SIGIR 7804
  4. Y. Yang, “An evaluation of statistical approaches to text categorization”, J. Information Retrieval, Vol 1, No. 1/2, pp 67-88, 1999
  5. K. Nigam, A. McCallum, S. Thrun and T. Mitchell. “Text Classification from Labeled and Unlabeled Documents using EM”. Machine Learning, 39(2/3). pp. 103-134. 2000
  6. J. Rocchio, Relevant feedback in information retrieval.. In G. Salton (ed.). “The smart retrieval systemexperiments in automatic document processing”, 1971, Englewood Cliffs, NJ.
  7. Salton, G. and McGill, M. (1983). “Introduction to Modern Information Retrieval”. McGraw-Hill.
  8. Buckley, C., Salton G., and Allan J. (1994).”The effect of adding relevance information in a relevance feedback environment”, SIGIR-94.
  9. W. Jones and G. Furnas, Pictuers of relevance: “A geometric analysis of similarity measures”, J. American Society for Information Science, 38 (1987), pp. 420-442
  10. D. Zeimpekis, E. Gallopoulos, “Design of a MATLAB toolbox for term-document matrix generation“, Proceedings of the Workshop on Clustering High Dimensional Data, SIAM 2005 (to appear)
  11. Y. Yang and X. Liu, “A re-examination of text categorization methods“, Proceedings of ACM SIGIR 1999, pp 42-49
  12. G. Salton, A. Singhal, C. Buckley, M. Mitra, “Automatic Text Decomposition Using Text Segments and Text Themes”, in Proc. 7th ACM Conf. Hypertext, Washington, DC, Mar. 1996, pp. 53-65
  13. K. Hammouda, K. Mohamed, “Efficient Phrase-Based Document Indexing for Web Document Clustering”, IEEE Transactions on Knowledge and Data Engineering 16, 10 (Oct. 2004), 1279-1296
  14. M. A. Hearst and C. Plaunt. “Subtopic structuring for fulllength document access”, SIGIR 1993, Pittsburgh, PA, pp 59-68, 1993
Download


Paper Citation


in Harvard Style

Antonellis I., Bouras C., Poulopoulos V. and Zouzias A. (2006). SCALABILITY OF TEXT CLASSIFICATION . In Proceedings of WEBIST 2006 - Second International Conference on Web Information Systems and Technologies - Volume 1: WEBIST, ISBN 978-972-8865-46-7, pages 408-413. DOI: 10.5220/0001240904080413


in Bibtex Style

@conference{webist06,
author={Ioannis Antonellis and Christos Bouras and Vassilis Poulopoulos and Anastasios Zouzias},
title={SCALABILITY OF TEXT CLASSIFICATION},
booktitle={Proceedings of WEBIST 2006 - Second International Conference on Web Information Systems and Technologies - Volume 1: WEBIST,},
year={2006},
pages={408-413},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0001240904080413},
isbn={978-972-8865-46-7},
}


in EndNote Style

TY - CONF
JO - Proceedings of WEBIST 2006 - Second International Conference on Web Information Systems and Technologies - Volume 1: WEBIST,
TI - SCALABILITY OF TEXT CLASSIFICATION
SN - 978-972-8865-46-7
AU - Antonellis I.
AU - Bouras C.
AU - Poulopoulos V.
AU - Zouzias A.
PY - 2006
SP - 408
EP - 413
DO - 10.5220/0001240904080413