Categorization of Very Short Documents

Mika Timonen

doi:10.5220/0004108300050016

Categorization of Very Short Documents

Mika Timonen

2012

Abstract

Categorization of very short documents has become an important research topic in the field of text mining. Twitter status updates and market research data form an interesting corpus of documents that are in most cases less than 20 words long. Short documents have one major characteristic that differentiate them from traditional longer documents: each word occurs usually only once per document. This is called the TF=1 challenge. In this paper we conduct a comprehensive performance comparison of the current feature weighting and categorization approaches using corpora of very short documents. In addition, we propose a novel feature weighting approach called Fragment Length Weighted Category Distribution that takes the challenges of short documents into consideration. The proposed approach is based on previous work on Bi-Normal Separation and on short document categorization using a Naive Bayes classifier. We compare the performance of the proposed approach against several traditional approaches including Chi-Squared, Mutual Information, Term Frequency-Inverse Document Frequency and Residual Inverse Document Frequency. We also compare the performance of a Support Vector Machine classifier against other classification approaches such as k-Nearest Neighbors and Naive Bayes classifiers.

References

Benevenuto, F., Mango, G., Rodrigues, T., and Almeida, V. (2010). Detecting spammers on twitter. In CEAS 2010. Seventh annual Collaboration, Electronic messaging, Anti-Abuse and Spam Conference, Redmond, Washington, USA, July 13 - 14, 2010.
Cai, L. and Hofmann, T. (2003). Text categorization by boosting automatically extracted concepts. In SIGIR 2003: Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Toronto, Canada, July 28 - August 1, 2003, pages 182-189. ACM.
Clark, K. and Gale, W. (1995). Inverse document frequency (idf): A measure of deviation from poisson. In Third Workshop on Very Large Corpora, Massachusetts Institute of Technology Cambridge, Massachusetts, USA, 30 June 1995, pages 121-130.
Forman, G. (2003). An extensive empirical study of feature selection metrics for text classification. J Machine Learning Res, 3:1289-1305.
Forman, G. (2008). Bns feature scaling: an improved representation over tf-idf for svm text classification. In Proceedings of the 17th ACM Conference on Information and Knowledge Management, CIKM 2008, Napa Valley, California, USA, October 26-30, 2008, pages 263-270. ACM.
Irani, D., Webb, S., Pu, C., and Li, K. (2010). Study of trendstuffing on twitter through text classification. In CEAS 2010, Seventh annual Collaboration, Electronic messaging, Anti-Abuse and Spam Conference, Redmond, Washington, USA, July 13 - 14, 2010.
Joachims, T. (1998). Text categorization with support vector machines: Learning with many relevant features. In Machine Learning: ECML-98, 10th European Conference on Machine Learning, Chemnitz, Germany, April 21-23, 1998, volume 1398 of Lecture Notes in Computer Science, pages 137-142. Springer.
Joachims, T. (1999). Advances in Kernel Methods - Support Vector Learning, chapter Making large-Scale SVM Learning Practical, pages 41-56. MIT Press.
Kibriya, A. M., Frank, E., Pfahringer, B., and Holmes, G. (2004). Multinomial naive bayes for text categorization revisited. In AI 2004: Advances in Artificial Intelligence, 17th Australian Joint Conference on Artificial Intelligence, Cairns, Australia, December 4-6, 2004, volume 3339 of Lecture Notes in Computer Science, pages 488-499. Springer.
Krishnakumar, A. (2006). Text categorization building a knn classifier for the reuters-21578 collection. http://citeseerx.ist.psu.edu/viewdoc/-summary? doi=10.1.1.135.9946.
Mladenic, D. and Grobelnik, M. (1999). Feature selection for unbalanced class distribution and naive bayes. In Proceedings of the Sixteenth International Conference on Machine Learning (ICML 1999), Bled, Slovenia, June 27 - 30, 1999, pages 258-267. Morgan Kaufmann.
Pak, A. and Paroubek, P. (2010). Twitter as a corpus for sentiment analysis and opinion mining. In Proceedings of the International Conference on Language Resources and Evaluation, LREC 2010, Valletta, Malta, 17-23 May 2010. European Language Resources Association.
Phan, X. H., Nguyen, M. L., and Horiguchi, S. (2008). Learning to classify short and sparse text & web with hidden topics from large-scale data collections. In Proceedings of the 17th International Conference on World Wide Web, WWW 2008, Beijing, China, April 21-25, 2008, pages 91-100. ACM.
Rennie, J. D., Shih, L., Teevan, J., and Karger, D. R. (2003). Tackling the poor assumptions of naive bayes text classifiers. In Machine Learning, Proceedings of the Twentieth International Conference (ICML 2003), August 21- 24, 2003, Washington, DC, USA, pages 616-623. AAAI Press.
Rennie, J. D. M. and Jaakkola, T. (2005). Using term informativeness for named entity detection. In SIGIR 2005: Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Salvador, Brazil, August 15-19, 2005, pages 353-360. ACM.
Ritter, A., Cherry, C., and Dolan, B. (2010). Unsupervised modeling of twitter conversations. In Human Language Technologies: Conference of the North American Chapter of the Association of Computational Linguistics, Proceedings, Los Angeles, California, USA, June 2-4, 2010, pages 172-180. The Association for Computational Linguistics.
Salton, G. and Buckley, C. (1988). Term-weighting approaches in automatic text retrieval. Inf Process Manage, 24(5):513-523.
Timonen, M., Silvonen, P., and Kasari, M. (2011). Classification of short documents to categorize consumer opinions. In Advanced Data Mining and Applications - 7th International Conference, ADMA 2011, Beijing, China, December 17-19, 2011. Online Proceedings.
Yang, Y. (1999). An evaluation of statistical approaches to text categorization. Inf. Retr., 1(1-2):69-90.
Yang, Y. and Liu, X. (1999). A re-examination of text categorization methods. In SIGIR 7899: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Berkeley, CA, USA, August 15-19, 1999, pages 42-49. ACM.
Yang, Y. and Pedersen, J. (1997). Feature selection in statistical learning of text categorization. In Proceedings of the Fourteenth International Conference on Machine Learning (ICML 1997), Nashville, Tennessee, USA, July 8-12, 1997, pages 412-420.

Download

Paper Citation

in Harvard Style

Timonen M. (2012). Categorization of Very Short Documents . In Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2012) ISBN 978-989-8565-29-7, pages 5-16. DOI: 10.5220/0004108300050016

in Bibtex Style

@conference{kdir12,
author={Mika Timonen},
title={Categorization of Very Short Documents},
booktitle={Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2012)},
year={2012},
pages={5-16},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0004108300050016},
isbn={978-989-8565-29-7},
}

in EndNote Style

TY - CONF
JO - Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2012)
TI - Categorization of Very Short Documents
SN - 978-989-8565-29-7
AU - Timonen M.
PY - 2012
SP - 5
EP - 16
DO - 10.5220/0004108300050016