EXPLOITING N-GRAM IMPORTANCE AND WIKIPEDIA BASED ADDITIONAL KNOWLEDGE FOR IMPROVEMENTS IN GAAC BASED DOCUMENT CLUSTERING

Niraj Kumar; Venkata Vinay Babu Vemula; Kannan Srinathan; Vasudeva Varma

doi:10.5220/0003081201820187

EXPLOITING N-GRAM IMPORTANCE AND WIKIPEDIA BASED ADDITIONAL KNOWLEDGE FOR IMPROVEMENTS IN GAAC BASED DOCUMENT CLUSTERING

Niraj Kumar, Venkata Vinay Babu Vemula, Kannan Srinathan, Vasudeva Varma

2010

Abstract

This paper provides a solution to the issue: “How can we use Wikipedia based concepts in document clustering with lesser human involvement, accompanied by effective improvements in result?” In the devised system, we propose a method to exploit the importance of N-grams in a document and use Wikipedia based additional knowledge for GAAC based document clustering. The importance of N-grams in a document depends on a many features including, but not limited to: frequency, position of their occurrence in a sentence and the position of the sentence in which they occur, in the document. First, we introduce a new similarity measure, which takes the weighted N-gram importance into account, in the calculation of similarity measure while performing document clustering. As a result, the chances of topical similarity in clustering are improved. Second, we use Wikipedia as an additional knowledge base both, to remove noisy entries from the extracted N-grams and to reduce the information gap between N-grams that are conceptually-related, which do not have a match owing to differences in writing scheme or strategies. Our experimental results on the publicly available text dataset clearly show that our devised system has a significant improvement in performance over bag-of-words based state-of-the-art systems in this area.

References

Banerjee, S., Ramanathan, K., Gupta, A., 2007. Clustering Short Texts using Wikipedia; SIGIR'07, July 23-27, Amsterdam, The Netherlands.
Clauset, A., Newman, M., Moore, C., 2004. Finding community structure in verylarge networks. Physical Review E, 70:066111, 2004.
Hammouda, K., Matute, D., Kamel, M., 2005. CorePhrase: Keyphrase Extraction for Document Clustering; In IAPR: 4th International Conference on Machine Learning and Data Mining.
Han, J., Kim, T., Choi, J., 2007. Web Document Clustering by Using Automatic Keyphrase Extraction; Proceedings of the 2007 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology - Workshops.
Hu, X., Zhang, X., Lu, C., Park, E., Zhou, X., 2009. Exploiting Wikipedia as External Knowledge for Document Clustering; KDD'09.
Huang, A., Milne, D., Frank, E., Witten, I. 2008. Clustering Documents with Active Learning Using Wikipedia. ICDM 2008.
Huang, A., Milne, D., Frank, E., Witten, I., 2009. Clustering documents using a wikipedia-based concept representation. In Proc 13th Pacific-Asia Conference on Knowledge Discovery and Data Mining.
Kaufman, L., and Rousseeuw, P., 1999. Finding Groups in data: An introduction to cluster analysis, 1999, John Wiley & Sons.
Kumar, N., Srinathan, K., 2008. Automatic Keyphrase Extraction from Scientific Documents Using N-gram Filtration Technique. In the Proceedings of ACM DocEng.
Newman, M., Girvan, M., 2004. Finding and evaluating community structure in networks. Physical review E, 69:026113, 2004.
Steinbach, M., Karypis, G., and Kumar, V., 2000. A Comparison of document clustering techniques. Technical Report. Department of Computer Science and Engineering, University of Minnesota.
Tan, P., Steinbach, M., Kumar, V., 2006. Introduction to Data Mining; Addison-Wesley; ISBN-10: 0321321367.
Zhao, Y., Karypis, G., 2001. Criterion functions for document clustering: experiments and analysis, Technical Report. Department of Computer Science, University of Minnesota.

Download

Paper Citation

in Harvard Style

Kumar N., Vinay Babu Vemula V., Srinathan K. and Varma V. (2010). EXPLOITING N-GRAM IMPORTANCE AND WIKIPEDIA BASED ADDITIONAL KNOWLEDGE FOR IMPROVEMENTS IN GAAC BASED DOCUMENT CLUSTERING . In Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2010) ISBN 978-989-8425-28-7, pages 182-187. DOI: 10.5220/0003081201820187

in Bibtex Style

@conference{kdir10,
author={Niraj Kumar and Venkata Vinay Babu Vemula and Kannan Srinathan and Vasudeva Varma},
title={EXPLOITING N-GRAM IMPORTANCE AND WIKIPEDIA BASED ADDITIONAL KNOWLEDGE FOR IMPROVEMENTS IN GAAC BASED DOCUMENT CLUSTERING },
booktitle={Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2010)},
year={2010},
pages={182-187},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0003081201820187},
isbn={978-989-8425-28-7},
}

in EndNote Style

TY - CONF
JO - Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2010)
TI - EXPLOITING N-GRAM IMPORTANCE AND WIKIPEDIA BASED ADDITIONAL KNOWLEDGE FOR IMPROVEMENTS IN GAAC BASED DOCUMENT CLUSTERING
SN - 978-989-8425-28-7
AU - Kumar N.
AU - Vinay Babu Vemula V.
AU - Srinathan K.
AU - Varma V.
PY - 2010
SP - 182
EP - 187
DO - 10.5220/0003081201820187