EXPLOITING N-GRAM IMPORTANCE AND WIKIPEDIA BASED ADDITIONAL KNOWLEDGE FOR IMPROVEMENTS IN GAAC BASED DOCUMENT CLUSTERING
Niraj Kumar, Venkata Vinay Babu Vemula, Kannan Srinathan, Vasudeva Varma
2010
Abstract
This paper provides a solution to the issue: “How can we use Wikipedia based concepts in document clustering with lesser human involvement, accompanied by effective improvements in result?” In the devised system, we propose a method to exploit the importance of N-grams in a document and use Wikipedia based additional knowledge for GAAC based document clustering. The importance of N-grams in a document depends on a many features including, but not limited to: frequency, position of their occurrence in a sentence and the position of the sentence in which they occur, in the document. First, we introduce a new similarity measure, which takes the weighted N-gram importance into account, in the calculation of similarity measure while performing document clustering. As a result, the chances of topical similarity in clustering are improved. Second, we use Wikipedia as an additional knowledge base both, to remove noisy entries from the extracted N-grams and to reduce the information gap between N-grams that are conceptually-related, which do not have a match owing to differences in writing scheme or strategies. Our experimental results on the publicly available text dataset clearly show that our devised system has a significant improvement in performance over bag-of-words based state-of-the-art systems in this area.
References
- Banerjee, S., Ramanathan, K., Gupta, A., 2007. Clustering Short Texts using Wikipedia; SIGIR'07, July 23-27, Amsterdam, The Netherlands.
- Clauset, A., Newman, M., Moore, C., 2004. Finding community structure in verylarge networks. Physical Review E, 70:066111, 2004.
- Hammouda, K., Matute, D., Kamel, M., 2005. CorePhrase: Keyphrase Extraction for Document Clustering; In IAPR: 4th International Conference on Machine Learning and Data Mining.
- Han, J., Kim, T., Choi, J., 2007. Web Document Clustering by Using Automatic Keyphrase Extraction; Proceedings of the 2007 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology - Workshops.
- Hu, X., Zhang, X., Lu, C., Park, E., Zhou, X., 2009. Exploiting Wikipedia as External Knowledge for Document Clustering; KDD'09.
- Huang, A., Milne, D., Frank, E., Witten, I. 2008. Clustering Documents with Active Learning Using Wikipedia. ICDM 2008.
- Huang, A., Milne, D., Frank, E., Witten, I., 2009. Clustering documents using a wikipedia-based concept representation. In Proc 13th Pacific-Asia Conference on Knowledge Discovery and Data Mining.
- Kaufman, L., and Rousseeuw, P., 1999. Finding Groups in data: An introduction to cluster analysis, 1999, John Wiley & Sons.
- Kumar, N., Srinathan, K., 2008. Automatic Keyphrase Extraction from Scientific Documents Using N-gram Filtration Technique. In the Proceedings of ACM DocEng.
- Newman, M., Girvan, M., 2004. Finding and evaluating community structure in networks. Physical review E, 69:026113, 2004.
- Steinbach, M., Karypis, G., and Kumar, V., 2000. A Comparison of document clustering techniques. Technical Report. Department of Computer Science and Engineering, University of Minnesota.
- Tan, P., Steinbach, M., Kumar, V., 2006. Introduction to Data Mining; Addison-Wesley; ISBN-10: 0321321367.
- Zhao, Y., Karypis, G., 2001. Criterion functions for document clustering: experiments and analysis, Technical Report. Department of Computer Science, University of Minnesota.
Paper Citation
in Harvard Style
Kumar N., Vinay Babu Vemula V., Srinathan K. and Varma V. (2010). EXPLOITING N-GRAM IMPORTANCE AND WIKIPEDIA BASED ADDITIONAL KNOWLEDGE FOR IMPROVEMENTS IN GAAC BASED DOCUMENT CLUSTERING . In Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2010) ISBN 978-989-8425-28-7, pages 182-187. DOI: 10.5220/0003081201820187
in Bibtex Style
@conference{kdir10,
author={Niraj Kumar and Venkata Vinay Babu Vemula and Kannan Srinathan and Vasudeva Varma},
title={EXPLOITING N-GRAM IMPORTANCE AND WIKIPEDIA BASED ADDITIONAL KNOWLEDGE FOR IMPROVEMENTS IN GAAC BASED DOCUMENT CLUSTERING
},
booktitle={Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2010)},
year={2010},
pages={182-187},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0003081201820187},
isbn={978-989-8425-28-7},
}
in EndNote Style
TY - CONF
JO - Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2010)
TI - EXPLOITING N-GRAM IMPORTANCE AND WIKIPEDIA BASED ADDITIONAL KNOWLEDGE FOR IMPROVEMENTS IN GAAC BASED DOCUMENT CLUSTERING
SN - 978-989-8425-28-7
AU - Kumar N.
AU - Vinay Babu Vemula V.
AU - Srinathan K.
AU - Varma V.
PY - 2010
SP - 182
EP - 187
DO - 10.5220/0003081201820187