UTILIZING TERM PROXIMITY BASED FEATURES TO IMPROVE TEXT DOCUMENT CLUSTERING
Shashank Paliwal, Vikram Pudi
2011
Abstract
Measuring inter-document similarity is one of the most essential steps in text document clustering. Traditional methods rely on representing text documents using the simple Bag-of-Words (BOW) model which assumes that terms of a text document are independent of each other. Such single term analysis of the text completely ignores the underlying (semantic) structure of a document. In the literature, sufficient efforts have been made to enrich BOW representation using phrases and n-grams like bi-grams and tri-grams. These approaches take into account dependency only between adjacent terms or a continuous sequence of terms. However, while some of the dependencies exist between adjacent words, others are more distant. In this paper, we make an effort to enrich traditional document vector by adding the notion of term-pair features. A Term-Pair feature is a pair of two terms of the same document such that they may be adjacent to each other or distant. We investigate the process of term-pair selection and propose a methodology to select potential term-pairs from the given document. Utilizing term proximity between distant terms also allows some flexibility for two documents to be similar if they are about similar topics but with varied writing styles. Experimental results on standard web document data set show that the clustering performance is substantially improved by adding term-pair features.
References
- Ahlgren, P. and Colliander, C. (2009). Document-document similarity approaches and science mapping: Experimental comparison of five approaches. Journal of Informetrics, 3(1):49-63.
- Andrews, N. O. and Fox, E. A. (2007). Recent Developments in Document Clustering. Technical report, Computer Science, Virginia Tech.
- Beeferman, D., Berger, A., and Lafferty, J. (1997). A model of lexical attraction and repulsion. In Proceedings of the eighth conference on European chapter of the Association for Computational Linguistics, EACL 7897, pages 373-380, Stroudsburg, PA, USA. Association for Computational Linguistics.
- Bekkerman, R. and Allan, J. (2003). Using bigrams in text categorization.
- Chim, H. and Deng, X. (2007). A new suffix tree similarity measure for document clustering. In Proceedings of the 16th international conference on World Wide Web, WWW 7807, pages 121-130, New York, NY, USA. ACM.
- Croft, W. B. and Harper, D. J. (1997). Using probabilistic models of document retrieval without relevance information, pages 339-344. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA.
- Cummins, R. and O'Riordan, C. (2009). Learning in a pairwise term-term proximity framework for information retrieval. In Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval, SIGIR 7809, pages 251-258, New York, NY, USA. ACM.
- Fagan, J. (1987). Automatic phrase indexing for document retrieval. In Proceedings of the 10th annual international ACM SIGIR conference on Research and development in information retrieval, SIGIR 7887, pages 91-101, New York, NY, USA. ACM.
- Fuhr, N. (1992). Probabilistic models in information retrieval. The Computer Journal, 35:243-255.
- Hammouda, K. M. and Kamel, M. S. (2004). Efficient phrase-based document indexing for web document clustering. IEEE Trans. on Knowl. and Data Eng., 16:1279-1296.
- Hawking, D., Hawking, D., Thistlewaite, P., and Thistlewaite, P. (1996). Relevance weighting using distance between term occurrences. Technical report, The Australian National University.
- Lafferty, J. and Zhai, C. (2001). Document language models, query models, and risk minimization for information retrieval. In Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, SIGIR 7801, pages 111-119, New York, NY, USA. ACM.
- Ponte, J. M. and Croft, W. B. (1998). A language modeling approach to information retrieval. In Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval, SIGIR 7898, pages 275-281, New York, NY, USA. ACM.
- Rasolofo, Y. and Savoy, J. (2003). Term proximity scoring for keyword-based retrieval systems. In Proceedings of the 25th European conference on IR research, ECIR'03, pages 207-218, Berlin, Heidelberg. Springer-Verlag.
- Salton, G., Wong, A., and Yang, C. S. (1975). A vector space model for automatic indexing. Commun. ACM, 18:613-620.
- Song, R., Taylor, M. J., Wen, J.-R., Hon, H.-W., and Yu, Y. (2008). Viewing term proximity from a different perspective. In Proceedings of the IR research, 30th European conference on Advances in information retrieval, ECIR'08, pages 346-357, Berlin, Heidelberg. Springer-Verlag.
- Tao, T. and Zhai, C. (2007). An exploration of proximity measures in information retrieval. In Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, SIGIR 7807, pages 295-302, New York, NY, USA. ACM.
- Zamir, O. and Etzioni, O. (1999). Grouper: A dynamic clustering interface to web search results. In Proceedings of the eighth international conference on World Wide Web, pages 1361-1374.
- Zhao, J. and Yun, Y. (2009). A proximity language model for information retrieval. In Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval, SIGIR 7809, pages 291-298, New York, NY, USA. ACM.
Paper Citation
in Harvard Style
Paliwal S. and Pudi V. (2011). UTILIZING TERM PROXIMITY BASED FEATURES TO IMPROVE TEXT DOCUMENT CLUSTERING . In Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: SSTM, (IC3K 2011) ISBN 978-989-8425-79-9, pages 529-536. DOI: 10.5220/0003645805370544
in Bibtex Style
@conference{sstm11,
author={Shashank Paliwal and Vikram Pudi},
title={UTILIZING TERM PROXIMITY BASED FEATURES TO IMPROVE TEXT DOCUMENT CLUSTERING},
booktitle={Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: SSTM, (IC3K 2011)},
year={2011},
pages={529-536},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0003645805370544},
isbn={978-989-8425-79-9},
}
in EndNote Style
TY - CONF
JO - Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: SSTM, (IC3K 2011)
TI - UTILIZING TERM PROXIMITY BASED FEATURES TO IMPROVE TEXT DOCUMENT CLUSTERING
SN - 978-989-8425-79-9
AU - Paliwal S.
AU - Pudi V.
PY - 2011
SP - 529
EP - 536
DO - 10.5220/0003645805370544