A Comparison of Document Clustering Algorithms
Yong Wang, Julia Hodges
2005
Abstract
Document clustering is a widely used strategy for information retrieval and text data mining. This paper describes the preliminary work for ongoing research of document clustering problems. A prototype of a document clustering system has been implemented and some basic aspects of document clustering problems have been studied. Our experimental results demonstrate that the average-link inter-cluster distance measure and TFIDF weighting function are good methods for the document clustering problem. Other investigators have indicated that the bisecting K-means method is the preferred method for document clustering. However, in our research we have found that, whereas the bisecting K-means method has advantages when working with large datasets, a traditional hierarchical clustering algorithm still achieves the best performance for small datasets.
References
- F. Beil, M. Ester, and X. Xu, “Frequent Term-Based Text Clustering,” Proc. of the 8th International Conference on Knowledge Discovery and Data Mining, 2002.
- D. Cutting, D. Karger, J. Pedersen, and J. Tukey, “Scatter/Gather: a Clusterbased Approach to Browsing Large Document Collection,” Proc. of the 15th ACM SIGIR Conference, Copenhagen, Denmark, 1992, pp. 318-329.
- B. C. M. Fung, Hierarchical Document Clustering Using Frequent Itemsets, Master Thesis, Dept. Computer Science, Simon Fraser University, Canada, 2002.
- G. A. Miller, R. Beckwith, C. Fellbaum, D. Gross, and K. J. Miller, “Introduction to WordNet: An On-Line Lexical Database,” International Journal of Lexicography, vol. 3, no. 4, 1990, pp. 235-312.
- A. Ratnaparkhi, “A Maximum Entropy Part-Of-Speech Tagger,” Proc. of the Empirical Methods in Natural Language Processing Conference, University of Pennsylvania, May 1996, pp. 17-18.
- J. C. Reynar and A. Ratnaparkhi, “A Maximum Entropy Approach to Identifying Sentence Boundaries,” Proc. of the Fifth Conference on Applied Natural Language Processing, Washington, D.C., March 31-April 3, 1997.
- M. Steinbach, G. Karypis, and V. Kumar, “A Comparison of Document Clustering Techniques,” KDD Workshop on Text Mining, 2000.
- O. Zamir, Clustering Web Documents: A Phrase-Based Method for Group Search Engine Results, Ph.D. dissertation, Dept. Computer Science & Engineering, Univ. of Washington, 1999.
Paper Citation
in Harvard Style
Wang Y. and Hodges J. (2005). A Comparison of Document Clustering Algorithms . In Proceedings of the 5th International Workshop on Pattern Recognition in Information Systems - Volume 1: PRIS, (ICEIS 2005) ISBN 972-8865-28-7, pages 186-191. DOI: 10.5220/0002557501860191
in Bibtex Style
@conference{pris05,
author={Yong Wang and Julia Hodges},
title={A Comparison of Document Clustering Algorithms},
booktitle={Proceedings of the 5th International Workshop on Pattern Recognition in Information Systems - Volume 1: PRIS, (ICEIS 2005)},
year={2005},
pages={186-191},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0002557501860191},
isbn={972-8865-28-7},
}
in EndNote Style
TY - CONF
JO - Proceedings of the 5th International Workshop on Pattern Recognition in Information Systems - Volume 1: PRIS, (ICEIS 2005)
TI - A Comparison of Document Clustering Algorithms
SN - 972-8865-28-7
AU - Wang Y.
AU - Hodges J.
PY - 2005
SP - 186
EP - 191
DO - 10.5220/0002557501860191