A Comparison of Document Clustering Algorithms

Yong Wang, Julia Hodges

Abstract

Document clustering is a widely used strategy for information retrieval and text data mining. This paper describes the preliminary work for ongoing research of document clustering problems. A prototype of a document clustering system has been implemented and some basic aspects of document clustering problems have been studied. Our experimental results demonstrate that the average-link inter-cluster distance measure and TFIDF weighting function are good methods for the document clustering problem. Other investigators have indicated that the bisecting K-means method is the preferred method for document clustering. However, in our research we have found that, whereas the bisecting K-means method has advantages when working with large datasets, a traditional hierarchical clustering algorithm still achieves the best performance for small datasets.

References

  1. F. Beil, M. Ester, and X. Xu, “Frequent Term-Based Text Clustering,” Proc. of the 8th International Conference on Knowledge Discovery and Data Mining, 2002.
  2. D. Cutting, D. Karger, J. Pedersen, and J. Tukey, “Scatter/Gather: a Clusterbased Approach to Browsing Large Document Collection,” Proc. of the 15th ACM SIGIR Conference, Copenhagen, Denmark, 1992, pp. 318-329.
  3. B. C. M. Fung, Hierarchical Document Clustering Using Frequent Itemsets, Master Thesis, Dept. Computer Science, Simon Fraser University, Canada, 2002.
  4. G. A. Miller, R. Beckwith, C. Fellbaum, D. Gross, and K. J. Miller, “Introduction to WordNet: An On-Line Lexical Database,” International Journal of Lexicography, vol. 3, no. 4, 1990, pp. 235-312.
  5. A. Ratnaparkhi, “A Maximum Entropy Part-Of-Speech Tagger,” Proc. of the Empirical Methods in Natural Language Processing Conference, University of Pennsylvania, May 1996, pp. 17-18.
  6. J. C. Reynar and A. Ratnaparkhi, “A Maximum Entropy Approach to Identifying Sentence Boundaries,” Proc. of the Fifth Conference on Applied Natural Language Processing, Washington, D.C., March 31-April 3, 1997.
  7. M. Steinbach, G. Karypis, and V. Kumar, “A Comparison of Document Clustering Techniques,” KDD Workshop on Text Mining, 2000.
  8. O. Zamir, Clustering Web Documents: A Phrase-Based Method for Group Search Engine Results, Ph.D. dissertation, Dept. Computer Science & Engineering, Univ. of Washington, 1999.
Download


Paper Citation


in Harvard Style

Wang Y. and Hodges J. (2005). A Comparison of Document Clustering Algorithms . In Proceedings of the 5th International Workshop on Pattern Recognition in Information Systems - Volume 1: PRIS, (ICEIS 2005) ISBN 972-8865-28-7, pages 186-191. DOI: 10.5220/0002557501860191


in Bibtex Style

@conference{pris05,
author={Yong Wang and Julia Hodges},
title={A Comparison of Document Clustering Algorithms},
booktitle={Proceedings of the 5th International Workshop on Pattern Recognition in Information Systems - Volume 1: PRIS, (ICEIS 2005)},
year={2005},
pages={186-191},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0002557501860191},
isbn={972-8865-28-7},
}


in EndNote Style

TY - CONF
JO - Proceedings of the 5th International Workshop on Pattern Recognition in Information Systems - Volume 1: PRIS, (ICEIS 2005)
TI - A Comparison of Document Clustering Algorithms
SN - 972-8865-28-7
AU - Wang Y.
AU - Hodges J.
PY - 2005
SP - 186
EP - 191
DO - 10.5220/0002557501860191