data set. This also demonstrates that the good performance of bisecting K-means
method achieved by in Steinbach, Karypis, and Kumar’s experiments is also based on
large data set.
4 Conclusions
In this paper, we described a general document clustering system prototype and
discussed three ways to achieve better performance. A brief overview about different
document clustering algorithms, vector weighting functions, and distance measures
was provided. From our experimental results, we can below conclusions.
For the HAC clustering method, the average-link inter-clustering distance measure
is better than the single-link method and complete-link method. For weighting
functions in the vector space model, the TFIDF method is better than the binary
method and TF method. The TFIDF method assigns a weight to a feature by
combining its importance in a document and its distinguishability for whole document
set. An important term may be a medium frequency word instead of a high frequency
word (too common) or a low frequency word (too particular). For large data sets, the
bisecting algorithm outperforms all the other methods. But for small data sets, the
HAC method gets the best performance. The K-means method has a performance that
is similar to that of the buckshot method for large data sets. But for small data sets,
the buckshot method is better than the K-means method. The advantage of the HAC
method on small data sets improves the performance of the buckshot method.
References
1. F. Beil, M. Ester, and X. Xu, “Frequent Term-Based Text Clustering,” Proc. of the 8th
International Conference on Knowledge Discovery and Data Mining, 2002.
2. D. Cutting, D. Karger, J. Pedersen, and J. Tukey, “Scatter/Gather: a Clusterbased Approach
to Browsing Large Document Collection,” Proc. of the 15th ACM SIGIR Conference,
Copenhagen, Denmark, 1992, pp. 318-329.
3. B. C. M. Fung, Hierarchical Document Clustering Using Frequent Itemsets, Master Thesis,
Dept. Computer Science, Simon Fraser University, Canada, 2002.
4. G. A. Miller, R. Beckwith, C. Fellbaum, D. Gross, and K. J. Miller, “Introduction to
WordNet: An On-Line Lexical Database,” International Journal of Lexicography, vol. 3, no.
4, 1990, pp. 235-312.
5. A. Ratnaparkhi, “A Maximum Entropy Part-Of-Speech Tagger,” Proc. of the Empirical
Methods in Natural Language Processing Conference, University of Pennsylvania, May
1996, pp. 17-18.
6. J. C. Reynar and A. Ratnaparkhi, “A Maximum Entropy Approach to Identifying Sentence
Boundaries,” Proc. of the Fifth Conference on Applied Natural Language Processing,
Washington, D.C., March 31-April 3, 1997.
7. M. Steinbach, G. Karypis, and V. Kumar, “A Comparison of Document Clustering
Techniques,” KDD Workshop on Text Mining, 2000.
8. O. Zamir, Clustering Web Documents: A Phrase-Based Method for Group Search Engine
Results, Ph.D. dissertation, Dept. Computer Science & Engineering, Univ. of Washington,
1999.
191