Figure 4: F-measure of Clustering Algorithms(Topic-
Set2).
As shown Figure 3 and Figure 4, AS index and
DB index have better performance than traditional
clustering algorithms(K-means, group average) in all
topic-Sets. But CH index and SA index show lower
performance than traditional clustering algorithms.
Table 3: Running time of GC with Cluster Index.
Cluster Index time(s)
GC(DB Index) 2,549
GC(CH Index) 3,612
GC(SD Index) 3,892
GC(AS Index) 15
Table 3 shows a running time of GC with Cluster
Index. As shown Table 3, AS Index has faster
running time than other cluster indices.
Consequently AS Index has the best performance
and the fastest running time for Greedy Clustering.
5 CONCLUSIONS
In this paper, we propose the greedy algorithm for
document clustering(Greedy Clustering). Main goal
of this paper is find optimal function for Greedy
Clustering(high performance, fast running time).
So various cluster indices are used to optimal
function for Greedy Clustering. As the results of
experiments in this paper, the Average Similarity
index is the most suitable for the Greedy Clustering
among cluster indices(DB, CH, SD, AS). Moreover
Greedy Clustering with AS Index has 15~20% better
performance than traditional clustering algorithms
(K-means, Group Average Clustering).
But Greedy Clustering has weakness that is a
long running time due to the complexity of
calculation of cluster index compare with traditional
clustering algorithms. We will fix this problem
through the optimization of Greedy Clustering with
AS Index.
ACKNOWLEDGEMENTS
This research was supported by Basic Science
Research Program through the National Research
Foundation of Korea(NRF) funded by the Ministry
of Education, Science and Technology(No. 2011-
0004389) And second stage of Brain Korea 21
Project in 2011.
REFERENCES
Csaba, Legany., Sandor, Juhasz., Attila, Babos., 2006,
Cluster validity measurement techniques. Knowledge
Engineering and Data Bases
Christopher D., Manning, Prabhakar, Raghavan., Hinrich,
Schütze., 2008, Introduction to Information Retrieval,
Cambridge University Press.
Croft, W. B., Metzler, D., Strohman, T., 2009, Search
engines information retrieval in practice. Addison
Wesley.
Cui, X., Potok, T.E., Palathingal, P., 2005, Document
clustering using particle swarm optimization. Swarm
Intelligence Symposium 185 - 191
Cutting, D. R., Pedersen, J. O., Karger, D. R., Tukey, J.W.,
1992, Scatter/Gather: a cluster-based approach to
browsing large document collections. SIGIR, 318-329
D, L, Davies., D, W, Bouldin., 1979, A cluster separation
measure. IEEE Trans. Pattern Anal. Intell. 224-227
Halkidi, M., Batistakis, Y., Vazirgiannis, M., 2001, On
Clustering Validation Techniques. J. Intell. Inf. Syst.
107-145
Likas, A., Vlassis, N.A., Verbeek, J.J., 2003, The global k-
means clustering algorithm. Pattern Recognition 451-461
Maulik, U., Bandyopadhyay, S., 2000, Genetic algorithm-
based clustering technique. Pattern Recognition 1455-
1465
Maulik, U., Bandyopadhyay, S., 2002, Performance
Evaluation of Some Clustering Algorithms and
Validity Indices. IEEE Trans. Pattern Anal. Intell
1650-1654
Richard, Neapolitan., Kumarss, NaimipourSmith. 2011.
Foundations of Algorithms, Jones & Bartlett 4
th
edition.
Shokri, Z., Selim, M., A, Ismail., 1984, K-Means-Type
Algorithms: A Generalized Convergence Theorem and
Characterization of Local Optimality. IEEE Trans.
Pattern Anal. Mach. Intell. 81~87
Song, W., Park, S. C., 2009, Genetic algorithm for text
clustering based on latent semantic indexing. Computers
& Mathematics with Applications 1901-1907
Zhang, Z., Schwartz, S., Wagner, L., Miller, W., 2000, A
Greedy Algorithm for Aligning DNA Sequences.
Journal of Computational Biology 203-214
ICPRAM 2012 - International Conference on Pattern Recognition Applications and Methods
600