GREEDY APPROACH FOR DOCUMENT CLUSTERING

Lim Choen Choi, Soon Cheol Park

Abstract

A Greedy Algorithm for Document Clustering (Greedy Clustering) is proposed in this paper. Various cluster validity indices (DB, CH, SD, AS) are used to find the most appropriate optimization function for Greedy Clustering. The clustering algorithms are tested and compared on Reuter-21578. The results show that AS Index shows the best performance and the fastest running time among cluster indices in various experiments. Also Greedy Clustering with AS Index has 15~20% better performance than traditional clustering algorithms (K-means, Group Average).

References

  1. Csaba, Legany., Sandor, Juhasz., Attila, Babos., 2006, Cluster validity measurement techniques. Knowledge Engineering and Data Bases
  2. Christopher D., Manning, Prabhakar, Raghavan., Hinrich, Sch├╝tze., 2008, Introduction to Information Retrieval, Cambridge University Press.
  3. Croft, W. B., Metzler, D., Strohman, T., 2009, Search engines information retrieval in practice. Addison Wesley.
  4. Cui, X., Potok, T.E., Palathingal, P., 2005, Document clustering using particle swarm optimization. Swarm Intelligence Symposium 185 - 191
  5. Cutting, D. R., Pedersen, J. O., Karger, D. R., Tukey, J.W., 1992, Scatter/Gather: a cluster-based approach to browsing large document collections. SIGIR, 318-329
  6. D, L, Davies., D, W, Bouldin., 1979, A cluster separation measure. IEEE Trans. Pattern Anal. Intell. 224-227
  7. Halkidi, M., Batistakis, Y., Vazirgiannis, M., 2001, On Clustering Validation Techniques. J. Intell. Inf. Syst. 107-145
  8. Likas, A., Vlassis, N.A., Verbeek, J.J., 2003, The global kmeans clustering algorithm. Pattern Recognition 451-461
  9. Maulik, U., Bandyopadhyay, S., 2000, Genetic algorithmbased clustering technique. Pattern Recognition 1455- 1465
  10. Maulik, U., Bandyopadhyay, S., 2002, Performance Evaluation of Some Clustering Algorithms and Validity Indices. IEEE Trans. Pattern Anal. Intell 1650-1654
  11. Richard, Neapolitan., Kumarss, NaimipourSmith. 2011. Foundations of Algorithms, Jones & Bartlett 4th edition.
  12. Shokri, Z., Selim, M., A, Ismail., 1984, K-Means-Type Algorithms: A Generalized Convergence Theorem and Characterization of Local Optimality. IEEE Trans. Pattern Anal. Mach. Intell. 8187
  13. Song, W., Park, S. C., 2009, Genetic algorithm for text clustering based on latent semantic indexing. Computers & Mathematics with Applications 1901-1907
  14. Zhang, Z., Schwartz, S., Wagner, L., Miller, W., 2000, A Greedy Algorithm for Aligning DNA Sequences. Journal of Computational Biology 203-214
Download


Paper Citation


in Harvard Style

Choi L. and Park S. (2012). GREEDY APPROACH FOR DOCUMENT CLUSTERING . In Proceedings of the 1st International Conference on Pattern Recognition Applications and Methods - Volume 2: ICPRAM, ISBN 978-989-8425-99-7, pages 597-600. DOI: 10.5220/0003836605970600


in Bibtex Style

@conference{icpram12,
author={Lim Choen Choi and Soon Cheol Park},
title={GREEDY APPROACH FOR DOCUMENT CLUSTERING},
booktitle={Proceedings of the 1st International Conference on Pattern Recognition Applications and Methods - Volume 2: ICPRAM,},
year={2012},
pages={597-600},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0003836605970600},
isbn={978-989-8425-99-7},
}


in EndNote Style

TY - CONF
JO - Proceedings of the 1st International Conference on Pattern Recognition Applications and Methods - Volume 2: ICPRAM,
TI - GREEDY APPROACH FOR DOCUMENT CLUSTERING
SN - 978-989-8425-99-7
AU - Choi L.
AU - Park S.
PY - 2012
SP - 597
EP - 600
DO - 10.5220/0003836605970600