GREEDY APPROACH FOR DOCUMENT CLUSTERING
Lim Choen Choi, Soon Cheol Park
2012
Abstract
A Greedy Algorithm for Document Clustering (Greedy Clustering) is proposed in this paper. Various cluster validity indices (DB, CH, SD, AS) are used to find the most appropriate optimization function for Greedy Clustering. The clustering algorithms are tested and compared on Reuter-21578. The results show that AS Index shows the best performance and the fastest running time among cluster indices in various experiments. Also Greedy Clustering with AS Index has 15~20% better performance than traditional clustering algorithms (K-means, Group Average).
References
- Csaba, Legany., Sandor, Juhasz., Attila, Babos., 2006, Cluster validity measurement techniques. Knowledge Engineering and Data Bases
- Christopher D., Manning, Prabhakar, Raghavan., Hinrich, Schütze., 2008, Introduction to Information Retrieval, Cambridge University Press.
- Croft, W. B., Metzler, D., Strohman, T., 2009, Search engines information retrieval in practice. Addison Wesley.
- Cui, X., Potok, T.E., Palathingal, P., 2005, Document clustering using particle swarm optimization. Swarm Intelligence Symposium 185 - 191
- Cutting, D. R., Pedersen, J. O., Karger, D. R., Tukey, J.W., 1992, Scatter/Gather: a cluster-based approach to browsing large document collections. SIGIR, 318-329
- D, L, Davies., D, W, Bouldin., 1979, A cluster separation measure. IEEE Trans. Pattern Anal. Intell. 224-227
- Halkidi, M., Batistakis, Y., Vazirgiannis, M., 2001, On Clustering Validation Techniques. J. Intell. Inf. Syst. 107-145
- Likas, A., Vlassis, N.A., Verbeek, J.J., 2003, The global kmeans clustering algorithm. Pattern Recognition 451-461
- Maulik, U., Bandyopadhyay, S., 2000, Genetic algorithmbased clustering technique. Pattern Recognition 1455- 1465
- Maulik, U., Bandyopadhyay, S., 2002, Performance Evaluation of Some Clustering Algorithms and Validity Indices. IEEE Trans. Pattern Anal. Intell 1650-1654
- Richard, Neapolitan., Kumarss, NaimipourSmith. 2011. Foundations of Algorithms, Jones & Bartlett 4th edition.
- Shokri, Z., Selim, M., A, Ismail., 1984, K-Means-Type Algorithms: A Generalized Convergence Theorem and Characterization of Local Optimality. IEEE Trans. Pattern Anal. Mach. Intell. 8187
- Song, W., Park, S. C., 2009, Genetic algorithm for text clustering based on latent semantic indexing. Computers & Mathematics with Applications 1901-1907
- Zhang, Z., Schwartz, S., Wagner, L., Miller, W., 2000, A Greedy Algorithm for Aligning DNA Sequences. Journal of Computational Biology 203-214
Paper Citation
in Harvard Style
Choi L. and Park S. (2012). GREEDY APPROACH FOR DOCUMENT CLUSTERING . In Proceedings of the 1st International Conference on Pattern Recognition Applications and Methods - Volume 2: ICPRAM, ISBN 978-989-8425-99-7, pages 597-600. DOI: 10.5220/0003836605970600
in Bibtex Style
@conference{icpram12,
author={Lim Choen Choi and Soon Cheol Park},
title={GREEDY APPROACH FOR DOCUMENT CLUSTERING},
booktitle={Proceedings of the 1st International Conference on Pattern Recognition Applications and Methods - Volume 2: ICPRAM,},
year={2012},
pages={597-600},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0003836605970600},
isbn={978-989-8425-99-7},
}
in EndNote Style
TY - CONF
JO - Proceedings of the 1st International Conference on Pattern Recognition Applications and Methods - Volume 2: ICPRAM,
TI - GREEDY APPROACH FOR DOCUMENT CLUSTERING
SN - 978-989-8425-99-7
AU - Choi L.
AU - Park S.
PY - 2012
SP - 597
EP - 600
DO - 10.5220/0003836605970600