Kernel Hierarchical Agglomerative Clustering - Comparison of Different Gap Statistics to Estimate the Number of Clusters

Na Li, Nicolas Lefebvre, Régis Lengellé

2014

Abstract

Clustering algorithms, as unsupervised analysis tools, are useful for exploring data structure and have owned great success in many disciplines. For most of the clustering algorithms like k-means, determining the number of the clusters is a crucial step and is one of the most difficult problems. Hierarchical Agglomerative Clustering (HAC) has the advantage of giving a data representation by the dendrogram that allows clustering by cutting the dendrogram at some optimal level. In the past years and within the context of HAC, efficient statistics have been proposed to estimate the number of clusters and the Gap Statistic by Tibshirani has shown interesting performances. In this paper, we propose some new Gap Statistics to further improve the determination of the number of clusters. Our works focus on the kernelized version of the widely-used Hierarchical Clustering Algorithm.

References

  1. Aizerman, A., Braverman, E. M., and Rozoner, L. (1964). Theoretical foundations of the potential function method in pattern recognition learning. Automation and remote control, 25:821-837.
  2. Barbará, D. and Jajodia, S. (2002). Applications of data mining in computer security, volume 6. Springer.
  3. Bezdek, J. C., Ehrlich, R., and Full, W. (1984). Fcm: The fuzzy c-means clustering algorithm. Computers & Geosciences, 10(2):191-203.
  4. CaliÁski, T. and Harabasz, J. (1974). A dendrite method for cluster analysis. Communications in Statistics-theory and Methods, 3(1):1-27.
  5. Cortes, C. and Vapnik, V. (1995). Support-vector networks. Machine Learning, 20(3):273-297.
  6. Driver, H. E. and Kroeber, A. L. (1932). Quantitative expression of cultural relationships. University of California Press.
  7. Filippone, M., Camastra, F., Masulli, F., and Rovetta, S. (2008). A survey of kernel and spectral methods for clustering. Pattern Recognition, 41(1):176-190.
  8. Fraley, C. and Raftery, A. E. (1998). How many clusters? which clustering method? answers via model-based cluster analysis. The computer journal, 41(8):578- 588.
  9. Girolami, M. (2002). Mercer kernel-based clustering in feature space. Neural Networks, IEEE Transactions on, 13(3):780-784.
  10. Gordon, A. D. (1996). Null models in cluster validation. In From data to knowledge, pages 32-44. Springer.
  11. Harman, H. H. (1960). Modern factor analysis.
  12. Hartigan, J. A. (1975). Clustering Algorithms. John Wiley & Sons, Inc., New York, NY, USA, 99th edition.
  13. Kaufman, L. and Rousseeuw, P. J. (2009). Finding groups in data: an introduction to cluster analysis, volume 344. Wiley-Interscience.
  14. Kim, D.-W., Lee, K. Y., Lee, D., and Lee, K. H. (2005). Evaluation of the performance of clustering algorithms in kernel-induced feature space. Pattern Recognition, 38(4):607-611.
  15. Krzanowski, W. J. and Lai, Y. (1988). A criterion for determining the number of groups in a data set using sumof-squares clustering. Biometrics, pages 23-34.
  16. Mac Queen, J. et al. (1967). Some methods for classification and analysis of multivariate observations. In Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, volume 1, page 14. California, USA.
  17. Mercer, J. (1909). Functions of positive and negative type, and their connection with the theory of integral equations. Philosophical transactions of the royal society of London. Series A, containing papers of a mathematical or physical character, 209:415-446.
  18. Milligan, G. W. and Cooper, M. C. (1985). An examination of procedures for determining the number of clusters in a data set. Psychometrika, 50(2):159-179.
  19. Muller, K.-R., Mika, S., Ratsch, G., Tsuda, K., and Scholkopf, B. (2001). An introduction to kernel-based learning algorithms. Neural Networks, IEEE Transactions on, 12(2):181-201.
  20. Murtagh, F. (1983). A survey of recent advances in hierarchical clustering algorithms. The Computer Journal, 26(4):354-359.
  21. Olson, C. F. (1995). Parallel algorithms for hierarchical clustering. Parallel computing, 21(8):1313-1325.
  22. Príncipe, J. C., Liu, W., and Haykin, S. (2011). Kernel Adaptive Filtering: A Comprehensive Introduction, volume 57. John Wiley & Sons.
  23. Qin, J., Lewis, D. P., and Noble, W. S. (2003). Kernel hierarchical gene clustering from microarray expression data. Bioinformatics, 19(16):2097-2104.
  24. Schölkopf, B., Smola, A., and Müller, K.-R. (1998). Nonlinear component analysis as a kernel eigenvalue problem. Neural Computation, 10(5):1299-1319.
  25. Shawe-Taylor, N. and Kandola, A. (2002). On kernel target alignment. Advances in neural information processing systems, 14:367.
  26. Sneath, P. H., Sokal, R. R., et al. (1973). Numerical taxonomy. The principles and practice of numerical classification.
  27. Sugar, C. A. and James, G. M. (2003). Finding the number of clusters in a dataset. Journal of the American Statistical Association, 98(463).
  28. Tibshirani, R., Walther, G., and Hastie, T. (2001). Estimating the number of clusters in a data set via the gap statistic. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 63(2):411-423.
  29. Vapnik, V. (2000). The Nature of Statistical Learning Theory. Springer.
Download


Paper Citation


in Harvard Style

Li N., Lefebvre N. and Lengellé R. (2014). Kernel Hierarchical Agglomerative Clustering - Comparison of Different Gap Statistics to Estimate the Number of Clusters . In Proceedings of the 3rd International Conference on Pattern Recognition Applications and Methods - Volume 1: ICPRAM, ISBN 978-989-758-018-5, pages 255-262. DOI: 10.5220/0004828202550262


in Bibtex Style

@conference{icpram14,
author={Na Li and Nicolas Lefebvre and Régis Lengellé},
title={Kernel Hierarchical Agglomerative Clustering - Comparison of Different Gap Statistics to Estimate the Number of Clusters},
booktitle={Proceedings of the 3rd International Conference on Pattern Recognition Applications and Methods - Volume 1: ICPRAM,},
year={2014},
pages={255-262},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0004828202550262},
isbn={978-989-758-018-5},
}


in EndNote Style

TY - CONF
JO - Proceedings of the 3rd International Conference on Pattern Recognition Applications and Methods - Volume 1: ICPRAM,
TI - Kernel Hierarchical Agglomerative Clustering - Comparison of Different Gap Statistics to Estimate the Number of Clusters
SN - 978-989-758-018-5
AU - Li N.
AU - Lefebvre N.
AU - Lengellé R.
PY - 2014
SP - 255
EP - 262
DO - 10.5220/0004828202550262