Model Selection and Stability in Spectral Clustering

Zeev Volkovich, Renata Avros

Abstract

An open problem in spectral clustering concerning of finding automatically the number of clusters is studied. We generalize the method for the scale parameter selecting offered in the Ng-Jordan-Weiss (NJW) algorithm and reveal a connection with the distance learning methodology. Values of the scaling parameter estimated via clustering of samples drawn are considered as a cluster stability attitude such that the clusters quantity corresponding to the most concentrated distribution is accepted as “true” number of clusters. Numerical experiments provided demonstrate high potential ability of the offered method.

References

  1. Barzily, Z., Volkovich, Z., Akteko-Ozturk, B., and Weber, G.-W. (2009). On a minimal spanning tree approach in the cluster validation problem. Informatica, 20(2):187-202.
  2. Ben-Hur, A., Elisseeff, A., and Guyon, I. (2002). A stability based method for discovering structure in clustered data. In Pacific Symposium on Biocomputing, pages 6-17.
  3. Ben-Hur, A. and Guyon, I. (2003). Detecting stable clusters using principal component analysis. In Brownstein, M. and Khodursky, A., editors, Methods in Molecular Biology, pages 159-182. Humana press.
  4. Calinski, R. and Harabasz, J. (1974). A dendrite method for cluster analysis. Communications in Statistics, 3:1- 27.
  5. Celeux, G. and Govaert, G. (1992). A classification EM algorithm for clustering and two stochastic versions. Computational Statistics and Data Analysis, 14:315- 332.
  6. Chung, F. R. K. (1997). Spectral Graph Theory. AMS Press, Providence, R.I.
  7. Dasgupta, S. and Ng, V. (2009). Mine the easy, classify the hard: a semi-supervised approach to automatic sentiment classification. In ACL-IJCNLP 2009: Proceedings of the Main Conference, pages 701-709.
  8. Dhillon, I., Kogan, J., and Nicholas, C. (2003). Feature selection and document clustering. In Berry, M., editor, A Comprehensive Survey of Text Mining, pages 73-100. Springer, Berlin Heildelberg New York.
  9. Dhillon, I. S., Guan, Y., and Kulis, B. (2004). Kernel kmeans, spectral clustering and normalized cuts. In Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), pages 551-556.
  10. Dhillon, I. S. and Modha, D. S. (2001). Concept decompositions for large sparse text data using clustering. Machine Learning, 42(1):143-175. Also appears as IBM Research Report RJ 10147, July 1999.
  11. Ding, C., He, X., and Simon, H. D. (2005). On the equivalence of nonnegative matrix factorization and spectral clustering. In Proceedings of the fifth SIAM international conference on data mining, volume 4, pages 606-610.
  12. Dudoit, S. and Fridlyand, J. (2002). A prediction-based resampling method for estimating the number of clusters in a dataset. Genome Biol., 3(7).
  13. Dunn, J. C. (1974). Well Separated Clusters and Optimal Fuzzy Partitions. Journal on Cybernetics, 4:95-104.
  14. Filippone, M., Camastra, F., Masulli, F., and Rovetta, S. (2008). A survey of kernel and spectral methods for clustering. Pattern Recognition, 41(1):176-190.
  15. Forgy, E. W. (1965). Cluster analysis of multivariate data - efficiency vs interpretability of classifications. Biometrics, 21(3):768-769.
  16. Fortunato, S. (2010). Community detection in graphs. Phys. Rep., 486(3-5):75-174.
  17. Gordon, A. D. (1994). Identifying genuine clusters in a classification. Computational Statistics and Data Analysis, 18:561-581.
  18. Gordon, A. D. (1999). Classification. Chapman and Hall, CRC, Boca Raton, FL.
  19. Hartigan, J. A. (1985). Statistical theory in clustering. J. Classification, 2:63-76.
  20. Hubert, L. and Schultz, J. (1974). Quadratic assignment as a general data-analysis strategy. Br. J. Math. Statist. Psychol., 76:190-241.
  21. Jain, A. and Dubes, R. (1988). Algorithms for Clustering Data. Englewood Cliffs, Prentice-Hall, New Jersey.
  22. Jain, A. K. and Moreau, J. V. (1987). Bootstrap technique in cluster analysis. Pattern Recognition, 20(5):547-568.
  23. Kulis, B., Basu, S., Dhillon, I., and Mooney, R. J. (2005). Semi-supervised graph clustering: A kernel approach. In Proceedings of the 22nd International Conference on Machine Learning, pages 457-464, Bonn, Germany.
  24. Levine, E. and Domany, E. (2001). Resampling method for unsupervised estimation of cluster validity. Neural Computation, 13:2573-2593.
  25. Liu, X., Yu, S., Moreau, Y., Moor, B. D., Glanzel, W., and Janssens, F. A. L. (2009). Hybrid clustering of text mining and bibliometrics applied to journal sets. In SDM'09, pages 49-60.
  26. Luxburg, U. V. (2007). A tutorial on spectral clustering. Statistics and Computing, 17(4):395-416.
  27. MacQueen, J. B. (1967). Some methods for classification and analysis of multivariate observations. In Proceedings of 5-th Berkeley Symposium on Mathematical Statistics and Probability, volume 1, pages 281- 297. Berkeley, University of California Press.
  28. McLachlan, G. J. and Peel, D. (2000). Finite Mixure Models. Wiley.
  29. Milligan, G. and Cooper, M. (1985). An examination of procedures for determining the number of clusters in a data set. Psychometrika, 50:159-179.
  30. Mohar, B. (1997). Some applications of Laplace eigenvalues of graphs. G. Hahn and G. Sabidussi (Eds.), Graph Symmetry: Algebraic Methods and Applications, Springer.
  31. Mufti, G. B., Bertrand, P., and Moubarki, E. (2005). Determining the number of groups from measures of cluster validity. In Proceedings of ASMDA 2005, pages 404- 414.
  32. Nascimento, M. and Carvalho, A. D. (2011). Spectral methods for graph clustering - a survey. European Journal Of Operational Research, 2116(2):221-231.
  33. Ng, A. Y., Jordan, M. I., and Weiss, Y. (2001). On spectral clustering: analysis and an algorithm. In Advances in Neural Information Processing Systems 14 (NIPS 2001), pages 849-856.
  34. Roth, V., Lange, V., Braun, M., and J., B. (2002). A resampling approach to cluster validation. In COMPSTAT, available at http://www.cs.uni-bonn.De/b˜raunm.
  35. Roth, V., Lange, V., Braun, M., and J., B. (2004). Stabilitybased validation of clustering solutions. Neural Computation, 16(6):1299 - 1323.
  36. Shi, J. and Malik, J. (2000). Normalized Cuts and Image Segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(8):888-905.
  37. Spielman, D. A. (2012). Spectral graph theory. U. Naumann and O. Schenk (Eds.), Combinatorial Scientific Computing, Chapman & Hall/CRC Computational Science.
  38. Sugar, C. and James, G. (2003). Finding the number of clusters in a data set: An information theoretic approach. J. of the American Statistical Association, 98:750-763.
  39. Tibshirani, R., Walther, G., and Hastie, T. (2001). Estimating the number of clusters via the gap statistic. J. Royal Statist. Soc. B, 63(2):411-423.
  40. Toledano-Kitai, D., Avros, R., and Volkovich, Z. (2011). A fractal dimension standpoint to the cluster validation problem. International Journal of Pure and Applied Mathematics, 68(2):233-252.
  41. Volkovich, V., Kogan, J., and Nicholas, C. (2004). k-means initialization by sampling large datasets. In Dhillon, I. and Kogan, J., editors, Proceedings of the Workshop on Clustering High Dimensional Data and its Applications (held in conjunction with SDM 2004), pages 17-22.
  42. Volkovich, Z., Barzily, Z., and Morozensky, L. (2008). A statistical model of cluster stability. Pattern Recognition, 41(7):2174-2188.
  43. Wechsler, H. (2010). Intelligent biometric information management. Intelligent Information Management, 2:499-511.
  44. White, S. and Smyth, P. (2005). A spectral clustering approach to finding communities in graphs. In Proceedings of the fifth SIAM international conference on data mining, volume 119, pages 274-285. Society for Industrial Mathematics.
  45. Xing, E. P., Ng, A. Y., Jordan, M. I., and Russell, S. (2002). Distance metric learning, with application to clustering with side-information. Advances in Neural Information Processing Systems 15 (NIPS 2002), pages 505-512.
  46. Yu, S. X. and Shi, J. (2003). Multiclass spectral clustering. In Proceedings of the Ninth IEEE International Conference on Computer Vision, volume 1, pages 313- 319.
  47. Zelnik-manor, L. and Perona, P. (2004). Self-tuning spectral clustering. In Advances in Neural Information Processing Systems 17, pages 1601-1608. MIT Press.
Download


Paper Citation


in Harvard Style

Volkovich Z. and Avros R. (2012). Model Selection and Stability in Spectral Clustering . In Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2012) ISBN 978-989-8565-29-7, pages 25-34. DOI: 10.5220/0004132700250034


in Bibtex Style

@conference{kdir12,
author={Zeev Volkovich and Renata Avros},
title={Model Selection and Stability in Spectral Clustering },
booktitle={Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2012)},
year={2012},
pages={25-34},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0004132700250034},
isbn={978-989-8565-29-7},
}


in EndNote Style

TY - CONF
JO - Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2012)
TI - Model Selection and Stability in Spectral Clustering
SN - 978-989-8565-29-7
AU - Volkovich Z.
AU - Avros R.
PY - 2012
SP - 25
EP - 34
DO - 10.5220/0004132700250034