Paired Indices for Clustering Evaluation - Correction for Agreement by Chance

Maria José Amorim, Margarida G. M. S. Cardoso

2014

Abstract

In the present paper we focus on the performance of clustering algorithms using indices of paired agreement to measure the accordance between clusters and an a priori known structure. We specifically propose a method to correct all indices considered for agreement by chance – the adjusted indices are meant to provide a realistic measure of clustering performance. The proposed method enables the correction of virtually any index – overcoming previous limitations known in the literature - and provides very precise results. We use simulated datasets under diverse scenarios and discuss the pertinence of our proposal which is particularly relevant when poorly separated clusters are considered. Finally we compare the performance of EM and K-Means algorithms, within each of the simulated scenarios and generally conclude that EM generally yields best results.

References

  1. Agresti, A., Wackerly, D. & Boyett, J. M., 1979. Exact conditional tests for cross-classifications: approximation of attained significance levels. Psychometrika, 44, 75-83.
  2. Albatineh, A. N., Niewiadomska-Bugaj, M. & Mihalko, D., 2006. On Similarity Indices and Correction for Chance Agreement. Journal of Classification, 23, 301- 313.
  3. Albatineh, A. N., 2010. Means and variances for a family of similarity indices used in cluster analysis. Journal of Statistical Planning and Inference, 140, 2828-2838.
  4. Albatineh, A. N. & Niewiadmska-Bugaj, M., 2011. Correcting Jaccard and other similarity indices for chance agreement in cluster analysis. Advances in Data Analysis and Classification, 5, 179-200.
  5. Amorim, M. J. &Cardoso, M. G. M. S., 2010. Limiares De Concordância Entre Duas Partições. Livro de Resumos do XVIII Congresso Anual da Sociedade Portuguesa de Estatística, 47-49.
  6. Amorim, M. J. P. C. & Cardoso, M. G. M. S., 2012. Clustering cross-validation and mutual information indices. In: Ana Colubi, K. F., Gil GonzalezRodriguesand Erricos John Kontoghiorghes, ed. 20th International Con-ference on Computational Statistics (COMPSTAT 2012), 2012 Limassol, Cyprus. The International Statistical Institute/International Association for Statistical Computing, 39-52.
  7. Chumwatana, T., Wong, K. W. & Xie, H., 2010. A SOMBased Document Clustering Using Frequent Max Substrings for Non-Segmented Texts. J. Intelligent Learning Systems & Applications,, 2, 117-125.
  8. Czekanowski, J., 1932. "Coefficient of racial likeness" and "durchschnittliche Differenz". Anthropologischer Anzeiger, 14, 227-249.
  9. Dempster, A. P., Laird, N. M. & Rubin, D. B., 1977. Maximum likelihood from incomplete data via the EM algorithm.Journal of the Royal Statistical Society. Series B (Methodological), 1-38.
  10. Everit, B., Landau, S. & Leese, M. 2001. Cluster Analysis, London, Arnold.
  11. Fowlkes, E. B. &mallows, C. L., 1983. A method for comparing two hierarchical clusterings.Journal of the American Statistical Association, 78, 553-569.
  12. Goodman, L. A. & Kruskal, W. H., 1954. Measures of Association for Cross Classifications. Journal of the American Statistical Associations, 49.
  13. Gower, J. C. & Legendre, P., 1986. Metric and Euclidean Properties of Dissimilarity Coefficients. Journal of Classification, 3.
  14. Halton, J. H., 1969. A rigorous derivation of the exact contingency formula. In:Proceedings of the Cambridge Philosophical Society. Cambridge Univ Press, 527-530.
  15. Hennig, C., 2006. Cluster-wise assessment of cluster stability. Research report nº 271, Department of Statistical Science, University College London.
  16. Hubert, L. and Arabie, P. 1985. Comparing partitions. Journal of classification, 2, 193-218.
  17. Jaccard, 1908. Nouvelles Recerches sur la Distribuition Florale. Bulletin de la Societé Vaudoise de Sciences Naturells, 44, 223-370.
  18. Jain, A. K., 2010. Data clustering: 50 years beyond Kmeans. Pattern Recognition Letters, 31, 651-666.
  19. Krzanowski, W. J. & Marriott, F. H. C., 1994. Multivariate analysis, Edward Arnold London.
  20. Lebret, R., S., L., Langrognet, F., Biernacki, C., Celeux, G. & Govaert, G., 2012. Rmixmod: The r package of the model-based unsupervised, supervised and semisupervised classification mixmod library.http://cran.rproject.org/web/ packages/Rmixmod/index.html.
  21. Maitra, R. & Melnykov, V., 2010. Simulating data to study performance of finite mixture modeling and clustering algorithms. Computational and Graphical Statistics, 19, 354-376.
  22. Meyeri, A. D. S., Garcia, A. A. F., Souza, A. P. & JR., C. L. D. S., 2005. Comparison of similarity coefficients used for cluster analysis with dominant markers in maize (Zea mays L). Genetics and Molecular Biology, 27, 83-91. 
  23. Milligan, G. W. & Cooper, M. C., 1986. A Study of Comparability of External Criteria for Hierarchical Cluster Analysis. Multivariate Behavioral Reserch, 21, 441-458.
  24. O'Hagan, A., Murphy, T. B. & Gormley, I. C., 2012. Computational aspects of fitting mixture models via the expectation-maximization algorithm. Computational Statistics and Data Analysis, 56, 3843-3864.
  25. Rand, W. M., 1971. Objective Criteria for the Evaluation of Clustering Methods. Journal of the American Statistical Association, 66, 846-850.
  26. RusseL, P. F. & Rao, T. R. 1940. On Habitat and Association of Species of Anophelinae Larvae in South-Eastern Madras. J. Malar. Inst. India, 3, 153- 178.
  27. Shamir, O. and tishby, N., 2010. Stability and model selection in k-means clustering. Mach Learn, 80, 213- 244.
  28. Sokal, R. R. and Sneath, P. H., 1963. Principles of Numerical Taxonomy, San Francisco CA: Freeman.
Download


Paper Citation


in Harvard Style

Amorim M. and G. M. S. Cardoso M. (2014). Paired Indices for Clustering Evaluation - Correction for Agreement by Chance . In Proceedings of the 16th International Conference on Enterprise Information Systems - Volume 1: ICEIS, ISBN 978-989-758-027-7, pages 164-170. DOI: 10.5220/0004868301640170


in Bibtex Style

@conference{iceis14,
author={Maria José Amorim and Margarida G. M. S. Cardoso},
title={Paired Indices for Clustering Evaluation - Correction for Agreement by Chance},
booktitle={Proceedings of the 16th International Conference on Enterprise Information Systems - Volume 1: ICEIS,},
year={2014},
pages={164-170},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0004868301640170},
isbn={978-989-758-027-7},
}


in EndNote Style

TY - CONF
JO - Proceedings of the 16th International Conference on Enterprise Information Systems - Volume 1: ICEIS,
TI - Paired Indices for Clustering Evaluation - Correction for Agreement by Chance
SN - 978-989-758-027-7
AU - Amorim M.
AU - G. M. S. Cardoso M.
PY - 2014
SP - 164
EP - 170
DO - 10.5220/0004868301640170