A Study on the Relationship between Internal and External Validity Indices Applied to Partitioning and Density-based Clustering Algorithms

Caroline Tomasini, Eduardo N. Borges, Karina Machado, Leonardo Emmendorfer

2017

Abstract

Measuring the quality of data partitions is essential to the success of clustering applications. A lot of different validity indices have been proposed in the literature, but choosing the appropriate index for evaluating the results of a particular clustering algorithm remains a challenge. Clustering results can be evaluated using different indices based on external or internal criteria. An external criterion requires a partitioning of the data previously defined for comparison with the clustering results while an internal criterion evaluates clustering results considering only the data proprieties. This paper proposes a method that helps the user for selecting the most suitable cluster validity internal index applied on the results of partitioning and density-based clustering algorithms. We have looked into the relationships between internal and external indexes, relating them through linear regression and regression model trees. Each algorithm was run over synthetic datasets generated for this purpose, using different configurations. Experiments results point out that \textit{Silhouette} and \textit{Gamma} are the most suitable indices for evaluating both the datasets with compactness propriety and the datasets with multiple density.

References

  1. Baker, F. B. and Hubert, L. J. (1975). Measuring the power of hierarchical cluster analysis. Journal of the American Statistical Association, 70(349):31-38.
  2. Berkhin, P. (2006). A survey of clustering data mining techniques. In Kogan, J., Nicholas, C., and Teboulle, M., editors, Grouping Multidimensional Data, pages 25- 71. Springer Berlin Heidelberg.
  3. Berry, M. J. and Linoff, G. (1996). Data mining techniques for marketing, sales and customer support. john willey & sons. Inc., 1997, 454 P.
  4. Chaimontree, S., Atkinson, K., and Coenen, F. (2010). Best clustering configuration metrics: Towards multiagent based clustering. In Advanced Data Mining and Applications, pages 48-59. Springer.
  5. Davies, D. L. and Bouldin, D. W. (1979). A cluster separation measure. IEEE Transactions on Pattern Analysis and Machine Intelligence, PAMI-1(2):224-227.
  6. Dunn, J. C. (1974). Well-separated clusters and optimal fuzzy partitions. Journal of cybernetics, 4(1):95-104.
  7. Ester, M., Kriegel, H.-P., Sander, J., and Xu, X. (1996). A density-based algorithm for discovering clusters in large spatial databases with noise. In KDD, volume 96, pages 226-231.
  8. Fowlkes, E. B. and Mallows, C. L. (1983). A method for comparing two hierarchical clusterings. Journal of the American Statistical Association, 78(383):553-569.
  9. Goodman, N. (1963). Statistical analysis based on a certain multivariate complex gaussian distribution (an introduction). Annals of mathematical statistics, pages 152-177.
  10. Halkidi, M., Batistakis, Y., and Vazirgiannis, M. (2001a). On clustering validation techniques. J. Intell. Inf. Syst., 17(2-3):107-145.
  11. Halkidi, M., Batistakis, Y., and Vazirgiannis, M. (2001b). On clustering validation techniques. Journal of Intelligent Information Systems, 17(2-3):107-145.
  12. Han, J., Kamber, M., and Pei, J. (2006). Data mining, southeast asia edition: Concepts and techniques. Morgan kaufmann.
  13. Handl, J., Knowles, J., and Kell, D. B. (2005). Computational cluster validation in post-genomic data analysis. Bioinformatics, 21(15):3201-3212.
  14. Hartigan, J. A. and Wong, M. A. (1979). Algorithm as 136: A k-means clustering algorithm. Applied statistics, pages 100-108.
  15. Hubert, L. J. and Levin, J. R. (1976). A general statistical framework for assessing categorical clustering in free recall. Psychological bulletin, 83(6):1072.
  16. Liu, Y., Li, Z., Xiong, H., Gao, X., and Wu, J. (2010). Understanding of internal clustering validation measures. In Proceedings of the 2010 IEEE International Conference on Data Mining, ICDM 7810, pages 911-916, Washington, DC, USA.
  17. Milligan, G. W. and Cooper, M. C. (1985). An examination of procedures for determining the number of clusters in a data set. Psychometrika, 50(2):159-179.
  18. Quinlan, J. R. et al. (1992). Learning with continuous classes. In 5th Australian joint conference on artificial intelligence, volume 92, pages 343-348. Singapore.
  19. Rand, W. M. (1971). Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association, 66(336):846-850.
  20. Rousseeuw, P. J. (1987). Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Journal of computational and applied mathematics, 20:53-65.
  21. Tan, P.-N., Steinbach, M., Kumar, V., et al. (2006). Introduction to data mining, volume 1. Pearson Addison Wesley Boston.
  22. Tomasini, C., Emmendorfer, L., Borges, E. N., and Machado, K. (2016). A methodology for selecting the most suitable cluster validation internal indices. In Proceedings of the 31st Annual ACM Symposium on Applied Computing, pages 901-903, New York, NY, USA. ACM.
  23. Vendramin, L., Campello, R. J., and Hruschka, E. R. (2010). Relative clustering validity criteria: A comparative overview. Statistical Analysis and Data Mining, 3(4):209-235.
  24. Xu, R. and Wunsch, D. (2009). Clustering. IEEE Press, Piscataway, NJ, USA.
  25. Xu, R., Wunsch, D., et al. (2005). Survey of Clustering Algorithms. Neural Networks, IEEE Transactions on, 16(3):645-678.
  26. Zhou, H., Wang, P., and Li, H. (2012). Research on adaptive parameters determination in dbscan algorithm. Journal of Information & Computational Science, 9(7):1967-1973.
Download


Paper Citation


in Harvard Style

Tomasini C., N. Borges E., Machado K. and Emmendorfer L. (2017). A Study on the Relationship between Internal and External Validity Indices Applied to Partitioning and Density-based Clustering Algorithms . In Proceedings of the 19th International Conference on Enterprise Information Systems - Volume 1: ICEIS, ISBN 978-989-758-247-9, pages 89-98. DOI: 10.5220/0006317000890098


in Bibtex Style

@conference{iceis17,
author={Caroline Tomasini and Eduardo N. Borges and Karina Machado and Leonardo Emmendorfer},
title={A Study on the Relationship between Internal and External Validity Indices Applied to Partitioning and Density-based Clustering Algorithms},
booktitle={Proceedings of the 19th International Conference on Enterprise Information Systems - Volume 1: ICEIS,},
year={2017},
pages={89-98},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0006317000890098},
isbn={978-989-758-247-9},
}


in EndNote Style

TY - CONF
JO - Proceedings of the 19th International Conference on Enterprise Information Systems - Volume 1: ICEIS,
TI - A Study on the Relationship between Internal and External Validity Indices Applied to Partitioning and Density-based Clustering Algorithms
SN - 978-989-758-247-9
AU - Tomasini C.
AU - N. Borges E.
AU - Machado K.
AU - Emmendorfer L.
PY - 2017
SP - 89
EP - 98
DO - 10.5220/0006317000890098