Table 4: Regression evaluation using DBSCAN.
Index Regression Correlation RRSE
J
linear 65.48 75.58
M5 67.59 73.73
R
linear 72.59 68.78
M5 72.59 68.78
FM
linear 64.29 76.59
M5 64.29 76.59
The analysis of models allows us to verify that the
most suitable cluster validity internal indices for eval-
uating the generated datasets using DBSCAN were
Silhouette and Gamma. The values of quality mea-
sures suggest a moderate correlation between them
and all three external indices.
5 CONCLUSION
In this paper we have investigated the relationships
between internal and external clustering validity in-
dices learning a set of regression models. The analy-
sis of these models allowed the inference of the most
suitable internal index for each method of clustering
algorithm. The experiments results point out that Sil-
houette and Gamma were the most suitable indices
for evaluating the datasets with compactness propri-
ety using k-means and the datasets with multiple den-
sity using DBSCAN.
Finally, our method can be seen as a template for
a general strategy for selecting an internal validity in-
dex in which specific clustering or regression algo-
rithms may be replaced by more effective or efficient
ones in specific scenarios. As future work we high-
light the performance of new experiments using dif-
ferent clustering algorithms and real datasets.
REFERENCES
Baker, F. B. and Hubert, L. J. (1975). Measuring the power
of hierarchical cluster analysis. Journal of the Ameri-
can Statistical Association, 70(349):31–38.
Berkhin, P. (2006). A survey of clustering data mining tech-
niques. In Kogan, J., Nicholas, C., and Teboulle, M.,
editors, Grouping Multidimensional Data, pages 25–
71. Springer Berlin Heidelberg.
Berry, M. J. and Linoff, G. (1996). Data mining techniques
for marketing, sales and customer support. john willey
& sons. Inc., 1997, 454 P.
Chaimontree, S., Atkinson, K., and Coenen, F. (2010). Best
clustering configuration metrics: Towards multiagent
based clustering. In Advanced Data Mining and Ap-
plications, pages 48–59. Springer.
Davies, D. L. and Bouldin, D. W. (1979). A cluster separa-
tion measure. IEEE Transactions on Pattern Analysis
and Machine Intelligence, PAMI-1(2):224–227.
Dunn, J. C. (1974). Well-separated clusters and optimal
fuzzy partitions. Journal of cybernetics, 4(1):95–104.
Ester, M., Kriegel, H.-P., Sander, J., and Xu, X. (1996).
A density-based algorithm for discovering clusters in
large spatial databases with noise. In Kdd, volume 96,
pages 226–231.
Fowlkes, E. B. and Mallows, C. L. (1983). A method for
comparing two hierarchical clusterings. Journal of the
American statistical association, 78(383):553–569.
Goodman, N. (1963). Statistical analysis based on a cer-
tain multivariate complex gaussian distribution (an in-
troduction). Annals of mathematical statistics, pages
152–177.
Halkidi, M., Batistakis, Y., and Vazirgiannis, M. (2001a).
On clustering validation techniques. J. Intell. Inf.
Syst., 17(2-3):107–145.
Halkidi, M., Batistakis, Y., and Vazirgiannis, M. (2001b).
On clustering validation techniques. Journal of Intel-
ligent Information Systems, 17(2-3):107–145.
Han, J., Kamber, M., and Pei, J. (2006). Data mining, south-
east asia edition: Concepts and techniques. Morgan
kaufmann.
Handl, J., Knowles, J., and Kell, D. B. (2005). Computa-
tional cluster validation in post-genomic data analysis.
Bioinformatics, 21(15):3201–3212.
Hartigan, J. A. and Wong, M. A. (1979). Algorithm as 136:
A k-means clustering algorithm. Applied statistics,
pages 100–108.
Hubert, L. J. and Levin, J. R. (1976). A general statistical
framework for assessing categorical clustering in free
recall. Psychological bulletin, 83(6):1072.
Liu, Y., Li, Z., Xiong, H., Gao, X., and Wu, J. (2010). Un-
derstanding of internal clustering validation measures.
In Proceedings of the 2010 IEEE International Con-
ference on Data Mining, ICDM ’10, pages 911–916,
Washington, DC, USA.
Milligan, G. W. and Cooper, M. C. (1985). An examination
of procedures for determining the number of clusters
in a data set. Psychometrika, 50(2):159–179.
Quinlan, J. R. et al. (1992). Learning with continuous
classes. In 5th Australian joint conference on artificial
intelligence, volume 92, pages 343–348. Singapore.
Rand, W. M. (1971). Objective criteria for the evaluation of
clustering methods. Journal of the American Statisti-
cal association, 66(336):846–850.
Rousseeuw, P. J. (1987). Silhouettes: a graphical aid to
the interpretation and validation of cluster analysis.
Journal of computational and applied mathematics,
20:53–65.
Tan, P.-N., Steinbach, M., Kumar, V., et al. (2006). Intro-
duction to data mining, volume 1. Pearson Addison
Wesley Boston.
Tomasini, C., Emmendorfer, L., Borges, E. N., and
Machado, K. (2016). A methodology for selecting the
most suitable cluster validation internal indices. In
Proceedings of the 31st Annual ACM Symposium on
A Study on the Relationship between Internal and External Validity Indices Applied to Partitioning and Density-based Clustering Algorithms
97