Table 3: Mean ± standard deviation of best solution of 100 independent runs for the DE-simple matching, DE-IOF, DE-
Eskin, DE-Scaling, and DE-MSM.
DE-Simple
Matching
DE-IOF DE-Eskin DE-Scaling DE-MSM T-test
Breast Cancer
0.823201 ±
00013254
0.7901874 ±
0.000231
0.805437 ±
0.006119
0.82289 ±
000245
0.8472614 ±
0.07811E-05
Significant
Zoo
0.90132 ±
0.0002621
0.884791±
0.6119E-04
0.899645 ±
0.00332
0.908892 ±
0.002583
0.9435833 ±
2.52812 E-06
Significant
Hepatitis
0.798517±
0.003213
0.769026 ±
0.00371
0.734618 ±
1.842E-04
0.797582±
0.0007739
0.83306326 ±
7.2235E-05
Significant
Heart
Diseases
0.762825 ±
0.000765
0.7356806 ±
2.5723E-05
0.6571352 ±
0.00422
0.774329 ±
0.000113
0.82840165 ±
3.77392E-05
Significant
Dermatology
0.85060403
± 0.000113
0.7285605 ±
0.00117
0.705437 ±
0.0005632
0.8505721 ±
0.00017
0.86351823 ±
1.4426 E-04
Significant
Credit
0.9392598 ±
0.0006234
0.88369739
± 0.000921
0.7401278 ±
3.48192E-04
0.940456 ±
0.000253
0.91358951 ±
0.000218
Significant
The experimental results showed that the MSM
method achieved statistically significant accuracy in
80% of the tested datasets. We then move to
evolutionary setting using DE where similarity
measures were used to compute distance and update
centers during the search process. DE showed its
ability to improve the clustering performance
compared to the non-evolutionary setting, and DE-
MSM achieved statistically significant accuracy in
90% of the tested datasets compared to DE-simple
matching, DE-IOF, DE-Eskin and DE-Scaling. The
time and space complexity of our proposed method
is analyzed, and the comparison with the other
methods confirms the effectiveness of our method.
For future work, the proposed MSM and/or DE-
MSM methods can be used in a multiobjective data
clustering framework to deal specifically with mixed
datasets. Furthermore, the current work can be
extended to data clustering models with uncertainty.
REFERENCES
Ahmad, Dey L., 2007, A k-mean clustering algorithm for
mixed numeric and categorical data, Data &
Knowledge Engineering, 63, pp. 503–527.
Ammar E. Z., Lingras P., 2012, K-modes clustering using
possibilistic membership, IPMU 2012, Part III, CCIS
299, pp. 596–605.
Aranganayagi S., Thangavel K., 2009, Improved K-
modes for categorical clustering using weighted
dissimilarity measure, International Journal of
Computer, Electrical, Automation, Control and
Information Engineering, 3 (3), pp. 729–735.
Arbelaitz O., Gurrutxaga I., Muguerza J., Rez J. M.,
Perona I., 2013, An extensive comparative study of
cluster validity indices, Pattern Recognition (46), pp.
243–256.
Asadi S., Rao S., Kishore C., Raju Sh., 2012, Clustering
the mixed numerical and categorical datasets using
similarity weight and filter method, International
Journal of Computer Science, Information Technology
and Management, 1 (1-2).
Baghshah M. S., Shouraki S. B., 2009, Semi-supervised
metric learning using pairwise constraints,
Proceedings of the Twenty-First International Joint
Conference on Artificial Intelligence (IJCAI), pp.
1217–1222.
Bai L., Lianga J., Dang Ch., Cao F., 2013, A novel fuzzy
clustering algorithm with between-cluster information
for categorical data, Fuzzy Sets and Systems, 215, pp.
55–73.
Bai L., Liang J., Sui Ch., Dang Ch., 2013, Fast global k-
means clustering based on local geometrical
information, Information Sciences, 245, pp. 168-180.
Bhagat P. M., Halgaonkar P. S., Wadhai V. M., 2013,
Review of clustering algorithm for categorical data,
International Journal of Engineering and Advanced
Technology, 3 (2).
Blake, C., Merz, C., 1998. UCI repository machine
learning datasets.
Boriah Sh., Chandola V., Kumar V., 2008, Similarity
measures for categorical data: A comparative
evaluation. The Eighth SIAM International
Conference on Data Mining. pp. 243–254.
Cha S., 2007, Comprehensive survey on
distance/similarity measures between probability
density functions, International journal of
mathematical models and methods in applied sciences,
1(4), pp. 300–307.
Gibson D., Kleinberg J., Raghavan P., 1998, Clustering
categorical data: An approach based on dynamical
systems, In 24th International Conference on Very
Large Databases, pp. 311–322.