CBK-Modes: A Correlation-based Algorithm for Categorical Data Clustering

Joel Luis Carbonera, Mara Abel

Abstract

Categorical data sets are often high-dimensional. For handling the high-dimensionality in the clustering process, some works take advantage of the fact that clusters usually occur in a subspace. In soft subspace clustering approaches, different weights are assigned to each attribute in each cluster, for measuring their respective contributions to the formation of each cluster. In this paper, we adopt an approach that uses the correlation among categorical attributes for measuring their relevancies in clustering tasks. We use this approach for developing the CBK-Modes (Correlation-based K-modes); a soft subspace clustering algorithm that extends the basic k-modes by using the correlation-based approach for measuring the relevance of the attributes. We conducted experiments on five real-world datasets, comparing the performance of our algorithm with five state-of-the-art algorithms, using three well-known evaluation metrics: accuracy, f-measure and adjusted Rand index. The results show that the performance of CBK-Modes outperforms the algorithms that were considered in the evaluation, regarding the considered metrics.

References

  1. Aggarwal, C. C. (2014). Data Clustering: Algorithms and Applications, chapter An Introduction to Cluster Analysis, pages 1-28. CRC Press.
  2. Andreopoulos, B. (2014). Data Clustering: Algorithms and Applications, chapter Clustering Categorical Data, pages 1-28. CRC Press.
  3. Bai, L., Liang, J., Dang, C., and Cao, F. (2011). A novel attribute weighting algorithm for clustering high-dimensional categorical data. Pattern Recognition, 44(12):2843-2861.
  4. Cao, F., Liang, J., Li, D., and Zhao, X. (2013). A weighting k-modes algorithm for subspace clustering of categorical data. Neurocomputing, 108:23-30.
  5. Carbonera, J. L. and Abel, M. (2014a). Categorical data clustering:a correlation-based approach for unsupervised attribute weighting. In Proceedings of ICTAI 2014.
  6. Carbonera, J. L. and Abel, M. (2014b). An entropy-based subspace clustering algorithm for categorical data. In Proceedings of ICTAI 2014.
  7. Cesario, E., Manco, G., and Ortale, R. (2007). Top-down parameter-free clustering of high-dimensional categorical data. Knowledge and Data Engineering, IEEE Transactions on, 19(12):1607-1624.
  8. Chan, E. Y., Ching, W. K., Ng, M. K., and Huang, J. Z. (2004). An optimization algorithm for clustering using weighted dissimilarity measures. Pattern recognition, 37(5):943-952.
  9. Gan, G. and Wu, J. (2004). Subspace clustering for high dimensional categorical data. ACM SIGKDD Explorations Newsletter, 6(2):87-94.
  10. He, Z., Xu, X., and Deng, S. (2011). Attribute value weighting in k-modes clustering. Expert Systems with Applications, 38(12):15365-15369.
  11. Huang, Z. (1998). Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Mining and Knowledge Discovery, 2(3):283-304.
  12. Jing, L., Ng, M. K., and Huang, J. Z. (2007). An entropy weighting k-means algorithm for subspace clustering of high-dimensional sparse data. Knowledge and Data Engineering, IEEE Transactions on, 19(8):1026-1041.
  13. Kriegel, H.-P., Kröger, P., and Zimek, A. (2012). Subspace clustering. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 2(4):351-364.
  14. Larsen, B. and Aone, C. (1999). Fast and effective text mining using linear-time document clustering. In Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 16-22. ACM.
  15. Parsons, L., Haque, E., and Liu, H. (2004). Subspace clustering for high dimensional data: a review. ACM SIGKDD Explorations Newsletter, 6(1):90-105.
  16. Sloutsky, V. M. (2010). From perceptual categories to concepts: What develops? Cognitive science, 34(7):1244-1286.
  17. Zaki, M. J., Peters, M., Assent, I., and Seidl, T. (2007). Clicks: An effective algorithm for mining subspace clusters in categorical datasets. Data & Knowledge Engineering, 60(1):51-70.
  18. Zimek, A. (2014). Data Clustering: Algorithms and Applications, chapter Clustering High-Dimensional Data, pages 201-229. CRC Press.
Download


Paper Citation


in Harvard Style

Carbonera J. and Abel M. (2015). CBK-Modes: A Correlation-based Algorithm for Categorical Data Clustering . In Proceedings of the 17th International Conference on Enterprise Information Systems - Volume 1: ICEIS, ISBN 978-989-758-096-3, pages 603-608. DOI: 10.5220/0005367106030608


in Bibtex Style

@conference{iceis15,
author={Joel Luis Carbonera and Mara Abel},
title={CBK-Modes: A Correlation-based Algorithm for Categorical Data Clustering},
booktitle={Proceedings of the 17th International Conference on Enterprise Information Systems - Volume 1: ICEIS,},
year={2015},
pages={603-608},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0005367106030608},
isbn={978-989-758-096-3},
}


in EndNote Style

TY - CONF
JO - Proceedings of the 17th International Conference on Enterprise Information Systems - Volume 1: ICEIS,
TI - CBK-Modes: A Correlation-based Algorithm for Categorical Data Clustering
SN - 978-989-758-096-3
AU - Carbonera J.
AU - Abel M.
PY - 2015
SP - 603
EP - 608
DO - 10.5220/0005367106030608