A Semi-supervised Learning Framework to Cluster Mixed Data Types

Artur Abdullin, Olfa Nasraoui

Abstract

We propose a semi-supervised framework to handle diverse data formats or data with mixed-type attributes. Our preliminary results in clustering data with mixed numerical and categorical attributes show that the proposed semi-supervised framework gives better clustering results in the categorical domain. Thus the seeds obtained from clustering the numerical domain give an additional knowledge to the categorical clustering algorithm. Additional results show that our approach has the potential to outperform clustering either domain on its own or clustering both domains after converting them to the same target domain.

References

  1. Al-Razgan, M. and Domeniconi, C. (2006). Weighted clustering ensembles. In Proc. of the 6th SIAM ICML.
  2. Banerjee, A., Dhillon, I. S., Ghosh, J., Sra, S., and Ridgeway, G. (2005). Clustering on the unit hypersphere using von mises-fisher distributions. Journal of ML Research, 6.
  3. Basu, S., Banerjee, A., and Mooney, R. (2002). Semisupervised clustering by seeding. In Proc. of 19th ICML.
  4. Blum, A. and Chawla, S. (2001). Learning from Labeled and Unlabeled Data Using Graph Mincuts. In Proc. 18th ICML, pages 19-26.
  5. Blum, A. and Mitchell, T. (1998). Combining labeled and unlabeled data with co-training. In Proc. of the 11th annual conference on CL theory, pages 92-100.
  6. Cohn, D., Caruana, R., and Mccallum, A. (2003). Semisupervised clustering with user feedback. Technical report.
  7. Davies, D. L. and Bouldin, D. W. (1979). A cluster separation measure. Pattern Analysis and Machine Intelligence, pages 224 -227.
  8. Dhillon, I. S. and Modha, D. S. (2001). Concept decompositions for large sparse text data using clustering. Mach. Learn., 42:143-175.
  9. Dunn, J. C. (1974). Well separated clusters and optimal fuzzy partitions. J. Cybern, 4:95-104.
  10. Ester, M., peter Kriegel, H., S, J., and Xu, X. (1996). A density-based algorithm for discovering clusters in large spatial databases with noise. In Proc. of the Second International Conference on KDD, pages 226- 231.
  11. Frank, A. (2005). On kuhn's hungarian method - a tribute from hungary. Naval Research Logistics (NRL), 52:2- 5.
  12. Frank, A. and Asuncion, A. (2010). UCI machine learning repository.
  13. Gan, G., Ma, C., and Wu, J. (2007). Data Clustering: Theory, Algorithms, and Applications. Society for Industrial and Applied Mathematics.
  14. Ganti, V., Gehrke, J., and Ramakrishnan, R. (1999). Cactus - clustering categorical data using summaries. In Proc. of the 5th ACM SIGKDD International Conference on KDD, pages 73-83.
  15. Ghaemi, R., Sulaiman, M. N., Ibrahim, H., and Mustapha, N. (2009). A survey: Clustering ensembles techniques.
  16. Guerrero-Curieses, A. and Cid-Sueiro, J. (2000). An entropy minimization principle for semi-supervised terrain classification. In Image Processing, 2000 International Conference on, volume 3, pages 312 - 315.
  17. Guha, S., Rastogi, R., and Shim, K. (2000). Rock: A robust clustering algorithm for categorical attributes. Information Systems, 25:345 - 366.
  18. Huang, Z. (1997). A fast clustering algorithm to cluster very large categorical data sets in data mining. In In Research Issues on KDD, pages 1-8.
  19. Huang, Z. (1998). Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Mining and Knowledge Discovery, 2:283-304.
  20. Joachims, T. (1999). Transductive inference for text classification using support vector machines. In Proc. of 16th ICML, pages 200-209, Bled, SL.
  21. Karypis, G. and Kumar, V. (1998). A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM Journal on Scientific Comp., 20(1):359-392.
  22. Kaufman, L. and Rousseeuw, P. (1990). Finding Groups in Data An Introduction to Cluster Analysis.
  23. Kuhn, H. W. (1955). The Hungarian method for the assignment problem. Naval Research Logistic Quarterly, 2:83-97.
  24. MacQueen, J. B. (1967). Some methods for classification and analysis of multivariate observations. In Proc. of the 5th Berkeley Symposium on Math. Statistics and Probability, volume 1, pages 281-297.
  25. Manning, C. D., Raghavan, P., and Schtze, H. (2008). Introduction to Information Retrieval.
  26. Nigam, K., McCallum, A. K., Thrun, S., and Mitchell, T. (2000). Text classification from labeled and unlabeled documents using em. Mach. Learn., 39:103-134.
  27. Plant, C. and Böhm, C. (2011). Inconco: interpretable clustering of numerical and categorical objects. In Proc. of the 17th ACM SIGKDD International Conference on KDD, pages 1127-1135.
  28. Rousseeuw, P. (1987). Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math., 20:53-65.
  29. Shi, J. and Malik, J. (2000). Normalized Cuts and Image Segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(8):888-905.
  30. Strehl, A., Strehl, E., Ghosh, J., and Mooney, R. (2000). Impact of similarity measures on web-page clustering. In Workshop on AI for Web Search, pages 58-64.
  31. Wagstaff, K., Cardie, C., Rogers, S., and Schrödl, S. (2001). Constrained k-means clustering with background knowledge. In Proc. of the 18th ICML, pages 577-584.
  32. Xiong, H., Wu, J., and Chen, J. (2006). K-means clustering versus validation measures: a data distribution perspective. In Proc. of the 12th ACM SIGKDD international conference on KDD, pages 779-784.
  33. Zeng, H.-J., Wang, X.-H., Chen, Z., Lu, H., and Ma, W.- Y. (2003). Cbc: clustering based text classification requiring minimal labeled data. In Data Mining, Third IEEE ICDM, pages 443 - 450.
  34. Zhong, S. (2006). Semi-supervised model-based document clustering: A comparative study. Mach. Learn., 65:3- 29.
  35. Zhong, S. and Ghosh, J. (2003). A unified framework for model-based clustering. Journal of ML Research, 4:1001-1037.
  36. Zhu, X., Ghahramani, Z., and Lafferty, J. (2003). Semisupervised learning using gaussian fields and harmonic functions. In Proc. 20th International Conf. on ML, pages 912-919.
Download


Paper Citation


in Harvard Style

Abdullin A. and Nasraoui O. (2012). A Semi-supervised Learning Framework to Cluster Mixed Data Types . In Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2012) ISBN 978-989-8565-29-7, pages 45-54. DOI: 10.5220/0004134300450054


in Bibtex Style

@conference{kdir12,
author={Artur Abdullin and Olfa Nasraoui},
title={A Semi-supervised Learning Framework to Cluster Mixed Data Types},
booktitle={Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2012)},
year={2012},
pages={45-54},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0004134300450054},
isbn={978-989-8565-29-7},
}


in EndNote Style

TY - CONF
JO - Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2012)
TI - A Semi-supervised Learning Framework to Cluster Mixed Data Types
SN - 978-989-8565-29-7
AU - Abdullin A.
AU - Nasraoui O.
PY - 2012
SP - 45
EP - 54
DO - 10.5220/0004134300450054