Data Clustering Validation using Constraints

João M. N. Duarte, Ana L. N. Fred, F. Jorge F. Duarte

Abstract

Much attention is being given to the incorporation of constraints into data clustering, mainly expressed in the form of must-link and cannot-link constraints between pairs of domain objects. However, its inclusion in the important clustering validation process was so far disregarded. In this work, we integrate the use of constraints in clustering validation. We propose three approaches to accomplish it: produce a weighted validity score considering a traditional validity index and the constraint satisfaction ratio; learn a new distance function or feature space representation which better suits the constraints, and use it with a validation index; and a combination of the previous. Experimental results in 14 synthetic and real data sets have shown that including the information provided by the constraints increases the performance of the clustering validation process in selecting the best number of clusters.

References

  1. Arbelaitz, O., Gurrutxaga, I., Muguerza, J., Prez, J. M., and Perona, I. (2013). An extensive comparative study of cluster validity indices. Pattern Recognition, 46(1):243 - 256.
  2. Bache, K. and Lichman, M. (2013). UCI machine learning repository.
  3. Basu, S. (2005). Semi-supervised clustering: probabilistic models, algorithms and experiments. PhD thesis, Austin, TX, USA. Supervisor-Mooney, Raymond J.
  4. Basu, S., Banjeree, A., Mooney, E., Banerjee, A., and Mooney, R. J. (2004). Active semi-supervision for pairwise constrained clustering. In In Proceedings of the 2004 SIAM International Conference on Data Mining (SDM-04, pages 333-344.
  5. Basu, S., Davidson, I., and Wagstaff, K. (2008). Constrained Clustering: Advances in Algorithms, Theory, and Applications. Chapman & Hall/CRC.
  6. Cohn, D., Caruana, R., and McCallum, A. (2003). Semisupervised clustering with user feedback.
  7. Davidson, I. and Ravi, S. (2005). Clustering with constraints feasibility issues and the k-means algorithm. In 2005 SIAM International Conference on Data Mining (SDM'05), pages 138-149, Newport Beach,CA.
  8. Duarte, J. M. M., Fred, A. L. N., and Duarte, F. J. F. (2012). Evidence accumulation clustering using pairwise constraints. In Fred, A. L. N. and Filipe, J., editors, KDIR 2012 - Proceedings of the International Conference on Knowledge Discovery and Information Retrieval, Barcelona, Spain, 4 - 7 October, 2012, pages 293- 299. SciTePress.
  9. Fred, A. L. N. (2001). Finding consistent clusters in data partitions. In Proceedings of the Second International Workshop on Multiple Classifier Systems, MCS 7801, pages 309-318, London, UK. Springer-Verlag.
  10. Ge, R., Ester, M., Jin, W., and Davidson, I. (2007). Constraint-driven clustering. In KDD 7807: Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 320-329, New York, NY, USA. ACM.
  11. Hoi, S., Liu, W., Lyu, M., and Ma, W.-Y. (2006). Learning distance metrics with contextual constraints for image retrieval. In Computer Vision and Pattern Recognition, 2006 IEEE Computer Society Conference on, volume 2, pages 2072-2078.
  12. Hubert, L. and Arabie, P. (1985). Comparing partitions. Journal of Classification, 2:193-218.
  13. Jain, A. K. (2010). Data clustering: 50 years beyond kmeans. Pattern Recogn. Lett., 31(8):651-666.
  14. MacQueen, J. B. (1967). Some methods for classification and analysis of multivariate observations. In Cam, L. M. L. and Neyman, J., editors, Proc. of the fifth Berkeley Symposium on Mathematical Statistics and Probability, volume 1, pages 281-297. University of California Press.
  15. Rousseeuw, P. (1987). Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics, 20:53- 65.
  16. Sokal, R. R. and Michener, C. D. (1958). A statistical method for evaluating systematic relationships. University of Kansas Scientific Bulletin, 28:1409-1438.
  17. Tung, A. K. H., Hou, J., and Han, J. (2000). Coe: Clustering with obstacles entities. a preliminary study. In PADKK 7800: Proceedings of the 4th Pacific-Asia Conference on Knowledge Discovery and Data Mining, Current Issues and New Applications, pages 165-168, London, UK. Springer-Verlag.
  18. Wagstaff, K. L. (2002). Intelligent clustering with instancelevel constraints. PhD thesis, Ithaca, NY, USA. ChairClaire Cardie.
  19. Wang, X. and Davidson, I. (2010). Flexible constrained spectral clustering. In Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining, KDD 7810, pages 563-572, New York, NY, USA. ACM.
Download


Paper Citation


in Harvard Style

M. N. Duarte J., L. N. Fred A. and F. Duarte F. (2013). Data Clustering Validation using Constraints . In Proceedings of the International Conference on Knowledge Discovery and Information Retrieval and the International Conference on Knowledge Management and Information Sharing - Volume 1: KDIR, (IC3K 2013) ISBN 978-989-8565-75-4, pages 17-27. DOI: 10.5220/0004543800170027


in Bibtex Style

@conference{kdir13,
author={João M. N. Duarte and Ana L. N. Fred and F. Jorge F. Duarte},
title={Data Clustering Validation using Constraints},
booktitle={Proceedings of the International Conference on Knowledge Discovery and Information Retrieval and the International Conference on Knowledge Management and Information Sharing - Volume 1: KDIR, (IC3K 2013)},
year={2013},
pages={17-27},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0004543800170027},
isbn={978-989-8565-75-4},
}


in EndNote Style

TY - CONF
JO - Proceedings of the International Conference on Knowledge Discovery and Information Retrieval and the International Conference on Knowledge Management and Information Sharing - Volume 1: KDIR, (IC3K 2013)
TI - Data Clustering Validation using Constraints
SN - 978-989-8565-75-4
AU - M. N. Duarte J.
AU - L. N. Fred A.
AU - F. Duarte F.
PY - 2013
SP - 17
EP - 27
DO - 10.5220/0004543800170027