CORD: A HYBRID APPROACH FOR EFFICIENT CLUSTERING OF ORDINAL DATA USING FUZZY LOGIC AND SELF-ORGANIZING MAPS

Natascha Hoebel, Stanislav Kreuzer

Abstract

This paper presents CORD, a hybrid clustering system, which combines modifications of three modern clustering approaches to create a hybrid solution, that is able to efficiently process very large sets of ordinal data. The Self-organizing Maps algorithm for categorical data by Chen and Marques is hereby used for a rough preclustering for finding the initial position and number of centroids. The main clustering task utilizes a k-modes algorithm and its fuzzy set extension described by Kim et al. for categorical data using fuzzy centroids. Finally in dealing with large amounts of data, the BIRCH algorithm described by Zhang et al. for efficient clustering of very large databases (VLDBs) is adapted to ordinal data. BIRCH can be used as a preliminary phase for both Fuzzy Centroids and NCSOM. Both algorithms profit from this symbiosis as their iterative computations can be done on data, that is fully held in main memory. Combining these approaches, the resulting system is able to extract significant information even from very large datasets efficiently. The presented reference implementation of the hybrid system shows good results. The aim is clustering and visual analyzing large amounts of user profiles. This should help in understandingWeb user behavior and personalize advertisement.

References

  1. Aggarwal, G., Feder, T., and Kenthapadi, K. (2006). Achieving anonymity via clustering. In Proc. of the 25th ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, pages 153-162, NY, USA.
  2. Braun-Blanquet, J., Conard, H. S., and Fuller, G. D. (1932). Plant sociology. McGraw-Hill book company. http://www.biodiversitylibrary.org/bibliography/7161.
  3. Burke, R. (2002). Hybrid recommender systems: Survey and experiments. User Modeling and User-Adapted Interaction, 12(4):331-370.
  4. Cheu, E. Y., Kwoh, C. K., and Zhou, Z. (2004). On the two-level hybrid clustering algorithm. Nanyang Technological University.
  5. Chiu, T., Fang, D., Chen, J., Wang, Y., and Jeris, C. (2001). A robust and scalable clustering algorithm for mixed type attributes in large database environment. In Proceedings of the 7th ACM SIGKDD, pages 263-268, NY, USA.
  6. D.J. Newman, A. A. (2007). UCI machine learning repository. http://archive.ics.uci.edu/ml/.
  7. Gan, G., Yang, Z., and Wu, J. (2005). A genetic k-modes algorithm for clustering categorical data. In ADMA, pages 195-202.
  8. Gugubarra (2009). Data set user profiles. www.dbis.cs.unifrankfurt.de/downloads/research/data.zip.
  9. Helmer, S. (2007). Measuring the structural similarity of semistructured documents using entropy. In Proc. of the 33rd Int. Conf. on VLDBs, pages 1022-1032.
  10. Huang, Z. (1997). A fast clustering algorithm to cluster very large categorical data sets in data mining. In In Research Issues on Data Mining and Knowledge Discovery, pages 1-8.
  11. Kim, D.-W., Lee, K. H., and Lee, D. (2004). Fuzzy clustering of categorical data using fuzzy centroids. Pattern Recogn. Lett., 25(11):1263-1271.
  12. Kossmann, D., Ramsak, F., and Rost, S. (2002). Shooting stars in the sky: an online algorithm for skyline queries. In Proc. of the 28th Int. Conf. on VLDBs, pages 275-286.
  13. Ng, R. T. and Han, J. (1994). Efficient and effective clustering methods for spatial data mining. In Proc. of the 20th Int. Conf. on VLDBs, pages 144-155, San Francisco, CA, USA. Morgan Kaufmann Pub. Inc.
  14. Parmar, D., Wu, T., and Blackhurst, J. (2007). Mmr: An algorithm for clustering categorical data using rough set theory.
  15. Podani, J. (2005). Multivariate exploratory analysis of ordinal data in ecology: Pitfalls, problems and solutions. Journal of Vegetation Science, 16(5):497-510.
  16. Yin, X., Han, J., and Yu, P. S. (2007). Crossclus: userguided multi-relational clustering. Data Min. Knowl. Discov., 15(3):321-348.
  17. Zhang, T., Ramakrishnan, R., and Livny, M. (1996). Birch: An efficient data clustering method for vldbs. In Proc. of the ACM SIGMOD, pages 103-114, Montreal, Canada.
  18. Zicari, R. V., Hoebel, N., Kaufmann, S., and Tolle, K. (2006). The design of gugubarra 2.0: A tool for building and managing profiles of web users. In Proc. of the IEEE/WIC/ACM Int. Conf. on Web Intelligence, pages 317-320, Washington, DC, USA.
Download


Paper Citation


in Harvard Style

Hoebel N. and Kreuzer S. (2010). CORD: A HYBRID APPROACH FOR EFFICIENT CLUSTERING OF ORDINAL DATA USING FUZZY LOGIC AND SELF-ORGANIZING MAPS . In Proceedings of the 6th International Conference on Web Information Systems and Technology - Volume 1: WEBIST, ISBN 978-989-674-025-2, pages 297-306. DOI: 10.5220/0002795402970306


in Bibtex Style

@conference{webist10,
author={Natascha Hoebel and Stanislav Kreuzer},
title={CORD: A HYBRID APPROACH FOR EFFICIENT CLUSTERING OF ORDINAL DATA USING FUZZY LOGIC AND SELF-ORGANIZING MAPS},
booktitle={Proceedings of the 6th International Conference on Web Information Systems and Technology - Volume 1: WEBIST,},
year={2010},
pages={297-306},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0002795402970306},
isbn={978-989-674-025-2},
}


in EndNote Style

TY - CONF
JO - Proceedings of the 6th International Conference on Web Information Systems and Technology - Volume 1: WEBIST,
TI - CORD: A HYBRID APPROACH FOR EFFICIENT CLUSTERING OF ORDINAL DATA USING FUZZY LOGIC AND SELF-ORGANIZING MAPS
SN - 978-989-674-025-2
AU - Hoebel N.
AU - Kreuzer S.
PY - 2010
SP - 297
EP - 306
DO - 10.5220/0002795402970306