SIMPLE AND EFFICIENT PROJECTIVE CLUSTERING

Clark F. Olson, Henry J. Lyons

2010

Abstract

We describe a new Monte Carlo algorithm for projective clustering that is both simple and efficient. Like previous Monte Carlo algorithms, we perform trials that sample a small subset of the data points to determine the dimensions in which the points are sufficiently close to form a cluster and then search the rest of the data for data points that are part of the cluster. However, our algorithm differs from previous algorithms in the method by which the dimensions of the cluster are determined and the method for determining the points in the cluster. This allows us to use smaller subsets of the data to determine the cluster dimensions and achieve improved efficiency over previous algorithms. The complexity of our algorithm is O(nd1+loga/logb), where n is the number of data points, d is the number of dimensions in the space, and a and b are parameters that specify which clusters should be found. To our knowledge, this is the lowest published complexity for an algorithm that is able to place high bounds on the probability of success. We present experiments that show that our algorithm outperforms previous algorithms on real and synthetic data.

References

  1. Aggarwal, C. C., Wolf, J. L., Yu, P. S., Procopiuc, C., and Park, J. S. (1999). Fast algorithms for projected clustering. In Proceedings of the ACM SIGMOD International Conference on Management of Data, pages 61-72.
  2. Aggarwal, C. C. and Yu, P. S. (2000). Finding generalized projected clusters in high dimensional spaces. In Proceedings of the ACM SIGMOD International Conference on Management of Data, pages 70-81.
  3. Agrawal, R., Gehrke, J., Gunopulos, D., and Raghavan, P. (1998). Automatic subspace clustering of high dimensional data for data mining applications. In Proceedings of the ACM SIGMOD International Conference on Management of Data, pages 94-105.
  4. Agrawal, R., Gehrke, J., Gunopulos, D., and Raghavan, P. (2005). Automatic subspace clustering of high dimensional data. Data Mining and Knowledge Discovery, 11:5-33.
  5. Asuncion, A. and Newman, D. (2007). UCI machine learning repository. University of California, Irvine, School of Information and Computer Sciences, http://www.ics.uci.edu/~mlearn/MLRepository.html.
  6. Beyer, K., Goldstein, J., Ramakrishnan, R., and Shaft, U. (1999). When is“nearest neighbor” meaningful? In Proceedings of the 7th International Conference on Database Theory, pages 217-235.
  7. Cheng, C. H., Fu, A. W., and Zhang, Y. (1999). Entropybased subspace clustering for mining numerical data. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 84-93.
  8. Dash, M., Choi, K., Scheuermann, P., and Liu, H. (2002). Feature selection for clustering - a filter solution. In Proceedings of the IEEE International Conference on Data Mining, pages 115-122.
  9. Ding, C., He, X., Zha, H., and Simon, H. D. (2002). Adaptive dimension reduction for clustering high dimensional data. In Proceedings of the IEEE International Conference on Data Mining, pages 147-154.
  10. Goil, S., Nagesh, H., and Choudhary, A. (1999). MAFIA: Efficient and scalable subspace clustering for very large data sets. Technical Report No. CPDC-TR9906-010, Northwestern University.
  11. Jain, A. K., Murty, M. N., and Flynn, P. J. (1999). Data clustering: a review. ACM Computing Surveys, 31(3):264-323.
  12. Nagesh, H., Goil, S., and Choudhary, A. (2001). Adaptive grids for clustering massive data sets. In Proceedings of the SIAM International Conference on Data Mining.
  13. Parsons, L., Haque, E., and Liu, H. (2004). Subspace clustering for high dimensional data: A review. SIGKDD Explorations, 6(1):90-105.
  14. Procopiuc, C. M., Jones, M., Agarwal, P. K., and Murali, T. M. (2002). A Monte Carlo algorithm for fast projective clustering. In Proceedings of the ACM SIGMOD International Conference on Management of Data, pages 418-427.
  15. Woo, K.-G., Lee, J.-H., and Lee, Y.-J. (2004). FINDIT: A fast and intelligent subspace clustering algorithm using dimension voting. Information and Software Technology, 46(4):255-271.
  16. Yiu, M. L. and Mamoulis, N. (2005). Iterative projected clustering by subspace mining. IEEE Transactions on Knowledge and Data Engineering, 17(2):176-189.
Download


Paper Citation


in Harvard Style

F. Olson C. and J. Lyons H. (2010). SIMPLE AND EFFICIENT PROJECTIVE CLUSTERING . In Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2010) ISBN 978-989-8425-28-7, pages 45-55. DOI: 10.5220/0003068400450055


in Bibtex Style

@conference{kdir10,
author={Clark F. Olson and Henry J. Lyons},
title={SIMPLE AND EFFICIENT PROJECTIVE CLUSTERING},
booktitle={Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2010)},
year={2010},
pages={45-55},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0003068400450055},
isbn={978-989-8425-28-7},
}


in EndNote Style

TY - CONF
JO - Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2010)
TI - SIMPLE AND EFFICIENT PROJECTIVE CLUSTERING
SN - 978-989-8425-28-7
AU - F. Olson C.
AU - J. Lyons H.
PY - 2010
SP - 45
EP - 55
DO - 10.5220/0003068400450055