# Subspace Clustering with Distance-density Function and Entropy in High-dimensional Data

### Jiwu Zhao, Stefan Conrad

#### Abstract

Subspace clustering is an extension of traditional clustering that enables finding clusters in subspaces within a data set, which means subspace clustering is more suitable for detecting clusters in high-dimensional data sets. However, most subspace clustering methods usually require many complicated parameter settings, which are almost troublesome to determine, and therefore there are many limitations in applying these subspace clustering methods. In our previous work, we developed a subspace clustering method Automatic Subspace Clustering with Distance-Density function (ASCDD), which computes the density distribution directly in high-dimensional data sets by using just one parameter. In order to facilitate choosing the parameter in ASCDD we analyze the relation of neighborhood objects and investigate a new way of determining the range of the parameter in this article. Furthermore, we will introduce here a new method by applying entropy in detecting potential subspaces in ASCDD, which evidently reduces the complexity of detecting relevant subspaces.

#### References

- Aggarwal, C. C. and Yu, P. S. (2000). Finding generalized projected clusters in high dimensional spaces. In Proceedings of the 2000 ACM SIGMOD international conference on Management of data, SIGMOD 7800, pages 70-81. ACM.
- Agrawal, R., Gehrke, J., Gunopulos, D., and Raghavan, P. (1998). Automatic subspace clustering of high dimensional data for data mining applications. In Proceedings of the 1998 ACM SIGMOD international conference on Management of data, SIGMOD 7898, pages 94-105. ACM.
- Chang, J.-W. and Jin, D.-S. (2002). A new cell-based clustering method for large, high-dimensional data in data mining applications. In Proceedings of the 2002 ACM symposium on Applied computing, SAC 7802, pages 503-507. ACM.
- Cheng, C.-H., Fu, A. W., and Zhang, Y. (1999). Entropybased subspace clustering for mining numerical data. In Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining, KDD 7899, pages 84-93. ACM.
- Ester, M., Kriegel, H.-P., Sander, J., and Xu, X. (1996). A density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of the 2nd International Conference on Knowledge Discovery and Data mining, volume 1996, pages 226- 231. AAAI Press.
- Frank, A. and Asuncion, A. (2010). UCI machine learning repository.
- Friedman, J. H. and Meulman, J. J. (2004). Clustering objects on subsets of attributes. Journal of the Royal Statistical Society: Series B (Statistical Methodology), pages 815-849.
- Goil, S., Nagesh, H., and Choudhary, A. (1999). Mafia: Efficient and scalable subspace clustering for very large data sets. Technical Report CPDC-TR-9906- 010, Northwestern University.
- Hinneburg, A. and Gabriel, H.-H. (2007). Denclue 2.0: fast clustering based on kernel density estimation. In Proceedings of the 7th international conference on Intelligent data analysis, IDA'07, pages 70-80. SpringerVerlag.
- Hinneburg, A., Hinneburg, E., and Keim, D. A. (1998). An efficient approach to clustering in large multimedia databases with noise. In Proc. 4rd Int. Conf. on Knowledge Discovery and Data Mining, pages 58-65. AAAI Press.
- Kriegel, H.-P., Krö ger, P., and Zimek, A. (2009). Clustering high-dimensional data: A survey on subspace clustering, pattern-based clustering, and correlation clustering. ACM Transactions on Knowledge Discovery from Data, 3:1:1-1:58.
- Krö ger, P., Kriegel, H.-P., and Kailing, K. (2004). Densityconnected subspace clustering for high-dimensional data. In Proc. SIAM Int. Conf. on Data Mining (SDM'04), pages 246-257.
- MacQueen, J. B. (1967). Some methods for classification and analysis of multivariate observations. In Proc. of the fifth Berkeley Symposium on Mathematical Statistics and Probability, volume 1, pages 281-297. University of California Press.
- Mü ller, E., Gü nnemann, S., Assent, I., and Seidl, T. (2009). Evaluating clustering in subspace projections of high dimensional data. Proceedings of the VLDB Endowment, 2(1):1270-1281.
- Parsons, L., Haque, E., and Liu, H. (2004). Subspace clustering for high dimensional data: A review. SIGKDD Explor. Newsl., 6:90-105.
- Procopiuc, C. M., Jones, M., Agarwal, P. K., and Murali, T. M. (2002). A monte carlo algorithm for fast projective clustering. In Proceedings of the 2002 ACM SIGMOD international conference on Management of data, SIGMOD 7802, pages 418-427. ACM.
- Sim, K., Gopalkrishnan, V., Zimek, A., and Cong, G. (2012). A survey on enhanced subspace clustering. Data Mining and Knowledge Discovery, pages 1-66.
- Woo, K.-G., Lee, J.-H., Kim, M.-H., and Lee, Y.-J. (2004). Findit: a fast and intelligent subspace clustering algorithm using dimension voting. Information and Software Technology, 46(4):255-271.
- Zhao, J. and Conrad, S. (2012). Automatic subspace clustering with density function. In International Confenrence on Data Technologies and Applications, DATA 2012, pages 63-69. SciTePress Digital Library.

#### Paper Citation

#### in Harvard Style

Zhao J. and Conrad S. (2013). **Subspace Clustering with Distance-density Function and Entropy in High-dimensional Data** . In *Proceedings of the 2nd International Conference on Data Technologies and Applications - Volume 1: DATA,* ISBN 978-989-8565-67-9, pages 14-22. DOI: 10.5220/0004486600140022

#### in Bibtex Style

@conference{data13,

author={Jiwu Zhao and Stefan Conrad},

title={Subspace Clustering with Distance-density Function and Entropy in High-dimensional Data},

booktitle={Proceedings of the 2nd International Conference on Data Technologies and Applications - Volume 1: DATA,},

year={2013},

pages={14-22},

publisher={SciTePress},

organization={INSTICC},

doi={10.5220/0004486600140022},

isbn={978-989-8565-67-9},

}

#### in EndNote Style

TY - CONF

JO - Proceedings of the 2nd International Conference on Data Technologies and Applications - Volume 1: DATA,

TI - Subspace Clustering with Distance-density Function and Entropy in High-dimensional Data

SN - 978-989-8565-67-9

AU - Zhao J.

AU - Conrad S.

PY - 2013

SP - 14

EP - 22

DO - 10.5220/0004486600140022