CDD can also work with the bottom-up principle. We
apply the same synthetic data sets on ASCDD and
SUBCLU in order to compare the performances of
the two algorithms.
The experiment data sets has ten dimensions and
1000 objects. In the first test we set five simple clus-
ters in different subspaces. The ten dimensions have
the same value ranges. By choosing the proper param-
eters both algorithms yield almost the same results.
Both methods find the five clusters. The running
time is also similar for two methods. It is notewor-
thy that as the dimensionality of subspace increases,
the parameter settings are changing. The setting of
ε and minPts for SUBCLU is quite difficult by high-
dimensional subspace, whereas in ASCDD, DDT is
relatively simple to choose, because DDT should al-
ways be selected between 0 and 1.
In the second experiment, we change the ten-
dimensional data with various value ranges. In this
case, SUBCLU can not continue to work in the sub-
space higher than four dimension, because in high
dimensional all objects appear to be sparse, and the
strategy of choosing the minimum distance ε for
neighborhood becomes less efficient. However AS-
CDD works still excellent in this situation, and has no
trouble to discover the five subspace clusters exactly.
5 CONCLUSIONS
In this paper, we proposed a novel subspace clustering
method (ASCDD) based on former work (SUGRA)
for high-dimensional data set. Departing from the tra-
ditional clustering method, ASCDD can be applied
much easier with just one simple parameter and pro-
vides useful distribution information, and is suitable
for different types of data. The result of ASCDD is ac-
curate, and presents clusters according to their sizes,
which does not depend on the input order. Compare
with its predecessor SUGRA, ASCDD can investi-
gate clusters directly in high-dimensional subspace,
and moreover, the density function is smoother than
SUGRA’s.
From the results obtained so far, ASCDD works
really good in most situations. However, the clus-
tering result and quality depends on choosing the pa-
rameter DDT . Thus one extension of the approach is
researching a proper range of choosing DDT , which
will bring more convenience for the clustering pro-
cess. Another plan for our future work is to opti-
mize the subspace selection and to reduce the calcu-
lation time as the number of objects and dimensions
increases.
REFERENCES
Aggarwal, C. C., Wolf, J. L., Yu, P. S., Procopiuc, C., and
Park, J. S. (1999). Fast algorithms for projected clus-
tering. In Proceedings of the 1999 ACM SIGMOD in-
ternational conference on Management of data, SIG-
MOD ’99, pages 61–72. ACM.
Agrawal, R., Gehrke, J., Gunopulos, D., and Raghavan, P.
(1998). Automatic subspace clustering of high dimen-
sional data for data mining applications. In Proceed-
ings of the 1998 ACM SIGMOD international confer-
ence on Management of data, SIGMOD ’98, pages
94–105. ACM.
Cheng, C.-H., Fu, A. W., and Zhang, Y. (1999). Entropy-
based subspace clustering for mining numerical data.
In Proceedings of the fifth ACM SIGKDD interna-
tional conference on Knowledge discovery and data
mining, KDD ’99, pages 84–93. ACM.
Ester, M., Kriegel, H.-P., Sander, J., and Xu, X. (1996).
A density-based algorithm for discovering clusters in
large spatial databases with noise. In KDD, pages
226–231.
Frank, A. and Asuncion, A. (2010). UCI machine learning
repository. [http://archive.ics.uci.edu/ml]. University
of California, Irvine, School of Information and Com-
puter Sciences.
Goil, S., Nagesh, H., and Choudhary, A. (1999). Mafia:
Efficient and scalable subspace clustering for very
large data sets. Technical Report CPDC-TR-9906-
010, Northwestern University.
Hinneburg, A. and Gabriel, H.-H. (2007). Denclue 2.0: fast
clustering based on kernel density estimation. In Pro-
ceedings of the 7th international conference on Intel-
ligent data analysis, IDA’07, pages 70–80. Springer-
Verlag.
Hinneburg, A., Hinneburg, E., and Keim, D. A. (1998).
An efficient approach to clustering in large multime-
dia databases with noise. In Proc. 4rd Int. Conf. on
Knowledge Discovery and Data Mining, pages 58–65.
AAAI Press.
Kriegel, H.-P., Kr
¨
oger, P., and Zimek, A. (2009). Clustering
high-dimensional data: A survey on subspace cluster-
ing, pattern-based clustering, and correlation cluster-
ing. ACM Transactions on Knowledge Discovery from
Data, 3:1:1–1:58.
Kr
¨
oger, P., Kriegel, H.-P., and Kailing, K. (2004). Density-
connected subspace clustering for high-dimensional
data. In Proc. SIAM Int. Conf. on Data Mining
(SDM’04), pages 246–257.
MacQueen, J. B. (1967). Some methods for classification
and analysis of multivariate observations. In Proc. of
the fifth Berkeley Symposium on Mathematical Statis-
tics and Probability, volume 1, pages 281–297. Uni-
versity of California Press.
Parsons, L., Haque, E., and Liu, H. (2004). Subspace clus-
tering for high dimensional data: A review. SIGKDD
Explor. Newsl., 6:90–105.
Woo, K.-G., Lee, J.-H., Kim, M.-H., and Lee, Y.-J. (2004).
Findit: a fast and intelligent subspace clustering algo-
DATA2012-InternationalConferenceonDataTechnologiesandApplications
68