Subspace Clustering with Distance-density Function and Entropy in

High-dimensional Data

Jiwu Zhao and Stefan Conrad

Institute of Computer Science, Databases and Information Systems,

Heinrich-Heine University, Universitaetsstr. 1, 40225 Duesseldorf, Germany

Keywords:

Subspace Clustering, Density, High dimensionality, Entropy.

Abstract:

Subspace clustering is an extension of traditional clustering that enables ﬁnding clusters in subspaces within

a data set, which means subspace clustering is more suitable for detecting clusters in high-dimensional data

sets. However, most subspace clustering methods usually require many complicated parameter settings, which

are almost troublesome to determine, and therefore there are many limitations in applying these subspace

clustering methods. In our previous work, we developed a subspace clustering method Automatic Subspace

Clustering with Distance-Density function (ASCDD), which computes the density distribution directly in high-

dimensional data sets by using just one parameter. In order to facilitate choosing the parameter in ASCDD

we analyze the relation of neighborhood objects and investigate a new way of determining the range of the

parameter in this article. Furthermore, we will introduce here a new method by applying entropy in detecting

potential subspaces in ASCDD, which evidently reduces the complexity of detecting relevant subspaces.

1 INTRODUCTION

We usually need to investigate unknown or hidden

information from raw data. Clustering techniques

help us to discover interesting patterns in the data

sets. Clustering methods divide the observations into

groups (clusters), so that observations in the same

cluster are similar, whereas those from different clus-

ters are dissimilar. The clustering is important for data

analysis in many ﬁelds, including market basket anal-

ysis, bio science, and fraud detection.

Unlike traditional clustering methods, which seek

clusters only in the whole space, subspace cluster-

ing enables clustering in particular projections (sub-

spaces) within a data set. Subspace clustering is usu-

ally applied in high-dimensional data sets.

Many famous subspace clustering algorithms can

ﬁnd clusters in subspaces of the data set. However, the

effectivity is a problem of these algorithms. For in-

stance, it is commonly known that the majority of the

algorithms usually demand many parameter settings

for subspace clustering. In addition, the determina-

tion of the input parameters is not simple. Further-

more, varying many sensitive parameters often cause

very different clustering results.

In our former work, we introduced a density-based

subspace clustering algorithm ASCDD (Automatic

Subspace Clustering with Distance-Density function)

(Zhao and Conrad, 2012). With its density function,

the distribution of data is calculated directly in any

subspace, and clusters are automatically explored ac-

cording to the sizes of clusters. The method can be

applied for differently scaled data. Moreover, the al-

gorithm uses one parameter, which simpliﬁes the ap-

plication process. In this paper, we investigate the

range of the parameter, in order to set the parameter

in a proper range. Another important improvement of

ASCDD is that we introduce an entropy based sub-

space search.

The remainder of this paper is organized as fol-

lows: In section 2, we present related work in the area

of subspace clustering and some ideas from other al-

gorithms. Section 3 describes the subspace cluster-

ing method ASCDD and presents our new ideas about

choosing the parameter and subspace detection with

entropy. Section 4 presents experimental studies for

verifying the proposed method. Finally, section 5 is

the conclusion of the paper.

2 RELATED WORK

In recent years, there has been an increasing amount

of literature on subspace clustering. Surveys con-

Zhao J. and Conrad S..

Subspace Clustering with Distance-density Function and Entropy in High-dimensional Data.

DOI: 10.5220/0004486600140022

In Proceedings of the 2nd International Conference on Data Technologies and Applications (DATA-2013), pages 14-22

ISBN: 978-989-8565-67-9

 2013 SCITEPRESS (Science and Technology Publications, Lda.)

ducted by (Parsons et al., 2004) and (Kriegel et al.,

2009) have divided subspace clustering algorithms

into two groups: top-down and bottom-up. Top-

down methods (e.g. PROCLUS (Aggarwal et al.,

1999), ORCLUS (Aggarwal and Yu, 2000), FINDIT

(Woo et al., 2004), COSA (Friedman and Meulman,

2004)) use multiple iterations for improving the clus-

tering results. Bottom-up methods (e.g. CLIQUE

(Agrawal et al., 1998), ENCLUS (Cheng et al., 1999),

MAFIA (Goil et al., 1999), CBF (Chang and Jin,

2002), DOC (Procopiuc et al., 2002)) ﬁrstly ﬁnd clus-

ters in low-dimensional subspaces, and then expand

the searching into high dimensions. Other surveys

from (M

uller et al., 2009) and (Sim et al., 2012) cat-

egorize the basic subspace clustering methods gener-

ally into grid-based, clustering-oriented and density-

based approaches.

The grid-based subspace clustering algorithms

partition the data space into cells with grids, and gen-

erate subspace clusters by combining dense cells with

big amount of objects. CLIQUE (Agrawal et al.,

1998) is a typical representation of grid-based sub-

space clustering algorithms. It detects ﬁrstly one-

dimensional subspace clusters, and combines them

to ﬁnd high-dimensional subspace clusters. CLIQUE

has many extensions, one of them is ENCLUS (Cheng

et al., 1999), which measures the entropy values for

detecting potential subspaces with clusters, namely a

subspace with clusters has lower entropy than a sub-

space without clusters. The entropy calculation re-

quires density which is calculated as follows: Each

dimension is divided into cells, and the density is the

proportion of objects contained in a cell to all objects.

After detecting all the subspace candidates, the clus-

tering process is similar to CLIQUE.

A clustering-oriented subspace clustering method

assigns objects to k medoids (similar to k-means

(MacQueen, 1967)) to form clusters with correspond-

ing subspace. Representations of clustering-oriented

subspace clustering methods are PROCLUS and its

extensions, such as ORCLUS, FINDIT.

Many density-based subspace clustering ap-

proaches are based on the technique of DBSCAN (Es-

ter et al., 1996). For example, SUBCLU (Kr

oger

et al., 2004) as an extension of DBSCAN is in-

tended for subspace clustering. The density of an

object is counted by the number of objects in a ε-

neighborhood. A cluster in a relevant subspace sat-

isﬁes two properties: All objects within a cluster are

density-connected with each other; If an object is

density-connected to any object of a cluster, it belongs

to the cluster as well.

Another density-based clustering technique such

as DENCLUE (Hinneburg et al., 1998) and DEN-

CLUE 2.0 (Hinneburg and Gabriel, 2007) use Gaus-

sian kernel function as the abstract density function

and apply hill climbing method to detect cluster cen-

ters. It is unnecessary to estimate numbers or posi-

tions of clusters, because clustering is based on the

density of each point. However, the estimation of pa-

rameters such as mean and variance in DENCLUE

or the iteration threshold and the percentage of the

largest posteriors in DENCLUE 2.0 is still necessary.

Almost all the mentioned subspace clustering

methods suffer from serious limitations of deter-

mining appropriate values of parameters. For in-

stance, the parameters such as the numbers of clus-

ters and subspaces of top-down methods; the bottom-

up method’s parameters, e.g. density, grid interval,

and size of clusters. These parameters inﬂuence the

iterations or clustering results, but the parameters are

difﬁcult to be determined. In order to make the sub-

space clustering task more practical, it is necessary to

simplify the parameters.

With the motivation of facilitating the determina-

tion of parameters, a subspace clustering method AS-

CDD (Automatic Subspace Clustering with Distance-

Density function) was introduced in our previous

work. ASCDD can be applied directly in any sub-

space for searching clusters. Based on the density

values calculated with its density function, the centers

of clusters can be found easily. The idea of using a

density function is inspired by DENCLUE. However,

the deﬁnitions of the density functions are different.

ASCDD’s density function can be applied directly on

any subspace. A cluster in ASCDD is explored by

expanding neighbors of an object with high density.

Nevertheless, the deﬁnition and searching “density-

connected” neighbors are totally different from DB-

SCAN. The clustering process in ASCDD needs just

one parameter called DDT (distance-density thresh-

old) with the function of determining whether two ob-

jects are neighbors (belong to the same cluster). Since

choosing a proper DDT is important for ASCDD, in

this paper we investigate thoroughly the relation be-

tween setting the parameter DDT and the clustering

results and develop a way to set the range of DDT.

Although ASCDD can be applied on any subspace

directly, it is still required an effective way of choos-

ing the right subspaces with potential clusters instead

of searching each subspace. Our solution is to ap-

ply entropy on detecting the potential subspaces and

to reduce the subspace searching complexity. Unlike

ENCLUS, ASCDD’s entropy is not calculated by ap-

plying grids, but with the help of ASCDD’s density

function. The “interesting subspaces” in ENCLUS

are the subspaces with entropy that exceeds a parame-

ter ω. Meanwhile, interest gain more than a threshold

SubspaceClusteringwithDistance-densityFunctionandEntropyinHigh-dimensionalData

ε. However, the difﬁculty is to choose the proper pa-

rameters for an unfamiliar data set. In order to ap-

ply entropy more simply we use another technique

to locate signiﬁcant subspace. The extension of en-

tropy makes ASCDD more efﬁcient by detecting clus-

ters directly in subspace candidates. More details are

shown in the following sections.

3 AUTOMATIC SUBSPACE

CLUSTERING WITH

DISTANCE-DENSITY

FUNCTION (ASCDD)

Generally, a data set could be considered as a pair

(A,O), where A = {a

,·· · } is a set of all at-

tributes (dimensions) and O = {o

,·· · } is a set of

all objects. o

denotes the value of an object o

dimension a

A subspace cluster S is also a data set and can be

deﬁned as follows:

S = (

where the subspace

A ⊆ A and

O ⊆ O, and S must

satisfy a particular condition C , which is deﬁned dif-

ferently in each subspace clustering algorithm. How-

ever, a general principle of C is that objects in the

same cluster are similar, meanwhile the objects from

different clusters are dissimilar. S

indicates all sub-

space clusters that refer to

Suppose S

are two subspace clusters, where

= (A

) and S

= (A

), the intersection of

two subspace clusters is deﬁned as follows: S

∩ S

∪ A

∩ O

)

The subspace, objects and subspace clusters have

following relations:

• If A

6= A

∨ O

6= O

=⇒ S

6= S

, the subspace

clusters that have different subspaces or objects

are considered as different ones.

• If A

⊇ A

∧ O

= O

or A

= A

∧ O

⊇ O

=⇒

> S

. So if S

> S

> ··· > S

, normally only

the largest subspace cluster S

is considered as a

clustering result.

The Automatic Subspace Clustering with

Distance-Density function (ASCDD) is based on its

density function. The following deﬁnitions are im-

portant for the density function. The distance-density

of objects o

and o

with regard to subspace

A is

deﬁned as follows:

· |O| + 1)

(1)

where r

is the normalized Euclidean dis-

tance, which is calculated as follows: r

∑

∀a∈

( ¯o

− ¯o

)

. The normalization of an object o

in one dimension a is deﬁned as ¯o

−min(o

)

max(o

)−min(o

)

so ¯o

∈ [0, 1].

The density of an object o

relating to all objects

in subspace

A is deﬁned as follows:

∑

∀o

∑

∀o







· |O| + 1



(2)

The density function of ASCDD can be consid-

ered as a distribution function, which describes the

distribution smoothly. The characters of clusters are

shown through the density evidently, namely the clus-

ter center has higher density than objects at edge, and

therefore the position and size of the clusters can be

indicated easily. Another important feature is that the

algorithm can be executed in any subspace, which is

simple and convenient for clustering particular sub-

space. Figure 1 shows an example of density for one

0 0.2 0.4 0.6 0.8 1

Density

Position of the objects

Figure 1: An example of density function.

dimensional subspace. The peaks are possible centers

of clusters, which are the key targets of our study.

3.1 Distance-density Threshold

Clustering is the next step after the density values are

calculated. The objects in a cluster are considered as

“connected” or “neighbors”. A threshold for choos-

ing proper neighbors called DDT (Distance-Density

Threshold) is introduced in ASCDD and is important

to the clustering step. The neighbors of an object o

are deﬁned as follows:

Neighbor(o

) = {o

| d

> DDT } (3)

DATA2013-2ndInternationalConferenceonDataManagementTechnologiesandApplications

An object and its neighbors are considered in the same

cluster.

Choosing a proper DDT is important, because

DDT can affect the size of clusters. Only the neigh-

bors with distance-density to the center object higher

than DDT meet the condition. It is apparent that the

larger DDT is chosen, the fewer neighbors will be se-

lected. Since d

has a value between 0 and 1, the

parameter DDT has also to be determined within the

range (0,1). However, an improper DDT (too small

or a too big) can cause that all objects belong to one

cluster or there is no cluster. So a proper value for

DDT should be found in (0, 1) .

We notice these two values T

min

∀i



max

∀ j

)



and T

max

= max

∀i



max

∀ j

)



are important for the determination of DDT .

max

∀ j

) is the maximum distance-density of

with regard to the subspace

A. Each object o

has its maximum distance-density value with an

object o

A. Obviously, o

has the minimum Eu-

clidean distance to o

. T

min

is the smallest maximum

distance-density of all objects, and T

max

is the largest

distance-density of all objects. If DDT ≥ T

max

there will be no cluster result, because no object

has a neighbor. If DDT < T

min

, then all objects

will be clustered as one cluster, since all objects are

connected through the neighborhood. Obviously,

DDT should be set between T

min

and T

max

to get a

clustering result so DDT is deﬁned as follows:

DDT = q · T

min

+ (1 − q) · T

max

, 0 < q < 1 (4)

Figure 2 illustrates an example of values T

min

and

max

. We notice that DDT should near T

min

for getting

a complete result. In Figure 2, o

min

is the object with

: o

min

· · · o

max

∀j

)

max

min

Figure 2: An example of T

min

and T

max

distance-density T

min

, if DDT is close to T

min

, many

objects with distance-density values bigger than mini-

mum have the chances to be clustered in the next step.

Conversely, if DDT is close to T

max

, the amount of se-

lected objects will be much smaller. So by setting q

close to 1, ASCDD can get a relative complete result

in most cases.

Notice that T

min

and T

max

are different according

A, so DDT has normally also different values in

diverse

3.2 Applying Entropy for Finding

Potential Subspace

Another issue is choosing the potential subspace with

clusters. Our solution is to apply entropy on detecting

subspaces. The authors of ENCLUS (Cheng et al.,

1999) introduced a method of applying entropy for

subspace clustering. However, ASCDD calculates

and applies entropy in subspace clustering with a dif-

ferent way.

Entropy is a measure of the amount of uncer-

tainty regarding a random variable. For a discrete

random variable X with n possible outcomes {x

i = 1, ··· ,n}, the Shannon entropy is deﬁned as fol-

lows: H(X) = −

∑

∀i=1

p(x

) log p(x

), where p(·) is

the probability mass function. Obviously H(X) > 0.

Entropy has an important property, that the variables

with more uncertainty have lower entropy than the

variables with less uncertainty. For the clustering pur-

pose, we can say that a subspace with many clusters

has a low entropy.

The entropy reaches maximum if all outcomes are

equal.

H(p

,·· · , p

) ≤ H(

,·· · ,

) = −

∑

i=1

log

= log n

Sometimes normalized entropy is much more con-

venient, because it has a range [0, 1] for any n. The

normalized entropy is then deﬁned as follows:

E(X) =

H(X)

logn

= −

∑

∀i=1

p(x

) log p(x

)/log n

Unlike ENCLUS we apply the probability of an

object o

A with

p(o

) =

∑

∀i

Obviously 0 ≤ p(o

) ≤ 1, and

∑

∀i

p(o

) = 1, the

with high density has also a big value p(o

), which

corresponds to the property of probability mass func-

tion.

SubspaceClusteringwithDistance-densityFunctionandEntropyinHigh-dimensionalData

We apply the normalized entropy E(

A) in AS-

CDD in order to facilitate the measurement and com-

parison of entropy values for any subspace. As intro-

duced above, E(

A) is deﬁned as follows:

A) = −

∑

∀i=1

p(o

) log p(o

)/log n

The E(

A) is applicable for any subspace

A. A

small E(

A) value indicates more uncertainties in

which means there is more chance to detect clusters in

A. A big E(

A) shows that the objects distribute more

uniformly. The maximum value of E(

A) should be 1.

However, the objects with uniform distribution do not

have the same density values in ASCDD, because the

densities of objects in the middle are little bigger than

the densities at edge, but the difference is not large, so

in this situation E(

A) is smaller than 1 but very close

to 1.

The entropy of low dimensional subspace and

high dimensional subspace has some relations, which

helps us to speculate about the potential subspaces. If

the entropy of a subspace

A and a higher dimensional

subspace

A ∪ {a

} have a relation E(

A ∪ {a

}) <

A), then the subspace

A ∪ {a

} has more clearly

separated clusters than

A. Conversely, if E(

A ∪

}) > E(

A) it is likely that the subspace

A ∪ {a

}

has more uniform objects than the ones in the

Our aim is exploring potential subspaces through

the property of entropy in order to reduce the com-

plexity. The exploring of high-dimensional potential

subspace from low-dimensional subspace uses this

principle: If E(

A ∪ {a

}) is not bigger than the en-

tropy of any subspace of

A ∪ {a

}, we say subspace a

can be integrated to subspace

A, which is described

as follows.

A ∪{a

}) ≤ min({E(X) |∀X ∈

A ∪{a

}})

The process of searching potential subspaces

starts from one-dimensional subspace with low en-

tropy, for instance, a

is a subspace candidate, if the

entropy E(a

) < min(E(a

),E(a

)) then the sub-

space candidate will expand from a

to the new sub-

space {a

}. Suppose a subspace candidate

A sat-

isﬁes the condition: ∀a

, E(

A) < E(

A ∪ {a

}), then

A reaches its maximum dimension. The expansion

stops when the subspace candidate reaches the maxi-

mum dimension.

Figure 3 is a simple example for subspace cluster-

ing. It is not straightforward to cluster directly in the

three dimensional space, but if the objects are pro-

jected into any two dimensional subspace the clus-

tering process will be more effective because in each

-1

cluster 1

cluster 2

cluster 3

Figure 3: An example of three dimensional subspace clus-

tering.

two dimensional subspace one cluster is much tighter

than the other two clusters. Obviously, the two di-

mensional subspaces {x,y}, {y,z}, {x, z} are subspace

candidates. This result can also be veriﬁed through

the subspace searching method. The entropy of dif-

ferent subspaces has relations as the follows:

E(x, y);E(y,z);E(x,z) < E(x);E(y); E(z) < E(x, y, z)

The entropy of two dimensional subspace is smaller

than the entropy of one or three dimensional sub-

space. So each two dimensional subspace reaches the

maximum dimension. The subspace searching pro-

cess starts from one dimension and stops at two di-

mensional subspace, whereas the three dimensional

space will not be considered because it has a bigger

entropy value than two dimensional subspaces.

3.3 Algorithm

The clustering process of ASCDD consists of two

steps. The ﬁrst step is searching the potential sub-

spaces and the second step is exploring clusters from

the potential subspaces.

We use greedy strategy to search the potential sub-

space, which is shown in Algorithm 1.

Searching the potential subspaces starts from one-

dimensional subspace with low entropy. A high-

dimensional subspace is considered as a subspace

candidate only with the principle that it has a lower

entropy than all its subspaces.

Algorithm 2 illustrates the clustering process of

ASCDD. The clustering process for a subspace can-

didate

A is divided into four steps.

I. ∀i, calculate D

II. Take the starting object o

that has the maximum

density of current set of objects O

current

III. Find all neighbors from o

, and set them as a

cluster S, then expand S by ﬁnding new neigh-

bors of objects in S until no new neighbor is

found.

DATA2013-2ndInternationalConferenceonDataManagementTechnologiesandApplications

Algorithm 1: Searching subspace.

Input: (A,O)

Output: Subspace Candidate Set: SCS

1 ascending sort E(a

): E(a

) ≤ E(a

) when i < j

2 SCS =

3 for i = 1 to |A| do

4 C = {a

}

5 for j = i + 1 to |A | do

6 minEntropy = min(E(C ),E(a

))

7 if E(C ∪ {a

}) <minEntropy then

8 C = C ∪ {a

}

9 SCS = SCS ∪{C }

Algorithm 2: Clustering.

Input: (A,O),SubspaceCandidateSet

Output: Set of all clusters

S =

2 foreach

A ⊆ SubspaceCandidateSet do

3 O

current

= O

4 ∀i, calculate D

5 while O

current

0 do

6 o

has max(D

), ∀o

∈ O

current

O = Neighbor(o

)

8 Iteration: ∀o

∈

O, Neighbor(o

) ⊆

9 O

current

= O

current

−

10 S = (

O),

S =

S ∪ S

IV. Remove objects in S from O

current

, repeat step II

until no more new cluster is found.

ASCDD could ﬁnd arbitrary (convex or concave)

shaped clusters through extending the neighborhood.

For example, a cluster with a concave form is found as

follows: ASCDD may ﬁnd an object with the highest

density in the cluster as the center object. Then the

process of searching and adding the new neighbors

to this cluster connects the objects together to reach

its original concave shape. The object with highest

density in a cluster is chosen as the “center” object.

However, this “center” object is possibly not the ge-

ometric center of the cluster. Figure 4 shows an ex-

ample of two-dimensional clustered objects (marked

with different colors) and corresponding density val-

ues of the objects. In this example, some center ob-

jects are at edges of the clusters. The objects in one

cluster are all the extensions of neighborhoods from

its center object.

The clusters are detected according to the order of

density values of center objects one by one (from the

highest density to the lowest density), which does not

Figure 4: Two-dimensional clustered objects & Three-

dimensional view of objects density.

depend on the input order of the objects. Therefore it

is not necessary to estimate the quantity of clusters in

ASCDD.

The time complexity of ASCDD depends on the

numbers of objects |O| and dimensions |A| and sub-

space candidates |SCS|. The run-time of density cal-

culation is O(|O|

) and the run-time of searching sub-

space depends on the subspace candidates, which can

between O(|S C S |) and O(2

|A|

4 EMPIRICAL EXPERIMENTS

A set of experiments was performed to observe the ef-

fectiveness and efﬁciency of ASCDD, particularly, fo-

cusing on the accuracy and run-time of clustering for

large quantities of data on high-dimensional spaces

and the ability for searching subspaces. All experi-

ments were carried out on a PC with 800MHz dual-

core processor, 4GB RAM, Linux operating system

and Java environment.

4.1 Synthetic Data

Firstly, we use synthetic data as experimental data

in order to make the experiment controllable and to

measure the accuracy easily. The data sets consist

of 10000 objects and 100 dimensions. 20 simulated

clusters are hidden in 10 different subspaces. The

clusters have different forms, e.g. convex and con-

SubspaceClusteringwithDistance-densityFunctionandEntropyinHigh-dimensionalData

cave forms. The subspaces without clusters are ﬁlled

with random objects.

We compare the results of ASCDD with differ-

ent settings of the parameter DDT . As we discussed

above, DDT depends on q, which is deﬁned in Equa-

tion 4. So the problem of determining the DDT is

transformed to choose a q ∈ (0,1). Because the two

extreme situations q = 0 and q = 1 cause two results

respectively: no cluster object and all objects belong

to one cluster. When q is close to 1, almost all clus-

ter centers are taken into account by ASCDD, and

the clustering result is more complete than the re-

sults with a small q; When q approximates to 0, some

small clusters disappear, and the big clusters shrink

to small ones. However, the computation time will

be reduced with a small q. Generally speaking, alter-

ing q between 1 and 0 could adjust between details

of clusters and run-time. Figure 5 presents the run-

5000

10000

15000

20000

25000

30000

35000

2 4 6 8

Time (seconds)

Number of objects

q=0.99

q=0.6

q=0.4

q=0.1

Figure 5: Run-time with different q.

time with four arbitrarily chosen q values. It is worth

mentioning that the clustering results do not change

much in a small range of q. In order to acquire com-

plete clustering results, we choose q = 0.96 in the fol-

lowing experiments, where 0.96 is just a discretionary

choice close to 1. Nevertheless, q can also be another

value ∈ (0.95,0.98) because the clustering results are

almost equal.

Since ENCLUS is one of the most famous sub-

space clustering method applying entropy, we com-

pare ASCDD with ENCLUS in the next experiment

with regard to potential subspaces and clustering re-

sults. ASCDD starts searching with the subspace with

lowest entropy, and expands the subspace in higher

dimensions by calculating and comparing the entropy

values. Finally all expected subspaces are obtained

correctly. We apply ENCLUS by setting the number

of units to 285 in order to keep averagely 35 objects in

each cell as the authors suggest. ENCLUS uses “en-

tropy < ω” and “interest

gain > ε” as the thresholds

for detecting subspace candidates. However, choos-

ing proper values for these two parameters is a chal-

lenge. We choose ω = 8.5, ε = 1 as described in the

article. ENCLUS does not ﬁnd all the same subspace

candidates as ASCDD, ENCLUS ﬁnds a part of ex-

pected subspaces and many non-expected subspaces,

where no clusters exist. Even by altering the two

parameters with different combinations in ENCLUS,

the results of subspace candidates are still mixed with

non-expected subspaces.

Next we compare the clustering results between

ASCDD and ENCLUS. In this step ASCDD ﬁnds

the deﬁned clusters in both convex and concave

forms with high precision. ENCLUS uses grid-based

method by searching the clusters ﬁrstly through the

grids in one dimensional subspace, and combines the

clusters in high-dimensional subspace to search more

clusters. In this experiment the result of ENCLUS

includes just some simple convex clusters correctly.

Some concave clusters are bound together as one clus-

ter and some are separated to small clusters. Unlike

ENCLUS, who has to search each low-dimensional

subspace of a subspace candidate, ASCDD can di-

rectly focus on the subspace candidates for searching

clusters.

The efﬁciency evaluation of ASCDD and EN-

CLUS are illustrated in Figure 6. This evaluation is

based on subsets of the synthetic data set. ASCDD

and ENCLUS use the same parameter settings as in

the former experiment.

5000

10000

15000

20000

25000

30000

35000

1 2 3 4 5 6 7 8 9

(x10

)

Time (seconds)

(a) Number of objects

ASCDD

ENCLUS

500

1000

1500

2000

2500

3000

3500

4000

10 20 30 40 50 60 70

Time (seconds)

(b) Number of dimensions

ASCDD

ENCLUS

Figure 6: Run-time compared with ENCLUS.

ASCDD scales very well with an increasing di-

mensionality. As we can see, the run-time of ASCDD

increases linearly if the number of dimensions grows.

The reason is that ASCDD searches ﬁrstly only the

subspace candidates, and the clustering process ex-

ecutes directly on high dimensional subspace candi-

dates. ASCDD has almost the same run-time for a

DATA2013-2ndInternationalConferenceonDataManagementTechnologiesandApplications

Table 1: Results of ASCDD and ENCLUS on “Gas Sensor Array Drift”.

Cluster ASCDD ENCLUS

Accuracy Subspace Accuracy Subspace

1 68% 76, 113, 17, 4, 79, 70, 14, 68, 121, 57, 15, 6, 7, 53, 118,

12, 54, 62, 127

41% 113, 4, 79, 70, 68, 57, 15, 54, 7, 14, 53, 118, 83, 14, 73

2 67% 15, 6, 78, 49, 7, 12, 55, 63 55% 20, 6, 78, 30, 19, 7, 66, 23, 11, 50, 93

3 39% 47, 24, 107, 111, 88, 97, 99, 105 31% 88, 40, 26, 113, 105, 95, 33, 28, 16

4 68% 44, 108, 39, 47, 24, 103, 111, 88, 97, 99, 105 52% 111, 23, 108, 75, 39, 94, 47, 85

5 34% 112, 56, 120, 122, 98, 16, 35, 106, 43, 80, 36, 108, 24,

107, 88, 97, 99, 105

19% 112, 43, 106, 16, 80, 24, 74, 87, 86, 98, 19, 108, 58

6 88% 65, 9, 76, 4, 79, 70, 14, 68, 15, 6, 78, 7, 12, 39, 47, 103 59% 65, 83, 4, 68, 70, 6, 81, 14, 7, 103, 79

clustering within a subspace with no matter high or

low dimension.

With increasing number of objects the run-time

of ASCDD grows quadratically, which is longer than

ENCLUS in this situation. The reason is that the cal-

culation of density for one object in ASCDD involves

all objects and ENCLUS works similar to CLIQUE

that separates the objects into grids, which is not sen-

sitive to amount of objects. Although the scalability

of ASCDD related to the size of objects is not linear,

the complexity ensures getting a complete clustering

result. Of course the run-time with regard to the num-

ber of objects depends also on the parameter setting

because choosing a DDT that yields many objects in

the clustering result takes more time than with a DDT

that involves fewer objects.

ENCLUS ﬁnds almost the same low dimensional

subspace candidates, but ENCLUS is slower than AS-

CDD for high dimensional subspace, because EN-

CLUS does clustering only from low to high dimen-

sional subspace, which takes much time than direct

clustering in high dimensional subspace as ASCDD.

4.2 Real Data

The data set “Gas Sensor Array Drift” has been ob-

tained from the UC Irvine Machine Learning Repos-

itory (Frank and Asuncion, 2010). This data set cor-

responds to the measurements of 16 chemical sensors

utilized in simulations for drift compensation in dis-

criminating six gas types (Ammonia, Acetaldehyde,

Acetone, Ethylene, Ethanol, and Toluene) at various

concentrations. The data is prepared for the chemo-

sensor research community and artiﬁcial intelligence

to develop strategies to cope with sensor/concept

drift. The dataset contains 128 dimensions, 13910

measurements with six clusters (six gas types), we

applied ASCDD and ENCLUS on the data without

cluster labels, the results were then compared with

the cluster labels. The clusters are located in differ-

ent subspaces, which means the particular subspaces

can specialize detecting the gas types. We illustrate

some examples of the clustering result and the accu-

racies of data related to months one and two in Table

1. The accuracy is deﬁned as the proportion of the

number of correctly clustered objects to the number

of objects in that cluster.

This clustering process takes 1440 seconds with

ASCDD and 4410 seconds with ENCLUS. Compared

with ENCLUS, ASCDD is more efﬁcient on high-

dimensional subspace and is able to detect the clusters

directly on these subspaces with higher precision.

5 CONCLUSIONS

Departing from the traditional clustering methods,

ASCDD is suitable for complex data with arbitrary

forms. It provides useful distribution information and

can be applied easily with just one simple parameter

DDT by clustering. The clusters are detected accord-

ing to their densities, which does not depend on the in-

put order. The results of ASCDD in our experiments

show high accuracy.

In this paper we improve the methods of sub-

space detection and parameter determination in

the subspace clustering method ASCDD for high-

dimensional data set. By adhibiting entropy, ASCDD

is able to detect high-dimensional subspace candi-

dates easily, where a subspace with low entropy is

considered as a potential subspace. We develop a way

to detect subspace candidates to reach its maximum

dimensions. ASCDD can directly ﬁnd clusters within

the located subspace candidates. Since the cluster-

ing result and quality depend on choosing the param-

eter DDT , we investigate the DDT and introduce a

method of choosing this parameter. The DDT can be

chosen in accordance with the tendencies to complete

clustering results or short run-time. One of our future

works will be reducing the calculation time with very

high number of objects.

REFERENCES

Aggarwal, C. C., Wolf, J. L., Yu, P. S., Procopiuc, C., and

Park, J. S. (1999). Fast algorithms for projected clus-

SubspaceClusteringwithDistance-densityFunctionandEntropyinHigh-dimensionalData

tering. In Proceedings of the 1999 ACM SIGMOD in-

ternational conference on Management of data, SIG-

MOD ’99, pages 61–72. ACM.

Aggarwal, C. C. and Yu, P. S. (2000). Finding general-

ized projected clusters in high dimensional spaces. In

Proceedings of the 2000 ACM SIGMOD international

conference on Management of data, SIGMOD ’00,

pages 70–81. ACM.

Agrawal, R., Gehrke, J., Gunopulos, D., and Raghavan, P.

(1998). Automatic subspace clustering of high dimen-

sional data for data mining applications. In Proceed-

ings of the 1998 ACM SIGMOD international confer-

ence on Management of data, SIGMOD ’98, pages

94–105. ACM.

Chang, J.-W. and Jin, D.-S. (2002). A new cell-based clus-

tering method for large, high-dimensional data in data

mining applications. In Proceedings of the 2002 ACM

symposium on Applied computing, SAC ’02, pages

503–507. ACM.

Cheng, C.-H., Fu, A. W., and Zhang, Y. (1999). Entropy-

based subspace clustering for mining numerical data.

In Proceedings of the ﬁfth ACM SIGKDD interna-

tional conference on Knowledge discovery and data

mining, KDD ’99, pages 84–93. ACM.

Ester, M., Kriegel, H.-P., Sander, J., and Xu, X. (1996).

A density-based algorithm for discovering clusters in

large spatial databases with noise. In Proceedings of

the 2nd International Conference on Knowledge Dis-

covery and Data mining, volume 1996, pages 226–

231. AAAI Press.

Frank, A. and Asuncion, A. (2010). UCI machine learning

repository.

Friedman, J. H. and Meulman, J. J. (2004). Clustering ob-

jects on subsets of attributes. Journal of the Royal Sta-

tistical Society: Series B (Statistical Methodology),

pages 815–849.

Goil, S., Nagesh, H., and Choudhary, A. (1999). Maﬁa:

Efﬁcient and scalable subspace clustering for very

large data sets. Technical Report CPDC-TR-9906-

010, Northwestern University.

Hinneburg, A. and Gabriel, H.-H. (2007). Denclue 2.0: fast

clustering based on kernel density estimation. In Pro-

ceedings of the 7th international conference on Intel-

ligent data analysis, IDA’07, pages 70–80. Springer-

Verlag.

Hinneburg, A., Hinneburg, E., and Keim, D. A. (1998).

An efﬁcient approach to clustering in large multime-

dia databases with noise. In Proc. 4rd Int. Conf. on

Knowledge Discovery and Data Mining, pages 58–65.

AAAI Press.

Kriegel, H.-P., Kr

oger, P., and Zimek, A. (2009). Clustering

high-dimensional data: A survey on subspace cluster-

ing, pattern-based clustering, and correlation cluster-

ing. ACM Transactions on Knowledge Discovery from

Data, 3:1:1–1:58.

oger, P., Kriegel, H.-P., and Kailing, K. (2004). Density-

connected subspace clustering for high-dimensional

data. In Proc. SIAM Int. Conf. on Data Mining

(SDM’04), pages 246–257.

MacQueen, J. B. (1967). Some methods for classiﬁcation

and analysis of multivariate observations. In Proc. of

the ﬁfth Berkeley Symposium on Mathematical Statis-

tics and Probability, volume 1, pages 281–297. Uni-

versity of California Press.

uller, E., G

unnemann, S., Assent, I., and Seidl, T. (2009).

Evaluating clustering in subspace projections of high

dimensional data. Proceedings of the VLDB Endow-

ment, 2(1):1270–1281.

Parsons, L., Haque, E., and Liu, H. (2004). Subspace clus-

tering for high dimensional data: A review. SIGKDD

Explor. Newsl., 6:90–105.

Procopiuc, C. M., Jones, M., Agarwal, P. K., and Murali,

T. M. (2002). A monte carlo algorithm for fast pro-

jective clustering. In Proceedings of the 2002 ACM

SIGMOD international conference on Management of

data, SIGMOD ’02, pages 418–427. ACM.

Sim, K., Gopalkrishnan, V., Zimek, A., and Cong, G.

(2012). A survey on enhanced subspace clustering.

Data Mining and Knowledge Discovery, pages 1–66.

Woo, K.-G., Lee, J.-H., Kim, M.-H., and Lee, Y.-J. (2004).

Findit: a fast and intelligent subspace clustering algo-

rithm using dimension voting. Information and Soft-

ware Technology, 46(4):255–271.

Zhao, J. and Conrad, S. (2012). Automatic subspace clus-

tering with density function. In International Confen-

rence on Data Technologies and Applications, DATA

2012, pages 63–69. SciTePress Digital Library.

DATA2013-2ndInternationalConferenceonDataManagementTechnologiesandApplications