SIMPLE AND EFFICIENT PROJECTIVE CLUSTERING

Clark F. Olson and Henry J. Lyons

Computing and Software Systems, University of Washington

Box 358534, 18115 Campus Way N.E., Bothell, WA 98011-8246, U.S.A.

Keywords:

Projective clustering, Monte Carlo algorithm.

Abstract:

We describe a new Monte Carlo algorithm for projective clustering that is both simple and efﬁcient. Like

previous Monte Carlo algorithms, we perform trials that sample a small subset of the data points to determine

the dimensions in which the points are sufﬁciently close to form a cluster and then search the rest of the data

for data points that are part of the cluster. However, our algorithm differs from previous algorithms in the

method by which the dimensions of the cluster are determined and the method for determining the points in

the cluster. This allows us to use smaller subsets of the data to determine the cluster dimensions and achieve

improved efﬁciency over previous algorithms. The complexity of our algorithm is O(nd

1+logα/logβ

), where

n is the number of data points, d is the number of dimensions in the space, and α and β are parameters that

specify which clusters should be found. To our knowledge, this is the lowest published complexity for an

algorithm that is able to place high bounds on the probability of success. We present experiments that show

that our algorithm outperforms previous algorithms on real and synthetic data.

1 INTRODUCTION

Data clustering is a technique for the unsupervised

classiﬁcation of data points into (possibly overlap-

ping) sets that has many applications (Jain et al.,

1999). However, clustering in many dimensions of-

ten yields poor results, since clusters may form in

only a subset of thedimensions (Agrawalet al., 1998).

In fact, it has been shown that, under some assump-

tions, the ratio between the distance to the nearest

neighbor and the distance to the farthest neighbor ap-

proaches one as the dimensionality of the space in-

creases (Beyer et al., 1999). In such a situation, clus-

tering in the full space becomes nearly meaningless.

Projective clustering is a special case of the data

clustering problem in which the clusters of data points

are allowed to form in a subset of the dimensions of

the full space. As opposed to dimensionality reduc-

tion techniques (Dash et al., 2002; Ding et al., 2002),

the clusters are not constrained to form in the same

subset of dimensions. While most projective clus-

tering problems consider a many-dimensional space,

we can illustrate the problem using three dimensions.

Two projective clusters are present in Fig. 1(a), one in

the x and y dimensions and one in the y and z dimen-

sions. Figure 1(b) shows the projection of the points

onto the x - y plane, where the ﬁrst cluster is easy to

see.

More formally, the projective clustering problem

can be stated as follows. Given a set of points P

in a d-dimensional space, ﬁnd all maximal sets of

points C

⊂ P and corresponding sets of dimensions

⊂ [1, ...,d] such that the point set is sufﬁciently

close in each of the dimensions to form a cluster. We

say that the points congregate in this dimension; clus-

ter will be used only as a noun. We also say that the

dimensions in D

are the congregating dimensions of

the cluster. For our purposes, we deﬁne this to be true

if they are within a width w of each other:

∀p, q ∈ C

,∀ j ∈ D

,|p

− q

| ≤ w. (1)

Furthermore, to be reported, each cluster must surpass

some criteria with respect to the cardinalities ofC

and

µ(|C

|,|D

|) ≥ µ

. (2)

The clusters reported are maximal sets to prevent all

subsets of each cluster from being reported. Most al-

gorithms relax the requirement of reporting all max-

imal sets, since there are typically many overlapping

maximal sets that meet the above requirements. This

is necessary for an efﬁcient algorithm, since it is pos-

sible for the number of such maximal sets to be quite

large.

Our contribution is a new algorithm for projective

clustering that is simple, robust and efﬁcient. We call

F. Olson C. and J. Lyons H..

SIMPLE AND EFFICIENT PROJECTIVE CLUSTERING.

DOI: 10.5220/0003068400450055

In Proceedings of the International Conference on Knowledge Discovery and Information Retrieval (KDIR-2010), pages 45-55

ISBN: 978-989-8425-28-7

 2010 SCITEPRESS (Science and Technology Publications, Lda.)

(a) (b)

Figure 1: Projective cluster example. (a) Two projective clusters among outliers in a three-dimensional space. (b) The

projection of the points onto the x-y plane illustrates one projective cluster, but not the other (which lies in the y-z plane).

it SEPC, for Simple and Efﬁcient Projective Cluster-

ing. Using a Monte Carlo algorithm, we ﬁnd projec-

tive clusters with high probability. The complexity of

the algorithm has a linear dependence on the number

of data points and a polynomial dependence on the

number of dimensions in the space. The algorithm

does not require the number of output clusters as an

input, it is able to operate with clusters of arbitrary

size and dimensionality, and it is robust to outliers.

No assumptions are made about the distribution of

the clusters or outliers, except that the clusters must

have a diameter no larger than a user-deﬁned con-

stant in any of the cluster dimensions. In addition,

the algorithm can be used either to partition the data

into disjoint clusters (with or without an outlier set)

or generate overlapping dense regions in the projec-

tive subspaces. Our algorithm generates tighter clus-

ters than previous Monte Carlo algorithms, such as

FastDOC (Procopiuc et al., 2002). The computational

complexity of our algorithm is also less dependent on

the cluster evaluation parameters and is lower overall.

Our experiments show that we are able to ﬁnd projec-

tive clusters that are missed by FastDOC. The perfor-

mance of the algorithm on a real data set is superior

to previously reported results.

In Section 2, we review prior work on this prob-

lem. We then discuss our approach in Section 3. The

algorithm itself is described in Section 4, as well as

an analysis of the efﬁciency of the algorithm and op-

timizations. Our experimental results on random and

real data are presented in Section 5. Finally, our con-

clusions are given in Section 6.

2 PREVIOUS WORK

Considerable research has been performed on projec-

tive clustering. Parsons, Haque and Liu (2004) review

many techniques for this problem. The CLIQUE al-

gorithm of Agrawal, Gehrke, Gunopulos and Ragha-

van (1998, 2005) was likely the ﬁrst algorithm to ad-

dress the projective clustering problem. The algo-

rithm uses a bottom-up strategy that initially ﬁnds

clusters in projections onto single dimensions of the

space. Clusters previously found in k-dimensional

spaces are used to ﬁnd clusters in (k+1)-dimensional

spaces. Clusters are built with one additional dimen-

sional at each step, until no more dimensions can be

added. One drawback to the algorithm is that it is

exponential in the number of dimensions in the out-

put cluster. Other bottom-up algorithms include EN-

CLUS (Cheng et al., 1999) and MAFIA (Goil et al.,

1999; Nagesh et al., 2001).

Aggarwal, Wolf, Yu, Procopiuc and Park (1999)

developed the PROCLUS algorithm for projec-

tive clustering using a top-down strategy based on

medoids. The medoids are individual points from the

data set selected to serve as surrogate centers for the

clusters. After initializing the medoids, an iterative

hill-climbing approach is used to improve the set of

points used as medoids. A ﬁnal reﬁnement stage gen-

erates the ﬁnal projective clusters. This algorithm re-

quires both the number of clusters and the average

number of dimensions as inputs. Additional methods

that use top-down strategies include ORCLUS (Ag-

garwal and Yu, 2000) (which considers non-axis par-

allel subspaces) and FINDIT (Woo et al., 2004).

Procopiuc, Jones, Agarwal and Murali (2002) de-

veloped the DOC and FastDOC algorithms for pro-

KDIR 2010 - International Conference on Knowledge Discovery and Information Retrieval

jective clustering. One of their contributions was a

deﬁnition of an optimal projective cluster, which has

the following properties:

1. It has a minimum density α (a fraction of the num-

ber of points in the data set).

2. It has a width of no more than w in each dimension

in which it congregates.

3. It has a width larger than w in each dimension in

which it does not congregate.

4. Among clusters that satisfy the above criteria, it

maximizes a quality function µ(|C|,|D|), where

|C| is the number of points in the cluster and |D|

is the number of dimensions in which the cluster

congregates.

While any function that is monotonically increasing

in each variable can be used as the quality function,

Procopiuc et al. use µ(|C|, |D|) = |C| ·(1/β)

|D|

, where

0 < β < 1 is a parameter that determines the trade-

off between the number of points and the number of

dimensions in the optimal cluster. If a cluster C

con-

gregates in f fewer dimensions than C

, it must have

| > |C

|/β

points to surpass the score for C

. An

optimal cluster must have at least 1/β as many points

as another cluster if it congregates in exactly one less

dimension. We use the same deﬁnition in this work.

Note, however, that the DOC algorithm is restricted

to β < 0.5, while our algorithm is not.

With the above deﬁnition for an optimal cluster,

Procopiuc et al. developed a Monte Carlo algorithm

for ﬁnding an optimal cluster with high probability.

Their algorithm uses two loops, the outer loop selects

a single seed point and the inner loop selects an addi-

tional set of points from the data called the discrimi-

nating set. The seed point and all of the discriminat-

ing set must belong to the optimal cluster for the trial

to succeed. The dimensions in the cluster are deter-

mined by ﬁnding the dimensions in which all of the

points in the discriminating set are within w of the

seed point (allowing an overall cluster width of 2w).

Given these dimensions, the cluster is estimated by

ﬁnding all points in the data set within w of the seed

point in these dimensions.

For ﬁxed α, β and ε (the arbitrarily low proba-

bility of failure), the DOC algorithm has complexity

O(nd

log(α/2)

log2β

), where n is the number of points in the

data set and d is the number of dimensions. In order

to improve the speed of the algorithm, Procopiuc et al.

propose the FastDOC algorithm in which a heuristic

is used limiting the number of trials in each inner loop

to d

. This reduces the complexity of the algorithm to

O(nd

Yiu and Mamoulis (2005) describe the CFPC al-

gorithm. As in the DOC algorithm, an outer loop

is used that samples individual points from the data

set. However, they replace the inner loop from the

DOC algorithm with a technique adapted from mining

frequent itemsets to determine the cluster dimensions

and points (assuming that the sample point is from the

optimal cluster). No formal analysis of the computa-

tional complexity is given, but Yiu and Mamoulis re-

ported improved speeds in comparison to PROCLUS

and FastDOC. However, the speed of the method is

dependent on subsampling the data set to a small size

(1000 data points) before processing.

3 APPROACH

Our approach to the projective clustering problem is

a Monte Carlo algorithm inspired by the DOC algo-

rithm of Procopiuc et al. (2002). In each trial of the

algorithm, a small set of data points is sampled. Fol-

lowing Procopiuc et al., we will call this the discrimi-

nating set. Note, however, that (unlike DOC) we have

no outer loop where individual seed points are sam-

pled. A trial can succeed only if all of the points in the

discriminating set are from the same cluster, among

other conditions. Many trials are performed in order

to achieve a high probability of success. In Section 4

we develop probability guarantees for the number of

trials necessary. These guarantees hold for optimal

clusters with sufﬁcient density α as a percentage of

the points in the data set.

In each trial, the discriminating set is used to de-

termine the set of congregating dimensions for a hy-

pothesized projective cluster. This is performed by

selecting the dimensions in which the span of the dis-

criminating set is less than the desired cluster width

w. This requires only ﬁnding the minimum and max-

imum value in each of the dimensions. Note that this

is a considerable improvement over the DOC algo-

rithm, in which the width of the sheath used to de-

termine the congregating dimensions is 2w. The nar-

rower sheath helps prevent incorrect dimensions from

being included. This also allows us to use a larger

value for β (the fraction of points necessary to remain

in a cluster to add another congregating dimension).

The DOC algorithm is limited to using β < 0.5. We

are not constrained, except that β can never exceed

one, for obvious reasons.

Given the hypothesized congregating dimensions,

we need to determine the points in the cluster. The

cluster points will not necessarily fall within the

bounds given by the discriminating set, since the dis-

criminating set will not generally include the extremal

points in all congregating dimensions of the cluster. If

the span of the discriminating set in cluster dimension

SIMPLE AND EFFICIENT PROJECTIVE CLUSTERING

(a) (b)

Figure 2: Comparison between the SEPC and DOC algorithms for determining cluster dimensions and cluster points using

a discriminating set. (a) The SEPC algorithm uses a sheath of width w to determine if a discriminating set congregates in

a dimension. When the set congregates, a larger sheath with width between w and 2w is used to determine additional data

points that are added to the cluster. (b) The DOC and FastDOC algorithms use a sheath with width 2w both to determine if

the discriminating set congregates in a dimension and to ﬁnd additional data points that are added to the cluster.

i is d

, we need to allow an additional w− d

on each

side of the span to ensure that any possible cluster

with width w is detected. See Fig. 2(a). This can al-

low clusters that are wider than w, since the sheath for

adding points to the cluster is 2w− d

. However, few

(if any) outliers will be included, since they must con-

gregate with the cluster in all of the congregating di-

mensions. In comparison, the DOC algorithm contin-

ues using a sheath with width 2w to determine which

points are included in the cluster. See Fig. 2(b).

The cluster found in each trial is given a score us-

ing the DOC metric and retained if the score is sufﬁ-

ciently high. Typically, we retain only the top scoring

cluster over all of the trials. This cluster is removed

from the data and further clusters are found using ad-

ditional iterations of the algorithm in order to gener-

ate disjoint clusters

. The process continues until no

more clusters surpassing some criteria are found. Any

remaining points can be classiﬁed as outliers or added

to the cluster to which they are closest. If overlapping

clusters are desired, multiple clusters can be located

in a single iteration of the algorithm with a simple

modiﬁcation. In this case, we store not just the best

cluster found, but all clusters of sufﬁcient quality that

are qualitatively different.

If the highest scoring cluster found in an iteration

has density less than α, but meets the score criterion,

Reducing the size of the data set improves the ability

of the algorithm to ﬁnd small clusters. If small clusters are

not desired, then the cluster density can be computed with

respect to the original number of points in the data set.

we report it, even if clusters exist with density larger

than α. If such clusters are discarded, then no proba-

bility guarantees can be made. Consider a large clus-

ter with a density of exactly α. If a subcluster exists

with one fewer point and one more dimension, then

this subcluster will achieve a greater score, unless β is

set arbitrarily close to one. In addition, this subclus-

ter will usually prevent the larger cluster from being

found. Unless the discriminating set contains the sin-

gle additional point in the larger cluster, the congre-

gating dimensions will be hypothesized to include the

additional dimension that the remaining points con-

gregate in and this will exclude the additional point

from the detected cluster. If the subcluster is dis-

carded as too sparse, then the larger cluster cannot be

detected with high probability. Note that this is true

not just for our algorithm, but for all similar Monte

Carlo algorithms. Instead of discarding the cluster,

we allow the smaller cluster to supersede the larger

cluster, since it has a higher score. Our probability

guarantees are such that, if an optimal cluster with

density α exists in the data set, we will report it or a

higher scoring cluster with high probability.

4 ALGORITHM

The basic form of the SEPC algorithm is given in Fig-

ure 3. The input to the algorithm includes the set of

points P, the number of dimensions d, the maximum

width of a cluster w, the number of trials k

trials

, and

KDIR 2010 - International Conference on Knowledge Discovery and Information Retrieval

the cardinality of each discriminating set s. In each

trial, we randomly sample (without replacement) s

points from the data. Within the discriminating set,

the set of dimensions in which the points congregate

is determined (line 4). For each dimension, bounds

are determined on the span of the cluster (inﬁnite

bounds are used for dimensions in which the cluster

does not congregate). The points in the cluster are de-

termined (line 10) by ﬁnding the intersection of the

point set with the Cartesian product of each of the di-

mension ranges r

. Finally, the cluster is saved if it is

the best found so far.

SEPC(P, d, w, k

trials

, s):

1 Let µ

best

= 0.

2 For i = 1 to k

trials

3 Sample S

⊂ P randomly, with |S

| = s.

4 Let D

= { j | ∀p,q ∈ S

,|p

− q

| ≤ w}.

5 For j = 1 to d:

6 If j ∈ D

7 Let r

= [max

p∈S

− w, min

p∈S

+ w].

8 Else:

9 Let r

= [−∞, ∞].

10 Let C

= P∩

∏

1≤ j≤d

11 If µ(|C

|,|D

|) > µ

best

12 Let µ

best

= µ(|C

|,|D

|), C

best

= C

, D

best

= D

13 Return C

best

Figure 3: The basic SEPC algorithm for ﬁnding a projec-

tive cluster. This algorithm ﬁnds a single cluster and may

be iterated after removing the cluster (for disjoint cluster-

ing). To detect non-disjoint clusters, multiple clusters can

be found in a single iteration of the algorithm. In this case,

all clusters of sufﬁcient quality should be saved in lines 11-

12, unless the clusters substantially overlap.

4.1 Number of tTrials

A crucial parameter to determine is the number of tri-

als that are sufﬁcient to ﬁnd a projective cluster with

high probability. We will assume that a cluster exists

containing at least m = ⌈αn⌉ points, since we other-

wise make no claims about the likelihood of a clus-

ter being found. For a trial to succeed, we need for

two conditions to hold. First, the trial must select

only points within the projective cluster (allowing us

to ﬁnd all of the dimensions in which the cluster is

formed). Second, the trial must not select only points

that randomly congregate in any dimension that is not

a congregating dimension of the full projective clus-

ter. For any ﬁxed s, a lower bound can be placed on

the probability of both conditions holding:

trial

≥





1−



, (3)

where l = ⌊βm⌋ and C

is the number of k-

combinations that can be chosen from a set of car-

dinality j. The ﬁrst term of Eq. 3 is a lower bound

on probability of the ﬁrst condition holding. The sec-

ond term is a lower bound on the probability of the

second condition holding, given that the ﬁrst condi-

tion holds. This term is computed based on the fact

that, for an optimal cluster of c points, no more than

⌊βc⌋ points in the cluster can be within w of each

other in any dimension that is not a congregating di-

mension of the cluster (otherwise, this subset would

form a higher scoring projective cluster that includes

the dimension)

. The second term is taken to the dth

power, since the random clustering could occur in any

of the d dimensions (except for those that the projec-

tive cluster does congregate in).

For large data sets, Equation 3 is well approxi-

mated by:

trial

≥ α

(1− β

)

. (4)

After k

trials

iterations, we want the probability that

none of the trials succeed to be below some small con-

stant ε (e.g., 10

−2

). This is achieved with:

(1− P

trial

)

trials

≤ ε, (5)

which yields:

trials

≥

logε

log(1− P

trial

)

. (6)

4.2 Discriminating Set Cardinality

The above equations are valid for discriminating sets

with arbitrary cardinality greater than one. However,

the number of trials varies greatly depending on the

cardinality of each discriminating set. If the cardinal-

ity is large, then it will be unlikely that the points in

the set will all be in the desired cluster. If the car-

dinality is small, then it is likely that the points will

congregate in (at least) one dimension in which the

cluster does not. The optimal value can be obtained

by computing k

trials

for a small number of values and

using the value that requires the fewest trials. We

use a heuristic to approximate the optimal value. The

heuristic sets the cardinality such that the probability

of the discriminating set congregating in even one of

the incorrect dimensions is bounded by 3/4. (The sec-

ond term in Eq. 3 usually dominates the computation

of k

trials

With this heuristic we have:



1−



≥

, (7)

Note that there may be more than C

combinations of

such points, but only if the cluster contains more than m

points. This probability still serves as a lower bound.

SIMPLE AND EFFICIENT PROJECTIVE CLUSTERING

which yields:

l!(m− s)!

m!(l − s)!

≤ 1 − 4

−1

(8)

and

m−s

∑

i=l−s+1

logi−

∑

i=l+1

logi ≤ log(1− 4

−1

). (9)

∑

i=l−s+1

logi−

∑

i=m−s+1

logi ≤ log(1− 4

−1

). (10)

From this, we derive the following approximation to

the optimal value of s:

est

≈

log(1− 4

−1

)

logl − logm

(11)

log(1− 4

−1

)

log⌊β⌈αn⌉⌋ − log⌈αn⌉

(12)

≈

log(1− 4

−1

)

logβ

(13)

≈

log(d

−1

ln4)

logβ

log(d/ ln4)

log(1/β)

(14)

Table 1: Optimal and approximated values for s with α =

0.1, n = 100,000, and varying values for d and β.

d ↓ β → 0.15 0.2 0.25 0.3 0.35

opt

2 2 3 3 3

50 s

est

1.9 2.2 2.6 3.0 3.4

opt

2 3 3 3 4

100 s

est

2.3 2.7 3.1 3.6 4.1

opt

3 3 4 4 4

200 s

est

2.6 3.1 3.6 4.1 4.7

opt

3 4 4 4 5

400 s

est

3.0 3.5 4.1 4.7 5.4

Table 1 compares the estimate for the optimal car-

dinality of the discriminating set with the true opti-

mum with respect to the number of trials as speciﬁed

by Eq. 6. For this comparison, α was held constant at

0.1 and n at 100,000. The levels of d and β were var-

ied (since these are the most inﬂuential in determin-

ing the optimal number of trials). Our approximation

(when rounded to the nearest whole number) overes-

timates the optimal number in some cases. Even in

these cases, the average percentage difference in the

number of trials is less than 30% from the optimal

value. Over all of the cases considered, the average

percentage difference from the optimal number of tri-

als is 4.42%.

Table 2 compares the cardinality of the discrimi-

nating set and required number of trials between the

Table 2: Comparison of discriminating set cardinality and

number of trials required in SEPC and DOC algorithms with

α = 0.1, n = 100,000, and varying values for d and β. Note

that the cardinality for the DOC algorithm includes the seed

point and the number of trials includes all outer iterations

per seed point.

d Alg. β → 0.15 0.2 0.25 0.3 0.35

opt

2 2 3 3 3

SEPC k

trials

1.4·10

3.5·10

1.0·10

1.8·10

4.1·10

50 s 5 7 8 11 14

DOC k

trials

4.4·10

1.7·10

3.5·10

2.8·10

2.3·10

opt

2 3 3 3 4

SEPC k

trials

6.5·10

1.0·10

2.2·10

7.1·10

2.1·10

100 s 6 7 9 12 16

DOC k

trials

8.9·10

1.7·10

7.1·10

5.7·10

9.1·10

opt

3 3 4 4 4

SEPC k

trials

9.0·10

2.3·10

1.0·10

2.3·10

9.4·10

200 s 6 8 10 13 18

DOC k

trials

8.9·10

3.5·10

1.40·10

1.1·10

3.6·10

opt

3 4 4 4 5

SEPC k

trials

1.8·10

8.7·10

2.2·10

1.2·10

3.8·10

400 s 7 9 11 15 20

DOC k

trials

1.8·10

7.1·10

2.8·10

4.5·10

1.5·10

SEPC algorithm and the DOC algorithm. For the

DOC algorithm, we use the total number of points

that are sampled in testing a cluster, thus including the

seed point. The number of trials for the DOC algo-

rithm includes both the inner iterations and the outer

iterations (i.e., the total number of trials performed

before a cluster is returned). This yields the following

equations for DOC (we use our notation rather than

that of Procopiuc et al. (2002)):

(DOC)

log2d

log

2β

(15)

(DOC)

trials









s−1

ln4

(16)

It can be seen in the table that the SEPC algorithm

requires at least 3 orders of magnitude less trials in

each of the cases considered and almost 20 orders

of magnitude less trials for the most complex case

considered. It should be noted that the FastDOC al-

gorithm uses a heuristic where the number of inner

iterations is limited to d

(Procopiuc et al., 2002).

However,this removesany probability guarantees and

our experiments (Sec. 5) indicate that this results in

missed clusters.

KDIR 2010 - International Conference on Knowledge Discovery and Information Retrieval

4.3 Efﬁciency

For ﬁxed α, β, and d, the running time of the algo-

rithm is O(n), since the number of trials does not de-

pend on n. However, the running time has a more

interesting relationship with α, β, and d. The number

of trials is:

trials



logε

log(1− α

(1− β

)



, (17)

with s ≈ log(d/ln4)/ log(1/β). Since ln(1−δ) < −δ

for 0 < δ < 1, we have:

trials

< 1 +

ln1/ε

(1− β

)

(18)

Recall that s was chosen speciﬁcally to make (1 −

)

≥ 1/4. In fact, the bound is not tight, since d in-

cludes the (unknown) number of dimensions in which

the cluster congregates. This yields:

trials

< 1+

4ln1/ε

⌈log(d/ ln4)/log(1/β)⌉

(19)

< 1+ 4α

log(d/ ln4)/log(β)−1

(20)

= 1+



ln4



logα

logβ

(21)

Equations 19-21 imply that the number of trials re-

quired is O(

logα

logβ

log

). The complexity of the

overall algorithm is O(

logα

logβ

log

), since each

trial in the algorithm requires O(nd) time.

The complexity of the SEPC algorithm can

be contrasted with the DOC algorithm, which is

log(α/2)

log2β

log

). With α = 0.1, β = 0.25 and

ﬁxed ε, our algorithm is O(nd

2.661

), while the DOC

algorithm is O(nd

5.322

). The difference between the

exponents increases as the problem becomes more

difﬁcult (decreasing α and/or increasing β). For these

parameters, our algorithm also has a lower computa-

tional complexity than the FastDOC algorithm, which

is O(nd

) and, as we will see, much better cluster de-

tection performance.

4.4 Parameter Setting

Given the increase in the number of trials necessary as

β increases, it might be questioned why a larger value

is desirable. A larger value is necessary in some cases

to prevent subsets of a cluster from producing a larger

score than the “correct” cluster owing to random ac-

cumulation in one or more dimensions.

Consider a projective clustering problem in which

each dimension ranges over [0, 1]. When determin-

ing the points in a cluster, we use a sheath of width

v ≤ 2w in the dimensions in which the discriminat-

ing set congregates. (The DOC and FastDOC algo-

rithms use v = 2w.) A random outlier will fall within

this sheath (in a single dimension) with probability v.

When β is less than v, the inclusion of an incorrect di-

mension usually leads to a cluster with a better score

than the desired projective cluster, since it will include

an extra dimension (although fewer points). The frac-

tion of points in the smaller cluster that are captured

from the larger cluster is expected to be v (although

it varies around this fraction given random data). If

the larger cluster has score µ

, the smaller cluster is

expected to have a score of approximately

. For

example, if β = 0.25 and v = 0.30, it is nearly cer-

tain that the inclusion on an incorrect dimension in

(the hypothesized set of dimensions in the ith trial)

will result in an output cluster with one or more extra

dimensions and fewer points

It is worth noting that the presence of trials with

such extra dimensions is common. We set s such that

the probability of ﬁnding at least one incorrect dimen-

sion is no more than 3/4 in a discriminating set that is

entirely part of the optimal cluster. This is normally

acceptable, as long as we ﬁnd at least one case with-

out such an incorrect dimension. However, if β is too

small, it will result in the cluster subset scoring higher

than the full projective cluster.

Procopiuc et al. (2002) noticed this effect in their

experiments. Their solution was to generate more out-

put clusters than were present in the input data in or-

der to ﬁnd all of the cluster points, even though they

are broken into multiple output clusters.

Another heuristic that is useful in some cases is to

discard any cluster that does not surpass some mini-

mum cardinality. While this would remove the prob-

ability guarantee for ﬁnding clusters, it helps elimi-

nate clusters with incorrect dimensions, since they are

signiﬁcantly smaller than the true clusters. In cases

where the larger cluster is found with frequency com-

parable to the smaller cluster, it is likely to be detected

even if the smaller cluster is also detected (and dis-

carded).

4.5 Optimizations

An optimization that is useful is to limit the number of

passes through the entire data set, particularly if it is

too large to ﬁt in memory and accessing it from a disk

is time consuming. The idea here is to generate many

discriminating sets S

during a single pass through the

In this case, we expect approximately 30% of the clus-

ter points to remain in the cluster with the extra dimension,

but the scoring function increases the score by a factor of

1/β = 4.

SIMPLE AND EFFICIENT PROJECTIVE CLUSTERING

data. After the discriminating sets are constructed,

they can be examined jointly in a single pass through

the data (and further discriminating sets can be gener-

ated during this pass). An additional parameter is re-

quired with this optimization, k

sets

, which is the num-

ber of discriminating sets to consider simultaneously.

In the FastDOC algorithm, Procopiuc et al. (2002)

propose to consider only one discriminating set for

each outer iteration of the algorithm. This discrimi-

nating set is chosen by ﬁnding the one that congre-

gates in the largest number of dimensions. We can

similarly consider only one of the k

sets

discriminat-

ing sets each full pass through the data in the SEPC

algorithm. This optimization removes the probabil-

ity guarantees, since not every trial is fully evaluated,

but good results can be achieved since the discrim-

inating sets yielding many cluster dimensions often

yield high scoring clusters. The speedup achieved is

roughly k

sets

, since scanning the data set for potential

cluster points dominates the running time.

Although we could use a heuristic to limit the

number of trials as is done in FastDOC (Procopiuc

et al., 2002), we choose not to do this, since the num-

ber of trials required is much lower in the SEPC algo-

rithm and it would reduce the likelihood of ﬁnding a

projective cluster.

5 EXPERIMENTS

This section discusses our experiments on real and

synthetic data. Most experiments were run on a 2

GHz Intel Core Duo CPU with 2 GB RAM running

Windows Vista. We compare primarily against the

FastDOC algorithm, since previous experiments have

shown it to be superior to algorithms such as PRO-

CLUS and ORCLUS (when applied to clusters that

are in axis-aligned subpaces) (Procopiuc et al., 2002).

5.1 Random Data Generation

For synthetic data, the data sets were generated fol-

lowing the methodology originally described by Ag-

garwal et al. (1999) and used (with some variations)

by Procopiuc et al. (2002). Each data set is composed

of 100,000 points in 200 dimensions and each dimen-

sion has range [0, 100].

Clusters were generated with an average of 40 di-

mensions in which they congregate, but the actual

number of dimensions was varied as a Poisson ran-

dom variable. After generating the dimensions in

which the ﬁrst cluster congregates, each remaining

cluster was generated such that half of the congre-

gating dimensions in each new cluster were the same

as congregating dimensions in the previously gener-

ated cluster. The remaining dimensions were selected

randomly. In the congregating dimensions, the points

were generated according to a normal distribution

with σ ∈ [2, 4] (uniformly distributed). Cluster cen-

ters, outliers, and non-congregating dimensions were

uniformly distributed over [0, 100].

When multiple clusters were generated, the clus-

ter sizes were generated to be proportional to inde-

pendent exponential random variables. We use the

technique of Procopiuc et al. to enforce signiﬁcant

variation in cluster size (Procopiuc et al., 2002). This

method divides the clusters into two sets and removes

points from clusters in one set while adding them to

the clusters in the other set (prior to computation of

the point locations).

5.2 Multiple Clusters (Random Data)

Our initial experiments were similar to those in pre-

vious work (Aggarwal et al., 1999; Procopiuc et al.,

2002) with 5 clusters totaling 95,000 points and 5,000

outliers. We tested our implementations of SEPC

and FastDOC, with the algorithms continuing to gen-

erate clusters until no clusters were found having a

score better than would be achieved with the mini-

mum number of dimensions (20 in these experiments)

and the minimum density (0.1). After each cluster

was found, the points in the cluster were removed

from the data before detecting another cluster. This is

a relatively easy problem in the sense that the optimal

cluster remaining in the data at any iteration rarely (if

ever) comes close to the minimum density that the al-

gorithms seek to allow (0.1).

The experiments of Procopiuc et al. used w = 15

and β = 0.25. We note that this can be problematic

in an algorithm that allows points to congregate in a

dimension over a range of 2w (as does FastDOC in all

cases and SEPC in the worst case). The reason is that

this allows 30% of the points in a dimension to con-

gregate at random around any cluster center. If only

above 25% of the points are required to form a higher

scoring cluster, then the optimal clusters will always

include extra dimensions in which the points are in

fact uniformly distributed. Ironically, FastDOC does

not usually ﬁnd these clusters (even if they meet the

deﬁnition of an optimal cluster), since the number of

trials performed is insufﬁcient to ﬁnd small clusters.

We use two sets of parameters (w = 10,β = 0.25) and

(w = 15,β = 0.35) in which this problem is unlikely.

With the ﬁrst set of parameters, we restrict the stan-

dard deviation of the points in the congregating di-

mensions to σ = 0.2. Ten experiments were run for

each set of parameters, with 2000 outer iterations in

KDIR 2010 - International Conference on Knowledge Discovery and Information Retrieval

each experiment.

Once clusters were found, each point in the data

set was classiﬁed as belonging to one of the clusters

or to the outlier set. We allowed nearly identical clus-

ters to merge after all clusters were extracted to allow

for cases where a subcluster with an additional dimen-

sions was detected along with (or instead of) the full

cluster. A one-to-one correspondencebetween the de-

tected clusters and the input clusters was then formed

and the fraction of correctly classiﬁed points was cal-

culated. If the number of clusters was uneven, all

points in the extra cluster(s) were considered to be

classiﬁed incorrectly.

For the ﬁrst set of parameters (w = 10,β = 0.25),

SEPC classiﬁed over99.99% of the points correctly in

every experiment

. All of the incorrect classiﬁcations

occurred because cluster points were classiﬁed as out-

liers owing to points that were unusually far from the

cluster center. (No constraint was made during data

generation enforcing a maximum radius.)

FastDOC classiﬁed 95.5% of the points correctly,

on average. Most of the incorrectly classiﬁed points

were cluster points that were classiﬁed as outliers.

This occurred more frequently for FastDOC, since

it forces the cluster center to be at the location of a

seed point, which is not necessarily at the true cen-

ter of the cluster. Additional errors occurred owing

to misclassiﬁcation between clusters, and some small

clusters that were missed entirely (clusters as small as

747 points were observed). In this data, the SEPC al-

gorithm required 31.3 minutes, on average, while the

FastDOC algorithm required 107.5 minutes, on aver-

age.

For the second set of parameters (w = 15, β =

0.35), SEPC classiﬁed 99.96% of the points cor-

rectly. Most misclassiﬁcations were the result of clus-

ter points being classiﬁed as outliers. FastDOC per-

formed poorly in this case, detecting only 44.0% of

the clusters and classifying 67.6% of the points cor-

rectly (typically the larger clusters were detected).

Most failures occurred because several small clusters

were not found, since a discriminating set of 17 points

is required in FastDOC with β = 0.35 and d = 200.

Too few trials were performed in order to ﬁnd valid

discriminating sets for these clusters. FastDOC re-

quired 54.8 minutes, on average, compared to 130.8

minutes for the SEPC algorithm. Some of the differ-

ence in running time can be attributed to the clusters

missed by FastDOC, since each experiment ended as

soon as an iteration was unable to ﬁnd a cluster.

FastDOC performed better on this data set with

β = 0.25 than with β = 0.35. However, the perfor-

Only 8 errors were made over the 10

points in all of

the experiments.

Figure 4: Performance comparison between SEPC and

FastDOC for data sets containing a single cluster.

mance was still worse than SEPC. FastDOC achieved

a 95.5% success rate in this case, although it did not

ﬁnd the clusters that met the deﬁnition of an optimal

cluster, since these had more congregating dimen-

sions and fewer points than the input clusters. The

average running time increased to 102.1 minutes in

this case, since more iterations of the algorithm were

performed.

In none of these experiments was an outlier ever

classiﬁed as part of one of the detected clusters, even

with relaxed criteria for adding points to clusters.

This is to be expected, since uniformly distributed

outliers are captured by clusters with very low prob-

ability. (A cluster with 20 dimensions that captures

30% of the points in each dimension would cap-

ture a random outlier with probability 0.3

= 3.49×

−11

5.3 Single Clusters (Random Data)

Based our analysis, we hypothesized that the SEPC

algorithm would succeed in cases where the FastDOC

algorithm would fail owing to the FastDOC heuris-

tic that the number of inner iterations was limited to

in order to speed up the algorithm. These failures

should occur in cases where the clusters are close to

the minimum density α, since reducing the number

of trials signiﬁcantly would make it unlikely that any

trial would ﬁnd a valid discriminating set.

To test our hypothesis, we ran a set of experiments

using data sets with a single cluster and a high frac-

tion of outliers. In these experiments, each data set

had one cluster C

with between 10,000 and 30,000

points that congregated in D

= 40 dimensions. Each

cluster size was tried 1,000 times with both SEPC

and FastDOC, with the following algorithm param-

eters: α = 0.1, β = 0.25, and w = 15. An experi-

ment was counted as a success if the detected clus-

SIMPLE AND EFFICIENT PROJECTIVE CLUSTERING

ter (C

best

) met the following criteria: |C

best

| ≥ 7000

, |D

best

∩ D

| ≥ 36. (Since the clusters have a nor-

mal distribution with σ ∈ [2,4] in each of the cluster

dimensions, a signiﬁcant number will not fall within

the cluster width of w = 15 in each congregating di-

mension.)

Figure 4 shows the results of these experiments.

It can be observed that SEPC is much more effective

at detecting small clusters. Some failures occurred

for clusters comprising less than 15% of the data set

owing, in part, to clusters that exceeded the maximum

width w. FastDOC failed to detect the small clusters,

since the number of trials was not sufﬁcient to ensure

a high probability of success. In fact, the FastDOC

algorithm had failures up to clusters comprising 30%

of the data set.

The observed success rates for FastDOC are con-

sistent with the theoretical probability. For a cluster

containing 20% of the data set, a trial with a seed

point and a discriminating set of 9 points will have

a probability of 1.02 × 10

−7

of sampling points that

are all within the cluster. Over 20 seed points and

40,000 discriminating sets per seed point, the proba-

bility of at least one combination of a seed point and

a discriminating set being entirely from the cluster is

7.8%. (For a cluster, containing 25% of the data set,

the probability rises to 48.7%.)

For each data point, the SEPC algorithm ran in

fewer seconds than the FastDOC algorithm averaging

8 seconds per experiment as compared to 40 seconds

for FastDOC. The SEPC running times increased with

the cluster size (note that α was held constant in the

algorithm). The reason is that more full scans of the

data set were required as the cluster size increased.

The full DOC algorithm was not tested, since it would

require greater than 10

trials for these parameters.

5.4 Image Segmentation Data

Our ﬁnal set of experiments used the SEPC algo-

rithm to perform clustering in a data set consisting

of 3 × 3 pixel regions from images that were ran-

domly selected from a database of outdoor images

(Asuncion and Newman, 2007). Each region in the

data set is classiﬁed as one of the following classes:

Brickface (B), Sky (S), Foliage (F), Cement (C), Win-

dow (W), Path (P), and Grass (G). This data set has

19 attributes (dimensions) and 2,100 points (300 of

each class). Yiu and Mamoulis used this data set

to test FastDOC and other projective clustering tech-

niques (Yiu and Mamoulis, 2005). For their algorithm

(CFPC) and FastDOC, they used α = 0.13, β = 0.25,

and w = 0.25. Like Yiu and Mamoulis, we pre-

processed the data to be in the range [0, 1] using min-

Table 3: Confusion matrix for classifying pixel regions us-

ing the SEPC algorithm. Results are shown for a represen-

tative experiment with 80.7% success rate. (The average

success rate with these parameters over all experiments was

80.5%.)

Input C

B 251 0 2 4 43 0 0

S 0 300 0 0 0 0 0

F 16 0 217 35 31 1 0

C 49 0 1 222 10 18 0

W 109 0 67 35 89 0 0

P 0 0 0 35 0 265 0

G 0 0 0 1 2 0 297

Table 4: Comparison of SEPC classiﬁcation results. All

results except for SEPC are from Yiu and Mamoulis. SEPC

1 is the result of our algorithm without training. SEPC 2 is

the result using a small training set to determine the clusters.

Technique Accuracy

SEPC 77.3%

CFPC 69.3%

FastDOC 64.1%

PROCLUS 62.1%

KMED 60.2%

max normalization. We use β = 0.25 and w = 0.19

for the SEPC algorithm.

The parameters were tested

over 100 independent experiments. For each experi-

ment, seven clusters were detected in the data. Bipar-

tite matching between the detected clusters and the

correct clusters was performed. Over the 100 trials,

77.3% of the data points were assigned to the correct

cluster.

Table 3 shows a confusion matrix for a represen-

tative experiment. This experiment achieved 78.1%

success. The results are perfect on the Sky class and

very good on the Grass class. The worst performance

is on the Window class with only 89 of 300 points

clustered together. This is consistent with the results

found by Yiu and Mamoulis and this group appears

to be the most difﬁcult to classify based on the given

attributes.

Table 4 compares the SEPC results to the results

published by Yiu and Mamoulis (2005). Our algo-

rithm achieves a higher accuracy than any of the re-

sults reported in (Yiu and Mamoulis, 2005) by a sig-

niﬁcant margin.

We also tested the SEPC algorithm in a super-

vised training mode, where a 210 point training set

We expect SEPC to perform worse with the parame-

ters tuned by Yiu and Mamoulis (2005) and CFPC to per-

form worse with the parameters we use. We compare SEPC

against the best results reported by Yiu and Mamoulis.

KDIR 2010 - International Conference on Knowledge Discovery and Information Retrieval

was used to determine the cluster dimensions and lo-

cations and a 2100 point testing set (disjoint from the

training set) was used for classiﬁcation. In this exper-

iment, 74.8% of the points were classiﬁed correctly.

This demonstrates that the algorithm is able to deter-

mine the cluster properties from a small set of exam-

ples and apply them to previously unknown points.

The performance is lower than the unsupervised ex-

periments owing to the smaller set of points used to

determine the clusters.

6 CONCLUSIONS

We have presented a new algorithm called SEPC

for locating projective clusters using a Monte Carlo

method. The algorithm is straightforward to imple-

ment and has low complexity (linear in the number of

data points and low-order polynomial in the number

of dimensions). In addition, the algorithm does not re-

quire the number of clusters or the number of cluster

dimensions as input and does not make assumptions

about the distribution of cluster points (other than that

the clusters have bounded diameter). The algorithm is

widely applicable to projective clustering problems,

including the ability to ﬁnd both disjoint and non-

disjoint clusters. The performance of the SEPC al-

gorithm surpasses previously reported results on both

synthetic and real data.

REFERENCES

Aggarwal, C. C., Wolf, J. L., Yu, P. S., Procopiuc, C., and

Park, J. S. (1999). Fast algorithms for projected clus-

tering. In Proceedings of the ACM SIGMOD Inter-

national Conference on Management of Data, pages

61–72.

Aggarwal, C. C. and Yu, P. S. (2000). Finding generalized

projected clusters in high dimensional spaces. In Pro-

ceedings of the ACM SIGMOD International Confer-

ence on Management of Data, pages 70–81.

Agrawal, R., Gehrke, J., Gunopulos, D., and Raghavan, P.

(1998). Automatic subspace clustering of high dimen-

sional data for data mining applications. In Proceed-

ings of the ACM SIGMOD International Conference

on Management of Data, pages 94–105.

Agrawal, R., Gehrke, J., Gunopulos, D., and Raghavan, P.

(2005). Automatic subspace clustering of high dimen-

sional data. Data Mining and Knowledge Discovery,

11:5–33.

Asuncion, A. and Newman, D. (2007). UCI ma-

chine learning repository. University of California,

Irvine, School of Information and Computer Sciences,

http://www.ics.uci.edu/∼mlearn/MLRepository.html.

Beyer, K., Goldstein, J., Ramakrishnan, R., and Shaft, U.

(1999). When is“nearest neighbor” meaningful? In

Proceedings of the 7th International Conference on

Database Theory, pages 217–235.

Cheng, C. H., Fu, A. W., and Zhang, Y. (1999). Entropy-

based subspace clustering for mining numerical data.

In Proceedings of the ACM SIGKDD International

Conference on Knowledge Discovery and Data Min-

ing, pages 84–93.

Dash, M., Choi, K., Scheuermann, P., and Liu, H. (2002).

Feature selection for clustering - a ﬁlter solution. In

Proceedings of the IEEE International Conference on

Data Mining, pages 115–122.

Ding, C., He, X., Zha, H., and Simon, H. D. (2002). Adap-

tive dimension reduction for clustering high dimen-

sional data. In Proceedings of the IEEE International

Conference on Data Mining, pages 147–154.

Goil, S., Nagesh, H., and Choudhary, A. (1999). MAFIA:

Efﬁcient and scalable subspace clustering for very

large data sets. Technical Report No. CPDC-TR-

9906-010, Northwestern University.

Jain, A. K., Murty, M. N., and Flynn, P. J. (1999). Data

clustering: A review. ACM Computing Surveys,

31(3):264–323.

Nagesh, H., Goil, S., and Choudhary, A. (2001). Adaptive

grids for clustering massive data sets. In Proceedings

of the SIAM International Conference on Data Min-

ing.

Parsons, L., Haque, E., and Liu, H. (2004). Subspace clus-

tering for high dimensional data: A review. SIGKDD

Explorations, 6(1):90–105.

Procopiuc, C. M., Jones, M., Agarwal, P. K., and Murali,

T. M. (2002). A Monte Carlo algorithm for fast pro-

jective clustering. In Proceedings of the ACM SIG-

MOD International Conference on Management of

Data, pages 418–427.

Woo, K.-G., Lee, J.-H., and Lee, Y.-J. (2004). FINDIT: A

fast and intelligent subspace clustering algorithm us-

ing dimension voting. Information and Software Tech-

nology, 46(4):255–271.

Yiu, M. L. and Mamoulis, N. (2005). Iterative projected

clustering by subspace mining. IEEE Transactions on

Knowledge and Data Engineering, 17(2):176–189.

SIMPLE AND EFFICIENT PROJECTIVE CLUSTERING