CBK-Modes: A Correlation-based Algorithm for Categorical Data

Clustering

Joel Luis Carbonera and Mara Abel

Institute of Informatics, Federal University of Rio Grande do Sul, Porto Alegre, Brazil

Keywords:

Clustering, Subspace Clustering, Categorical Data, Attribute Weighting, Data Mining.

Abstract:

Categorical data sets are often high-dimensional. For handling the high-dimensionality in the clustering pro-

cess, some works take advantage of the fact that clusters usually occur in a subspace. In soft subspace cluster-

ing approaches, different weights are assigned to each attribute in each cluster, for measuring their respective

contributions to the formation of each cluster. In this paper, we adopt an approach that uses the correla-

tion among categorical attributes for measuring their relevancies in clustering tasks. We use this approach

for developing the CBK-Modes (Correlation-based K-modes); a soft subspace clustering algorithm that ex-

tends the basic k-modes by using the correlation-based approach for measuring the relevance of the attributes.

We conducted experiments on ﬁve real-world datasets, comparing the performance of our algorithm with ﬁve

state-of-the-art algorithms, using three well-known evaluation metrics: accuracy, f-measure and adjusted Rand

index. The results show that the performance of CBK-Modes outperforms the algorithms that were considered

in the evaluation, regarding the considered metrics.

1 INTRODUCTION

Clustering is a widely used technique in which a set

of data points is partitioned into a set of groups of ob-

jects that are as similar as possible (Aggarwal, 2014).

In this context, according to (Andreopoulos, 2014),

categorical data clustering refers to the clustering of

objects that are deﬁned over categorical attributes (or

discrete-valued, symbolic attributes).

Traditionally, techniques of clustering are devel-

oped for handling objects that are described by nu-

merical attributes. In such cases, the similarity (or

dissimilarity) of objects and the quality of a cluster

can be determined using well-studied measures that

are derived from the geometric properties of the data

(Andreopoulos, 2014). In the case of categorical data

clustering, the categorical attributes are not inherently

comparable. Another challenge regarding categori-

cal data clustering arises from the fact that categor-

ical data sets are often high-dimensional (Bai et al.,

2011). In high-dimensional data, as the number of

dimensions in a dataset increases, distance measures

become increasingly meaningless, since thet the dis-

tance between a given object x and its nearest object

will be close to the dissimilarity between x and its far-

thest object. Due to this problem, which is one of the

aspects of the curse of dimensionality (Parsons et al.,

2004; Zimek, 2014), discovering meaningful separa-

ble clusters becomes a very challenging task.

For dealing with the curse of dimensionality, the

so-called subspace clustering approaches (Gan and

Wu, 2004; Zaki et al., 2007; Cesario et al., 2007;

Kriegel et al., 2012; Carbonera and Abel, 2014b) take

advantage of the fact that clusters usually occur in a

subspace deﬁned by a subset of the original set of at-

tributes (Zimek, 2014). Soft subspace clustering (Jing

et al., 2007; Bai et al., 2011) is a special case of sub-

space clustering, in which different weights are as-

signed to each attribute in each cluster, for measur-

ing their respective contributions for the formation of

each cluster (Zimek, 2014). That is, in these tech-

niques, different weight vectors are assigned to dif-

ferent clusters. Due to this, the strategy for attribute

weighting plays a crucial role in soft subspace clus-

tering approaches.

In (Carbonera and Abel, 2014a), the authors ex-

plore a strategy for measuring the contribution of each

attribute considering its correlations with other at-

tributes. This approach is inspired by cognitive stud-

ies that state that humans spontaneously learn cate-

gories by exploring the correlations among the at-

tributes of the perceived objects. However, this ap-

proach was not evaluated in practical clustering al-

gorithms yet. In this paper, we address this issue,

603

Carbonera J. and Abel M..

CBK-Modes: A Correlation-based Algorithm for Categorical Data Clustering.

DOI: 10.5220/0005367106030608

In Proceedings of the 17th International Conference on Enterprise Information Systems (ICEIS-2015), pages 603-608

ISBN: 978-989-758-096-3

 2015 SCITEPRESS (Science and Technology Publications, Lda.)

by developing a novel algorithm called CBK-modes

(Correlation-based K-modes), which extends the ba-

sic k-modes algorithm by adopting the approach pro-

posed by (Carbonera and Abel, 2014a) for attribute

weighting. The performance of this algorithm was

compared against the performances of ﬁve algorithms

available in the literature, considering ﬁve real data

sets. The results show that CBK-Modes has perfor-

mances that are comparable to the performances of

other state-of-the-art algorithms that were considered

in the evaluation. The results also show that, in gen-

eral, CBK-modes has performances that are better

than the performances of other algorithms. The ex-

perimental analysis also suggest that the correlation-

based approach for attribute weighting is a sufﬁcient

condition for improving the performance of clustering

algorithms.

In Section 2 we discuss some related works.

Section 3 presents the formal notation that will be

used throughout the paper. Section 4 presents the

correlation-based attribute weighting proposed by

(Carbonera and Abel, 2014a). Section 5 presents the

CBK-modes algorithm. Experimental results are pre-

sented in Section 6. Finally, section 7 presents our

concluding remarks.

2 RELATED WORKS

In the last few years, several algorithms have been

proposed for dealing with categorical data clustering.

In this work, our focus of interest is on the so-called

soft subspace clustering approaches o categorical data

clustering, such as (Chan et al., 2004; Bai et al., 2011;

Cao et al., 2013; Jing et al., 2007; Carbonera and

Abel, 2014b).

According to (Jing et al., 2007), in subspace clus-

tering, objects are grouped into clusters considering

subsets of the original set of dimensions (or attributes)

of the data set. Soft subspace clustering can be viewed

as a speciﬁc case of subspace clustering. Approaches

of this type estimate the contribution of each attribute

for each speciﬁc cluster. The contribution of a dimen-

sion is measured by a weight that is estimated and as-

signed to the dimension during the clustering process.

Thus, the resulting clustering is performed in the full-

dimensional, though skewed data space.

The approach proposed in (Chan et al., 2004), for

example, computes each weight according to the aver-

age distance of data objects from the mode of a clus-

ter. Thus, a larger weight is assigned to an attribute

The source of the CBK-modes algorithm can be found

in http://www.inf.ufrgs.br/∼jlcarbonera/?page id=87.

that has a smaller sum of the within cluster distances

and a smaller weight is assigned to an attribute that

has a larger sum of the within cluster distances. The

approach proposed in (Bai et al., 2011) assumes that

the contribution of a given attribute for a given clus-

ter is proportional to the frequency of the categori-

cal value of the mode of the cluster for that attribute.

In (Cao et al., 2013), the authors apply the notion of

complement entropy for developing an approach for

attribute weighting. The complement entropy reﬂects

the uncertainty of an object set with respect to an at-

tribute (or attribute set), in a way that the bigger the

complement entropy value is, the higher the uncer-

tainty is. In (Jing et al., 2007), the authors propose an

approach for attribute weighting based on the notion

of entropy, which is a measure of the uncertainty of a

given random variable. This approach minimizes the

within cluster dispersion and maximizes the negative

weight entropy to stimulate more dimensions to con-

tribute to the identiﬁcation of a cluster. In (Carbon-

era and Abel, 2014b), the authors propose to measure

the relevance of categorical attributes in the cluster-

ing process through the entropy-based relevance in-

dex (ERI). The ERI of some attribute a

(given by

ERI(a

)) is inversely proportional to the average of

the uncertainty that is projected to the attribute a

the modes of all attributes in the dataset.

3 NOTATION

In this section, we will introduce the notation, adopted

from (Carbonera and Abel, 2014a), which will be

used throughout the paper:

• U = {x

, x

, ..., x

} is a non-empty set of n data

objects, called a universe.

• A = {a

, a

, ..., a

} is a non-empty set of m cate-

gorical attributes.

• dom(a

) = {a

(1)

, a

(2)

, ..., a

)

} describes the do-

main of values of the attribute a

∈ A, where l

is the number of categorical values that a

can

assume in U. Notice that dom(a

) is ﬁnite and

unordered, e.g., for any 1 ≤ p ≤ q ≤ l

, either

(p)

= a

(q)

or a

(p)

6= a

(q)

• V is the union of attribute domains, i.e., V =

j=1

dom(a

• C = {c

, c

, ..., c

} is a set of k disjoint partitions

of U, such that U =

i=1

• Each x

∈ U is a m − tuple, such that x

, x

, ..., x

), where x

∈ dom(a

) for 1 ≤ i ≤ n

and 1 ≤ q ≤ m.

ICEIS2015-17thInternationalConferenceonEnterpriseInformationSystems

604

4 CORRELATION-BASED

APPROACH FOR

CATEGORICAL ATTRIBUTE

WEIGHTING

In (Carbonera and Abel, 2014a), the authors devel-

oped an approach for attribute weighting consider-

ing the correlations among the categorical attributes

as a measure of their relevance. This proposal was

inspired by studies in the Cognitive Sciences (Slout-

sky, 2010) that have pointed out that humans spon-

taneously learn categories exploring the correlations

among the attributes of the perceived objects. The ap-

proach proposed by the authors does not require pre-

vious supervised labeling of the data set and does not

require the setting of any parameter. In the following,

we will present this approach.

Since the frequency of the categorical values in

the dataset is important for the approach, it is consid-

ered the function f req

: V → N, which maps a given

categorical value a

(l)

to the number of objects in the

partition c

∈ C that are characterized by a

(l)

in the

corresponding attribute a

∈ A. That is

f req

(l)

) = |{x

∈ c

and x

q,h

= a

(l)

}| (1)

where, ∀a

(l)

∈ V ;∀c

∈ C; 0 ≤ f req

(l)

) ≤ |c

|; let

| be the number of data objects in c

. Notice that

in f req

, the index i means that we are considering all

objects in the partition c

∈ C. We will adopt the same

notation to the other functions.

Also, the function ψ

: V ×V → N maps two given

categorical values a

(l)

∈ dom(a

) and a

(p)

∈ dom(a

)

to the number of objects, in c

∈ C, in which these

values co-occur (assigned to the attributes a

and a

respectively). That is:

(l)

, a

(p)

) =|{x

∈ c

and x

= a

(l)

and x

= a

(p)

(2)

Besides that, the function M

: V × A → N maps

a given categorical value a

(l)

∈ dom(a

) and a given

categorical attribute a

∈ A, to the greatest value

that ψ

(l)

, a

(p)

) can assume, considering all a

(p)

∈

dom(a

). That is:

(l)

, a

) = max

p∈dom(a

)

{ψ

(l)

, a

(p)

)} (3)

Thus, M

(l)

, a

) represents the number of co-

occurrences of the value a

(l)

∈ a

and the value a

∈

in the partition c

∈ C; where a

is the categorical

value that has the greatest number of co-occurrences

with the value a

(l)

Finally, it is deﬁned the function α

: V × A → R,

in a way that

(l)

, a

) =

(l)

, a

)

f req

(l)

)

(4)

where ∀a

(l)

∈ V ; ∀a

∈ A; ∀c

∈ C; 0 ≤ α

(l)

, a

) ≤ 1.

Considering these functions, it is possible to de-

ﬁne the maximum co-occurrence correlation index

(mcci) and the correlational relevance index (cri); the

two main notions underlying the approach.

Deﬁnition 1. Maximum co-occurrence correlation

index: The mcci is an index that can be measured

between two given categorical attributes a

∈ A and

∈ A, considering a given partition c

∈ C, through

the function mcci

: A × A → R, such that:

mcci

, a

) =

∑

|dom(a

l=1

(l)

, a

)

|dom(a

(5)

where ∀a

∈ A; ∀a

∈ A; ∀c

∈ C;0 ≤

mcci

, a

) ≤ 1. It is important to notice

that ∀a

(l)

∈ V ; ∀a

∈ A;∀c

∈ C;(M

(l)

, a

) =

f req

(l)

)) =⇒ (mcci

, a

) = 1); i.e.,

mcci

, a

) assumes the greatest value possible

in this situation. Thus, informally, the mcci measured

between a

∈ A and a

∈ A is proportional to how

much the categorical values a

(p)

∈ a

vary, regarding

each categorical value a

(l)

∈ a

. Notice also that

mcci

, a

) is not necessarily equal to mcci

, a

Deﬁnition 2. Correlational relevance index: The cri

is an index that can be assigned to a given attribute a

in a given partition c

∈ C as deﬁned by the function

cri

: A → R, such that

cri

) =

∑

|A|

j=1

mcci

, a

)

|A|

(6)

where ∀a

∈ A;∀c

∈ C;0 ≤ cri

) ≤ 1.

Thus, the correlational relevance index of a given

attribute a

∈ A, considering a given partition c

∈C of

the data set, is proportional to the average of the maxi-

mum co-occurrence correlation indexes that are mea-

sured between every a

∈ A and a

. We assume that

the correlational relevance index of a given attribute

can be used as a measure of its relevance, considering

a given partition of the data set, for categorical clus-

tering tasks.

In (Carbonera and Abel, 2014a) the authors also

proposed an algorithm for computing the cri of all at-

tributes a

, assuming a given partition c

of the data

CBK-Modes:ACorrelation-basedAlgorithmforCategoricalDataClustering

605

set as input. More details and examples regarding

this approach can be viewed in (Carbonera and Abel,

2014a).

5 CBK-MODES: A

CORRELATION-BASED

K-MODES

The CBK-modes extends the basic K-modes algo-

rithm (Huang, 1998) by considering correlational rel-

evance index (cri) for measuring the relevance of each

attribute in each cluster. Thus, the CBK-modes can be

viewed as a soft subspace clustering algorithm. Our

algorithm uses the k-means paradigm for searching a

partition of U into k clusters that minimize the objec-

tive function P(W, Z, V ), with unknown variables W,

Z and V , as follows:

min

W,Z,V

P(W, Z, V ) =

∑

l=1

∑

i=1

d(x

, z

) (7)

subject to











∈ {0, 1} 1 ≤ l ≤ k, 1 ≤ i ≤ n

∑

l=1

= 1, 1 ≤ i ≤ n

0 ≤

∑

i=1

≤ n, 1 ≤ l ≤ k

∈ [0, 1], 1 ≤ l ≤ k, 1 ≤ j ≤ m

(8)

where:

• W = [w

] is a k × n binary membership matrix,

where w

= 1 indicates that x

is allocated to the

cluster C

• Z = [z

] is a k ×m matrix containing k cluster cen-

ters.

The dissimilarity function d(x

, z

) is deﬁned as

follows:

d(x

, z

) =

∑

j=1

, z

) (9)

where

, z

) =

(

1, x

6= z

1 − v

, x

= z

(10)

where

= cri

) × cri

) (11)

Notice that v

is the result of a local attribute

weight (cri

)) multiplied by a global attribute

weight (cri

)). In this way, we are considering the

contributions of the correlations among attributes in

both local and global levels.

The minimization of the objective function 7 with

the constraints in 8 forms a class of constrained non-

linear optimization problems whose solutions are un-

known. The usual method towards optimization of 7

Algorithm 1: CK-modes.

Input: A set of categorical data objects U and the number k of

clusters.

Output: The data objects in U partitioned in k clusters.

begin

Initialize the variable oldmodes as a k × |A|-ary empty array;

Randomly choose k distinct objects x

, x

,...,x

from U and

assign [x

, x

, ..., x

] to the k × |A|-ary variable newmodes;

Calculate initial global weights GW

of a

∈ A using cri

);

Set all initial weights LW

to GW

, where 1 ≤ l ≤ k, 1 ≤ j ≤ m;

Set all initial weights v

to LW

× GW

, where 1 ≤ l ≤ k,

1 ≤ j ≤ m;

while oldmodes 6= newmodes do

for i ← 1 to |U| do

for l ← 1 to k do

Calculate the dissimilarity (using the

dissimilarity function of equation 9) between

the i − th object and the l −th mode and

classify the i − th object into the cluster

whose mode is closest to it;

for l ← 1 to k do

Find the mode z

of each cluster and assign to

newmodes;

Calculate the weight LW

of each attribute a

∈ A

of the l − th cluster, using cri

);

Set all initial weights vlj to LW

× GW

, where

1 ≤ l ≤ k, 1 ≤ j ≤ m;

is to use partial optimization for Z, W and V . In this

method, following (Cao et al., 2013), we ﬁrst ﬁx Z

and V and ﬁnd necessary conditions on W to mini-

mize P(W, Z, V ). Then, we ﬁx W and V and minimize

P(W, Z, V ) with respect to Z. Finally, we then ﬁx W

and Z and minimize P(W, Z, V ) with respect to V . The

process is repeated until no more improvement in the

objective function value can be made. The Algorithm

1 presents the CBK-modes algorithm, which formal-

izes this process, using the correlational relevance in-

dex for measuring the relevance of each attribute in

each cluster.

6 EXPERIMENTS

The evaluation of our approach was performed by

comparing the CBK-modes algorithm with ﬁve state-

of-the-art algorithms. This comparison was based on

three well-known evaluation measures: accuracy (or

purity) (Huang, 1998; He et al., 2011), f-measure

(Larsen and Aone, 1999) and adjusted Rand index

(Bai et al., 2011). Our tests considered six real-

world data sets: congressional voting records, mush-

room, breast cancer, soybean

, genetic promoters and

splice-junction gene sequences. All the data sets were

This data set combines the large soybean data set and

its corresponding test data set

ICEIS2015-17thInternationalConferenceonEnterpriseInformationSystems

606

obtained from the UCI Machine Learning Reposi-

tory

. Regarding the data sets, the missing value in

each attribute was considered as a special category in

our experiments.

Table 1: Comparison of the average accuracy produced by

each algorithm in 100 random runs, and the respective stan-

dard deviations.

Algorithm Vote Mushroom

Breast

cancer

Soybean Promoters

0.86

±0.02

0.71

±0.15

0.70

±0.00

0.63

±0.03

0.59

±0.08

NWKM

0.86

±0.03

0.72

±0.14

0.70

±0.01

0.63

±0.04

0.61

±0.08

MWKM

0.86

±0.01

0.72

±0.14

0.70

±0.00

0.63

±0.03

0.61

±0.08

WKM

0.87

±0,01

0.73

±0.13

0.70

±0.01

0.65

±0.03

0.62

±0.08

EBKM

0.87

±0.00

0.76

±0.12

0.70

±0.01

0.66

±0.03

0.62

±0.08

CBKM

0.87

±0.00

0.76

±0.13

0.71

±0.01

0.66

±0.03

0.65

±0.11

Average

0.87

±0.01

0.73

±0.14

0.70

±0.01

0.64

±0.03

0.62

±0.09

We compared the CBK-modes (CBKM) algo-

rithm with ﬁve algorithms available in the literature:

standard k-modes (KM) (Huang, 1998), NWKM (Bai

et al., 2011), MWKM (Bai et al., 2011), WK-modes

(WKM) (Cao et al., 2013) and EBK-modes (EBKM)

(Carbonera and Abel, 2014b). For the NWKM algo-

rithm, following the recommendations of the authors,

the parameter β was set to 2. For the same reason, for

the MWKM algorithm, we have used the following

parameter settings:β = 2, T

= 1 and T

= 1.

Table 2: Comparison of the average f-measure produced

by each algorithm in 100 random runs, and the respective

standard deviations.

Algorithm Vote Mushroom

Breast

cancer

Soybean Promoters

0.76

±0.02

0.64

±0.13

0.54

±0.00

0.42

±0.04

0.53

±0.05

NWKM

0.78

±0.03

0.64

±0.12

0.56

±0.05

0.42

±0.05

0.54

±0.04

MWKM

0.77

±0.01

0.64

±0.12

0.54

±0.02

0.42

±0.05

0.54

±0.05

WKM

0.78

±0,01

0.66

±0.12

0.55

±0.04

0.45

±0.04

0.55

±0.05

EBKM

0.78

±0.00

0.67

±0.11

0.56

±0.05

0.45

±0.04

0.55

±0.05

CBKM

0.79

±0.01

0.68

±0.12

0.59

±0.06

0.47

±0.04

0.57

±0.07

Average

0.78

±0.01

0.66

±0.12

0.56

±0.04

0.44

±0.04

0.55

±0.05

For each data set, we carried out 100 random

runs of each one of the considered algorithms. This

was done because all of the algorithms choose their

initial cluster centers via random selection methods,

and thus the clustering results may vary depending

on the initialization. In each run, we computed the

performance metrics that were selected: accuracy, f-

measure and adjusted Rand index. The Tables 1, 2

http://archive.ics.uci.edu/ml/

and 3 present respectively, the averages (with stan-

dard deviation) of accuracy, f-measure and adjusted

Rand index. Notice that in these tables, the average

performance is presented at the top of each cell and

standard deviation is presented at the bottom.

Table 3: Comparison of the average adjusted Rand index

(ARI) produced by each algorithm in 100 random runs, and

the respective standard deviations.

Algorithm Vote Mushroom

Breast

cancer

Soybean Promoters

0.51

±0.01

0.26

±0.26

0.01

±0.02

0.37

±0.04

0.06

±0.08

NWKM

0.54

±0.06

0.26

±0.25

0.02

±0.05

0.37

±0.05

0.07

±0.08

MWKM

0.52

±0.01

0.28

±0.25

0.01

±0.02

0.37

±0.05

0.07

±0.09

WKM

0.54

±0.02

0.29

±0.25

0.02

±0.05

0.41

±0.05

0.08

±0.09

EBKM

0.54

±0.01

0.33

±0.23

0.03

±0.05

0.42

±0.05

0.09

±0.10

CBKM

0.54

±0.01

0.33

±0.25

0.05

±0.06

0.42

±0.04

0.13

±0.13

Average

0.53

±0.02

0.29

±0.25

0.02

±0.04

0.39

±0.05

0.08

±0.10

The Tables 1, 2 and 3 show that the CBK-modes

algorithm is able to achieve high-quality overall re-

sults, considering the selected data sets and measures

of performance. Notice that the CBK-modes algo-

rithm have performances that are equivalent to or

better than the performances of state-of-the-art algo-

rithms, such as NWKM, MWKM and EBKM. And,

since the performance CBK-modes is better than the

performance of the basic K-modes, we can conclude

that using the correlational relevance index as a mea-

sure of the relevance of attributes is a sufﬁcient con-

dition for improving the performance of the basic K-

modes algorithm.

7 CONCLUSION

In this paper, we propose CBK-modes, an extension

of the K-modes algorithm, which uses a correlation-

based approach for attribute weighting. Our experi-

ments have shown that the proposed algorithm has a

performance comparable to (or even better than) the

performance of the state-of-the-art algorithms. The

results also suggest that using the correlational rel-

evance index as a measure of the relevance of at-

tributes is a sufﬁcient condition for improving the per-

formance of the clustering algorithms. In the next

steps, we plan to investigate how the correlational

relevance index can be used for improving the per-

formance of others algorithms and how this approach

can be extended for dealing with mixed data sets (with

both categorical and numerical attributes).

CBK-Modes:ACorrelation-basedAlgorithmforCategoricalDataClustering

607

ACKNOWLEDGEMENTS

We gratefully thank Brazilian Research Council,

CNPq, PRH PB-17 program (supported by Petro-

bras), for the ﬁnancial support to this work. In addi-

tion, we would like to thank Sandro Fiorini for com-

ments and ideas.

REFERENCES

Aggarwal, C. C. (2014). Data Clustering: Algorithms

and Applications, chapter An Introduction to Cluster

Analysis, pages 1–28. CRC Press.

Andreopoulos, B. (2014). Data Clustering: Algorithms and

Applications, chapter Clustering Categorical Data,

pages 1–28. CRC Press.

Bai, L., Liang, J., Dang, C., and Cao, F. (2011). A

novel attribute weighting algorithm for clustering

high-dimensional categorical data. Pattern Recogni-

tion, 44(12):2843–2861.

Cao, F., Liang, J., Li, D., and Zhao, X. (2013). A weighting

k-modes algorithm for subspace clustering of categor-

ical data. Neurocomputing, 108:23–30.

Carbonera, J. L. and Abel, M. (2014a). Categorical data

clustering:a correlation-based approach for unsuper-

vised attribute weighting. In Proceedings of ICTAI

2014.

Carbonera, J. L. and Abel, M. (2014b). An entropy-based

subspace clustering algorithm for categorical data. In

Proceedings of ICTAI 2014.

Cesario, E., Manco, G., and Ortale, R. (2007). Top-down

parameter-free clustering of high-dimensional cate-

gorical data. Knowledge and Data Engineering, IEEE

Transactions on, 19(12):1607–1624.

Chan, E. Y., Ching, W. K., Ng, M. K., and Huang, J. Z.

(2004). An optimization algorithm for clustering us-

ing weighted dissimilarity measures. Pattern recogni-

tion, 37(5):943–952.

Gan, G. and Wu, J. (2004). Subspace clustering for high

dimensional categorical data. ACM SIGKDD Explo-

rations Newsletter, 6(2):87–94.

He, Z., Xu, X., and Deng, S. (2011). Attribute value weight-

ing in k-modes clustering. Expert Systems with Appli-

cations, 38(12):15365–15369.

Huang, Z. (1998). Extensions to the k-means algorithm for

clustering large data sets with categorical values. Data

Mining and Knowledge Discovery, 2(3):283–304.

Jing, L., Ng, M. K., and Huang, J. Z. (2007). An

entropy weighting k-means algorithm for subspace

clustering of high-dimensional sparse data. Knowl-

edge and Data Engineering, IEEE Transactions on,

19(8):1026–1041.

Kriegel, H.-P., Kr

oger, P., and Zimek, A. (2012). Subspace

clustering. Wiley Interdisciplinary Reviews: Data

Mining and Knowledge Discovery, 2(4):351–364.

Larsen, B. and Aone, C. (1999). Fast and effective text

mining using linear-time document clustering. In

Proceedings of the ﬁfth ACM SIGKDD international

conference on Knowledge discovery and data mining,

pages 16–22. ACM.

Parsons, L., Haque, E., and Liu, H. (2004). Subspace clus-

tering for high dimensional data: a review. ACM

SIGKDD Explorations Newsletter, 6(1):90–105.

Sloutsky, V. M. (2010). From perceptual categories to

concepts: What develops? Cognitive science,

34(7):1244–1286.

Zaki, M. J., Peters, M., Assent, I., and Seidl, T. (2007).

Clicks: An effective algorithm for mining subspace

clusters in categorical datasets. Data & Knowledge

Engineering, 60(1):51–70.

Zimek, A. (2014). Data Clustering: Algorithms and Appli-

cations, chapter Clustering High-Dimensional Data,

pages 201–229. CRC Press.

ICEIS2015-17thInternationalConferenceonEnterpriseInformationSystems

608