by developing a novel algorithm called CBK-modes
1
(Correlation-based K-modes), which extends the ba-
sic k-modes algorithm by adopting the approach pro-
posed by (Carbonera and Abel, 2014a) for attribute
weighting. The performance of this algorithm was
compared against the performances of five algorithms
available in the literature, considering five real data
sets. The results show that CBK-Modes has perfor-
mances that are comparable to the performances of
other state-of-the-art algorithms that were considered
in the evaluation. The results also show that, in gen-
eral, CBK-modes has performances that are better
than the performances of other algorithms. The ex-
perimental analysis also suggest that the correlation-
based approach for attribute weighting is a sufficient
condition for improving the performance of clustering
algorithms.
In Section 2 we discuss some related works.
Section 3 presents the formal notation that will be
used throughout the paper. Section 4 presents the
correlation-based attribute weighting proposed by
(Carbonera and Abel, 2014a). Section 5 presents the
CBK-modes algorithm. Experimental results are pre-
sented in Section 6. Finally, section 7 presents our
concluding remarks.
2 RELATED WORKS
In the last few years, several algorithms have been
proposed for dealing with categorical data clustering.
In this work, our focus of interest is on the so-called
soft subspace clustering approaches o categorical data
clustering, such as (Chan et al., 2004; Bai et al., 2011;
Cao et al., 2013; Jing et al., 2007; Carbonera and
Abel, 2014b).
According to (Jing et al., 2007), in subspace clus-
tering, objects are grouped into clusters considering
subsets of the original set of dimensions (or attributes)
of the data set. Soft subspace clustering can be viewed
as a specific case of subspace clustering. Approaches
of this type estimate the contribution of each attribute
for each specific cluster. The contribution of a dimen-
sion is measured by a weight that is estimated and as-
signed to the dimension during the clustering process.
Thus, the resulting clustering is performed in the full-
dimensional, though skewed data space.
The approach proposed in (Chan et al., 2004), for
example, computes each weight according to the aver-
age distance of data objects from the mode of a clus-
ter. Thus, a larger weight is assigned to an attribute
1
The source of the CBK-modes algorithm can be found
in http://www.inf.ufrgs.br/∼jlcarbonera/?page id=87.
that has a smaller sum of the within cluster distances
and a smaller weight is assigned to an attribute that
has a larger sum of the within cluster distances. The
approach proposed in (Bai et al., 2011) assumes that
the contribution of a given attribute for a given clus-
ter is proportional to the frequency of the categori-
cal value of the mode of the cluster for that attribute.
In (Cao et al., 2013), the authors apply the notion of
complement entropy for developing an approach for
attribute weighting. The complement entropy reflects
the uncertainty of an object set with respect to an at-
tribute (or attribute set), in a way that the bigger the
complement entropy value is, the higher the uncer-
tainty is. In (Jing et al., 2007), the authors propose an
approach for attribute weighting based on the notion
of entropy, which is a measure of the uncertainty of a
given random variable. This approach minimizes the
within cluster dispersion and maximizes the negative
weight entropy to stimulate more dimensions to con-
tribute to the identification of a cluster. In (Carbon-
era and Abel, 2014b), the authors propose to measure
the relevance of categorical attributes in the cluster-
ing process through the entropy-based relevance in-
dex (ERI). The ERI of some attribute a
h
(given by
ERI(a
h
)) is inversely proportional to the average of
the uncertainty that is projected to the attribute a
h
by
the modes of all attributes in the dataset.
3 NOTATION
In this section, we will introduce the notation, adopted
from (Carbonera and Abel, 2014a), which will be
used throughout the paper:
• U = {x
1
, x
2
, ..., x
n
} is a non-empty set of n data
objects, called a universe.
• A = {a
1
, a
2
, ..., a
m
} is a non-empty set of m cate-
gorical attributes.
• dom(a
i
) = {a
(1)
i
, a
(2)
i
, ..., a
(l
i
)
i
} describes the do-
main of values of the attribute a
i
∈ A, where l
i
,
is the number of categorical values that a
i
can
assume in U. Notice that dom(a
i
) is finite and
unordered, e.g., for any 1 ≤ p ≤ q ≤ l
i
, either
a
(p)
i
= a
(q)
i
or a
(p)
i
6= a
(q)
i
.
• V is the union of attribute domains, i.e., V =
S
m
j=1
dom(a
j
).
• C = {c
1
, c
2
, ..., c
k
} is a set of k disjoint partitions
of U, such that U =
S
k
i=1
c
i
.
• Each x
i
∈ U is a m − tuple, such that x
i
=
(x
i1
, x
i2
, ..., x
im
), where x
iq
∈ dom(a
q
) for 1 ≤ i ≤ n
and 1 ≤ q ≤ m.
ICEIS2015-17thInternationalConferenceonEnterpriseInformationSystems
604