CLUSTERING WITH GRANULAR INFORMATION PROCESSING
Urszula Ku
˙
zelewska
Faculty of Computer Science, Technical University of Bialystok, Wiejska 45a, 15-521 Bialystok, Poland
Keywords:
Knowledge discovery, Data mining, Information granulation, Granular computing, Clustering, Hyperboxes.
Abstract:
Clustering is a part of data mining domain. Its task is to detect groups of similar objects on the basis of es-
tablished similarity criterion. Granular computing (GrC) includes methods from various areas with the aim
to support human with better understanding analyzed problem and generated results. Granular computing
techniques create and/or process data portions named as granules identified with regard to similar description,
functionality or behavior. Interesting characteristic of granular computation is offer of multi-perspective view
of data depending on required resolution level. Data granules identified on different levels of resolution form
a hierarchical structure expressing relations between objects of data.
A method proposed in this article performs creation data granules by clustering data in form of hyperboxes.
The results are compared with clustering of point-type data with regard to complexity, quality and interpretabil-
ity.
1 INTRODUCTION
Granular computing (GrC) is a new multidisciplinary
theory rapidly developed in recent years. The most
common definitions of GrC (Yao, 2006), (Zadeh,
2001) include a postulate of computing with informa-
tion granules, that is collections of objects, that ex-
hibit similarity in terms of their properties or func-
tional appearance. Although the term is new, the
ideas and concepts of GrC have been used in many
fields under different names: information hiding in
programming, granularity in artificial intelligence, di-
vide and conquer in theoretical computer science, in-
terval computing, cluster analysis, fuzzy and rough
set theories, neutrosophic computing, quotient space
theory, belief functions, machine learning, databases,
and many others. According to more universal defini-
tion, granular computing may be considered as a la-
bel of a new field of multi-disciplinary study, dealing
with theories, methodologies, techniques and tools
that make use of granules in the process of problem
solving (Yao, 2006).
Distinguishable aspect of GrC is a multi-
perspective standpoint of data. Multi-perspective
means diverse levels of resolution depending on
saliency features or grade of details of studied prob-
lem. Data granules that are identified on different
levels of resolution form a hierarchical structure ex-
pressing relations between objects of data. Such
structure can be used to facilitate investigation and
helps to understand complex systems. Understand-
ing of analyzed problem and attained results are
main aspects of human-oriented systems. There are
also definitions of granular computing additionally
concentrating on systems supporting human beings
(Bargiela and Pedrycz, 2002)-(Bargiela and Pedrycz,
2006). According to definitions mentioned above,
such methodology can allow to ignore irrelevant de-
tails and concentrate on essential features of the sys-
tems to make them more understandable. In (Bargiela
and Pedrycz, 2001) an approach of data granulation
based on approximating data by multi-dimensional
hyperboxes is presented. The hyperboxes represent
data granules formed from the data points focusing
on maximization of density of information present in
the data. It benefits from improvement of computa-
tional performance among the others. The algorithm
is described in the following sections.
Clustering is a part of data mining domain per-
forming exploratory analysis of data. Its aim is to de-
termine natural clusters, which means, groups of ob-
jects more similar to one another than to the objects
from other clusters (A. K. Jain and Flynn, 1999). Cri-
terion of similarity depends on clustering algorithm
89
Ku
˙
zelewska U..
CLUSTERING WITH GRANULAR INFORMATION PROCESSING.
DOI: 10.5220/0003142700890097
In Proceedings of the 3rd International Conference on Agents and Artificial Intelligence (ICAART-2011), pages 89-97
ISBN: 978-989-8425-40-9
Copyright
c
2011 SCITEPRESS (Science and Technology Publications, Lda.)
and data type. The most common similarity measure
is distance between points, for example, Euclidean
metric for continuous attributes. There is no univer-
sal method to assess clustering results. One of the
approaches is to measure quality of partitioning by
special indicants (validity indices). The most com-
mon measures are: Davies-Bouldin’s (DB), Dunn’s
(Halkidi and Batistakis, 2001), Silhouette Index (SI)
(Kaufman and Rousseeuw, 1990) and CDbw (Halkidi
and Vazirgiannis, 2002). Clustering algorithms have
wide applications in pattern recognition, image pro-
cessing, statistical data analysis and knowledge dis-
covery. Quoting definitions mentioned above, where
granule is determined as a set of objects, one can
consider groups identified by clustering algorithms as
data granules. According to that definition, a granule
can contain other granules as well as be the part of an-
other granule. It makes possible to employ clustering
algorithms to create granulation structures of data.
The article proposes an approach of information
granulation by clustering data, that are in form of hy-
perboxes. Hyperboxes are created in the first step of
the algorithm and then they are clustered by SOSIG
(Stepaniuk and Ku
˙
zelewska, 2008) method. This so-
lution is effective with regard to time complexity and
interpretability of generated groups of data. The pa-
per is organized as follows: the next section, Section
2, describes proposed approach, Section 3 reports col-
lected data sets as well as executed experiments. The
last section concludes the article.
2 GRANULAR CLUSTERING BY
SOSIG
The proposed method of data granulation is composed
of two phases. First phase prepares data objects in
form of granules (hperboxes), whereas second detects
similar groups of the granules. The final result of
granulation is a three-level structure, where the main
granulation is defined by clusters of granules and the
following level consists of granules from components
of the top level cluster. The down third level consists
of point-type objects.
The method of hyperboxes creation is designed
to reduce the complexity of the description of real-
world systems. The improved generality of informa-
tion granules is attained through sacrificing some of
the numerical precision of point-data (Bargiela and
Pedrycz, 2001). The hyperboxes (referred as I) are
multi-dimensional structures described by a pair of
values a and b for every dimension. The point a
i
represents minimal and b
i
maximal value of the gran-
ule in i-th dimension, thus width of i-th dimensional
edge equals |b
i
a
i
|. Creation of hyperboxes is based
on maximization of ”information density” of gran-
ules (the algorithm is described in details in (Bargiela
and Pedrycz, 2006)). Information density can be ex-
pressed by Equation 1.
σ =
card(I)
φ(width(I))
(1)
Maximization of σ is a problem of balancing the pos-
sible shortest dimensions against the greatest cardi-
nality of formed granule I. In presented experiments
in the following section, cardinality of the granule I
is considered as the number of point-type objects be-
longing to the granule. Belonging means that the val-
ues of point attributes are between or equal to the min-
imal and maximal values of the hyperbox attributes.
For that reason there is necessity to re-calculate car-
dinality in every case of forming a new largest gran-
ule from combination of two granules. In multi-
dimensional case of granules, as a function of hyper-
boxes width, is applied a function from Equation 2:
φ(u) = exp(K · max
i
(u
i
) min
i
(u
j
)),i, j = 1, ..., n
(2)
where u = (u
1
,u
2
,. .. ,u
n
) and u
i
= width([a
i
,b
i
]) for
i, j = 1,.. ., n. The points a
i
and b
i
denote respec-
tively minimal and maximal value in i-th dimension.
The constant K originally equals 2, however in the
experiments there were used different values of K
given as a parameter. Computational complexity of
this algorithm is O(N
3
). However, in every step of
the method, the size of data is decreased by 1, what
in practice significantly reduces the general complex-
ity. The data granulation algorithm assumes process-
ing hyperboxes as well as point-type data. To make
it possible new data are characterized by 2 · n val-
ues in comparison with original data. The first n at-
tributes describe minimal, whereas the following n
describe maximal values for every dimension. To as-
sure topological ”compatibility” point-type data and
hyperboxes dimensionality of the data is doubled ini-
tially.
2.1 Self-Organizing System for
Information Granulation
The SOSIG (Self-Organizing System for Information
Granulation) algorithm is a system designed for de-
tecting granules present in data. The granulation is
performed by clustering and the clusters can be iden-
tified on the different level of resolution. The proto-
type of the algorithm is a method described in (Wierz-
cho
´
n and Ku
˙
zelewska, 2006). However, in SOSIG
granulation property and application to cope with dif-
ferent attributes types was introduced. This follows
ICAART 2011 - 3rd International Conference on Agents and Artificial Intelligence
90
fundamental changes in its implementation. In the
following description of the algorithm there are used
new terms and symbols in contrary to the description
from (Wierzcho
´
n and Ku
˙
zelewska, 2006) more com-
patible with GrC theory. SOSIG creates a network
structure of connected objects (in general points, but
in the presented solution hyperboxes) forming clus-
ters. Organization of the system, including the points
as well as the connections, is constructed on the basis
of relationships between input data, without any ex-
ternal supervision. The structure points are represen-
tatives of input data, that is, an individual object from
the structure stands for one or more object from input
set. In effect of this, the number of representatives is
much less than input data without lost of information.
To have convenient and compact notation, let us as-
sume input data are defined as an information system
IS = (U, A) (Pawlak, 1991), where U = {x
1
,. .. ,x
n
}
is a set of objects and A = {a
1
,. .. ,a
k
} is a set of at-
tributes. The result generated by SOSIG is also de-
scribed by an information system IS
0
= (Y, A{a
gr
}),
where the last attribute a
gr
: Y {1,...,nc} denotes
label of generated cluster and card(Y ) card(U ) and
x U y Y (δ(x, y) < NR). The parameter NR in
general defines region of objects interactions and is
described later. The steps of the main (learning) part
of the SOSIG are shown in Algorithm 1, whereas ini-
tial phase is separated into Algorithm 2. Further clas-
sification of new as well as training points can be per-
formed using so-created network as it is presented in
Algorithm 6.
The parameter NR (Neighborhood Radius) exist-
ing in above descriptions defines neighborhood of ob-
jects from IS
0
. It directly influences level of granu-
lation of the input set. Initial value of NR is propor-
tional to maximal of nearest neighbor distances in the
input set (see Equation 3).
NR
init
= max({min({δ(x
i
,x
j
) : x
j
U & x
j
6= x
i
}) : x
i
U})
(3)
The following values of NR are calculated from cur-
rent state of the network (see Equation 4).
NR = rg ·
y
i
Y
min({δ(y
i
,y
j
) : y
j
Y & y
j
6= y
i
})
card(Y )
(4)
where rg [0,1] is a resolution of granulation param-
eter. Value of rg is proportional to the level of granu-
lation, that is the top, first level of granulated structure
is characterized by lower values of rg and the follow-
ing levels (second, third and so on) are characterized
by higher values of the parameter. The rg values are
not fixed for according level of hierarchy, but related
Algorithm 1: Construction of information sys-
tem with a set of representative objects.
Data:
IS = (U,A) - an information system, where U = {x
1
,. .. ,x
n
} is a
set of objects and A = {a
1
,. .. ,a
k
} is a set of attributes,
{δ
a
: a A} - a set of distance function of the form
δ
a
: V
a
×V
a
[0,), where V
a
is a set of values for attribute
a A and a global distance function δ : U ×U [0,) defined
by δ(x,y) = f usion(δ
a
1
(a
1
(x),a
1
(y)),. .. ,δ
a
k
(a
k
(x),a
k
(y)))
size
net
{0,1, .. .,card(U )} - initial size of network, rg [0, 1] -
resolution of granulation,
Result: IS
0
= (Y,A {a
gr
}) - an information system, where the last
attribute a
gr
: Y {1, .. ., nc} denotes label of generated
granule and card(Y ) card(U) and
x Uy Y δ(x, y) < NR
begin
[NR
init
,Y ] initialize(U,A, size
net
);
for y
i
,y
j
Y, i 6= j do /*form clusters*/
if δ(y
i
,y
j
) < NR
init
then connect(y
i
,y
j
);
NR NR
init
;
while ¬stopIterations(Y ) do
for y Y do
(y) = (δ(y,x))
xU
; /*calculate distances from
input data*/;
s
l
(y) = NR min (y);/*similarity level of the
object*/;
delete(U,A,Y ); /*remove redundant network
objects*/;
for y
i
,y
j
Y, i 6= j do /* reconnect objects*/
if δ(y
i
,y
j
) < NR then connect(y
i
,y
j
);
a
gr
(y
i
) 0; a
gr
(y
j
) 0;
grLabel 1;
for y
i
Y do /*label objects*/
if a
gr
= 0 then a
gr
(y
i
) grLabel;
for y
j
Y, j 6= i do
if connected(y
i
,y
j
) then
a
gr
(y
j
) grLabel;
grLabel grLabel +1;
for y
i
Y do
/*calculate the nearest neighbor network objects*/;
δ
NN
(y
i
) = min({δ(y
i
,y
j
) : y
j
Y & j 6= i});
NR rg ·
yY
δ
NN
(y)
card(Y )
;/*new value of NR*/;
if ¬stopIterations(Y )/*test stopping condition */
then
joinNotRepresented(U,Y,NR,);
adjust(Y,U,A, NR);
to individual set of data. However, there is a value of
rg = 0.5 the most often appearing in empirical tests
for the most separated and compact (termed ”natu-
ral”) clustering.
After initial phase (like normalization of data, cal-
culation initial value of NR) iterated steps of the al-
gorithm follow. First, the system objects are assessed.
CLUSTERING WITH GRANULAR INFORMATION PROCESSING
91
Algorithm 2: Initial steps of SOSIG algorithm.
Data: Set of input objects: U = {x
1
,. .. ,x
n
},
size
net
{0,1, .. ., card(U)} - initial size
of network
Result: Set of initial network objects
Y = {y
1
,. .. ,y
size
net
}, where Y U,
NR
init
- initial value of neighborhood
radius threshold
begin
δ
maxNN
= 0;
for x
i
U do
/*calculate the nearest neighbor
distances of the data*/ ;
δ
NN
(x
i
) = min({δ(x
i
,x
j
) : x
j
U & x
j
6= x
i
}) ;
/*find the greatest value of the nearest
neighbor distances*/ ;
if δ
NN
(x
i
) > δ
maxNN
then
δ
maxNN
= δ
NN
(x
i
);
NR
init
δ
maxNN
;
/*select the representatives objects*/ ;
for netOb j 1 to size
net
do
i = rand(1,. .. ,card(U ));
y
netOb j
= x
i
;
The measure of their usefulness is a similarity level
expressed by Equation 5.
s
l
(y) = NR min({δ(y,x) : x U}) (5)
The s
l
defines a degree of similarity between exam-
ined representative point y and the most similar point
x from the training data. There are considered only
training points from neighborhood of y (defined by
NR). The similarity level corresponds to the distance
between the points - the closer points are more sim-
ilar to each other than points located in farther dis-
tance. After calculation of similarity levels their val-
ues are normalized. In presented experiments to cal-
culate distance between objects (hyperboxes) a dis-
tance measure expressed by Equation 6 was used:
d(I
A
,I
B
) = (ka
B
a
A
k + kb
B
b
A
k)/2 (6)
where ka
B
a
A
k and ka
B
a
A
k denote sum of
subtractions of respectively minimal (a) and maximal
(b) values of granules I
A
and I
B
in every dimension.
The equation has been introduced in (Bargiela and
Pedrycz, 2001).
To control size of the network there is a removal
step as it is presented in Algorithm 3. Useless ob-
jects are removed. It affects redundant objects from
Algorithm 3: Detailed steps of SOSIG func-
tion delete.
Data: Set of input objects U, network set Y , set
of attributes A = {a
1
,. .. ,a
k
}, NR -
threshold of neighborhood radius
Result: Y \C, where C is a set of redundant
network objects
begin
C
/
0;/*initially the set is empty*/;
for x U do
for y Y do
if δ(y, x) < NR then /*add to the
set objects representing the input
element x*/;
C C {y};
δ
NN
(x,y
NN
) = min({δ(x,y) : y Y }) ;
C C \ {y
NN
};/*remove from the set
the best object representing x */;
Y Y \C;/*remove from the network
objects from the set C*/;
the representatives network. As redundant are deter-
mined points with the same input object (from U ) in
their neighborhood. The best points stay in the net-
work and also the ones that are not redundant for other
input data. This process controls size of the network
prevents forming excessively dense clusters. It results
in compression phenomenon.
The remaining objects in the network are re-
connected and labeled. A granule is determined by
edges between the objects in the structure. Compo-
nents of the same granule (group) have equal label.
Afterwards, a new value of NR parameter is calcu-
lated (see Equation 4). When stopping criterion is
met, the algorithm is stopped after connections recon-
struction and labeling. Otherwise, the following steps
are carried out. As a stopping criterion, a stable state
of the network is considered. That is the state of small
fluctuations of network size and value of NR.
The last step is to apply a procedure of adjusting
of all network objects (see Algorithm 5). In this step,
candidate objects, that are copies of original ones, are
created. The number of copies is not great, fixed in al-
gorithm on 5. Values of attributes of candidate objects
are slightly modified (depending on a similarity level
of objects and the type of attributes). This procedure
allows to adjust network objects in attained solution
to the examined problem.
In the algorithm there is also a step of introducing
an object from input data to the system. It concerns
the object, which has not been identified yet. This
operation (presented in Algorithm 4) avoids leaving
ICAART 2011 - 3rd International Conference on Agents and Artificial Intelligence
92
uncovered (not represented by network objects) area
in the training set.
Further classification of new as well as training
points can be performed using so-created structure
(see Algorithm 6). To assign a label to the considered
object it is necessary to determine neighborhood ob-
jects from the network structure. The neighborhood
of the point is defined by final value of the NR (the
last calculated value) of the SOSIG. The predominant
value of labels is given to the examined object.
It must be underlined that the algorithm SOSIG
does not require the number of clusters to be given.
Partitioning is generated automatically, which elimi-
nates inconvenient step of assessing and selecting the
best result from a set of potential clusterings.
The algorithm SOSIG is also described in (Stepa-
niuk and Ku
˙
zelewska, 2008).
Algorithm 4: Detailed steps of SOSIG func-
tion joinNotRepresented.
Data: Set of input objects: U = {x
1
,. .. ,x
n
},
Y = {y
1
,. .. ,y
n
}, - matrix of distances
between input and network objects, NR -
threshold of neighborhood radius
Result: Y {x} (with condition
¬∃y Y δ(y,x) < NR)
begin
for x U do
/*find an arbitrary object from the
training set not represented yet by any
network element*/ ;
add 1 ;
for y Y do
if δ(y, x) < NR then add 0 ;
break;
if add = 1 then Y Y {x};
break;
3 EXPERIMENTS
The article proposes a method for detection of groups
containing similar objects. The method clusters data,
that contains data points as well as hyperboxes. The
experiments focus on comparing results of clustering
in two approaches: when data are in point and granu-
lated form. There were performed the following com-
parisons: the number of detected groups, values of
SOSIG parameters, time of detection process, values
of validity indices. It was also taken into considera-
tion interpretability of created clusterings.
Algorithm 5: Detailed steps of SOSIG function
ad just.
Data: Set of input objects U, network set Y , set
of attributes A = {a
1
,. .. ,a
k
}, NR -
threshold of neighborhood radius
Result: Y Z, where Z is a set of adjusted
network objects
begin
Z
/
0;
for y Y do
for candidate 1 to noCandidates
do
z
candidate
y;
for a A do
sign rand({−1,1});
delta sign;
if a(z
candidate
) is binary then
/*modification of binary
values of attributes*/;
randVal rand({0, 1});
delta delta · randVal;
a(z
candidate
) delta;
else
/*modification of
continuous values of
attributes*/;
randVal rand([0, 1]);
/*scale of change depends
on the value of s
l
*/;
delta delta · randVal ·
(1.5 s
l
(y));
a(z
candidate
)
a(z
candidate
) + delta;
for x U do
if δ(z
candidate
,x) < NR then
/*only useful clones are
joined to the network */;
Z Z {z
candidate
};
break;
3.1 Description of the Datasets
There are several data sets in the experiments, shown
in Table 1. The sets are various with regard to the
number of objects, dimensionality and the existed
number of groups. The column groups numbers from
the Table 1 contains the number of groups presented
in the data according to the subjective human percep-
tion based on the separation and compactness of the
groups. However, the irises data set is a real data de-
livered with a priori class attribute. For that reason
CLUSTERING WITH GRANULAR INFORMATION PROCESSING
93
Algorithm 6: Clustering of new objects in
SOSIG algorithm.
Data:
IS = (U, A) - an information system, where
U = {x
1
,. .. ,x
n
} is a set of objects and
A = {a
1
,. .. ,a
k
} is a set of attributes,
IS
0
= (Y,A {a
gr
}) where last attribute
a
gr
: Y {1,...,ng} stands for label of
generated granule and card(Y ) card(U)
NR - threshold of neighborhood radius;
Result: Clustered information system
IS
gr
= (U,A {a
gr
}) into ng clusters
(granules), where last attribute
a
gr
: U {1,...,nc} stands for label
of generated granule
begin
for x U do
for granule 1 to ng do
grLabels[granule] 0;
/*calculate distance between x and the
network objects*/;
for y Y do
if δ(x,y) < NR then
label a
gr
(y);
grLabels[label]
grLabels[label] + 1;
/*predominant label is selected */;
a
gr
(x) max({grLabels}) ;
Table 1: Description of data sets.
data dim objects hyperboxes groups
set number number number
norm2D2gr 2 200 51 2
sph2D6gr 2 300 70 6
irises 4 150 94 3
sph10D4gr 10 200 13 4
the value of the groups number for it is related to the
number from the decision attribute.
3.2 Results of Experiments
Clustering by SOSIG algorithm was performed on
both point-type and granulated data. The number of
point-type and granular objects is compared in Table
1. It can be noticed that in all cases the number of
hyperboxes is significantly less than the number of
points. However the dimension is doubled for gran-
ulated data.
Number of groups identified by SOSIG in de-
scribed above data sets is presented in Table 2. When
Table 2: Results of clustering of point-type and granulated
data with respect to number of identified groups.
data point-type granulated
set data data
norm2D2gr 2 2
sph2D6gr 6 6
irises 2, 3 4
sph10D4gr 4 4
the result consists of groups of highly various sizes,
only a number of main groups is presented.
Granulation of the irises set contains two levels
(low and high resolution), what is visible in the all
following tables. In the result are 2 clusters when
granulation is performed on low resolution level,
whereas in high resolution one large cluster is split in
two smaller and there are additionally 5 significantly
smaller groups. The results considering both levels of
granulation are shown in the same cell of the tables,
where first value corresponds to low and the second to
high level of resolution. Granulation of irises hyper-
boxes is composed of only one level clustering with 4
main and 6 additional smaller groups.
In the remaining data sets clusterings the num-
ber of groups is corresponding to each other for both
types of processed data.
Table 3: SOSIG results with respect to generated number
of representatives and NR value when clustering point-type
and granulated data.
data point granulated
set data data
nr of NR nr of NR K
reps. val. repres. val. val.
norm2D2gr 80 0.11 21 0.25 10
sph2D6gr 123 0.07 33 0.09 10
irises 92, 0.22, 88 0.13 15
112 0.1 88 0.13 15
sph10D4gr 46 0.21 6 0.18 6
The number of representatives generated by
SOSIG and the NR value for point-type as well as
hyperboxes data are shown in Table 3. Additionally,
for granulated data K parameter is presented. K was
selected to have small number of hyperboxes and pre-
serving combining points from different clusters into
one granule. For the irises point-type data set two lev-
els of granulation were considered - composed of 2
and 3 groups. The number of representatives is less in
case of clustering granulated data. Values of the NR
are different in both cases, however there are com-
parable NR values for clustering of sph2D6gr, irises
(version of high level clustering for point-type data)
and sph10D4gr sets.
ICAART 2011 - 3rd International Conference on Agents and Artificial Intelligence
94
Table 4: Average time (in seconds) of clustering granulated
and point-type data.
data point-type granulated t
pd
/t
gd
set data, t
pd
data, t
gd
norm2D2gr 0.36 0.04 9
sph2D6gr 0.93 0.08 11.63
irises 0.87, 0.80 0.79 1.01
sph10D4gr 0.27 0.01 38.57
The results presented in Table 4 consider time (in
seconds) of a single run of SOSIG algorithm. This
is average time of 50 runs of the algorithm calculated
for clustering original data as well as hyperboxes. The
last column of the table contains quotient of the val-
ues. It can be seen the processing of granulated data
is significantly, up to about 40 times, faster than pro-
cessing original point-type objects. The most acceler-
ation is visible when the number of objects in data is
great.
To compare results of clustering one can use valid-
ity indices designed to detect the most compact and
separable partitioning. The validity indices are not
universal, however the most popular tool for assess-
ment clustering results (Halkidi and Batistakis, 2001).
Simultaneous comparison of several of them can give
quite objective result.
Evaluation of groupings of granulated data in
comparison to clusterings of point-type objects was
performed in the experiments and the results are
shown in Table 5. The following indices were
taken into consideration: Davies-Bouldin’s (DB),
Dunn’s, Silhouette (SI) and CDbw. Clusterings of
norm2D2gr, sph2D6gr and sph10D4gr in form of
hyperboxes are characterized by better values of 3
from 4 indices. For the set irises the values are better
for point-type clustering.
Table 6 contains detailed description of groups
(granules) detected in clustering of irises hyperboxes
data. The final result is composed of 10 clusters, how-
ever due to considerable differences in their size, the
result focuses on the main 3 granules. A priori de-
cision attribute is composed of 3 classes: Iris-setosa
(I-S), Iris-versicolor (I-Ve) and Iris-virginica (I-Vi).
The set is described by 4 attributes: sepal-length (SL),
sepal-width (SW), petal-length (PL) and petal-width
(PW). The granule gr
1
contains 13 smaller granules
(hyperboxes) and all of them belong to the class Iris-
setosa. The other granule (gr
3
) has comparable size
(15 objects) and contains only objects from the Iris-
versicolor class. The largest granule, gr
2
consists of
36 hyperboxes. It is not homogenous with respect of
class attribute, due to 31% of objects come from Iris-
versicolor class and 69% from Iris-virginica.
Table 5: Results of clustering (values of the indices) of
point-type and granulated data.
data indices point-type granulated
set data data
DB 0.06 0.08
norm2D2gr Dunn’s 0.16 0.74
SI 0.53 0.76
CDbw 19.15 45.34
DB 0.03 0.01
sph2D6gr Dunn’s 0.51 1.38
SI 0.75 0.85
CDbw 36.54 23.58
DB 0.14, 0.12 0.2
irises Dunn’s 0.39, 0.19 0.25
SI 0.63, 0.53 0.3
CDbw 74.84, 2.91 1.61
DB 0.01 0.0001
sph10D4gr Dunn’s 7.83 9.29
SI 0.93 0.96
CDbw 453.23 41.04
DB 0.23 0.37
It has to be focused attention on the attributes re-
sulted from doubling of dimensions. These features
are related to minimal and maximal values of the orig-
inal attributes. As a consequence it appears an addi-
tional feature - difference between the maximal and
minimal value of particular variables. Average differ-
ences are presented in Table 6 in column di f f
Avg
. The
granule gr
1
is characterized by the widest range of all
attributes, the granule gr
2
contains flowers with the
smallest size of petals and sepals. Finally, the granule
gr
3
is composed of irises with narrow and long petals
and sepals.
Table 7 presents granules from the second level
of data relationship hierarchy. The granules are hy-
perboxes identified in the first phase of the granula-
tion. In the table the greatest 3 hyperboxes (denoted
as gr
i j
) from every granule of the main level were se-
lected. The second-level granules from the top-level
granules gr
1
and gr
3
have larger size and the range
of their attributes values is greater on contrary to the
granules belonging to gr
2
. It shows the granules gr
1
and gr
3
are more compact and have greater regions
of even information density. It can be noticed, that
the hyperboxes are homogenous with regard to class
attribute.
4 CONCLUSIONS
The article presents modified clustering method as an
approach for data granulation. The algorithm is two-
CLUSTERING WITH GRANULAR INFORMATION PROCESSING
95
Table 6: Main level of irises data hierarchy composed of
clustering result of hyperboxes set. Table contains 3 main
granules.
id/ class attr. min max di f f
Avg
size distr. val. val.
100% SL 4.4-5.4 4.8-5.5 0.25
gr
1
/ I-S SW 3.0-3.7 3.1-3.9 0.15
13 PL 1.0-1.5 1.5-1.9 0.29
PW 0.1-0.4 0.1-0.5 0.12
31% SL 5.6-7.1 5.6-7.1 0.04
I-Ve
gr
2
/ 69% SW 2.5-3.4 2.5-3.4 0.03
36 I-Vi PL 4.3-6.0 4.4-6.0 0.07
PW 1.4-2.5 1.4-2.5 0.02
100% SL 5.2-6.1 5.2-6.2 0.11
I-Ve SL 5.2-6.1 5.2-6.2 0.11
gr
3
/ SW 2.3-2.9 2.3-3.0 0.08
15 PL 3.5-4.7 3.6-4.7 0.17
PW 1.0-1.4 1.1-0.5 0.08
Table 7: Second level of irises data hierarchy composed of
hyperboxes (there are presented only selected objects).
main gr. size class min values di f f
Avg
granule id distr. of attr.
SL 5.0 0.5
gr
11
15 100% I-S SW 3.4 0.3
PL 1.3 0.4
PW 0.2 0.2
SL 4.6 0.5
gr
1
gr
12
15 100% I-S SW 3.3 0.3
PL 1.0 0.7
PW 0.2 0.3
SL 4.8 0.2
gr
13
9 100% I-S SW 3.0 0.2
PL 1.2 0.4
PW 0.1 0.2
SL 6.4 0.3
gr
21
5 100% I-Ve SW 2.9 0.2
PL 4.3 0.4
PW 1.3 0.2
SL 6.4 0.1
gr
2
gr
22
4 100% I-Vi SW 3.0 0.2
PL 5.1 0.4
PW 1.8 0.2
SL 5.9 0.3
gr
23
4 100% I-Vi SW 2.8 0.2
PL 4.8 0.3
PW 1.8 0.0
SL 5.6 0.5
gr
31
14 100% I-Ve SW 2.7 0.3
PL 3.9 0.8
PW 1.2 0.3
SL 5.7 0.5
gr
3
gr
32
8 100% I-Ve SW 2.6 0.3
PL 3.5 0.8
PW 1.0 0.3
SL 5.4 0.3
gr
33
6 100% I-Vi SW 2.8 0.2
PL 4.1 0.4
PW 1.3 0.2
phased, first phase prepares input point-type data as
multi-dimensional granules in form of hyperboxes.
The hyperboxes are based on maximizing information
density in data. The next phase is clustering the gran-
ules by SOSIG algorithm. Clustering process can be
performed on different resolution of data. Clustering
of hyperboxes was executed without changing of res-
olution. Three-level structure of data was constructed
by joining original point (third down level) in hyper-
boxes (second level), whereas the top level contains
division of hyperboxes into clusters. Partitioning at
the top level of hyperboxes granulation (clustering)
is composed of the same number of groups as parti-
tioning point-type data. Quality of created clusters is
comparable as well, due to values of quality indices
are similar.
Process of hyperboxes creation is a type of aggre-
gation operation, therefore the most benefit of the pre-
sented method is shortening time of clusters creation
in comparison to the processing point-type data. It
is especially effective when data contain large num-
ber of objects. Hyperboxes also determine additional
level of relationship existing in data. Finally, descrip-
tion of the granules is more comprehensible since the
hyperboxes contain minimal and maximal values of
attributes.
ACKNOWLEDGEMENTS
This work was supported by Rector’s of Technical
University of Bialystok Grant No. S/WI/5/08.
The experiments were performed on the computer
cluster at Faculty of Computer Science, Bialystok
Technical University.
REFERENCES
A.K. Jain, M. M. and Flynn, P. (1999). Data clustering: a
review. In ACM Computing Surveys 31:3, 264–323.
Bargiela, A. and Pedrycz, W. (2001). Classification and
clustering of granular data. In IFSA World Congress
and 20th NAFIPS International Conference, vol.3,
1696–1701.
Bargiela, A. and Pedrycz, W. (2002). Granular Comput-
ing: an Introduction. Kluwer Academic Publishers,
Boston.
Bargiela, A. and Pedrycz, W. (2006). Granular analysis of
traffic data for turning movements estimation. In Int.
J. of Enterprise Information Systems, vol. 2-2, 13–27.
Halkidi, M. and Batistakis, Y. (2001). On clustering valida-
tion techniques. In Journal of Intelligent Information
Systems 17:2/3, 107–145.
ICAART 2011 - 3rd International Conference on Agents and Artificial Intelligence
96
Halkidi, M. and Vazirgiannis, M. (2002). Clustering va-
lidity assessment using multi representatives. In Pro-
ceedings of SETN Conference.
Kaufman, L. and Rousseeuw, P. (1990). Finding Groups in
Data: An Introduction to Cluster Analysis. Wiley.
Pawlak, Z. (1991). Rough Sets. Theoretical Aspects of Rea-
soning about Data. Kluwer Academic Publishers,
Dordrecht.
Stepaniuk, J. and Ku
˙
zelewska, U. (2008). Information gran-
ulation: A medical case study. In Transactions on
Rough Sets, vol.5390/2008, 96–113. Springer.
Wierzcho
´
n, S. and Ku
˙
zelewska, U. (2006). Evaluation of
clusters quality in artificial immune clustering system
- saris. In Biometrics, computer security systems and
artificial intelligence applications,323–331. Springer-
Verlag.
Yao, Y. (2006). Granular computing for data mining. In
Proceedings of SPIE Conference on Data Mining, In-
trusion Detection, Information Assurance, and Data
Networks Security,1–12.
Zadeh, L. A. (2001). A new direction in ai: Toward a
computational theory of perceptions. In AI Magazine
22(1), 73-84.
CLUSTERING WITH GRANULAR INFORMATION PROCESSING
97