CLUSTERING WITH GRANULAR INFORMATION PROCESSING

Urszula Ku

zelewska

Faculty of Computer Science, Technical University of Bialystok, Wiejska 45a, 15-521 Bialystok, Poland

Keywords:

Knowledge discovery, Data mining, Information granulation, Granular computing, Clustering, Hyperboxes.

Abstract:

Clustering is a part of data mining domain. Its task is to detect groups of similar objects on the basis of es-

tablished similarity criterion. Granular computing (GrC) includes methods from various areas with the aim

to support human with better understanding analyzed problem and generated results. Granular computing

techniques create and/or process data portions named as granules identiﬁed with regard to similar description,

functionality or behavior. Interesting characteristic of granular computation is offer of multi-perspective view

of data depending on required resolution level. Data granules identiﬁed on different levels of resolution form

a hierarchical structure expressing relations between objects of data.

A method proposed in this article performs creation data granules by clustering data in form of hyperboxes.

The results are compared with clustering of point-type data with regard to complexity, quality and interpretabil-

ity.

1 INTRODUCTION

Granular computing (GrC) is a new multidisciplinary

theory rapidly developed in recent years. The most

common deﬁnitions of GrC (Yao, 2006), (Zadeh,

2001) include a postulate of computing with informa-

tion granules, that is collections of objects, that ex-

hibit similarity in terms of their properties or func-

tional appearance. Although the term is new, the

ideas and concepts of GrC have been used in many

ﬁelds under different names: information hiding in

programming, granularity in artiﬁcial intelligence, di-

vide and conquer in theoretical computer science, in-

terval computing, cluster analysis, fuzzy and rough

set theories, neutrosophic computing, quotient space

theory, belief functions, machine learning, databases,

and many others. According to more universal deﬁni-

tion, granular computing may be considered as a la-

bel of a new ﬁeld of multi-disciplinary study, dealing

with theories, methodologies, techniques and tools

that make use of granules in the process of problem

solving (Yao, 2006).

Distinguishable aspect of GrC is a multi-

perspective standpoint of data. Multi-perspective

means diverse levels of resolution depending on

saliency features or grade of details of studied prob-

lem. Data granules that are identiﬁed on different

levels of resolution form a hierarchical structure ex-

pressing relations between objects of data. Such

structure can be used to facilitate investigation and

helps to understand complex systems. Understand-

ing of analyzed problem and attained results are

main aspects of human-oriented systems. There are

also deﬁnitions of granular computing additionally

concentrating on systems supporting human beings

(Bargiela and Pedrycz, 2002)-(Bargiela and Pedrycz,

2006). According to deﬁnitions mentioned above,

such methodology can allow to ignore irrelevant de-

tails and concentrate on essential features of the sys-

tems to make them more understandable. In (Bargiela

and Pedrycz, 2001) an approach of data granulation

based on approximating data by multi-dimensional

hyperboxes is presented. The hyperboxes represent

data granules formed from the data points focusing

on maximization of density of information present in

the data. It beneﬁts from improvement of computa-

tional performance among the others. The algorithm

is described in the following sections.

Clustering is a part of data mining domain per-

forming exploratory analysis of data. Its aim is to de-

termine natural clusters, which means, groups of ob-

jects more similar to one another than to the objects

from other clusters (A. K. Jain and Flynn, 1999). Cri-

terion of similarity depends on clustering algorithm

zelewska U..

CLUSTERING WITH GRANULAR INFORMATION PROCESSING.

DOI: 10.5220/0003142700890097

In Proceedings of the 3rd International Conference on Agents and Artiﬁcial Intelligence (ICAART-2011), pages 89-97

ISBN: 978-989-8425-40-9

 2011 SCITEPRESS (Science and Technology Publications, Lda.)

and data type. The most common similarity measure

is distance between points, for example, Euclidean

metric for continuous attributes. There is no univer-

sal method to assess clustering results. One of the

approaches is to measure quality of partitioning by

special indicants (validity indices). The most com-

mon measures are: Davies-Bouldin’s (DB), Dunn’s

(Halkidi and Batistakis, 2001), Silhouette Index (SI)

(Kaufman and Rousseeuw, 1990) and CDbw (Halkidi

and Vazirgiannis, 2002). Clustering algorithms have

wide applications in pattern recognition, image pro-

cessing, statistical data analysis and knowledge dis-

covery. Quoting deﬁnitions mentioned above, where

granule is determined as a set of objects, one can

consider groups identiﬁed by clustering algorithms as

data granules. According to that deﬁnition, a granule

can contain other granules as well as be the part of an-

other granule. It makes possible to employ clustering

algorithms to create granulation structures of data.

The article proposes an approach of information

granulation by clustering data, that are in form of hy-

perboxes. Hyperboxes are created in the ﬁrst step of

the algorithm and then they are clustered by SOSIG

(Stepaniuk and Ku

zelewska, 2008) method. This so-

lution is effective with regard to time complexity and

interpretability of generated groups of data. The pa-

per is organized as follows: the next section, Section

2, describes proposed approach, Section 3 reports col-

lected data sets as well as executed experiments. The

last section concludes the article.

2 GRANULAR CLUSTERING BY

SOSIG

The proposed method of data granulation is composed

of two phases. First phase prepares data objects in

form of granules (hperboxes), whereas second detects

similar groups of the granules. The ﬁnal result of

granulation is a three-level structure, where the main

granulation is deﬁned by clusters of granules and the

following level consists of granules from components

of the top level cluster. The down third level consists

of point-type objects.

The method of hyperboxes creation is designed

to reduce the complexity of the description of real-

world systems. The improved generality of informa-

tion granules is attained through sacriﬁcing some of

the numerical precision of point-data (Bargiela and

Pedrycz, 2001). The hyperboxes (referred as I) are

multi-dimensional structures described by a pair of

values a and b for every dimension. The point a

represents minimal and b

maximal value of the gran-

ule in i-th dimension, thus width of i-th dimensional

edge equals |b

− a

|. Creation of hyperboxes is based

on maximization of ”information density” of gran-

ules (the algorithm is described in details in (Bargiela

and Pedrycz, 2006)). Information density can be ex-

pressed by Equation 1.

σ =

card(I)

φ(width(I))

(1)

Maximization of σ is a problem of balancing the pos-

sible shortest dimensions against the greatest cardi-

nality of formed granule I. In presented experiments

in the following section, cardinality of the granule I

is considered as the number of point-type objects be-

longing to the granule. Belonging means that the val-

ues of point attributes are between or equal to the min-

imal and maximal values of the hyperbox attributes.

For that reason there is necessity to re-calculate car-

dinality in every case of forming a new largest gran-

ule from combination of two granules. In multi-

dimensional case of granules, as a function of hyper-

boxes width, is applied a function from Equation 2:

φ(u) = exp(K · max

) − min

)),i, j = 1, ..., n

(2)

where u = (u

,. .. ,u

) and u

= width([a

]) for

i, j = 1,.. ., n. The points a

and b

denote respec-

tively minimal and maximal value in i-th dimension.

The constant K originally equals 2, however in the

experiments there were used different values of K

given as a parameter. Computational complexity of

this algorithm is O(N

). However, in every step of

the method, the size of data is decreased by 1, what

in practice signiﬁcantly reduces the general complex-

ity. The data granulation algorithm assumes process-

ing hyperboxes as well as point-type data. To make

it possible new data are characterized by 2 · n val-

ues in comparison with original data. The ﬁrst n at-

tributes describe minimal, whereas the following n

describe maximal values for every dimension. To as-

sure topological ”compatibility” point-type data and

hyperboxes dimensionality of the data is doubled ini-

tially.

2.1 Self-Organizing System for

Information Granulation

The SOSIG (Self-Organizing System for Information

Granulation) algorithm is a system designed for de-

tecting granules present in data. The granulation is

performed by clustering and the clusters can be iden-

tiﬁed on the different level of resolution. The proto-

type of the algorithm is a method described in (Wierz-

cho

n and Ku

zelewska, 2006). However, in SOSIG

granulation property and application to cope with dif-

ferent attributes types was introduced. This follows

ICAART 2011 - 3rd International Conference on Agents and Artificial Intelligence

fundamental changes in its implementation. In the

following description of the algorithm there are used

new terms and symbols in contrary to the description

from (Wierzcho

n and Ku

zelewska, 2006) more com-

patible with GrC theory. SOSIG creates a network

structure of connected objects (in general points, but

in the presented solution hyperboxes) forming clus-

ters. Organization of the system, including the points

as well as the connections, is constructed on the basis

of relationships between input data, without any ex-

ternal supervision. The structure points are represen-

tatives of input data, that is, an individual object from

the structure stands for one or more object from input

set. In effect of this, the number of representatives is

much less than input data without lost of information.

To have convenient and compact notation, let us as-

sume input data are deﬁned as an information system

IS = (U, A) (Pawlak, 1991), where U = {x

,. .. ,x

}

is a set of objects and A = {a

,. .. ,a

} is a set of at-

tributes. The result generated by SOSIG is also de-

scribed by an information system IS

= (Y, A∪{a

}),

where the last attribute a

: Y → {1,...,nc} denotes

label of generated cluster and card(Y ) ≤ card(U ) and

∀x ∈ U ∃y ∈ Y (δ(x, y) < NR). The parameter NR in

general deﬁnes region of objects interactions and is

described later. The steps of the main (learning) part

of the SOSIG are shown in Algorithm 1, whereas ini-

tial phase is separated into Algorithm 2. Further clas-

siﬁcation of new as well as training points can be per-

formed using so-created network as it is presented in

Algorithm 6.

The parameter NR (Neighborhood Radius) exist-

ing in above descriptions deﬁnes neighborhood of ob-

jects from IS

. It directly inﬂuences level of granu-

lation of the input set. Initial value of NR is propor-

tional to maximal of nearest neighbor distances in the

input set (see Equation 3).

init

= max({min({δ(x

) : x

∈U & x

6= x

}) : x

∈U})

(3)

The following values of NR are calculated from cur-

rent state of the network (see Equation 4).

NR = rg ·

∑

∈Y

min({δ(y

) : y

∈ Y & y

6= y

})

card(Y )

(4)

where rg ∈ [0,1] is a resolution of granulation param-

eter. Value of rg is proportional to the level of granu-

lation, that is the top, ﬁrst level of granulated structure

is characterized by lower values of rg and the follow-

ing levels (second, third and so on) are characterized

by higher values of the parameter. The rg values are

not ﬁxed for according level of hierarchy, but related

Algorithm 1: Construction of information sys-

tem with a set of representative objects.

Data:

• IS = (U,A) - an information system, where U = {x

,. .. ,x

} is a

set of objects and A = {a

,. .. ,a

} is a set of attributes,

• {δ

: a ∈ A} - a set of distance function of the form

: V

×V

→ [0,∞), where V

is a set of values for attribute

a ∈ A and a global distance function δ : U ×U → [0,∞) deﬁned

by δ(x,y) = f usion(δ

(x),a

(y)),. .. ,δ

(x),a

(y)))

• size

net

∈ {0,1, .. .,card(U )} - initial size of network, rg ∈ [0, 1] -

resolution of granulation,

Result: IS

= (Y,A ∪ {a

}) - an information system, where the last

attribute a

: Y → {1, .. ., nc} denotes label of generated

granule and card(Y ) ≤ card(U) and

∀x ∈ U∃y ∈ Y δ(x, y) < NR

begin

[NR

init

,Y ] ←− initialize(U,A, size

net

);

for y

∈ Y, i 6= j do /*form clusters*/

if δ(y

) < NR

init

then connect(y

);

NR ←− NR

init

;

while ¬stopIterations(Y ) do

for y ∈ Y do

∆(y) = (δ(y,x))

x∈U

; /*calculate distances from

input data*/;

(y) = NR − min ∆(y);/*similarity level of the

object*/;

delete(U,A,Y ); /*remove redundant network

objects*/;

for y

∈ Y, i 6= j do /* reconnect objects*/

if δ(y

) < NR then connect(y

);

) ←− 0; a

) ←− 0;

grLabel ←− 1;

for y

∈ Y do /*label objects*/

if a

= 0 then a

) ←− grLabel;

for y

∈ Y, j 6= i do

if connected(y

) then

) ←− grLabel;

grLabel ←− grLabel +1;

for y

∈ Y do

/*calculate the nearest neighbor network objects*/;

) = min({δ(y

) : y

∈ Y & j 6= i});

NR ←− rg ·

∑

y∈Y

(y)

card(Y )

;/*new value of NR*/;

if ¬stopIterations(Y )/*test stopping condition */

then

joinNotRepresented(U,Y,NR,∆);

adjust(Y,U,A, NR);

to individual set of data. However, there is a value of

rg = 0.5 the most often appearing in empirical tests

for the most separated and compact (termed ”natu-

ral”) clustering.

After initial phase (like normalization of data, cal-

culation initial value of NR) iterated steps of the al-

gorithm follow. First, the system objects are assessed.

CLUSTERING WITH GRANULAR INFORMATION PROCESSING

Algorithm 2: Initial steps of SOSIG algorithm.

Data: Set of input objects: U = {x

,. .. ,x

size

net

∈ {0,1, .. ., card(U)} - initial size

of network

Result: Set of initial network objects

Y = {y

,. .. ,y

size

net

}, where Y ⊂ U,

init

- initial value of neighborhood

radius threshold

begin

maxNN

= 0;

for x

∈ U do

/*calculate the nearest neighbor

distances of the data*/ ;

) = min({δ(x

) : x

∈

U & x

6= x

}) ;

/*ﬁnd the greatest value of the nearest

neighbor distances*/ ;

if δ

) > δ

maxNN

then

maxNN

= δ

);

init

←− δ

maxNN

;

/*select the representatives objects*/ ;

for netOb j ← 1 to size

net

i = rand(1,. .. ,card(U ));

netOb j

= x

;

The measure of their usefulness is a similarity level

expressed by Equation 5.

(y) = NR − min({δ(y,x) : x ∈ U}) (5)

The s

deﬁnes a degree of similarity between exam-

ined representative point y and the most similar point

x from the training data. There are considered only

training points from neighborhood of y (deﬁned by

NR). The similarity level corresponds to the distance

between the points - the closer points are more sim-

ilar to each other than points located in farther dis-

tance. After calculation of similarity levels their val-

ues are normalized. In presented experiments to cal-

culate distance between objects (hyperboxes) a dis-

tance measure expressed by Equation 6 was used:

d(I

) = (ka

− a

k + kb

− b

k)/2 (6)

where ka

− a

k and ka

− a

k denote sum of

subtractions of respectively minimal (a) and maximal

(b) values of granules I

and I

in every dimension.

The equation has been introduced in (Bargiela and

Pedrycz, 2001).

To control size of the network there is a removal

step as it is presented in Algorithm 3. Useless ob-

jects are removed. It affects redundant objects from

Algorithm 3: Detailed steps of SOSIG func-

tion delete.

Data: Set of input objects U, network set Y , set

of attributes A = {a

,. .. ,a

}, NR -

threshold of neighborhood radius

Result: Y \C, where C is a set of redundant

network objects

begin

C ←−

0;/*initially the set is empty*/;

for x ∈ U do

for y ∈ Y do

if δ(y, x) < NR then /*add to the

set objects representing the input

element x*/;

C ←− C ∪ {y};

(x,y

) = min({δ(x,y) : y ∈ Y }) ;

C ←− C \ {y

};/*remove from the set

the best object representing x */;

Y ←− Y \C;/*remove from the network

objects from the set C*/;

the representatives network. As redundant are deter-

mined points with the same input object (from U ) in

their neighborhood. The best points stay in the net-

work and also the ones that are not redundant for other

input data. This process controls size of the network

prevents forming excessively dense clusters. It results

in compression phenomenon.

The remaining objects in the network are re-

connected and labeled. A granule is determined by

edges between the objects in the structure. Compo-

nents of the same granule (group) have equal label.

Afterwards, a new value of NR parameter is calcu-

lated (see Equation 4). When stopping criterion is

met, the algorithm is stopped after connections recon-

struction and labeling. Otherwise, the following steps

are carried out. As a stopping criterion, a stable state

of the network is considered. That is the state of small

ﬂuctuations of network size and value of NR.

The last step is to apply a procedure of adjusting

of all network objects (see Algorithm 5). In this step,

candidate objects, that are copies of original ones, are

created. The number of copies is not great, ﬁxed in al-

gorithm on 5. Values of attributes of candidate objects

are slightly modiﬁed (depending on a similarity level

of objects and the type of attributes). This procedure

allows to adjust network objects in attained solution

to the examined problem.

In the algorithm there is also a step of introducing

an object from input data to the system. It concerns

the object, which has not been identiﬁed yet. This

operation (presented in Algorithm 4) avoids leaving

ICAART 2011 - 3rd International Conference on Agents and Artificial Intelligence

uncovered (not represented by network objects) area

in the training set.

Further classiﬁcation of new as well as training

points can be performed using so-created structure

(see Algorithm 6). To assign a label to the considered

object it is necessary to determine neighborhood ob-

jects from the network structure. The neighborhood

of the point is deﬁned by ﬁnal value of the NR (the

last calculated value) of the SOSIG. The predominant

value of labels is given to the examined object.

It must be underlined that the algorithm SOSIG

does not require the number of clusters to be given.

Partitioning is generated automatically, which elimi-

nates inconvenient step of assessing and selecting the

best result from a set of potential clusterings.

The algorithm SOSIG is also described in (Stepa-

niuk and Ku

zelewska, 2008).

Algorithm 4: Detailed steps of SOSIG func-

tion joinNotRepresented.

Data: Set of input objects: U = {x

,. .. ,x

Y = {y

,. .. ,y

}, ∆ - matrix of distances

between input and network objects, NR -

threshold of neighborhood radius

Result: Y ∪ {x} (with condition

¬∃y ∈ Y δ(y,x) < NR)

begin

for x ∈ U do

/*ﬁnd an arbitrary object from the

training set not represented yet by any

network element*/ ;

add ←− 1 ;

for y ∈ Y do

if δ(y, x) < NR then add ←− 0 ;

break;

if add = 1 then Y ←− Y ∪ {x};

break;

3 EXPERIMENTS

The article proposes a method for detection of groups

containing similar objects. The method clusters data,

that contains data points as well as hyperboxes. The

experiments focus on comparing results of clustering

in two approaches: when data are in point and granu-

lated form. There were performed the following com-

parisons: the number of detected groups, values of

SOSIG parameters, time of detection process, values

of validity indices. It was also taken into considera-

tion interpretability of created clusterings.

Algorithm 5: Detailed steps of SOSIG function

ad just.

Data: Set of input objects U, network set Y , set

of attributes A = {a

,. .. ,a

}, NR -

threshold of neighborhood radius

Result: Y ∪ Z, where Z is a set of adjusted

network objects

begin

Z ←−

for y ∈ Y do

for candidate ←− 1 to noCandidates

candidate

←− y;

for a ∈ A do

sign ←− rand({−1,1});

delta ←− sign;

if a(z

candidate

) is binary then

/*modiﬁcation of binary

values of attributes*/;

randVal ←− rand({0, 1});

delta ←− delta · randVal;

a(z

candidate

) ←− delta;

else

/*modiﬁcation of

continuous values of

attributes*/;

randVal ←− rand([0, 1]);

/*scale of change depends

on the value of s

*/;

delta ←− delta · randVal ·

(1.5 − s

(y));

a(z

candidate

) ←−

a(z

candidate

) + delta;

for x ∈ U do

if δ(z

candidate

,x) < NR then

/*only useful clones are

joined to the network */;

Z ←− Z ∪ {z

candidate

};

break;

3.1 Description of the Datasets

There are several data sets in the experiments, shown

in Table 1. The sets are various with regard to the

number of objects, dimensionality and the existed

number of groups. The column groups numbers from

the Table 1 contains the number of groups presented

in the data according to the subjective human percep-

tion based on the separation and compactness of the

groups. However, the irises data set is a real data de-

livered with a priori class attribute. For that reason

CLUSTERING WITH GRANULAR INFORMATION PROCESSING

Algorithm 6: Clustering of new objects in

SOSIG algorithm.

Data:

• IS = (U, A) - an information system, where

U = {x

,. .. ,x

} is a set of objects and

A = {a

,. .. ,a

} is a set of attributes,

• IS

= (Y,A ∪ {a

}) where last attribute

: Y → {1,...,ng} stands for label of

generated granule and card(Y ) ≤ card(U)

• NR - threshold of neighborhood radius;

Result: Clustered information system

= (U,A ∪ {a

}) into ng clusters

(granules), where last attribute

: U → {1,...,nc} stands for label

of generated granule

begin

for x ∈ U do

for granule ← 1 to ng do

grLabels[granule] ← 0;

/*calculate distance between x and the

network objects*/;

for y ∈ Y do

if δ(x,y) < NR then

label ← a

(y);

grLabels[label] ←

grLabels[label] + 1;

/*predominant label is selected */;

(x) ← max({grLabels}) ;

Table 1: Description of data sets.

data dim objects hyperboxes groups

set number number number

norm2D2gr 2 200 51 2

sph2D6gr 2 300 70 6

irises 4 150 94 3

sph10D4gr 10 200 13 4

the value of the groups number for it is related to the

number from the decision attribute.

3.2 Results of Experiments

Clustering by SOSIG algorithm was performed on

both point-type and granulated data. The number of

point-type and granular objects is compared in Table

1. It can be noticed that in all cases the number of

hyperboxes is signiﬁcantly less than the number of

points. However the dimension is doubled for gran-

ulated data.

Number of groups identiﬁed by SOSIG in de-

scribed above data sets is presented in Table 2. When

Table 2: Results of clustering of point-type and granulated

data with respect to number of identiﬁed groups.

data point-type granulated

set data data

norm2D2gr 2 2

sph2D6gr 6 6

irises 2, 3 4

sph10D4gr 4 4

the result consists of groups of highly various sizes,

only a number of main groups is presented.

Granulation of the irises set contains two levels

(low and high resolution), what is visible in the all

following tables. In the result are 2 clusters when

granulation is performed on low resolution level,

whereas in high resolution one large cluster is split in

two smaller and there are additionally 5 signiﬁcantly

smaller groups. The results considering both levels of

granulation are shown in the same cell of the tables,

where ﬁrst value corresponds to low and the second to

high level of resolution. Granulation of irises hyper-

boxes is composed of only one level clustering with 4

main and 6 additional smaller groups.

In the remaining data sets clusterings the num-

ber of groups is corresponding to each other for both

types of processed data.

Table 3: SOSIG results with respect to generated number

of representatives and NR value when clustering point-type

and granulated data.

data point granulated

set data data

nr of NR nr of NR K

reps. val. repres. val. val.

norm2D2gr 80 0.11 21 0.25 10

sph2D6gr 123 0.07 33 0.09 10

irises 92, 0.22, 88 0.13 15

112 0.1 88 0.13 15

sph10D4gr 46 0.21 6 0.18 6

The number of representatives generated by

SOSIG and the NR value for point-type as well as

hyperboxes data are shown in Table 3. Additionally,

for granulated data K parameter is presented. K was

selected to have small number of hyperboxes and pre-

serving combining points from different clusters into

one granule. For the irises point-type data set two lev-

els of granulation were considered - composed of 2

and 3 groups. The number of representatives is less in

case of clustering granulated data. Values of the NR

are different in both cases, however there are com-

parable NR values for clustering of sph2D6gr, irises

(version of high level clustering for point-type data)

and sph10D4gr sets.

ICAART 2011 - 3rd International Conference on Agents and Artificial Intelligence

Table 4: Average time (in seconds) of clustering granulated

and point-type data.

data point-type granulated t

set data, t

data, t

norm2D2gr 0.36 0.04 9

sph2D6gr 0.93 0.08 11.63

irises 0.87, 0.80 0.79 1.01

sph10D4gr 0.27 0.01 38.57

The results presented in Table 4 consider time (in

seconds) of a single run of SOSIG algorithm. This

is average time of 50 runs of the algorithm calculated

for clustering original data as well as hyperboxes. The

last column of the table contains quotient of the val-

ues. It can be seen the processing of granulated data

is signiﬁcantly, up to about 40 times, faster than pro-

cessing original point-type objects. The most acceler-

ation is visible when the number of objects in data is

great.

To compare results of clustering one can use valid-

ity indices designed to detect the most compact and

separable partitioning. The validity indices are not

universal, however the most popular tool for assess-

ment clustering results (Halkidi and Batistakis, 2001).

Simultaneous comparison of several of them can give

quite objective result.

Evaluation of groupings of granulated data in

comparison to clusterings of point-type objects was

performed in the experiments and the results are

shown in Table 5. The following indices were

taken into consideration: Davies-Bouldin’s (DB),

Dunn’s, Silhouette (SI) and CDbw. Clusterings of

norm2D2gr, sph2D6gr and sph10D4gr in form of

hyperboxes are characterized by better values of 3

from 4 indices. For the set irises the values are better

for point-type clustering.

Table 6 contains detailed description of groups

(granules) detected in clustering of irises hyperboxes

data. The ﬁnal result is composed of 10 clusters, how-

ever due to considerable differences in their size, the

result focuses on the main 3 granules. A priori de-

cision attribute is composed of 3 classes: Iris-setosa

(I-S), Iris-versicolor (I-Ve) and Iris-virginica (I-Vi).

The set is described by 4 attributes: sepal-length (SL),

sepal-width (SW), petal-length (PL) and petal-width

(PW). The granule gr

contains 13 smaller granules

(hyperboxes) and all of them belong to the class Iris-

setosa. The other granule (gr

) has comparable size

(15 objects) and contains only objects from the Iris-

versicolor class. The largest granule, gr

consists of

36 hyperboxes. It is not homogenous with respect of

class attribute, due to 31% of objects come from Iris-

versicolor class and 69% from Iris-virginica.

Table 5: Results of clustering (values of the indices) of

point-type and granulated data.

data indices point-type granulated

set data data

DB 0.06 0.08

norm2D2gr Dunn’s 0.16 0.74

SI 0.53 0.76

CDbw 19.15 45.34

DB 0.03 0.01

sph2D6gr Dunn’s 0.51 1.38

SI 0.75 0.85

CDbw 36.54 23.58

DB 0.14, 0.12 0.2

irises Dunn’s 0.39, 0.19 0.25

SI 0.63, 0.53 0.3

CDbw 74.84, 2.91 1.61

DB 0.01 0.0001

sph10D4gr Dunn’s 7.83 9.29

SI 0.93 0.96

CDbw 453.23 41.04

DB 0.23 0.37

It has to be focused attention on the attributes re-

sulted from doubling of dimensions. These features

are related to minimal and maximal values of the orig-

inal attributes. As a consequence it appears an addi-

tional feature - difference between the maximal and

minimal value of particular variables. Average differ-

ences are presented in Table 6 in column di f f

Avg

. The

granule gr

is characterized by the widest range of all

attributes, the granule gr

contains ﬂowers with the

smallest size of petals and sepals. Finally, the granule

is composed of irises with narrow and long petals

and sepals.

Table 7 presents granules from the second level

of data relationship hierarchy. The granules are hy-

perboxes identiﬁed in the ﬁrst phase of the granula-

tion. In the table the greatest 3 hyperboxes (denoted

as gr

i j

) from every granule of the main level were se-

lected. The second-level granules from the top-level

granules gr

and gr

have larger size and the range

of their attributes values is greater on contrary to the

granules belonging to gr

. It shows the granules gr

and gr

are more compact and have greater regions

of even information density. It can be noticed, that

the hyperboxes are homogenous with regard to class

attribute.

4 CONCLUSIONS

The article presents modiﬁed clustering method as an

approach for data granulation. The algorithm is two-

CLUSTERING WITH GRANULAR INFORMATION PROCESSING

Table 6: Main level of irises data hierarchy composed of

clustering result of hyperboxes set. Table contains 3 main

granules.

id/ class attr. min max di f f

Avg

size distr. val. val.

100% SL 4.4-5.4 4.8-5.5 0.25

/ I-S SW 3.0-3.7 3.1-3.9 0.15

13 PL 1.0-1.5 1.5-1.9 0.29

PW 0.1-0.4 0.1-0.5 0.12

31% SL 5.6-7.1 5.6-7.1 0.04

I-Ve

/ 69% SW 2.5-3.4 2.5-3.4 0.03

36 I-Vi PL 4.3-6.0 4.4-6.0 0.07

PW 1.4-2.5 1.4-2.5 0.02

100% SL 5.2-6.1 5.2-6.2 0.11

I-Ve SL 5.2-6.1 5.2-6.2 0.11

/ SW 2.3-2.9 2.3-3.0 0.08

15 PL 3.5-4.7 3.6-4.7 0.17

PW 1.0-1.4 1.1-0.5 0.08

Table 7: Second level of irises data hierarchy composed of

hyperboxes (there are presented only selected objects).

main gr. size class min values di f f

Avg

granule id distr. of attr.

SL 5.0 0.5

15 100% I-S SW 3.4 0.3

PL 1.3 0.4

PW 0.2 0.2

SL 4.6 0.5

15 100% I-S SW 3.3 0.3

PL 1.0 0.7

PW 0.2 0.3

SL 4.8 0.2

9 100% I-S SW 3.0 0.2

PL 1.2 0.4

PW 0.1 0.2

SL 6.4 0.3

5 100% I-Ve SW 2.9 0.2

PL 4.3 0.4

PW 1.3 0.2

SL 6.4 0.1

4 100% I-Vi SW 3.0 0.2

PL 5.1 0.4

PW 1.8 0.2

SL 5.9 0.3

4 100% I-Vi SW 2.8 0.2

PL 4.8 0.3

PW 1.8 0.0

SL 5.6 0.5

14 100% I-Ve SW 2.7 0.3

PL 3.9 0.8

PW 1.2 0.3

SL 5.7 0.5

8 100% I-Ve SW 2.6 0.3

PL 3.5 0.8

PW 1.0 0.3

SL 5.4 0.3

6 100% I-Vi SW 2.8 0.2

PL 4.1 0.4

PW 1.3 0.2

phased, ﬁrst phase prepares input point-type data as

multi-dimensional granules in form of hyperboxes.

The hyperboxes are based on maximizing information

density in data. The next phase is clustering the gran-

ules by SOSIG algorithm. Clustering process can be

performed on different resolution of data. Clustering

of hyperboxes was executed without changing of res-

olution. Three-level structure of data was constructed

by joining original point (third down level) in hyper-

boxes (second level), whereas the top level contains

division of hyperboxes into clusters. Partitioning at

the top level of hyperboxes granulation (clustering)

is composed of the same number of groups as parti-

tioning point-type data. Quality of created clusters is

comparable as well, due to values of quality indices

are similar.

Process of hyperboxes creation is a type of aggre-

gation operation, therefore the most beneﬁt of the pre-

sented method is shortening time of clusters creation

in comparison to the processing point-type data. It

is especially effective when data contain large num-

ber of objects. Hyperboxes also determine additional

level of relationship existing in data. Finally, descrip-

tion of the granules is more comprehensible since the

hyperboxes contain minimal and maximal values of

attributes.

ACKNOWLEDGEMENTS

This work was supported by Rector’s of Technical

University of Bialystok Grant No. S/WI/5/08.

The experiments were performed on the computer

cluster at Faculty of Computer Science, Bialystok

Technical University.

REFERENCES

A.K. Jain, M. M. and Flynn, P. (1999). Data clustering: a

review. In ACM Computing Surveys 31:3, 264–323.

Bargiela, A. and Pedrycz, W. (2001). Classiﬁcation and

clustering of granular data. In IFSA World Congress

and 20th NAFIPS International Conference, vol.3,

1696–1701.

Bargiela, A. and Pedrycz, W. (2002). Granular Comput-

ing: an Introduction. Kluwer Academic Publishers,

Boston.

Bargiela, A. and Pedrycz, W. (2006). Granular analysis of

trafﬁc data for turning movements estimation. In Int.

J. of Enterprise Information Systems, vol. 2-2, 13–27.

Halkidi, M. and Batistakis, Y. (2001). On clustering valida-

tion techniques. In Journal of Intelligent Information

Systems 17:2/3, 107–145.

ICAART 2011 - 3rd International Conference on Agents and Artificial Intelligence

Halkidi, M. and Vazirgiannis, M. (2002). Clustering va-

lidity assessment using multi representatives. In Pro-

ceedings of SETN Conference.

Kaufman, L. and Rousseeuw, P. (1990). Finding Groups in

Data: An Introduction to Cluster Analysis. Wiley.

Pawlak, Z. (1991). Rough Sets. Theoretical Aspects of Rea-

soning about Data. Kluwer Academic Publishers,

Dordrecht.

Stepaniuk, J. and Ku

zelewska, U. (2008). Information gran-

ulation: A medical case study. In Transactions on

Rough Sets, vol.5390/2008, 96–113. Springer.

Wierzcho

n, S. and Ku

zelewska, U. (2006). Evaluation of

clusters quality in artiﬁcial immune clustering system

- saris. In Biometrics, computer security systems and

artiﬁcial intelligence applications,323–331. Springer-

Verlag.

Yao, Y. (2006). Granular computing for data mining. In

Proceedings of SPIE Conference on Data Mining, In-

trusion Detection, Information Assurance, and Data

Networks Security,1–12.

Zadeh, L. A. (2001). A new direction in ai: Toward a

computational theory of perceptions. In AI Magazine

22(1), 73-84.

CLUSTERING WITH GRANULAR INFORMATION PROCESSING