CORD: A HYBRID APPROACH FOR EFFICIENT

CLUSTERING OF ORDINAL DATA USING

FUZZY LOGIC AND SELF-ORGANIZING MAPS

Natascha Hoebel and Stanislav Kreuzer

Department of Computer Science and Mathematics, Database and Information Systems

Goethe University Frankfurt, Germany

Keywords:

User proﬁle analysis, Clustering, Ordinal data, Optimization, Web mining.

Abstract:

This paper presents CORD, a hybrid clustering system, which combines modiﬁcations of three modern clus-

tering approaches to create a hybrid solution, that is able to efﬁciently process very large sets of ordinal data.

The Self-organizing Maps algorithm for categorical data by Chen and Marques is hereby used for a rough pre-

clustering for ﬁnding the initial position and number of centroids. The main clustering task utilizes a k-modes

algorithm and its fuzzy set extension described by Kim et al. for categorical data using fuzzy centroids. Finally

in dealing with large amounts of data, the BIRCH algorithm described by Zhang et al. for efﬁcient clustering

of very large databases (VLDBs) is adapted to ordinal data. BIRCH can be used as a preliminary phase for

both Fuzzy Centroids and NCSOM. Both algorithms proﬁt from this symbiosis as their iterative computations

can be done on data, that is fully held in main memory. Combining these approaches, the resulting system

is able to extract signiﬁcant information even from very large datasets efﬁciently. The presented reference

implementation of the hybrid system shows good results. The aim is clustering and visual analyzing large

amounts of user proﬁles. This should help in understanding Web user behavior and personalize advertisement.

1 INTRODUCTION

Clustering algorithms have evolved greatly since the

development of data-mining technologies in terms

of signiﬁcance, as well as performance and quality.

Especially when taking into account the constantly

growing number of algorithms, research of the us-

ability and adaptivity seems inevitable in a practical

environment. It is obvious that an efﬁcient solution

to all clustering problems cannot be concentrated into

one algorithm. Therefore it is necessary to expand

the ﬁeld of research to a set of partial solutions. Some

of these solutions are adaptable to changing practical

requirements exploiting synergies of their members.

This paper introduces CORD (Clustering of

Ordinal Data), a hybrid clustering algorithm system.

Main contribution of our work presented herein is:

• Deﬁnition of a more accurate distance measure for

ordinal data (section 4.1).

• Extension of the known algorithms to work with

ordinal data by using the described measure.

• Optimization of the time complexity of Fuzzy-

Centroids (Kim et al., 2004) and showing its im-

plementation, which allows the results to be saved

and therefore reﬁned later (section 4.2).

• Improvement of the result of the clustering task by

helping in the decision process of choosing how

many centroids should be used; and where should

the k-centroids be placed. This is possible through

our hybrid approach with NCSOM.

• Improvement of the clustering task so that it

works with very large datasets. This is possible

through our hybrid approach with BIRCH.

• Deﬁnition of the normalization for the sum of w

l,t

for FuzzyCentroids (assumption in 3.1).

1.1 Problem Formulation

Clustering of elements is a common problem and has

been broadly discussed. All algorithms have to deal

with the issue of distance computation for the ob-

served elements. Most of the known approaches fo-

cus on numerical data that allows measuring the dis-

tance by arithmetic operations, for example Euclidean

or chi-squared distance metric. Others focus on cate-

gorical data ((Gan et al., 2005), (Huang, 1997), and

297

Hoebel N. and Kreuzer S.

CORD: A HYBRID APPROACH FOR EFFICIENT CLUSTERING OF ORDINAL DATA USING FUZZY LOGIC AND SELF-ORGANIZING MAPS.

DOI: 10.5220/0002795402970306

In Proceedings of the 6th International Conference on Web Information Systems and Technology (WEBIST 2010), page

ISBN: 978-989-674-025-2

(Parmar et al., 2007)) and measure the distance e.g.

by counting the number of different attribute values.

This is known as Hamming distance.

The focus of the CORD approach is to efﬁciently

cluster large amounts of ordinal data. Ordinal data,

unlike pure categorical data, possesses an inherent or-

der. Though this order can not be utilized to improve

a distance metric in general, there are speciﬁc cases,

where this kind of information can be a valuable asset

and enhance the quality of the cluster analysis.

1.2 Problem Application: User Proﬁles

The motivation in the design of CORD arose in the

Gugubarra project (Zicari et al., 2006). This project

lacks an appropriate solution for clustering and visual

analyzing large amounts of user proﬁles. The project

focuses on calculating different kinds of user proﬁles

of Web site visitors and understanding the user behav-

ior. The non-obvious user proﬁles (NOPs) represent

the interests of users in different topics. It is a vec-

tor per user with numerical values between 0 and 1

for each topic. While a numerical value is not very

meaningful, these values can be mapped to a so called

scale of interest. Sometimes these proﬁles are stored

in this scale, right from the start. The scale of interest

classiﬁes the values in a human understandable form

and is ordinal. An example of a scale of interest is:

no, little, medium, high, absolute interest. Another

kind of proﬁle is the user feedback proﬁle. Several

Web sites allow or ask their users to give a feedback

on their interests, hobbies or opinions about a prod-

uct or the site itself. A user feedback can hereby be

given on a scale of interest over a number of topics.

The reply can then be stored in a feedback proﬁle. To

understand users better, it can be of interest to cluster

across this feedback (i.e. what the user tells about his

interest) and the NOP (i.e. what he is supposed to be

interested in, by analyzing his click streams).

The paper shows the design and implementation

of a clustering solution in the area of clustering user

proﬁles and in general for clustering VLDBs of ordi-

nal data having the same cardinality and scale.

The rest of the paper is structured as follows: Sec-

tion 2 introduces the necessary deﬁnitions. Theoret-

ical foundations are presented in section 3. Section

4 describes the design and implementation of CORD.

Section 5 presents the experimental results. Related

work is compared in section 6.

2 DEFINITIONS

This section deﬁnes some of the concepts that will be

used in the rest of the paper.

Let N be the entire set of categorical records X:

N =

{

,...,X

}

= n (1)

with a relational schema R =

{

,...,A

}

with

= r.

, i = 1, ...,r is a categorical attribute with its do-

main D

= DOM (A

) respectively:

R =

{

,...,A

}

, D

= DOM (A

), i = 1, ...,r (2)

A domain D

for attribute A

is a set of categorical

values a

i, j

{

i,1

,...,a

i,n

}

, j = 1,...,n

(3)

A record X ∈ N is a tuple of r categorical values

∈ D

of the attributes A

,...,A

, that belong to the

relational schema R of N. The attribute value of X for

an attribute A

∈ R is deﬁned as A

(X).

X = (x

,...,x

), x

∈ D

, A

(X) = x

, i = 1, ...,r (4)

The universe U = D

× D

× ... × D

of all valid as-

signments for records of N is deﬁned as the Carte-

sian product of the domains D

of all attributes A

be-

longing to R of N. Based on this deﬁnition, a dis-

tance measure d (X ,Y) refers to a mapping given as:

d : U ×U −→ R, U = D

× D

× ... × D

that fulﬁlls

for every two records X,Y of U the following con-

straints:

• d (X, X ) = d

, X ∈ U, d

∈ R

• d (X,Y ) > d

, X,Y ∈ U , X 6= Y

• d (X,Y ) = d (Y,X ), X,Y ∈ U

A Cluster partition C of N, is deﬁned as C =



⊆ N



, j = 1,..., k where k is the number of clus-

ters and where for every two clusters c

applies:

∀ c

∈ C, i 6= j, c

∩ c

0. The centroid Z of a

cluster c

is a record, that is the best representative

for the average of all records in the cluster.

3 THEORETICAL

FOUNDATIONS

As CORD is a hybrid clustering algorithm based on

the following three algorithms, these algorithms are

ﬁrst described in the next subsections in more detail.

WEBIST 2010 - 6th International Conference on Web Information Systems and Technologies

298

3.1 Fuzzy Centroids

The work done by Kim et al. describes a Fuzzy clus-

tering algorithm of categorical data using fuzzy cen-

troids (Kim et al., 2004), and is named FuzzyCen-

troids in the following. This algorithm is an exten-

sion of Fuzzy k-Modes (Huang, 1997), and needs the

number of clusters k, a-priori. The fundamental idea

of this approach is using fuzzy sets as centroids to

overcome the information loss between iterations in

Fuzzy k-Modes. The fuzzy sets allow to build on in-

formation from the previous iterations when choosing

the centroids in the current iteration. In this way the

center of a cluster can be calculated more accurately.

A fuzzy (set) centroid

Z is deﬁned as follows:

Z = (

,...,

{

l,1

,ω

l,1

),..., (a

l,n

,ω

l,n

)

}

(5)

with a

l,t

∈ A

= n

,0 ≤ ω

l,t

≤ 1 , 1 ≤ l ≤ r , 1 ≤

t ≤ n

and

∑

t=1

l,t

= 1 , 1 ≤ l ≤ r. The values ω

l,t

build for each cluster the conﬁdence degree matrix

= [ω

l,t

] , 1 ≤ i ≤ k. Each

∈

Z is determined by

the category distribution of attribute A

of the records

belonging to the cluster. ω

l,t

indicates the conﬁdence

degree (or resonance) with which the attribute value

l,t

contributes to

In the ﬁrst step, initial centroids are chosen by se-

lecting the values ω

l,t

. In the second step the mem-

bership matrix M, formula (6), of the values µ

i, j

are

calculated. The fuzziness modiﬁer m and the distance

measure d



Z,X



∑

l=1

δ(

), that takes into ac-

count the ω

l,t

are both described in (Kim et al., 2004).

i, j

(t + 1) =







∑

l=1

















m−1







−1

(6)

In step three, the ω

l,t

of all centroids will be updated,

using the following stop condition, i.e. as long as there

is an improvement in minimizing the function:





∑

i=1

∑

j=1

(µ

i, j

)





(7)

subject to

∑

i=1

i, j

= 1 ∧ 0 <

∑

j=1

i, j

< n (8)

In case the stop condition is not fulﬁlled, step two

is repeated. Otherwise it stops. For further details,

please refer to (Kim et al., 2004).

It should be noted that ω

l,t

, as deﬁned in (Kim

et al., 2004), can have values between 0 and 1 only.

The sum of ω

l,t

for t = 1,..., n

has to be equal 1.

However, because the sum is taken over µ

i, j

it is pos-

sible that the result is higher than 1. This problem

is not mentioned in (Kim et al., 2004) and therefore

the following assumption is made: The ω

l,t

will be

normalized after the summation. Therefore, the sum

∑

t=1

l,t

= 1 , 1 ≤ l ≤ r is calculated. Afterwards the

l,t

, that are part of the summation, are normalized

with the result of the summation.

3.2 BIRCH

BIRCH, introduced in (Zhang et al., 1996), is a well

known clustering method for VLDBs and broadly

studied. The whole dataset is iterated once at maxi-

mum. A cluster feature tree is built as representation

of the data. Each leaf of the CF tree is a set of cluster

features (CF). A CF saves information about the cen-

troid and the amount of records in the cluster. This

summary is small enough to be held in the fast main

memory, that is allocated by BIRCH. Using this sum-

mary, the distance measurements for numerical data

(Zhang et al., 1996) can be used between cluster fea-

tures and records. In a further phase a clustering al-

gorithm can process the leaves and is therefore faster

than working on the original dataset. A cluster feature

of a cluster c

for numerical data is a triple CF



,SS



(9)

with N

, the number of records in c

∑

X∈c

X, is the linear sum of the attribute values and

∑

X∈c

the according square sum.

BIRCH is more or less an abstract instruction for

building an algorithm for VLDBs. Zhang et al. de-

scribes an implementation of this approach for nu-

merical data, using 4 distance measurements and an

agglomerative hierarchical clustering algorithm. For

further details, please refer to (Zhang et al., 1996).

3.3 NCSOM

The NCSOM algorithm (Chen and Marques, 2005)

is an extension of the Self Organized Map (SOM)

algorithm for categorical data and builds upon ar-

tiﬁcial neural networks, described by Teuvo Koho-

nen. The data structure of the NCSOM algorithm is

a two dimensional network of K neurons. The net-

work is square and the sides have all the same num-

ber of neurons. Each neuron has a randomly initial-

ized reference vector m

= [m

j,1

,...,m

j,r

], where r is

the number of attributes as deﬁned in section 2. Dur-

ing the iteration, these reference vectors change dy-

namically, since they adapt to the records already pro-

CORD: A HYBRID APPROACH FOR EFFICIENT CLUSTERING OF ORDINAL DATA USING FUZZY LOGIC AND

SELF-ORGANIZING MAPS

299

cessed and likewise to their neighbor neurons. Thus a

map of reference vectors forms, whereby similar vec-

tors lie closer to each other. The algorithm demands

no knowledge of the data and is able to determine a

global optimum of the cluster partition. For further

details of NCSOM, see (Chen and Marques, 2005).

4 CORD HYBRID ALGORITHM

AND IMPLEMENTATION

(Cheu et al., 2004) describes several possibilities to

use a hybrid algorithm. The word hybrid has its origin

from the Greek language and means bundling, cross-

ing or mixture. A hybrid is the combination of two

or more different things, aimed at achieving a par-

ticular objective. In this case, a hybrid concept is

used to cluster large amounts of ordinal data. The

ﬁrst phase of the concept processes the whole dataset

once to create a summary of its most distinctive fea-

tures being small enough to be held in main memory.

A modiﬁcation of BIRCH (Zhang et al., 1996), de-

scribed in subsection 4.3, is used in this phase. The

clustering task of the main phase uses the BIRCH

summary as input. Therefore the algorithm in (Kim

et al., 2004) was modiﬁed as described in subsec-

tion 4.2. A preliminary data analysis phase can

be placed before the main clustering task to review

the structure of the summary and to manually opti-

mize its initialization by choosing the initial cluster

centers. The modiﬁcation of the NCSOM algorithm

(Chen and Marques, 2005) is used here for creating

a similarity map, where best places for initial cluster

centers can be seen and picked visually. The results

of the main clustering phase, consisting of k matrices

[ω

l,t

] of weights for each of the k clusters, where l is

the number of attributes and t is the number of cate-

gories for the attributes, can then be stored efﬁciently.

Moreover, the ability to save the weight matrices cre-

ates the possibility to pause and resume the main clus-

tering task. A reﬁnement of an already achieved result

can then be done in the same way, allowing the use of

the old result as initialization, so that the clustering

does not have to be restarted completely.

4.1 Distance Measurement

We deﬁne in the following a distance measure for

ordinal data. The motivation is as follows: The at-

tributes of the user proﬁles as described in subsection

1.2 have the same cardinality and scale of attribute

values. Using the scale of interest as domain, results

in comparable attributes with the same context and

symbolism, namely the user interest in different top-

ics. This data is ordinal.

The measurements for text distances are not ap-

propriate, because the label letters itself do not play

any role. The measurements for numerical data are

also not appropriate, as the ordinal data can not be

transformed to numbers. For example, if the feed-

back is given on an ordinal scale, then this data cannot

easily be transformed into numerical values, because

ordinal data only offers information of the ordering

of some elements. The distance between these ele-

ments is not deﬁned. This is the central matter for

this research ﬁeld and is discussed in statistics. For

example, Podani has discussed this issue for Braun-

Blanquet dominance (Braun-Blanquet et al., 1932):

“it is inappropriate to analyse Braun-Blanquet abu-

dance/dominance data by methods assuming that Eu-

clidean distance is meaningful” (Podani, 2005).

The measurements for categorical data could be

applied, as by the clustering algorithms CACTUS,

ROCK or STIRR. However it seams insufﬁcient to

us, as ordinal data can offer more information (i.e. the

ordering) than simple categorical data. The attribute

values of the scale of interest are ordered and quasi

equidistant. Therefore the number of categories be-

tween two attribute values is taken into account. We

extended the distance metric to include this ordering

into the computation, see formula (10). We accept a

small error, as it is only a quasi-equidistance, but the

distance measure is more precise with these modiﬁ-

cations rather than without them. An example: Using

Hamming distance for categorical data, the distance

between small and high is equal to the distance be-

tween small and medium. With our proposed distance

measure in formula (10), the ﬁrst distance is higher

than the second, which makes sense.

The distance measure d(X ,Y ) for two records X

and Y of dataset N is given as follows:

d (X,Y ) =

∑

l=1

δ (x

), (10)

with r is the number of attributes as deﬁned in sec. 2

and:

δ (x

) =



0, if (x

= y

)

Rank(x

) − Rank(y

)

, otherwise

(11)

Rank(x

) is the position of x

in the order of the

scale. Given the distance d (x,y) = 1 of two records

means, that only one categorical attribute value dif-

fers, and the categories are ordered next to each other.

This is the smallest distance, that two ordinal records

can have. The next upper distance is 2, thus the dis-

tance concept is discrete.

WEBIST 2010 - 6th International Conference on Web Information Systems and Technologies

300

1 g e t M a t r i x R ow (W , X ∈ N , cost ) {

2 µ = (µ

,...,µ

) ;

4 f o r (W

∈ W , j = 1, ...,k ) {

5 d

= g e t D i s t a n c e (W

, X ) ; }

6 f o r (W

∈ W , j = 1, ...,k ) {

7 µ[ j] = 1 / power ( d

/ sum ( d

) , 1 / ( m −1));

8 cost += µ

∗ g e t D i s t a n c e (W

, X ) ; }

9 re t u r n µ ;

10 }

Listing 1: Membership calculation of one record.

1 new F u z z y C e n t r o i d s ( k ) {

2 W 1,W 2=





, j = 1,...,k ;

3 M = [µ

x,y

] , x = 1, ...,n , j = 1,...,k ;

4 cost

= 0 ;

5 cost

= ∞ ;

6 i

= 0 ;

8 f o r (W

∈ W1 , j = 1,...,k ) {

9 W

= i n i t i a l i z e F u z z y C e n t r o i d ( j ) ; }

10 wh i l e ( ( cost

< cost

| | i

< 1 ) && i

< max )

11 {

12 cost

= cost

;

13 f o r ( X

∈ N , i = 1,..., n ) {

14 µ

= getMatrixRow (W 1 , X

, cost

) ;

15 W 2 = u p d a t e C e n t r o i d s ( µ

, W2 , X

) ; }

16 s w i t c h P l a c e s (W 1 , W2 ) ;

17 i

++;}

18 }

Listing 2: Implementation of modiﬁed FuzzyCen-

troids algorithm (main phase of CORD).

4.2 Optimizing FuzzyCentroids for

Implementation

The steps of FuzzyCentroids as deﬁned in (Kim et al.,

2004) are not optimized. The time complexity is

given as O(kn(r + M)s) where M =

∑

l=1

is the to-

tal number of categories of all attributes. However,

for our problem of clustering very large databases, it

makes sense to have a closer look at the real runtime.

Implementing the algorithm as described by Kim et

al. would need 5 iterations through the whole dataset.

Our ﬁrst aim is minimizing the iteration cycles.

For the minimization, we changed the steps of the

algorithm. The calculation of ω

and µ

i, j

requires

each an iteration through the dataset. However it

is possible to perform both steps successively for a

record. In addition the calculation of the cost function

can be done in parallel (listing 1, line 8). The modiﬁ-

cations do not change the functionality of FuzzyCen-

troids, but decrease the computation time. As a result

the modiﬁed version of FuzzyCentroids iterates only

once through the whole dataset in the main method.

All steps are done in the for loop for each x

, see

listing 2, line 13. The conﬁdence degree matrices W

for all clusters are saved twice in the third order ten-

sors W 1 and W 2. This allows in each iteration ac-

1 g e t D i s t a n c e ( X ∈ N , W

) {

2 d = 0 ;

4 f o r ( i = 1, ..., r ){

5 f o r ( j = 1, ..., n

) {

6 i f ( x

∈ X != a

i, j

∈ D

) {

7 d +=



− a

i, j



∗W

[i][ j]; } } }

8 r e t u r n d ;

9 }

Listing 3: New distance measure record-to-centroid.

1 u p d a t e C e n t r o i d s ( µ , W , X ∈ N ) {

2 f o r (W

∈ W , j = 1, ...,k ) {

3 f o r ( i = 1, ..., r , x

∈ X ) {

4 ω

i,x

+= Math . pow ( µ

, m) ; } }

5 re t u r n W ;

6 }

Listing 4: Conﬁdence degree (resonance) calculation.

cessing the result from the last iteration and save the

current result. The method switchPlaces changes

the role of the tensors in-between.

The main method, shown in listing 2, executes two

more methods. The ﬁrst getMatrixRow, listing 1,

computes the membership µ

for the current x

and

the cost function. This is different to (Kim et al.,

2004), where the membership matrix is computed for

all X ∈ N in one step and the cost function in another

step. The computation for membership and cost both

rely on the measurement of distance.

The getDistance method is shown in listing 3.

For our needs, the distance calculation of (Kim et al.,

2004) is modiﬁed according to the considerations in

subsection 4.1 and formula (10).

Listing 4 shows the update of centroids. The new

resonance of ω

l,t

is calculated incrementally by taking

into account the membership degrees of one record

after another. The new tensor W 2 results at the end of

an iteration. As described in section 3.1, we perform

a normalization of ω

l,t

at the end of each iteration.

4.3 Increasing Speed with BIRCH

(Chiu et al., 2001) introduces a procedure develop-

ing on BIRCH, using an hierarchical clustering algo-

rithm. It works, with the help of a distance measure

based on maximum likelihood estimation, on mixed

numerical and categorical data. As the main goal of

our work is not on building a cluster hierarchy, but on

a partition and only on ordinal data, further improve-

ments for the selection of the distance measure and

the algorithm can be realized. The distance measure-

ment described in 4.1 ensures an optimal and useful

distance concept, that can be applied here, too.

As described in 3.2, BIRCH builds the so called

cluster features. Each CF is initialized with one

CORD: A HYBRID APPROACH FOR EFFICIENT CLUSTERING OF ORDINAL DATA USING FUZZY LOGIC AND

SELF-ORGANIZING MAPS

301

record as centroid. Other records will be absorbed,

as long as they are similar in a certain distance inter-

val. The parameter T controls the distance interval

size (radius). Using the described distance measure

for ordinal values, the parameter T deﬁnes how many

attribute values can be different from the centroid of

the CF, to be still part of the sub set, represented by

the CF. Thereby the information stored for each CF is

simpliﬁed too. Contrary to (Zhang et al., 1996) only

the centroid has to be held in the memory.

If the size of the CF tree borders the provided

memory during the execution, the tree must be re-

duced. Here the T parameter is used. T is increased

and the CFs of the old tree leaves are inserted into a

new tree. Due to the larger radius, some of the CFs

of the old tree absorb other CFs in the new tree. The

tree becomes smaller and, in addition, more diffused.

This is a good side-effect, since it leads to the fact that

outliers have marginal inﬂuence on the result, as they

disappear in the crowd.

The optimal choice of the parameter T is a difﬁ-

cult task. In (Zhang et al., 1996) a heuristic is used, to

determine T

i+1

from the previous value T

, the quan-

tity of already processed records N

, and the size of

the whole dataset N. For our application (see 1.1), we

deﬁne and use the following alternative heuristic:

i+1

= 1 + T

+ log





(12)

This heuristic should provide some advantages:

At the beginning, when the CF tree is too big, the

lower the ratio processed records per dataset, the

more the radius increases. At the end, when the

most records are processed, the increase of the radius

should indeed ﬂatten but still remain high enough to

ensure, that no excessive reconstruction of the CF tree

takes place. At the start T

= 0 and T

= 1.

The ﬁrst phase of CORD is based on BIRCH and

shown in listing 5. CFTree is the data structure of

the CF tree and CF

is the current processed cluster

feature. Each record X ∈ N is ﬁrst transformed into

a cluster feature with createClusterFeature, list-

ing 5, line 8. If the tree is not too big, the method

addToNode adds the new CF into the tree by ﬁnding

the correct position. If the CF tree borders the pro-

vided memory, the processing of cluster features is

paused and updateRadius is called to increase the

radius by using the heuristic, formula (12). After this

step the method rebuildCFTree adds the CF leaves

of the old tree leaves into a new tree. Afterwards the

processing continues with the next record. Finally the

set S of records, i.e. the set of CF members of all

leaves in the CF tree, is returned.

The method addToNode is shown in listing 6. A

1 S =



center (CF

),...,center





2 N =

{

,...,X

}

4 b i r c h ( N ) {

5 T = 0 ;

7 f o r ( ∀ X ∈ N ){

8 CF

= c r e a t e C l u s t e r F e a t u r e ( X ) ;

9 i f ( size (CFTree) <= maxSize−size(CF

)) {

10 CFTree =

11 addToNode ( CFTree , CF

, T ) ; }

12 e l s e {

13 T = u p d a t e R a d i u s ( T ) ;

14 CFTree =

15 r e b u i l d C F T r e e ( CFTree , T ) ;

16 addToNode ( CFTree , CF

, T ) ; } }

17 re t u r n S = buildSummary (CFTree) ;

18 }

Listing 5: Implementation of the ﬁrst phase of CORD,

based on BIRCH.

1 addToNode ( CFTree , CF , T ) {

2 node = min

(node|node ∈ root (CFTree), z = d(CF, node)) ;

4 wh i l e ( ! isLeaf ( node ) ) {

5 node = min

(Child|Child ∈ node,z = d(CF, Child)) ; }

7 i f ( size (node) < L ) {

8 f o r ( ∀ CF

∈ node ) {

9 i f ( d ( CF

, CF ) < T )

10 r e t u r n ; }

11 node . add ( CF ) ;

12 re t u r n ; }

13 e l s e {

14 M =

{

∈ node

}

= s p l i t L e a f ( node ) ;

15 f o r ( ∀ CF

∈ M ∪ CF )

16 CFTree =

17 addToNode ( CFTree , CF

, T ) ; }

18 re t u r n CFTree ;

19 }

Listing 6: Addition of cluster feature to a CF tree

node.

CF tree node is denoted by node. In case of a leaf

node, it is composed of a set of L CFs, otherwise a

set of B CFs. First, the appropriate leaf for adding CF

is searched recursively. The CF is added taking into

account the restrictions. Therefore the distance func-

tion d checks if the CF to be inserted will be absorbed

by one of the CFs of it’s appropriate leaf. d uses the

distance measurement according to subsection 4.1.

If the appropriate leaf is full, it has to be split into

two pieces with splitLeaf. The method addToNode

is recursively executed, for each of them. If there is

no space in the leaf, the recursion continues splitting

the CF tree nodes, until a leaf is reached with enough

space. If the procedure splits the tree root into two

nodes, these nodes are carried by a new root, that is

built. Thereby tree depth grows by one level.

This implementation of BIRCH is used for the

ﬁrst phase of CORD. In the main phase, the result

from BIRCH is clustered by the modiﬁed FuzzyCen-

WEBIST 2010 - 6th International Conference on Web Information Systems and Technologies

302

troids described in 4.2. The result from BIRCH can

also be processed by a phase in-between, as described

in the following section. This new phase is used to in-

crease quality by ﬁnding adequate initial centroids.

4.4 Increasing Quality with NCSOM

Before the dataset summary from BIRCH can be clus-

tered by FuzzyCentroids, the number k of centroids

and their values have to be selected. This phase is cru-

cial for the quality of the ﬁnal result. For this issue the

NCSOM algorithm can help by discovering the clus-

tering structure, which is visualized as follows: The

NCSOM forms a map of reference vectors, whereby

similar vectors lie closer to each other, as described

in section 3.3. The distance between two neighboring

vectors can be illustrated on a color palette of gray

tones. For small distances bright colors are used and

with increasing distance darker colors. A similarity

map of the data builds up as shown in ﬁgure 1. Groups

of similar neurons can be identiﬁed as bright areas on

the similarity map. If the users then selects a neu-

ron from the center on this bright area, this reference

vector will be a quite good choice as centroid of the

records represented by this area. This representation

is thus not only an assistance for ﬁnding the number

of centroids, that exist in the clustering structure. It

can additionally help very well to specify the initial

centroids allocation for a following cluster analysis.

We modiﬁed the distance measure of NCSOM, to

use it with ordinal data only, see formula (13) and

(14). The modiﬁcations are done accordingly to 4.1.

d (X

) =

∑

l=1



i,l

j,l



, (13)

with

δ (x

i,l

j,l

) =



0, if (x

i,l

= m

j,l

)

i,l

− m

j,l

, otherwise

(14)

The implementation of NCSOM is shown in listing 7.

Parameter δ (t) gives a monotonically decreasing ra-

dius and s

the vector space coordinates of the neuron

in the network. This leads to a fast move of the neu-

rons to their approximate place. In the further process

they adapt slightly until they arrive in their ﬁnal place.

A good choice of δ(t) is the diagonal network length.

Step 1 of NCSOM is implemented in line 2 and

3, listing 7. initializeDelta initializes the param-

eter δ(t) and initializeArtificialNetwork the

reference vectors randomly. The data structure of

the p × p neuron network is a matrix M

p×p

= [m

i, j

ref (m

i, j

) is the reference vector of the respective neu-

ron, pos (m

i, j

) the vector of it’s position in the net-

work. nsum (m

i, j

) contains in each case a matrix

r×n

= [v

s,t

] for s = 1, ..., r attributes and t = 1, ..., n

Figure 1: Similarity map (left) and a reference vector m

one neuron (right).

1 NCSOM( size , maxIterations ) {

2 δ

= i n i t i a l i z e D e l t a ( size , maxIterations ) ;

3 i n i t i a l i z e A r t i f i c i a l N e t w o r k ( M , size ) ;

5 wh i l e ( δt >0){

6 f o r ( X

∈ N , l = 1,...,n ) {

7 m

i, j

= f i n d B e s t M a t c h i n g U n i t ( M , X

) ;

8 sa v eNeighbo u r S u m s ( X

, m

i, j

, M , δ

) ; }

9 f o r ( i = 1, ..., p , j = 1, ..., p ) {

10 b a t c h U p d a t e U n i t ( m

i, j

) ; }

11 δ

−−;}

12 }

Listing 7: Implementation of the second phase of

CORD, based on NCSOM.

categories. This matrix saves the summations that are

used in step 3 (see below).

The distance measurement is used during the se-

lection of the best matching neuron (Step 2). The best

matching neuron for a record is searched in each iter-

ation with the method findBestMatchingUnit, line

7, listing 7. This is the neuron with the reference vec-

tor that has the smallest distance to the record accord-

ing to formula (13) and (14). The implementation of

the distance measurement is shown in listing 8. Af-

ter the winner neuron is found, the necessary infor-

mation for the update, has to be collected. Listing 9

shows the calculation of the matrices V

r×n

= [v

s,t

This computes the occurrence frequency of the cate-

gorical values for the r attributes of the neighborhood

records, using the Gaussian function deﬁned in (Chen

and Marques, 2005). The computational complexity

of this step is O





, as for each record and neuron a

matrix of sums has to be processed.

The update of the reference vectors (Step 3) is

1 g e t D i s t a n c e ( X ∈ N , m

i, j

) {

2 d = 0 ;

3 f o r ( i = 1, ..., r ){

4 f o r ( j = 1, ..., n

) {

5 i f ( x

∈ X != a

∈ D

) {

6 d +=



− a



; } } }

7 re t u r n d ;

8 }

Listing 8: New distance measure between record an

neuron.

CORD: A HYBRID APPROACH FOR EFFICIENT CLUSTERING OF ORDINAL DATA USING FUZZY LOGIC AND

SELF-ORGANIZING MAPS

303

done in line 9 to 10, listing 7, at the end of each it-

eration. The details of the implementation is shown

in listing 10. The neurons are updated according to

formula 15 and 17 in (Chen and Marques, 2005). The

summations are used in line 5. This update requires

one iteration through the whole dataset. This is rele-

vant in the case of VLDBs, as the computation time is

linear in O(maxIterations ∗ n) on the dataset size n.

An important factor is the network size. The cal-

culation in listing 9 and 10 respectively is quadratic





) on the neuron number p of one side. Sum-

marizing, the best case computation time is approxi-

mately at least maxIterations ∗ n, as the algorithm al-

ways executes the maximum number of steps. This

is necessary, since the number of steps is attached to

the function δ

. However, also in this case, the dataset

must be read only once per iteration. This batch algo-

rithm has a better accuracy than an online algorithm

but requires memory for an additional data structure;

the matrix V

r×n

for the neurons.

5 EXPERIMENTAL RESULTS

To evaluate CORD, we performed tests with two dif-

ferent datasets. Subsection 5.1 describes the tests with

a dataset of 100,000 records and 5.2 the tests with 14

million records. As with work of this nature, i.e a

hybrid approach, a comparison is difﬁcult, especially

as the proposed solution is tailored to suit a speciﬁc

use case. In our solution, the three phases of CORD

as described in section 4 can be processed indepen-

dently. For example one can select an automatic so-

lution without using NCSOM in-between. Therefore

we decided to evaluate each one separately.

In general algorithms are dependent on the read-

1 s a v e N e i g h b o u r S u ms ( X ∈ N , m

, M , δ

) {

2 f o r ( k = 1,..., p , l = 1, ..., p ) {

3 f o r ( i = 1, ..., r ){

4 f o r ( j = 1,..., n

) {

5 nsum



k,l



[i][x

] +=

6 exp



−



pos



k,l



− pos (m)



/2δ



;}}}

7 }

Listing 9: Calculation of categorical value frequency

of the records in the neighborhood.

1 b a t c h U p d a t e U n i t ( m ) {

2 f o r ( i = 1, ..., r ){

3 do u b l e s = 0 . 0 ;

4 f o r ( j = 1, ..., n

) {

5 s += j ∗nsum (m) / sum ( nsum (m) [k] ) ; }

6 ref (m) [k] = ro u n d ( s ) ; }

7 }

Listing 10: Update of the neuron reference vector at

the end of each iteration.

Figure 2: Reading time for naive approach and paging. Re-

sults for small record no. (left) and large data sets (right).

ing time of the data. Therefore we performed an opti-

mization of the reading process, we named as id range

paging (results shown in ﬁgure 2). A data base query

loads intervals of records (as page) on the basis of the

primary key from the data base.

The id range paging processes approximately 50

records in one millisecond. It is clearly observable,

that the id range paging exceeds the naive approach,

for which an out-of-memory exception occurred after

2 minutes and one million records. Further we tested

the inﬂuence of different page sizes on the reading

time. The size of 500 provided the best results.

5.1 Dataset User Proﬁles

Input for the test was a dataset of approx. 100,000

records (Gugubarra, 2009) according to the problem

application user proﬁles (see 1.2). The test was done

on a PC with Intel Core 2 Duo T7500 processor with

2.20GHz and 2GB DDR2 SDRAM main memory.

The module for the FuzzyCentroids was tested

with the result of BIRCH as input. Therefore the CF

tree is held in the main memory and the FuzzyCen-

troids can be executed quite fast. The computation

time depends linearly on the number of attributes.

The average of three test runs is shown in ﬁgure 3(b)

for 10, 15, 20, and 100 attributes and 1,000 records.

The number 100 is normally not exceeded for the

described user proﬁles (see subsection 1.2) and

supports the regression line built out of the results

for the other attribute numbers. For 100 attributes the

computation time was 150 milliseconds. Similarly

are the results for the number of records. The

computation time depends linearly on it as shown

in ﬁgure 3(a) for 1,000, 2,000, 3,000 and 105,691

records and 10 attributes. For 105,691 records the

computation time was about 2 seconds and supports

the regression line that is shown.

For testing the computation time of the BIRCH

module, we focused on the most interesting criteria.

This is when the CF tree is too big for the main mem-

ory. At this point the CF tree has to be rebuilt which

WEBIST 2010 - 6th International Conference on Web Information Systems and Technologies

304

Figure 3: FuzzyCentroids evaluation for number of records (a) and attributes (b). BIRCH evaluation: Increase in the radius

for 10000 records (c) and 20000 records (d).

takes the most time. In order to ensure, that no exces-

sive reconstruction of the CF tree takes place we pro-

posed a heuristic in 4.3. It’s functionality was proven

empirically by repeated runs of the module for differ-

ent number of records and CF tree parameters. The

average of 3 test runs is shown in ﬁgure 3(c),(d) for

10,000 and 20,000 records.

The parameters were set as follows: The maxi-

mum no. of nodes as 5, maximum no. of CFs in one

leaf as 15, and the maximum size of the CF tree was

5,000,000 bytes. The result shows that at the end,

when the most records are processed, the increase in

the radius is less. Most important, the steps are big

enough to ensure a fast inclusion of the whole data

set into the CF tree. That is for 20,000 records after

5 enlargements only. It is not a perfect logarithmic

function, because the dataset used is uniform.

Similar to the FuzzyCentroids, the computation

time of NCSOM depends linearly on the number of

records and attributes. A more interesting evaluation

would be regarding the inﬂuence of the neural net-

work size, that we report here instead. The time com-

plexity is quadratic. The average of three test runs

is shown in ﬁgure 4 for 1,000 records with different

network sizes. The quadratic dependency is clearly

observable. Further the running time depends on the

number of neighbor neurons used for the computation

of the similarity map. Since this computation is im-

plemented at the end of each iteration and not for each

record, the effect is small compared with other steps.

Figure 4: NCSOM evaluation for neural network size.

5.2 Dataset MSNBC.COM Web Data

The second performance test was done to prove the

scalability of the system. It was done on a Windows

Server with two Intel Xeon E5420 processors with

2.50GHz and 8GB main memory.

The dataset MSNBC.COM Data Set was taken

from the UCI ML Repository (D.J. Newman, 2007).

The data comes from the Server logs for msnbc.com

and news-related portions of msn.com. Each se-

quence in the dataset corresponds to page views of

a user during one day. Each event in the sequence

corresponds to a user’s page request. Requests are not

recorded at the page level but rather, they are recorded

at the level of page topics. The fourteen topics we

used are “frontpage”, “news”, “tech”, “local”, “opin-

ion”, “on-air”, “misc”, “weather”, “health”, “living”,

“business”, “sports”, “summary”, and “bbs”. The full

dataset consists of approx. one million users (i.e.

989,818).

This data is very useful, as it represents click-

streams associated to topics. From this data, it is very

easy to generate proﬁles that are similar to the NOPs

like described in subsection 1.2. We mapped the num-

ber of page visits on an ordinal scale, representing

which topics the user has more or less “touched”.

In the Gugubarra Project the number of topics (at-

tributes) changes dynamically. Therefore each at-

tribute value is saved in a separate record. For the

dataset MSNBC.COM we chose the same domain

speciﬁc database schema to have a realistic and com-

parable situation. As a result, CORD had to work on

attribute × user records, that is 14 million records.

The average of 3 test runs for FuzzyCentroids us-

ing BIRCH was a computation time of approx. 3,300

seconds. This is a good result compared with (Kim

et al., 2004). The average of 3 test runs for NCSOM

using BIRCH was a computation of approx. 3,400

seconds. The tree size decreased from 9,200 KBytes

to 1,800 KBytes, when the radius increased to 3. The

results are similar because each of the tests had to read

the records from the database. This was the slowest

part as a lot of disk reads had to be done.

CORD: A HYBRID APPROACH FOR EFFICIENT CLUSTERING OF ORDINAL DATA USING FUZZY LOGIC AND

SELF-ORGANIZING MAPS

305

6 RELATED WORK

In the area of clustering Web users, it is a new ap-

proach to take into account the non-obvious proﬁles,

described in (Zicari et al., 2006). Clustering non-

obvious-proﬁles with CORD takes indirectly into ac-

count the time spent by the user on a page and the con-

tent topics of this page. CORD clusters users based

on their supposed interest in these topics. With the

fuzzy centroids, it gives an interpretation to the clus-

ters as a value of a predeﬁned scale of interest. This is

a nameable advantage. Several algorithms have been

proposed in the area of clustering large datasets, as

BIRCH (Zhang et al., 1996) and CLARANS (Cluster-

ing Large Applications based on RANdom Search) by

Ng and Han (Ng and Han, 1994). There are special-

ized ﬁelds, e.g. multi-relational data clustering. Yin

and Han proposed here CrossClus (Yin et al., 2007),

that clusters data stored in multiple relational tables

based on user guidance and multi-relational features.

This algorithm requires as CORD the help of the

user, that is here the person who wants to cluster

the elements. Clustering can be applied to various

domains and issues, e.g. in (Aggarwal et al., 2006)

the k-anonymity (a technique to preserve privacy in

data) is treated as a special clustering problem, called

r-cellular clustering. (Aggarwal et al., 2006) han-

dle categorical attributes by the representation as n

equidistant points in a metric space. Hybrid Systems

are used in various research ﬁelds, e.g. in the area of

Web Burke (Burke, 2002) has deﬁned Hybrid Recom-

mender Systems, that combine information ﬁltering

and collaborative ﬁltering techniques. Helmer pro-

posed in (Helmer, 2007) a hybrid approach to mea-

sure the similarity of semistructured documents based

on entropy. (Kossmann et al., 2002) use a hybrid

approach to ﬁnd the Skyline, i.e. a set of interesting

points from a potentially large set of data.

REFERENCES

Aggarwal, G., Feder, T., and Kenthapadi, K. (2006).

Achieving anonymity via clustering. In Proc. of the

ACM SIGMOD-SIGACT-SIGART symposium on

Principles of database systems, pages 153–162, NY,

USA.

Braun-Blanquet, J., Conard, H. S., and Fuller, G. D. (1932).

Plant sociology. McGraw-Hill book company.

http://www.biodiversitylibrary.org/bibliography/7161.

Burke, R. (2002). Hybrid recommender systems: Survey

and experiments. User Modeling and User-Adapted

Interaction, 12(4):331–370.

Chen, N. and Marques, N. C. (2005). An extension of self-

organizing maps to categorical data. In EPIA, Portu-

gal.

Cheu, E. Y., Kwoh, C. K., and Zhou, Z. (2004). On the

two-level hybrid clustering algorithm. Nanyang Tech-

nological University.

Chiu, T., Fang, D., Chen, J., Wang, Y., and Jeris, C. (2001).

A robust and scalable clustering algorithm for mixed

type attributes in large database environment. In Pro-

ceedings of the 7

ACM SIGKDD, pages 263–268,

NY, USA.

D.J. Newman, A. A. (2007). UCI machine learning reposi-

tory. http://archive.ics.uci.edu/ml/.

Gan, G., Yang, Z., and Wu, J. (2005). A genetic k-modes

algorithm for clustering categorical data. In ADMA,

pages 195–202.

Gugubarra (2009). Data set user proﬁles. www.dbis.cs.uni-

frankfurt.de/downloads/research/data.zip.

Helmer, S. (2007). Measuring the structural similarity of

semistructured documents using entropy. In Proc. of

the 33

Int. Conf. on VLDBs, pages 1022–1032.

Huang, Z. (1997). A fast clustering algorithm to cluster

very large categorical data sets in data mining. In In

Research Issues on Data Mining and Knowledge Dis-

covery, pages 1–8.

Kim, D.-W., Lee, K. H., and Lee, D. (2004). Fuzzy cluster-

ing of categorical data using fuzzy centroids. Pattern

Recogn. Lett., 25(11):1263–1271.

Kossmann, D., Ramsak, F., and Rost, S. (2002). Shoot-

ing stars in the sky: an online algorithm for skyline

queries. In Proc. of the 28

Int. Conf. on VLDBs,

pages 275–286.

Ng, R. T. and Han, J. (1994). Efﬁcient and effective clus-

tering methods for spatial data mining. In Proc. of the

Int. Conf. on VLDBs, pages 144–155, San Fran-

cisco, CA, USA. Morgan Kaufmann Pub. Inc.

Parmar, D., Wu, T., and Blackhurst, J. (2007). Mmr: An

algorithm for clustering categorical data using rough

set theory.

Podani, J. (2005). Multivariate exploratory analysis of ordi-

nal data in ecology: Pitfalls, problems and solutions.

Journal of Vegetation Science, 16(5):497–510.

Yin, X., Han, J., and Yu, P. S. (2007). Crossclus: user-

guided multi-relational clustering. Data Min. Knowl.

Discov., 15(3):321–348.

Zhang, T., Ramakrishnan, R., and Livny, M. (1996). Birch:

An efﬁcient data clustering method for vldbs. In

Proc. of the ACM SIGMOD, pages 103–114, Mon-

treal, Canada.

Zicari, R. V., Hoebel, N., Kaufmann, S., and Tolle, K.

(2006). The design of gugubarra 2.0: A tool for build-

ing and managing proﬁles of web users. In Proc. of the

IEEE/WIC/ACM Int. Conf. on Web Intelligence, pages

317–320, Washington, DC, USA.

WEBIST 2010 - 6th International Conference on Web Information Systems and Technologies

306