IMPROVING CASE RETRIEVAL PERFORMANCE THROUGH THE

USE OF CLUSTERING TECHNIQUES

Paulo Tom´e

Department of Computing, Polytechnic of Viseu, Viseu, Portugal

Ernesto Costa

Department of Computing, University of Coimbra, Coimbra, Portugal

Lu´ıs Amaral

Department of Information Systems, University of Minho, Guimar˜aes, Portugal

Keywords:

Case-Based Reasoning, Clustering Techniques, Case-Base Maintenance.

Abstract:

The performance of Case-Based Reasoning (CBR) systems is highly depend on the performance of the re-

trieval phase. Usually, if the case memory has a large number of cases the system turn to be very slow. Several

mechanisms have been proposed in order to prevent a full search of the case memory during the retrieval

phase. In this work we propose a clustering technique applied to the memory of cases. But this strategy is

applied to an intermediate level of information that deﬁnes paths to the cases. Algorithms to the retrieval and

retention phase are also presented.

1 INTRODUCTION

CBR is a method (Watson, 1999) that allows to solve

problems based in previous resolved ones (Kolodner,

1993; Mantaras et al., 2006). Every CBR system

comprises a retrieval phase, a re-using phase and a

retention phase (Aamodt and Plaza, 1994). These

phases have different impacts on the performance

of the CBR. According with Smyth and McKenna

(Smyth and McKenna, 1999), the performance of a

CBR system can be measured according to three cri-

teria:

• Efﬁency - the average problem solving

time;

• Competence - the range of target problems

that can be successfully solved;

• Quality - the average quality of a proposed

solution.

As mentioned by Smyth and McKenna (Smyth

and McKenna, 1999), the retrieval process deserved

always highest interest of the CBR research commu-

nity because it has an high inﬂuence on the CBR sys-

tem performance.

The retrieval process involves the combination of

two procedures: similarity evaluation and searching

in the memory case. The ﬁrst procedure judges the

similarity of the current problem with the ones previ-

ously resolved. If the case memory have a large num-

ber of cases, the number of similarity evaluations is

large representing a computational burden.

Some authors tried to improve the retrieval phase

performance using case memory structures. The aim

of these structures is to organize the case memory in

way that enable a fast case access avoiding the sim-

ilarity evaluation of all cases in memory. There are

several examples of case memory structures propos-

als, for example, Kolodner (Kolodner, 1993) identi-

ﬁes four ways of organizing the case memory. Fol-

lowing those proposals Schaaf (Schaaf, 1996) orga-

nizes cases in a network by considering cases as a

polyhedron with a face for each aspect. And Wolver-

ton (Wolverton, 1994) organizes the case memory in

a semantic network. Within that semantic network,

small subgraphs of nodes and links which represent

aggregate concepts are explicitly grouped together

as conceptual graphs. More recently Yang and Wu

(Yang and Wu, 2000) split the case into a set of clus-

450

Tomé P., Costa E. and Amaral L. (2008).

IMPROVING CASE RETRIEVAL PERFORMANCE THROUGH THE USE OF CLUSTERING TECHNIQUES.

In Proceedings of the Tenth International Conference on Enterprise Information Systems - AIDSS, pages 450-454

DOI: 10.5220/0001687704500454

 SciTePress

ters distributed by different sites/machines.

Furthermore, the structure of the case mem-

ory are directly related with the case base mainte-

nance (CBM) methods (Wilson and Leake, 2001).

The objective of the CBM is maintaining consis-

tency, preserving competence and controlling case-

base growth.

The previous presented approaches does not ad-

dress the problem of missing case features. This prob-

lem is currently addressed in CBR research ﬁeld (GU

and Aamodt, 2005; Gu and Aamodt, 2006).

We propose the use of clustering techniques to im-

prove the performance of the CBR during the retrieval

phase. However, we use different approach of Yang

and Wu (Yang and Wu, 2000). We do not split the

case memory into distinct locations instead we use

clusters of links to cases. Besides that, our proposal

also deals with cases with missing features. In Sec-

tion 2 we review some clustering concepts essential

to the understanding of our proposal shown in section

3. In section 3 we present also the results of the ap-

plication of our proposal to a CBR system with 915

cases.

2 THE CLUSTERING

TECHNIQUE

Clustering techniques organizes data into groups that

are meaningful, useful or both (Tan et al., 2006). One

group of data is called a cluster, while the entire col-

lection of clusters is commonly referred to as a clus-

tering.

Two types of clustering can be considered: par-

tional and hierarchical. A clustering is hierarchical if

we permit clusters to have subclusters. In partional

clustering the data is divided into non-overlapping

clusters. Tan et al. (Tan et al., 2006) identiﬁed ﬁve

types of clusters: well-separated, prototype-based,

graph-based, density based and shared-based. In our

work we will consider prototype-based type. A set of

cases is grouped into a cluster with one representative

element. Then, in the retrieval phase the number of

similarity evaluations is reduced considerably.

There are several techniques to split the data, but

k-means and k-medoid are two of the most prominent

techniques associated to prototype-based techniques

(Tan et al., 2006). K-means deﬁnes a prototype in

terms of a centroid, while k-medoid deﬁnes deﬁnes a

prototype in terms of a medoid. The medoid is one

element of the cluster while the centroid is the mean

of the cluster. In our work we use the k-medoid tech-

nique. So each cluster is represented by the most rep-

resentative case among all cases in the group. There

are also different proposals to measure similarity be-

tween data: the Euclidean and cosine distance are the

most used similarity measures. The similarity mea-

sure is used whenever a new case has to be added to

the case memory. Naturally the updated cluster need

to update its prototype.

3 THE APPLICATION OF

CLUSTERING TECHNIQUES

TO THE CBR RETRIEVAL

PHASE

Our proposal, shown in ﬁgure 1, has two levels of in-

formation. The ﬁrst level is formed by a set of links

to the case memory and the second level is the case

memory database. The case links are paths to cases

memory. And the clustering technique is applied to

case links information. The ﬁrst level of information

requires a low amount of storage space however de-

creases the waiting time of the retrieval process. We

do not considered the division of the database case

memory because it is useful to access a case from

different ways. The ﬁgure 1 illustrates the storage

scheme and we can see that groups of clusters are the

interface between CBR process and the database of

cases.

Group of clusters

Case Links

Case Memory

Group of clusters

Case Links

Group of clusters

Case Links

Retrieval Process

Reusing

Revision

Process 1

…

Process 2

Process n

Retention

...

Figure 1: CBR system structure.

Each group has clusters of links to cases. And

each cluster, as shown in table 1, has a reference to

the medoid of the cluster and links to a set of cases

that constitute the cluster.

Each Group of clusters is identiﬁed by a binary ar-

ray codiﬁcation. The binary codiﬁcation scheme fol-

lows the proposal of Kolodner, table 2, who deﬁnes

that a case is formed by a Problem and by a Solu-

tion. And the Problem consists in Objective and a set

IMPROVING CASE RETRIEVAL PERFORMANCE THROUGH THE USE OF CLUSTERING TECHNIQUES

451

Table 1: Cluster deﬁnition.

Clus =< LMed, SCl > Med = link to case SCl =

{link to case}

Where:

Clus - Cluster

LMed - Link to medoid

SCl - Set of Cluster link;

C - Characteristic

Table 2: Case Structure.

Case =< P, S >

P =< O,Cs >

Cs = {C}

Where:

P - Problem

O - Objective

Cs - Set of characteristics;

C - Characteristic

Table 3: Case Example.

Cas

=< P, S >

P =< O1,Cs >

Cs =< C

of Characteristics. Table 3 shows a case with three

Characteristics. So each position of the binary array

is associated to a particular feature of a case, where

1 (0) indicates the availability (non-availibility) of the

feature.

The ﬁrst positions on the right side of the array are

used to represent objectives. The remaining positions

are used to represent characteristics. For example, us-

ing sixteen bits with the division illustrated in table 4,

the case Cas

, shown in table 2, addresses the group

with the following binary array 0000000001110001.

Table 4: Addressing Group Cluster.

However to deal with missing characteristics the

cases belong to more than one group. All combination

of the available characteristics and objective deﬁne

different groups. In table 5 it is presented the com-

binations of characteristics and objective for the case

Cas

. The case Cas

is associated to seven groups

(ﬁgure 2), in each group clustering might be achieved

with a distinct number of clusters.

The Retrieval process was modiﬁed to enable the

adoption of the principles previously described. Table

6 shows the steps of the algorithm when a new prob-

lem is presented: 1)- identiﬁcation of the clustering

Table 5: Combinations example to Cas

case.

0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 Co

0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 1 Co

0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 1 Co

0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 Co

0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 1 Co

0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 Co

0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 1 Co

Cluster

...

Cluster

...

Cluster

...

Cluster

...

Cluster

...

Cluster

...

Cluster

...

Groups of clusters

Figure 2: Example of a database of group of case links.

group 2)-identiﬁcation of the cluster within the group

using a similarity measure; 3)- ﬁnally the case is com-

pared with all cases in the cluster. In the second step

the Default Difference measure strategy (Bogaerts

and Leake, 2004) is applied only to the medoid of the

clusters. In the third step, different similarity evalua-

tions are used.

The retention process algorithm, shown in table 7,

was also redeﬁned. This process was parallelized, e.g.

the retention in each Group of cluster is implemented

by different program processes. The process begins

with the determination of all possible combinations

between the available characteristics and the objective

of the case. Then for each combination, the case is in-

serted in respective Group of clusters. This insertion

process is parallelized. Each insertion in a Group de-

termines: 1)-evaluation of the similarity with the clus-

ter medoids 2)- identiﬁcation of the cluster to insert

the case, 3) actualization of the medoid of the cluster

where the case was assigned. The similarity measure

strategy used in step two is also Default Difference.

In a group a new cluster is created whenever a binary

similarity evaluation results in a zero.

The medoid is computed using the use of the ex-

pression

pos med =

nf eatures

∑

i=1

pos(value o f feature(i)) ∗ weight( feature(i)) (1)

where pos(value o f feature(i)) is the posi-

ICEIS 2008 - International Conference on Enterprise Information Systems

452

Table 6: Retrieval Algorithm.

/* —————————————————————-

Cas is the case for which it is search a solution

Prop Cas is the proposed case

—————————————————————-*/

procedure retrieval(Cas in Case, Prop cas out Case)

Clusgroup ClusterGroup;

Clus Cluster;

Sim Similarity;

Sim a Similarity;

begin

Clusgroup ← Determine cluster group(Cas.Obj,Cas.Cars);

Clus ← 0;

Sim ← 0;

For each Cluster in Clusgroup do

begin

Sim a ← Similarity(Cas, Cluster(i).medoid);

if Sim a > Sim then

begin

Sim ← Sim a;

Clus ← Cluster(i);

end;

Sim ← 0;

For each Case in Clus do

begin

Sim a ← Similarity(Cas, case(i));

if Sim a > Sim then

begin

Sim ← Sim a;

Prop cas ← Case(i);

end;

0,00

2,00

4,00

6,00

8,00

10,00

12,00

14,00

16,00

18,00

20,00

22,00

24,00

26,00

28,00

30,00

32,00

34,00

36,00

38,00

40,00

42,00

44,00

46,00

48,00

50,00

52,00

54,00

56,00

58,00

60,00

62,00

64,00

66,00

68,00

70,00

1 21 41 61 8 1 10 1 121 141 161 181 201 221 24 1 2 61 281 30 1 3 21 341 36 1 381 401 42 1 441 461 481 5 01 521 54 1 5 61 581 60 1 6 21 641 66 1 6 81 701 72 1 7 41 761 78 1 801 821 841 8 61 881 901

Numbe r of cases

Recall time

Figure 3: Waiting time before clustering technics applica-

tion.

tion of feature in a ordered set of values and

weight( feature(i)) is the weight of the f eature(i).

We compare the described strategy with a ﬂat

memory case for a CBR system with 915 cases. As

we can see in ﬁgure 3, the system had bad perfor-

mance after the insertion of three hundred cases in the

case memory.

After the application of the clustering approach

Table 7: Retention Algorithm.

/* ———————————————-

Cas is the case that will be retained

———————————————–*/

procedure retention(Cas in Case)

Combs Combinations;

Combination TCombination;

Clusgroup ClusterGroup;

Clus Cluster;

Sim Similarity;

Sim a Similarity;

begin

Combs ← Generate all combinations(Cas.Obj,Cas.Cars);

For each Combination in Combs do

begin

Clusgroup ← Determine clus group(Combination(i));

Sim ← 0;

For each Cluster in Clusgroup do

begin

Sim a ← Similarity(Cas, Cluster(j).medoid);

if Sim a > Sim then

begin

Sim ← Sim a;

Clus ← Clus(i);

end;

insert case cluster(Clus,Cas);

recalculate medoid(Cluster(i));

end;

1 26 51 76 101 126 151 176 201 226 251 276 301 326 351 376 401 426 451 476 501 526 551 576 601 626 651 676 701 726 751 776 801 826 851 876 901

Figure 4: Waiting time after clustering technics application.

the waiting time for each of the 915 cases was less

than 1 second. Although, the retention process takes

more time than the ﬂat memory case. The ﬁgure 4

shows the retention waiting time for all of the 915

cases. Although, the total waiting time is smaller after

the application of the clustering techniques as shown

in ﬁgure 4.

4 CONCLUSIONS

This paper presents an approach that can be used to

improve the performance of CBR system Retrieval

phase. The approach uses clustering techniques to

IMPROVING CASE RETRIEVAL PERFORMANCE THROUGH THE USE OF CLUSTERING TECHNIQUES

453

structure the case memory. The use of clustering tech-

niques deﬁne a particular case memory structure and

consequently algorithms for the retrieval and reten-

tion phase are proposed.

The approach was tested in a CBR system with

915 cases. The result show that the overall system

performance is improved. It is important to notice that

the retrieval waiting time was considerably reduced

and the total waiting time (time of retrieval and reten-

tion) is also substantially smaller than with a ﬂat case

memory organization.

REFERENCES

Aamodt, A. and Plaza, E. (1994). Case-based reasoning:

Foundational issues, methodological variations and

systems approaches. AI-Communications, 7(1):39–

52.

Bogaerts, S. and Leake, D. (2004). Facilitating cbr for

incompletely-described cases: Distance metrics for

partial problem descriptions. In ECCBR 2004, pages

62–74.

GU, M. and Aamodt, A. (2005). A knowledge-intensive

method for conversational cbr. In 6th International

Conference on Case-Based Reasoning, pages 296–

311. Springer-Verlag.

Gu, M. and Aamodt, A. (2006). Dialog learning in con-

versational cbr. In 19th International FLAIRS Confer-

ence, pages 358–363, Florida, EUA. AAAI Press.

Kolodner, J. (1993). Case-Based Reasoning. Morgan Kauf-

mann Publishers.

Mantaras, R. L., Mcsherry, D., Bridge, D., Leake, D.,

Smyth, B., Craw, S., Faltings, B., Maher, M. L., Cox,

M. T., Forbus, K., Keane, M., Aamodt, A., and Wat-

son, I. (2006). Retrieval, reuse, revison and retention

in case-based reasoning. The Knowledge Engineering

Review, 20(3):215–240.

Schaaf, J. W. (1996). Fish and shrink. a next step towards

efﬁcient case retrieval in large scale case bases. In

Smith, I. and Faltings, B., editors, Advances in Case-

Based Reasoning, pages 362–376. Springer-Verlag.

Smyth, B. and McKenna, E. (1999). Footprint-based re-

trieval. In Third International Conference on Case-

Based Reasoning, pages 343–357, Munich, Germany.

Springer Verlag.

Tan, P. N., Steinbach, M., and Kumar, V. (2006). Introduc-

tion to Data Mining. Pearson Education.

Watson, I. (1999). CBR is a methodology not a technology.

Knowledge Based Systems Journal, 12(5-6).

Wilson, D. C. and Leake, D. (2001). Maintaining case-

based reasoners: Dimensions and directions. Com-

putational Intelligence, 17(2):196–213.

Wolverton, M. (1994). Retrieving Semantically Distant

Analogies. Ph.d thesis, Stanford University.

Yang, Q. and Wu, J. (2000). Keep it simple: A case-base

maintenance policy based on clustering and informa-

tion theory. In Canadian AI 2000, pages 102–114.

Springer-Verlag.

ICEIS 2008 - International Conference on Enterprise Information Systems

454