COMBINING TWO LAZY LEARNING METHODS

FOR CLASSIFICATION AND KNOWLEDGE DISCOVERY

A Case Study for Malignant Melanoma Diagnosis

Eva Armengol

and Susana Puig

Artiﬁcial Intelligence Research Institute (IIIA-CSIC) Bellaterra, Catalonia, Spain

Melanoma Unit, Dermatology Department, IDIBAPS, UT26 CIBERER, ISCIII

Hospital Cl

ınic i Provincial de Barcelona, Barcelona, Spain

Keywords:

Machine learning, Lazy learning methods, Knowledge discovery, Classiﬁcation, Medical diagnosis.

Abstract:

The goal of this paper is to construct a classiﬁer for diagnosing malignant melanoma. We experimented with

two lazy learning methods, k-NN and LID, and compared their results with the ones produced by decision trees.

We performed this comparison because we are also interested on building a domain model that can serve as

basis to dermatologists to propose a good characterization of early melanomas. We shown that lazy learning

methods have a better performance than decision trees in terms of sensitivity and speciﬁcity. We have seen

that both lazy learning methods produce complementary results (k-NN has high speciﬁcity and LID has high

sensitivity) suggesting that a combination of both could be a good classiﬁer. We report experiments conﬁrming

this point. Concerning the construction of a domain model, we propose to use the explanations provided by

the lazy learning methods, and we see that the resulting theory is as predictive and useful as the one obtained

from decision trees.

1 INTRODUCTION

The malignant melanoma (MM) is the second most

frequent kind of cancer among people between 15 and

34 years old. In the last thirty years the incidence

of MM has been increased more rapidly than other

kinds of cancer. Many studies show that an early de-

tection of MM increases the survival rate since when

tumors are thin the lesion can be excised and the sur-

vival is around the 95% after 5 years. However, when

the tumor has spread to the nodes the risk of metas-

tases increases and, thus the survival rate decreases.

The early diagnosis of melanoma is a difﬁcult task

that dermatologist face every day. When a lesion is

suspicious of being a melanoma it is removed and the

ﬁnal diagnosis is performed based on histopathology

criteria.

The clinical diagnosis of MM is based on the

ABCD rule that takes into account the asymmetry,

border irregularity, color and diameter of the lesion.

Although the ABCD rule has been proved to be effec-

tive for an early diagnosis, there are necessary more

accurate methods to correctly diagnose lesions that do

not present clear malignant characteristics. It is im-

portant that a dermatologist can detect suspicious skin

lesions during a clinical session, therefore it would

be very useful to have a clear and easy characteriza-

tion of MM in early stages. Dermoscopy is a non-

invasive technique introduced by dermatologists two

decades ago. This technique provides a more accurate

evaluation of skin lesions, and can therefore, avoid

the excision of lesions that are benign. Consequently,

dermatologists need to achieve a good dermatoscopic

classiﬁcation of lesions prior to extraction (Puig et al.,

2007). In (Hofmann-Wellenhof et al., 2002) a classi-

ﬁcation of benign melanocytic lesions is suggested.

In (Argenziano et al., 2007) authors hypothesize that

dermoscopic classiﬁcations may be better than the

classical clinico pathological classiﬁcations of benign

melanocytic lesions (nevi). Dermoscopy improves ac-

curacy for the diagnosis of melanoma in nearly 25%.

However, some benign lesions may mimic melanoma

and some melanomas may be similar to benign le-

sions, consequently many unnecessary extractions are

produced. It is assessed of 30 lesions excised by

non-expert dermatologists only one of them is MM.

When dermatologists have high expertise, the ratio

decreases to one MM for each 4 excisions. The re-

ﬂectance confocal microscopy is a new non-invasive

diagnostic technique that allows the visualization of

200

Armengol E. and Puig S..

COMBINING TWO LAZY LEARNING METHODS FOR CLASSIFICATION AND KNOWLEDGE DISCOVERY - A Case Study for Malignant Melanoma

Diagnosis.

DOI: 10.5220/0003652201920199

In Proceedings of the International Conference on Knowledge Discovery and Information Retrieval (KDIR-2011), pages 192-199

ISBN: 978-989-8425-79-9

 2011 SCITEPRESS (Science and Technology Publications, Lda.)

skin cells in vivo. This technique also increases the

accuracy of the experts’ diagnosis but even in the

hands of experts and in combination with dermoscopy

information, accuracy never reaches 100%.

Thus, we are especially interested on character-

izing skin lesions in the frontier of both malignant

and benignant lesions. In our experiments we used

descriptions of skins lesions that have already been

excised, i.e., they are lesions that dermatologists con-

sidered that could be malignant melanoma. However

some of them, after a histopathology analysis resulted

to be benignant. This means that they provide a good

set of suspicious lesions from which to generate a do-

main model able to discriminate between both malig-

nant and benignant lesions with similar characteris-

tics. We propose to take descriptions of known skin

lesions and to use a lazy learning method to obtain a

domain theory. Skin lesions are described using two

sets of features, dermatoscopic and confocal, and our

goal is to ﬁnd a subset of features characterizing ma-

lignant lesions.

There are several works that automatically diag-

nose Malignant Melanoma. MELAFIND (medgadget

.com/archives/2005/08/melaﬁnd

system.html) is a de-

vice designed to determine whether skin moles and le-

sions are malignant. It uses a database of around 6000

already biopsied lesions to ﬁnd similarities with a new

potentially malignant skin lesion. In (Vestergaard and

Menzies, 2008) there is an interesting comparison

of the performance of several automatic instruments

with human experts. The main conclusion of this

comparison is that there is not an automatic method

clearly outperforming human performace. All these

automatic instruments have a different goal than our

approach since they want to take the role of dermatol-

ogists and analyze and interpret an image of an skin

lesion in order to diagnose it. In our work, the goal is

not to diagnose from an image but from the interpreta-

tion of an image given by a dermatologist. In fact we

do not want to take the dematologist’s role but support

them in diagnosing a skin lesion.

In the present paper we introduce a classiﬁcation

system that using lazy learning methods, is able to

recognize MM from similar benignant skin lesions.

The main goal is to minimize the number of MM di-

agnosed as benignant and to maximize the number

of MM correctly diagnosed, although we have to ac-

cept a reasonable number of false positives. In other

words, we want primarily to achieve a high sensitivity

and secondly we try to obtain a speciﬁcity as high as

possible.

In domains as the current one, it could be spe-

cially useful, in addition to classify a new problem, to

generate also some kind of explanation of the domain

model. Usual domain models are automatically build

using inductive learning methods (Mitchell, 1997)

that generalize the input data to generate a model

(or domain theory) that can be useful in the future

to classify unseen data. Inductive learning methods

can produce overgeneralization when solution classes

are not clearly separated. This means that, although

the model ﬁts the known data, it fails in the classi-

ﬁcation of unseen objects. An example is the do-

main of predictive toxicology where from the descrip-

tion of carcinogen and non-carcinogen chemical com-

pounds, the goal is to ﬁnd a model for carcinogene-

sis (Helma and Kramer, 2003). The difﬁculty in that

domain is that there are chemical compounds with

a very similar chemical structure with different car-

cinogenic activity. A similar situation occurs in the

characterization of skin lesions since early malignant

melanoma can share many characteristics with benig-

nant lesions and, therefore a dermatologist can easily

confuse them.

A different approach for classifying unseen exam-

ples is to use some lazy learning method (instance-

based, case-based reasoning, etc.). Thus, a new prob-

lem is classiﬁed as belonging to a class by assessing

its similarity with a set of known examples. Lazy

learning methods are good classiﬁers but they do not

produce explicit generalizations and therefore no do-

main knowledge can be build from them. Currently

there is a growing research line that focuses on ex-

plaining the result of lazy learning methods (see for

instance (Roth-Berghofer, 2004; Plaza et al., 2005)

and proceedings of the workshops on Explanation-

aware Computing held from 2004). In (Armengol,

2008) we pointed out that if we could generate some

explicit generalization of the classiﬁcation process

from a lazy learning method, we could generate a do-

main theory. These generalizations could be seen as

local approximations and, by storing them, we should

have a model of the domain. Notice that this domain

theory is not complete, since it only describes some

areas of the problem space (those around the prob-

lems already solved). Consequently, explanations of

a lazy learning method could be used for knowledge

discovery. In some sense, this is the same idea of

explanation-based learning methods (Mitchell et al.,

1986) that generate domain rules from one example.

We experimented with two lazy learning methods:

the well known k-NN method and the LID method

(Armengol and Plaza, 2001). We compare their pre-

dictivity results with those produced using a decision

tree, and we show that the lazy methods have a better

performace than the decision tree also in terms of sen-

sitivity and speciﬁcity. From the experiments we also

constructed a domain theory that has been very use-

COMBINING TWO LAZY LEARNING METHODS FOR CLASSIFICATION AND KNOWLEDGE DISCOVERY - A

Case Study for Malignant Melanoma Diagnosis

201

ful to dermatologists in order to conﬁrm a subset of

relevant features characterizing early MM. Unexpect-

edly, we have seen that both lazy methods, k-NN and

LID exhibit a performance almost complementary: k-

NN has a high speciﬁcity (i.e., no false positive) but a

low number of true positives (low sensitivity), in addi-

tion, it classiﬁes many lesions as suspicious; LID has

a good level of sensitivity but its speciﬁcity is clearly

lower than the one of k-NN. These results suggest that

a combination of both methods could perform better,

i.e., to produce high speciﬁcity and high sensitivity.

In the current paper, we present the results of experi-

menting with a combination of both methods.

The paper is organized as follows. In sections 2

and 3 the lazy learning methods k-NN and LID are

described. Section 4 reports the experiments carried

out on a database of early melanomas and predictivity

of the methods is analyzed. Section 5 discuss about

how to perform knowledge discovery with lazy learn-

ing methods. Finally, there is a section with conclu-

sions and future work.

2 THE K-NN ALGORITHM

A common method for classiﬁcation is the k nearest

neighbor (k-NN in short). This method is based on

the idea that similar objects have similar classiﬁca-

tion. Given a training set S and a problem p to clas-

sify, the k-NN algorithm is composed of the following

steps:

• 1) To assess the distance between p and each one

of the elements of S;

• 2) To take the subset S

of S composed of the k

elements of S having the closest distance to p;

• 3) To classify p as belonging to the class of the

majority of elements in S

The key point of this algorithm is the distance

measure used to compare the objects. There are

several distances commonly used (for instance, Eu-

clidean, Minkowski, Mahalanobis, etc.) and the user

has to evaluate the most appropriate for the domain

at hand. Let us suppose that domain objects are de-

scribed as tuples of n attributes a

, . . . , a

. When the

values of these attributes are symbolic, the usual way

to assess the distance between two domain objects A

and B is D(a, b) =

∑

d(A.a

, B.a

), where A.a

and B.a

are the values that the attribute a

takes in A and B re-

spectively, and d(A.a

, B.a

) is deﬁned as follows:

d(A.a

, B.a

) =

1 if A.a

= B.a

0 otherwise

Function

LID (p, S

, D

, C)

if stopping-condition(S

)

then return class(S

)

else f

:= Select-attribute (p, S

, C)

i+1

:= Add-attribute( f

, D

)

i+1

:= Discriminatory-set (D

i+1

, S

)

LID (p, S

i+1

, D

i+1

, C)

end-if

end-function

Figure 1: The LID algorithm: p is the problem to be solved,

is the similitude term, S

is the discriminatory set asso-

ciated with D

, C is the set of solution classes, class(S

) is

the class C

∈ C to which all elements in S

belong.

3 THE LAZY INDUCTION OF

DESCRIPTIONS METHOD

Lazy Induction of Descriptions (LID) is a lazy learning

method for classiﬁcation tasks. LID determines which

are the most relevant attributes of a new problem and

searches in a case base for cases sharing these relevant

attributes. The problem is classiﬁed when LID ﬁnds a

set of relevant attributes whose values are shared by a

subset of cases all of them belonging to a same class.

The description formed by these relevant features is

called similitude term and the set of cases satisfying

the similitude term is called discriminatory set.

Given a problem for solving p, the LID algorithm

(Fig. 1) initializes D

as a description with no at-

tributes, the discriminatory set S

as the set of cases

satisfying D

, i.e., all the available cases, and C as the

set of solution classes into which the known cases are

classiﬁed. Let D

be the current similitude term and

be the set of all the cases satisfying D

. When the

stopping condition of LID is not satisﬁed, the next step

is to select an attribute for specializing D

The specialization of D

is achieved by adding at-

tributes to it. Given a set F of attributes candidate to

specialize D

, LID selects the most discriminatory at-

tribute in F using a distance measure. Such distance

is used to compare each partition P

induced on S

by an attribute f with the correct partition P

. The

correct partition has as many sets as solution classes.

Each attribute f ∈ F induces in S

a partition P

with

as many sets as the number of different values that f

takes in the cases contained in S

. Given a distance

measure ∆ and two attributes f and g inducing re-

spectively partitions P

and P

, we say that f is more

discriminatory than g iff ∆(P

, P

) < ∆(P

, P

). This

means that the partition P

is closer to the correct par-

tition than the partition P

Let f

be the most discriminatory attribute in F.

The specialization of D

deﬁnes a new similitude term

i+1

by adding to D

the attribute f

. The new simili-

KDIR 2011 - International Conference on Knowledge Discovery and Information Retrieval

202

tude term D

i+1

= D

∪ { f

} is satisﬁed by a subset of

cases in S

, namely S

i+1

. Next, LID is recursively

called with S

i+1

and D

i+1

. The recursive call of LID

has S

i+1

instead of S

because the cases that are not

satisﬁed by D

i+1

will not satisfy any further special-

ization. Notice that the specialization reduces the dis-

criminatory set at each step, i.e., we get a sequence

⊆ S

n−1

⊆ . . . ⊆ S

The selection of the most discriminatory attribute

is heuristically done using the LM distance (L

opez de

antaras, 1991) over the candidate attributes. Let us

recall its deﬁnition: Let X be a ﬁnite set of objects;

P = {P

, . . . , P

} be a partition of X in n sets; and

Q = {Q

, . . . , Q

} be a partition of X in m sets. The

LM distance between them is computed as follows:

LM(P , Q ) = 2 −

I(P ) + I(Q )

I(P ∩ Q )

where

I(P ) = −

∑

i=1

log

; p

|X|

I(Q ) = −

∑

j=1

log

; p

|X|

I(P ∩ Q ) = −

∑

i=1

∑

j=1

i j

log

i j

; p

i j

∩ Q

|X|

Given a partition P on a set X, I(P ) is the average

information of P and it measures the randomness of

the distribution of elements of X over the n classes of

the partition. The quantity represented by I(P ∩ Q ) is

the mutual average information of the intersection of

the partitions P and Q .

LID has two stopping situations: 1) all the cases in

the discriminatory set S

belong to the same solution

class C

, or 2) there is no attribute allowing the spe-

cialization of the similitude term. When the stopping

condition 1) is satisﬁed p is classiﬁed as belonging to

. When the stopping condition 2) is satisﬁed, S

contains cases from several classes; in such situation

the majority criteria is applied, and p is classiﬁed in

the class of the majority of cases in S

. The outcome

of LID is a class and the last similitude term, that jus-

tiﬁes the proposed classiﬁcation of the new problem.

4 EXPERIMENTS

We carried out experiments with a data base contain-

ing descriptions of 192 skin lesions, 50 of them are

malignant melanoma (MM) and 142 are benignant.

These lesions are described by 11 dermatoscopic at-

tributes and 22 confocal attributes. All attributes take

symbolic values although it is possible that some at-

tribute can have unknown value. We conducted 100

experiments, each one of them consisting on 1) the

random generation of a training set and a test set, and

2) the use of both k-NN and LID for classifying the ob-

jects in the test set. We prefer this kind of evaluation

instead of using 10-fold cross-validation because we

also want to analyze the performance of the methods

on training sets of different size. Thus the training set

size goes from 73 cases (the minimum one randomly

generated) to 164 cases (the maximum one randomly

generated) being the mean size of the 100 experiments

of 77 cases. We also compare the results with those

produced by decision trees generated using the J48

algorithm given in the Weka platform (Witten et al.,

1999).

Concerning the k-NN method, we experimented

with several values of k, i.e., 3, 5 and 7. From some

preliminary experiments we have seen that the best re-

sults are those produced taking k = 3. In fact, we saw

that the number of misclassiﬁcations is approximately

the same for all these values of k (around the 2.5%);

however the main difference is that higher values of

k produce highest number of multiple classiﬁcations.

For this reason, the results shown in the current paper

are those obtained using k = 3.

For each object in the test set the outcome may be

the following: one class (benign or MM) or a multi-

ple solution (i.e., both classes) meaning that the ob-

ject is suspicious since it satisﬁes descriptions of both

classes. Multiple solutions are produced by LID when

the ﬁnal similitude term is satisﬁed by cases of both

classes. Despite of being k and odd number, the k-NN

method can also give multiple solutions. This hap-

pens when there are several cases having the same

distance to the problem p, since in this situation all

these cases are retrieved. For instance, let c

, c

and c

be cases such that d(p, c

) = d

, d(p, c

) = d

and d(p, c

) = d(p, c

) = d

, although k = 3 the four

cases are retrieved as the most similar to p, therefore

it could be possible a tie between the two solution

classes.

We compare the predictivity of the methods us-

ing ROC curves (Fawcett, 2006). A ROC curve is

a technique for visualizing, organizing and selecting

classiﬁers based on their performance. This kind of

representation is specially useful when error costs are

not the same in all the classes. For instance, in our

domain is clearly worst to consider as benign a malig-

nant lesion that the inverse situation. The ROC curves

take into account the true positive rate (TPR) and the

false positive rate (FPR) for the comparison of meth-

COMBINING TWO LAZY LEARNING METHODS FOR CLASSIFICATION AND KNOWLEDGE DISCOVERY - A

Case Study for Malignant Melanoma Diagnosis

203

0,2

0,4

0,6

0,8

1,2

0 0,2 0,4 0,6 0,8 1 1,2

FPR

TPR

LID

k-NN

Figure 2: ROC curves corresponding to the 100 experiments

conducted with decision trees, k-NN and LID.

ods. These measures are calculated from the sensitiv-

ity (SE) and the speciﬁcity (SP), being

SP =

T N

T N+FP

SE =

T P

T P+F N

where T P stands for true positive, TN for true nega-

tive, FP for false positive, and FN for false negative.

The TPR is the sensitivity and the FPR is 1-SP.

One point (FPR, TPR) in the ROC space is bet-

ter than another if it is in the northwest part of the

graphic, i.e., if its TPR is higher, its FPR rate is lower,

or both. A classiﬁer represented by a point in the di-

agonal means that it has a random behavior. Points

upper the diagonal means that the classiﬁer exploits

some information from the data. Points down the di-

agonal means that the classiﬁer performs worst than

a random classiﬁer. Thus, a point is better than an-

other if TP is higher and FP is lower. Moreover,

given two points (FP

, T P

) and (FP

, T P

) such that

< FP

and T P

< T P

the performance of the two

methods is incomparable and the cost of false posi-

tives has to be taken into account in order to choose

between them.

A common way to do this comparison is by means

of the calculation of the area under the ROC curve

(AUC). The range of AUC is the interval [0, 1], how-

ever, because the area corresponding to a random

classiﬁer is 0.5, i.e., the area under the diagonal, the

AUC of a classiﬁer should to be upper to 0.5. The

AUC of a classiﬁer is equivalent to the probability

that the classiﬁer will rank a randomly chosen posi-

tive instance higher than a randomly chosen negative

instance. See (Fawcett, 2006) for an excellent tutorial

on ROC curves.

For each one of the experiments (100 for each

method) we calculated the TPR and the FPR. Because

a lesion could be classiﬁed as suspicious (i.e., mul-

tiple classiﬁcation), in the evaluation of the results

we do not taken into account multiple answers of the

methods. In other words, the sensitivity and speci-

ﬁcity have been calculated taking into account only

the classiﬁcations in one (correct or incorrect) class.

The predictivity of decision trees is 76.47% and it

never classiﬁes a lesion as suspicious (multiple clas-

siﬁcation). The k-NN produces around the 3% of sus-

picious lesions; and LID produces around the 1.5% of

suspicious lesions. The predictivity of the lazy meth-

ods is around 78% and around 74% for k-NN and LID

respectively.

The ROC curves present a different picture of the

performance of these methods (Fig. 2), since although

the lazy methods have lower predictivity than deci-

sion trees, they have better performance when analyz-

ing TP and FP. Thus we see that decision trees have

the lowest rate of true positives, meaning that the high

accuracy comes from the correct prediction of benign

lesions. ROC curves also show that k-NN has low

FPR, meaning that when it classiﬁes a lesion as MM

it never fails, i.e., never classiﬁes a benign lesion as

MM. Concerning LID it has a performance in the mid-

dle of the other methods: it is the method that captures

the highest number of MM, but it is also the method

producing the highest number of false positives. Table

1 shows the average of the 100 experiments in terms

of TPR and FPR. When using AUC to compare the

performance of the methods, we see that the best one

is LID although both methods clearly outperform de-

cision trees.

In fact, we would like to have a classiﬁer with the

TPR of LID and the speciﬁcity of k-NN. For this rea-

son we combined both methods in a single classiﬁer

using the following rule (say R1):

1. if k-NN classiﬁes as MM then return MM

2. if LID classiﬁes as MM then return MM

3. otherwise return benign

The last row of Table 1 shows the average of 100

experiments using the rule R1 and Fig. 3 shows the

ROC curves comparing k-NN, LID and R1. We see

that the rule R1 produces a TP rate improving the

ones produced by both k-NN and LID. Our explana-

tion of this fact, especially for the improvement with

respect to LID, is that both methods characterize dif-

ferent kinds of MM. Currently, the experts are in-

specting the results in more detail. However, the per-

formance of R1 concerning FPR is near to the one

Table 1: Summary of the results produced by the methods

after averaging 100 random experiments.

Method Accuracy TPR FPR AUC

DT 76.4665 0.3076 0.0756 0.6160

LID 74.4008 0.5326 0.1574 0.6876

k-NN 78.6500 0.3750 0.0325 0.6712

R1 76.6991 0.5973 0.1638 0.7167

KDIR 2011 - International Conference on Knowledge Discovery and Information Retrieval

204

0,2

0,4

0,6

0,8

1,2

0 0,2 0,4 0,6 0,8 1 1,2

FPR

TPR

LID

k-NN

Figure 3: ROC curves comparing the results of k-NN, LID,

and the combination of both methods given by the rule R1.

of LID than to the one of k-NN. We need to conduct

more experiments trying different combinations of the

methods. Comparing the AUC of the methods we see

that the classiﬁer produced using R1 is the best one.

Concerning the size of the training set, we do not

detect any correlation with the accuracy of the lazy

learning methods. Thus, some experiments with a

large training set, say over 100 objects, have lower

accuracy than some experiments with smaller train-

ing sets. The same occurs with the decision trees. We

think that this result is due to the particularities of the

domain, since very similar skin lesions may have a

very different diagnose.

5 EXPERIMENTING WITH LID

FOR KNOWLEDGE

DISCOVERY

The similitude term generated by LID can be inter-

preted as a partial discriminant description of C

since

all the cases satisfying the similitude term belong to C

(according to one of the stopping conditions of LID).

Therefore, the similitude term can be used as a gen-

eralization of knowledge in the sense of either PRO-

TOS (Bareiss et al., 1988), EBL or inductive learn-

ing methods. In (Armengol, 2008) the domain theory

constructed using similitude terms was analyzed. In

the present study we use the same idea to generate

a lazy domain theory characterizing skin lesions. In

this way, we can take beneﬁt from the classiﬁcation

power of lazy learning methods and, in addition, to

give the experts a basis to formulate a domain model

to support the early diagnosis of MM.

We conducted 70 experiments to generate a do-

main model from the training sets, and then this

model was analyzed and compared to the one pro-

duced by decision trees. Given a training set T , we

Grade = 3

PG_form= dendritic

PG_form=

dendritic

roundish

Benign

Figure 4: Two rules generated by LID: one for the benignant

class and another for the MM class. We represented them in

a tree form, however there is not any kind of neither order

nor preference between the attributes.

use LID with leave-one-out to classify each one of the

objects of T. When the object has been correctly clas-

siﬁed into a class C

, the similitude term justifying the

classiﬁcation is stored as a correct description for C

At the end of this process we have a lazy domain the-

ory to predict the classiﬁcation of the objects belong-

ing to the test set.

The results of these experiments show that the do-

main theories generated by LID contain more rules

than the theories generated by decision trees. In fact,

it has overﬁtting because there are a lot of class de-

scriptions satisﬁed by only one object of the training

set. Particularly, LID produces a high number of dis-

criminant descriptions for the class benign and only

a few for the class MM. Independently on the size of

the training sets, there are subsets of class descriptions

that, with little differences, appear in all the experi-

ments. This means that the LID theory is able to rec-

ognize both malignant and benignant skin lesions that

follow the expected regular patterns. For instance, as

it is well known by dermatologists, most of MM have

pagetoid inﬁltrations whereas most of benignant le-

sions do not. Nevertheless, some benignant lesions

can present pagetoid inﬁltration (they are suspicious

lesions that are commonly excised and, after biopsy,

they are classiﬁed as benignant) and some early MM

still not present pagetoid inﬁltration (they are false

negatives). For this reason, we specially focused on

the descriptions of the lazy theories for the benignant

class with some of the attributes related with having

pagetoid inﬁltration. Figure 4 shows two of the rules

generated by LID, both related with having pagetoid

inﬁltration of grade 3. One of the rules states that a

lesion is benignant because of, in addition to having

inﬁltration of grade 3, the form of the pagetoid cells is

dendritic. The other rule states that a lesion is a MM

because it has inﬁltration of grade 3 and the pagetoid

cells are of two types: dendritic and roundish. In fact,

experts know that the presence of roundish pagetoid

cells is indicative of MM.

Figure 5 shows two decision trees: one of them

involves the attribute pagetoid-inﬁltration and it is

COMBINING TWO LAZY LEARNING METHODS FOR CLASSIFICATION AND KNOWLEDGE DISCOVERY - A

Case Study for Malignant Melanoma Diagnosis

205

!"#$%&'()'*+,%-".&*/0/1$2/

3///456'--$#7,"-/0/89/:$*'#*/;<=>?@A=BBC/

3///456'--$#7,"-/0/A9/DD/;>=?E@A=FGC/

!"#$%&'()'*+,%-".&*/0/*&9/:$*'#*/;FG=EA@G=8C/

H7*2I$$%,'J$)K$,,2/0/89/:$*'#*/;EE=G?@AE=B?C/

H7*2I$$%,'J$)K$,,2/0/A9/DD/;?=EE@8=G>C/

Figure 5: Decision trees generated by the J48 algorithm.

according with the dermatologist’s knowledge since

most of lesions with an irregular dermal papilla

(DP irregular = 1) having pagetoid inﬁltration are

MM. The other decision tree involves only the at-

tribute junsheetlike-cells that represents the presence

or absence of atypical cells arranged in sheet-like

structures visualized in superﬁcial papillary dermis.

This kind of cells is an indicative of MM. Due to the

pruning process of the algorithm J48, class descrip-

tions are not discriminatory, nevertheless the predic-

tivity of the theory is around 75% and it never clas-

siﬁes a lesion as suspicious. Instead, using the lazy

domain theory generated by LID, a lesion may be clas-

siﬁed as suspicious (around a 12% of times) or even

it may not be classiﬁed (around a 20% of times). The

predictivity of the lazy theory is around 55%.

Figure 6 shows the ROC curves generated using

the average of TPR and FPR of the 70 random ex-

periments we performed with both LID and decision

trees. From this graphic we see that the theory gen-

erated with LID has a higher rate of TP than the one

generated with the decision tree. However, LID also

produces a higher rate of FP than decision trees. The

AUC of the methods is 0.6160 and 0.677 for deci-

sion trees and LID respectively. Thus, the AUC shows

that, although the mean predictivity of the LID theory

is lower than the one produced by the decision tree, it

has better performance when analyzing TP and FP. In

particular, the LID theory produces the highest num-

ber of TP.

An explanation of the better performance of the

LID theory should be its overﬁtting. The decision

tree tries to reduce it and consequently, it offers com-

pact class descriptions satisfying the known objects.

However, that theory can fail in recognizing some un-

seen suspicious lesions far of the already known le-

sions. Instead, due to the overﬁtting, lazy domain

theories have class descriptions that are more speciﬁc

than those of the produced by the decision tree. This

means that there will be unseen objects that will not

be classiﬁed because they belong to areas of the space

for which there is no characterization. Nevertheless,

there are other areas that are better characterized due

to this speciﬁcity of the descriptions.

ROC curves

0,2

0,4

0,6

0,8

1,2

0 0,2 0,4 0,6 0,8 1 1,2

FPR

TPR

LID

Figure 6: ROC curves comparing the performace of the the-

ories generated using both decision trees and LID.

6 CONCLUSIONS AND FUTURE

WORK

In this paper we introduced an approach for solv-

ing the problem of diagnosing early malignant

melanomas. Our main goal was to capture as more

MM as possible although this could produce also a

high number of false positives. However another

desirable goal is to generate some kind of domain

knowledge that could serve as basis to dermatologists

to elaborate an accurate domain theory. Our approach

is based on the use of lazy learning methods despite

they do no generate generalizations that could be used

as domain knowledge. In particular we compared the

performance of two lazy learning methods, the well-

known k-NN and LID, with the one of decision trees.

Although decision trees produce the highest accuracy,

a detailed analysis of the methods in terms of sensitiv-

ity and speciﬁcity show that lazy learning methods ﬁt

better our requirements. The experiments show also

that k-NN and LID seem to be complementary in rec-

ognizing MM, therefore we implemented a rule R1

combining the prediction of both methods. The use

of the rule R1, that can be seen as a meta-classiﬁer,

improves the true positive ratio although the false pos-

itive ratio is slightly higher than the one of LID.

Concerning the generation of a domain model, we

proposed to use the explanation of the classiﬁcations

produced by LID as a domain theory, and then we

compared it with a domain theory induced by means

of a decision tree. Results show that both theories

equally satisfy the expert’s expectations, however the

ROC analysis proved that the lazy domain theory cap-

tures more MM. We plan to analyze the domain the-

ory generated from the use of the rule R1.

I would be interesting to explore the possibility of

using some meta-learning method (Prodromidis et al.,

2000) to achieve a good combination of k-NN and LID

(and maybe some other method). We conducted some

KDIR 2011 - International Conference on Knowledge Discovery and Information Retrieval

206

preliminary experiments in this direction, however we

do not obtained the expected results: although the ac-

curacy improves with respect to each method alone,

the number of TP does not increase. A possible solu-

tion to focus on increasing TP could be the use of do-

main knowledge since it is known that there are some

relevant attributes characterizing MM (at least those

taken into account in the ABCD rule).

ACKNOWLEDGEMENTS

The authors acknowledge support of the Spanish

projects ARINF (TIN 2009-14704-C03-03), Next-

CBR (TIN 2009-13692-C03-01), and the grant 2009-

SGR-1434 from the Generalitat de Catalunya. The

authors also thank Pilar Delllunde and

Angel Garc

ıa-

Cerda

na for their helpful comments to improve this

paper.

REFERENCES

Argenziano, G., Zalaudek, I., Ferrara, G., Hofmann-

Wellenhof, R., and Soyer, H. (2007). Proposal of a

new classiﬁcation system for melanocytic naevi. Br J

Dermatol, 157(2):217–227.

Armengol, E. (2008). Building partial domain theories from

explanations. Knowledge Intelligence, 2(8):19–24.

Armengol, E. and Plaza, E. (2001). Lazy induction of

descriptions for relational case-based learning. In

Reaedt, L. D. and Flach, P., editors, ECML-2001.,

number 2167 in Lecture Notes in Artiﬁcial Intelli-

gence, pages 13–24. Springer.

Bareiss, E. R., Porter, B. W., and Wier, C. C. (1988). PRO-

TOS: an examplar-based learning apprentice. Int. J.

Man-Mach. Stud., 29(5):549–561.

Fawcett, T. (2006). An introduction to ROC analysis. Pat-

tern Recogn. Lett., 27:861–874.

Helma, C. and Kramer, S. (2003). A survey of the predic-

tive toxicology challenge 2000-2001. Bioinformatics,

pages 1179–1200.

Hofmann-Wellenhof, R., Blum, A., Wolf, I., Zalaudek, I.,

Piccolo, D., Kerl, H., Garbe, C., and Soyer, H. (2002).

Dermoscopic classiﬁcation of Clark’s nevi (atypical

melanocytic nevi). Clin Dermatol, 20(3):255–258.

opez de M

antaras, R. (1991). A distance-based attribute

selection measure for decision tree induction. Ma-

chine Learning, 6:81–92.

Mitchell, T. (1997). Machine Learning. McGraw-Hill In-

ternational Editions. Computer Science Series.

Mitchell, T., Keller, R., and Kedar-Cabelli, S. (1986).

Explanation-based learning: A unifying view. Ma-

chine Learning, 1(1):47–80.

Plaza, E., Armengol, E., and Onta

on, S. (2005). The ex-

planatory power of symbolic similarity in case-based

reasoning. Artiﬁcial Intelligence Review. Special Issue

on Explanation in Case-based Reasoning, 24:145–

161.

Prodromidis, A., Chan, P., and Stolfo, S. (2000). Meta-

learning in distributed data mining systems: Issues

and approaches. In Book on Advances of Distributed

Data Mining, editors Hillol Kargupta and Philip

Chan, AAAI press, 2000.

Puig, S., Argenziano, G., Zalaudek, I., Ferrara, G., Palou,

J., Massi, D., Hofmann-Wellenhof, R., Soyer, H., and

Malvehy, J. (2007). Melanomas that failed dermo-

scopic detection: a combined clinicodermoscopic ap-

proach for not missing melanoma. Dermatol Surg,

33(10):1262–1273.

Roth-Berghofer, T. R. (2004). Explanations and case-

based reasoning: Foundational issues. In Funk, P.

and Calero, P. A. G., editors, Advances in Case-Based

Reasoning, pages 389–403. Springer-Verlag.

Vestergaard, M. and Menzies, S. (2008). Automated di-

agnostic instruments for cutaneous melanoma. Semin

Cutan Med Surg, 27(1):32–6.

Witten, I., Frank, E., Trigg, L., Hall, M., Holmes, G., and

Cunningham, S. (1999). Weka: Practical machine

learning tools and techniques with Java implementa-

tions.

COMBINING TWO LAZY LEARNING METHODS FOR CLASSIFICATION AND KNOWLEDGE DISCOVERY - A

Case Study for Malignant Melanoma Diagnosis

207