FUZZY DECISION TREE LEARNING FOR PREOPERATIVE
CLASSIFICATION OF ADNEXAL MASSES
Emad Ahmadi
Digestive Diseases Research Center, Shariati Hospital, Tehran University of Medical Sciences, 14117 Tehran, Iran
Hoda Javadi
Medical Education and Development Center, Iran University of Medical Sciences, 14496 Tehran, Iran
Amin Khansefid
Department of Electrical and Computer Engineering, University of Tehran, 14396 Tehran, Iran
Atousa Asadi, Mohammad Mehdi Ebadzadeh
Department of Computer Engineering, Amirkabir University of Technology, 15875 Tehran, Iran
Dirk Timmerman
Department of Obstetrics and Gynecology, University Hospitals Leuven, Herestraat 49, B-3000 Leuven, Belgium
Keywords: Decision tree learning, Fuzzy logic, Adnexal diseases/diagnosis.
Abstract: The study problem was learning a fuzzy decision tree to classify patients with adnexal mass into either of
benign or malignant class prior to surgery using patients’ medical history, physical exam, laboratory tests,
and ultrasonography. A learning algorithm was developed to learn a fuzzy decision tree in three steps. In the
growing step, a binary decision tree was learned from a dataset of patients while fuzzy discretization was
used in decision nodes testing continuous attributes. The best degree of fuzziness was automatically found
by an algorithm based on optimization procedures. In the pruning step, the overfitted nodes were removed
by an algorithm based on critical value post-pruning method. In the refitting step, the labels of the leaf
nodes were optimized. The final resulted tree had 10 decision nodes and 11 leaf nodes. Performance testing
of the tree gave AUC of ROC of 0.91 and mean squared error of 0.1. The tree was translated into a set of 11
fuzzy if-then rules and the clinical plausibility of the rules was assessed by domain experts. All rules were
verified to be in agreement with medical knowledge in the domain. Despite the small learning set and the
lack of some important input variables, this method gave accurate and, more importantly, clinically
interpretable results.
1 INTRODUCTION
In the pool of more than 20 diseases causing adnexal
mass, malignant lesions should be differentiated
from benign lesions, because benign lesions should
not undergo surgery unless being symptomatic or
causing subfertility while malignant lesions should
be removed by surgery (Hoffman, 2009). Ovarian
cancers, comprising the majority of malignant
adnexal masses, can spread quickly in the abdominal
cavity and involve organs like diaphragm and bowel
(Schaffer, 2008). Performing surgery on such organs
is beyond the scope of general gynecology; therefore
patients with malignant adnexal mass should be
operated by gynecologic oncologists who have
sufficient expertise in such operations (Mann et al.,
2009). Thus, malignant and benign adnexal masses
364
Ahmadi E., Javadi H., Khansefid A., Asadi A., Mehdi Ebadzadeh M. and Timmerman D..
FUZZY DECISION TREE LEARNING FOR PREOPERATIVE CLASSIFICATION OF ADNEXAL MASSES.
DOI: 10.5220/0003179703640375
In Proceedings of the International Conference on Health Informatics (HEALTHINF-2011), pages 364-375
ISBN: 978-989-8425-34-8
Copyright
c
2011 SCITEPRESS (Science and Technology Publications, Lda.)
should be differentiated prior to surgery in order to
refer patients with malignant lesions to gynecologic
oncologists as well as withholding surgery for
innocent benign lesions.
No single imaging or laboratory study has been able
to accurately differentiate malignant from benign
adnexal masses (Myers et al.). In the meantime,
there are few experts who can accurately
differentiate malignant from benign masses prior to
surgery using patient’s history, physical exam,
laboratory tests, and ultrasonography results. This
observation leaded to the hypothesis that a
combination of patient’s data can accurately
differentiate adnexal masses. A series of studies
were tried to simulate experts’ thinking process for
classification of adnexal masses, but none have been
implemented into routine clinical practice (Hoffman,
2009). The reason is that simple models like logistic
regression are not accurate enough despite being
easy to interpret by clinicians, while complex
models like advanced kerned-based methods are not
interpretable by clinicians despite being accurate.
Ethical and legal issues do not allow clinicians to
make their therapeutic decisions based on outputs
coming out of black-box models without knowing
how the outputs are made. Briefly, a model being
both accurate and interpretable by clinicians is
lacking. By overviewing medical textbooks and
journals, it is revealed that combinatory if-then rules
and decision trees are the most widely used medical
decision making methods. Having all these facts, we
tried to make a decision tree for preoperative
classification of adnexal masses, which can then be
translated into a set of if-then rules. Because the
input data was nondeterministic for predicting
malignancy, fuzzy inference was used to manage
uncertainty associated with the data. The resulted
fuzzy decision tree was then translated into a set of
fuzzy if-then rules, which are interpretable by
clinicians and can be criticized and amended based
on medical knowledge in the domain.
The paper is organized as follows: Section 2
introduces the basic ID3 decision tree learning
algorithm, its extension to use continuous attributes
as input variables, the concept of nondeterministic
data and overfitting in decision tree learning, and
fuzzy decision trees. Section 3 defines the learning
problem and the steps used to learn a fuzzy decision
tree. Section 4 explains the exact methods by which
the fuzzy decision tree was learned from the dataset.
Section 5 explains the post-pruning method used to
eliminate overfitting. Section 6 describes a refitting
method used to further improve fuzzy decision tree
classification generalizability. Section 7 reviews the
dataset used in this study. Section 8 presents the
final parameters chosen for implementing the
learning task and the results of the final tree testing.
Section 9 discusses strengths of this study, inductive
bias associated with decision tree learning, and the
conclusion of the study.
2 INTRODUCTION TO FUZZY
DECISION TREE LEARNING
An algorithm which learns a decision tree from a
dataset of patients is said to learn a decision tree
from patients’ data in the training dataset. The target
function of this algorithm is the best decision tree
which can classify cases into either of benign or
malignant class. The simplest situation for making
decision trees is when all attributes are binary,
meaning each attribute can take only either value of
0 or 1. For making decision trees using such
attributes, the decision tree learning algorithm
follows these steps (Mitchell, 1997):
1- Create a tree by making a root node with one
left child and one right child
2- Using the first attribute, send training examples
with the value of 0 to the left child and
examples with the value of 1 to the right child
3- Assess the pooled purity of left and right
children for output classes
4- Redo 2 and 3 for all attributes, saving the
pooled purity caused by each attribute
5- Assess which attribute has resulted in the
maximum pooled purity; Assign that attribute to
the root
6- If the left child is pure for one class, make it a
leaf node and assign its label that class
7- If the left child is not pure for one class, go to 1
and start making a subtree rooted in the left
child
8- If the right child is pure for one class, make it a
leaf node and assign its label that class
9- If the right child is not pure for one class, go to
1 and start making a subtree rooted in the right
child
The above steps are followed by all decision tree
learning algorithms to grow the tree by making
children for nodes recursively, until 6 or 8 is met,
where the node is turned into a leaf and no more
children are made for the node. A binary attribute
selected for a node is eliminated from the list of
attributes which can be used by the descendents of
that node.
FUZZY DECISION TREE LEARNING FOR PREOPERATIVE CLASSIFICATION OF ADNEXAL MASSES
365
2.1 Nondeterministic Data and
Overfitting
Other than the variables recorded in the dataset,
there might be other variables affecting the output
which are not recorded in the dataset. This is the
usual case in modeling medical problems, where the
model has to predict the output using some attributes
while none of the attributes have a direct effect on
the output (and not directly caused by the output). In
this study, the only attribute which can directly
determine the malignancy is the pathology results,
but it cannot be used as an input in our model
(because measuring this attribute needs surgery and
is invasive). As a result, the model has to predict the
malignancy by using attributes which are neither
directly caused by the malignancy nor have direct
effect on the malignancy, but are noticed to have
interactions with the malignancy (e.g. malignant
tumors often, but not always, have bigger sizes).
When data is not deterministic, the output cannot be
absolutely predicted by any combination of the
attributes. Thus, even the best models will have a
degree of inaccuracy, called residual error.
When residual error is present, even the best
attributes in the final test nodes cannot make
absolutely pure children. If residual error is not
recognized, the learning algorithm tries to make
absolutely pure leaves while it is not possible by
using any attributes. The learning algorithm
continues to make children for nodes recursively,
leading to small number of cases in the bottom
nodes of an excessively grown tree. In this stage,
because of the small number of cases in each test
node, there is a high probability that one among all
attributes has different values for cases of different
classes, thus selecting this attribute for the node is
associated with correct separation of cases in the
training dataset who have reached that node (the
cases are thus separated by chance, not by the
selected attribute); but selecting this attribute for the
node is associated with incorrect classification of
cases reaching that node in subsequent testing of the
tree on a separate dataset (because the same chance
is improbable to occur again in testing). This
learning algorithm will select irrelevant attributes for
multiple bottom test nodes, resulting in an overfitted
tree to the training dataset.
To prevent overfitting, the learner has to recognize
residual error and turn the node into a leaf if a
sufficient amount of purity, consistent with residual
error, is met. Another approach is that the learning
algorithm lets the tree to become overfitted, and then
post-prunes the overfitted tree to make an optimal
decision tree. This approach is used in this study and
is introduced in section 5.
2.2 Crisp Discretization of Continuous
Attributes
If attributes are continuous rather than binary, the
second step of the learning algorithm becomes more
elaborate. The learner should test multiple thresholds
for the first attribute, sending cases with the attribute
value of less than threshold to the left child, and
cases with the attribute value of more than threshold
to the right child. The pooled purity of children will
be assessed for each threshold, and the best
threshold is selected for that attribute.
Then the same process will be repeated for all
attributes, assessing the pooled purity of children for
each threshold of each attribute. The best attribute
with its best threshold is finally assigned to the node.
2.3 Fuzzy Decision Trees
Assume a patient being classified by a conventional
decision tree. In each test node, a single attribute is
tested using a single threshold having two possible
answers: less than threshold and more than
threshold. The attribute space of the node (the local
attribute space) is thus split into two non-
overlapping subspaces, as shown in figure 1.
Patients with value of the tested attribute less than
threshold go to the left child, while cases with
attribute value more than threshold go to the right
child. To classify a new patient, it starts at the root
node and is tested sequentially in multiple test nodes
until it reaches a leaf. All patients reaching a leaf
will be assigned to the same class corresponding to
that leaf. In summary, each patient follows a single
path, reaches a single leaf, and is assigned the class
stored in that leaf.
Instead of defining crisp sets, we can define two
fuzzy sets for members of the left and right children
using a smooth and overlapping fuzzy discriminator
function for continuous attributes tested in the test
nodes (Olaru and Louis, 2003). Each fuzzy test node
tests a single attribute using a pair of two parameters
which characterize the fuzzy discriminator function.
The two parameters are threshold which is the
cutpoint, and width which defines the overlapping
region of left and right children. The local attribute
space is thus split into two overlapping subspaces. In
a fuzzy decision tree, a case can be classified by
being propagated through multiple paths in the tree
and reaching multiple leaves, if the case is situated
in the overlapping region of some test nodes. At the
HEALTHINF 2011 - International Conference on Health Informatics
366
Figure 1: Conventional decision tree testing two attributes and its corresponding fuzzy decision tree. In conventional
decision trees, the attribute space is partitioned into non-overlapping subspaces in which each case is assigned to a single
class. In fuzzy decision trees, the attribute space is partitioned into overlapping subspaces by fuzzy boundaries. If a case is
situated in the overlapping area, it may belong to both classes with different degrees of membership.
end, the case might have reached one or multiple
leaves with different membership values. The class
estimations given by all these leaves are then
aggregated through some defuzzification process to
determine the final estimated membership value of
the case in each output class.
3 THE LEARNING PROBLEM
In this study, we have a concept learning problem.
Let us use to denote the malignancy concept. Then
() denotes whether or not the patient is a
member of the malignancy class (section 5.1,
equation 3), and ̂
() denotes degree of
membership of the patient in the malignancy class
estimated by the tree.
In this study, membership-value weighted
average of leaves labels was used to calculate
patient’s estimated membership value in the
malignancy class:
̂
(
)
=
(
)
.
∈
()
∈
(1)
where
() denotes patient’s degree of
membrership in the th leaf and
denotes label of
the th leaf. The label of a node is the class
estimation of that node for cases reaching that node.
While the label of a leaf in a non-fuzzy decision
trees is the name of one class, the label of a leaf in a
fuzzy decision tree can be the fuzzy degree of
membership in one class. When the number of
output classes is only two, like in this study, we can
define the label of each leaf as the fuzzy degree of
membership in one class, like malignancy class in
this study. The denominator equals patient’s
membership value in the root and is equal to one in
this study.
A fuzzy decision tree is an approximation structure
for computing the degree of membership of patients
to a particular class, as a function of patients’
attribute values. The term attribute is used to denote
the input parameters used in the decision tree test
nodes for classifying patients, the term instances
denotes the set of all possible patients with any
possible attributes values, which we denote by ,
and the term example denotes the patient’s
attributes-output pairs provided in the training
dataset. The process of fuzzy decision tree making
was automatically done by supervised learning from
CA125
CA125
Tumor size
Tumor size
Benign
Benign
Malignant
1
0
0
CA125
Ben.
Ben.
<100
100
T. Size
<5
5
Mal.
CA125
0
0
T. Size
1
Left child
membership
value
Right child
membership
value
FUZZY DECISION TREE LEARNING FOR PREOPERATIVE CLASSIFICATION OF ADNEXAL MASSES
367
examples in the training dataset. The learning
problem could then be defined as:
Learning Task. Classifying patients into either of
benign or malignant class using patients’ attributes
Target Function. The function that maps the
instance space to whether the patient is a member of
the malignancy class:
: {0 ,1}
Target Function Representation. the function
equivalent to the fuzzy decision tree that maps the
instance space to the patient’s degree of membership
in the malignancy class:
: [0 ,1]
Training Experience. Using examples in the
training dataset to make a fuzzy decision tree to
calculate ̂
()
Performance Measure. minimized squared error
between the vector of outputs estimated by and the
vector of real outputs:

=
[
]
−[]
=
[
]
[
]
.
[
]
[
]
(2)
where

denotes performance error,
[
]
is the
vector of the estimated class of patients estimated by
containing values in the interval [0,1], and
[
]
is
the vector of the real class of patients provided in the
training dataset containing values from the set {0,1}.
For making an optimal fuzzy decision tree, a process
of three steps was used (Olaru and Louis, 2003).
First, a sufficiently large fuzzy decision tree was
made in the growing step, using a subset of the
dataset called the growing set (GS). In this step, test
nodes are consecutively added in a top-down
fashion, until one of the stopping criteria are met. At
the end of this step, a large (and presumably
overfitted) fuzzy decision tree is made.
Then, in the pruning step, the overfitted nodes of the
grown tree were pruned in a bottom-up fashion. A
cross-validation method was used for this step, using
a separate subset of the dataset called the pruning set
(PS).
Finally, in the refitting step, the labels of the leaves
of the pruned tree were tuned to optimize the
decision tree performance. This step used the whole
learning set (LS), including all cases of both
growing and pruning sets. At the end, the tree was
tested to assess its performance on a dataset separate
from the learning set.
All algorithms were coded in MATLAB
programming language and implemented in
MATLAB software (version 7.8.0.347 (R2009a),
The MathWorks Inc).
4 GROWING METHOD
We used a modified version of the method
introduced by Olaru et al (Olaru and Louis, 2003).
Figure 2 shows the split of a tree node corresponding
to a fuzzy set into two fuzzy subsets as the left
child and as the right child, based on the chosen
attribute  in the node . Each test node is
associated with a discriminator function which
determines the degree of membership of each patient
in the left child by using patient’s attribute value
(). A widely used fuzzy discriminator function
is the simple linear piecewise function, as shown in
figure 3.
Figure 2: Fuzzy split of test node into left and right
children and .
: test node label;
: test node error;

test node selected attribute;
: test node attribute
threshold;
: test node attribute width;
[
()
]
:
patients’ membership values in the test node; 
(
)
:
discriminator function value for the patient .
Corresponding left and right children features are shown
by subscripts L and R, respectively. The right child is a
leaf and thus does not have attribute-related features. Note
that 
(
)
is not a feature stored in the test node, but
can be calculated from the patient’s attribute value and test
node discriminator function parameters.
Left and right children membership values can then
be calculated from patient’s membership value in the
test node
(
)
and patient’s discriminator function
value 
(
)
, which itself is dependent on the test
node attribute threshold and width , and the
[
(

)
]

(
)

[
()
]
[
(

)
]

(
)

1 -
HEALTHINF 2011 - International Conference on Health Informatics
368
Figure 3: Linear piecewise discriminator function for test
node . : discriminator function; 
(): patient’s
attribute value; 
(
)
: left child discriminator
function value;
: test node attribute threshold;
: test
node attribute width. The inclining line characterizes right
child discriminator function value, which can be
calculated by subtracting left child discriminator function
value from one.
patient’s attribute value 
(
)
:
(
)
=
(
)
.
(
)
(
)
=
(
)
−
(
)
=
(
)
.1
(
)


(
)
=
1, 
(
)
≤
0, 
(
)
>+


(
)
,  −
<
(
)
≤+
The algorithm for making a fuzzy decision tree
follows the basic steps described in section 2.
However, some general concepts used there should
be exactly defined to accommodate fuzzy concepts,
including:
a. A method for selecting the best attribute for
each test node, including selecting the optimum
threshold and width of the fuzzy discriminator
function for each continuous attribute
b. A method for assigning a label to each node
c. Exact definitions for stopping criteria
4.1 Selecting the Best Attribute for the
Test Node, and Assigning Optimum
Labels to Left and Right Children
Objective: given [
()] , the vector of the
membership values of all patients in node , find the
attribute , threshold and width (parameters
defining the discriminator function) together with
left and right child labels
and
, so that the
division error

is minimized in equation 3.
Because the parameters of are still not fixed and
the search algorithm has to search for the optimum
values of and as well, the value of for each
patient is dependent on both the patient’s attribute
value () and the discriminator function
parameters and . The concept of minimizing

in equation 3 is to select the attribute and
discriminator function parameters so that the
estimated class of all patients pooled in both
children would have the minimum difference from
the real class of all patients, if the patients are
divided into left and right children by the
discriminator function. If a decision tree is
composed of a root with its left and right children
being leafs, then

equals

. For larger trees,

cannot be directly minimized in each test node,
and thus

was used as the closest possible
approximation to

which can be minimized in
each test node.
The algorithm selects the first attribute, searches for
multiple values of both and , reshapes the
discriminator function according to them, and
calculates
[
(

)]
, the vector of membership
values of all patients in the left child, and
[
(

)]
,
the vector of membership values of all patients in the
right child. Then the algorithm calculates the
optimum values of
and
for the selected ,
and .
Assuming that , and are selected and are
temporarily fixed, the optimum values of
and
to minimize

are achieved by getting the partial
derivative of

with respect to
and
and
making them equal to zero:



=0



=0
Because of the quadratic shape of

as a function
of
and
, solving the above equations will surely
give the unique global minimum of

. Solving the
above equations will give us equations 4 and 5. By
solving this linear system in
and
, we will have
the formulas for calculating the optimum values of
and
at each fixed and :
=
. −.
−.
=
. −.
−.
where , , , and are all sums computed from
() , () , and 
(
)
:
FUZZY DECISION TREE LEARNING FOR PREOPERATIVE CLASSIFICATION OF ADNEXAL MASSES
369
=
(
)
.
(
)
∈
=
(
)
.
(
)
.1−
(
)

∈
=
(
)
.1
(
)

∈
=
(
)
.
(
)
.
(
)
∈
=−
(
)
.
(
)
.1
(
)

∈

=
(
)
.
(
)
(

(
)
,,
)
.
∈
+
1−
(

(
)
,,
)
.

(3)
−2
∈
(
)
.

(
)
.
(
)
−

(
)
.
+
1−

(
)

.
= 0
(4)
−2
(
)
.
1−

(
)

.
(
)
∈
−

(
)
.
+
1−

(
)

.
=0
(5)
Then,

is calculated using the temporary values
of ,,
, and
. The parameters that minimize

are selected for the discriminator function of
this attribute.
4.2 Searching for and
We developed an algorithm to find the optimum
values of and for each attribute. This algorithm
sorts the cases based on their attribute values (and
eliminates duplicate values), assigns to the mean of
attribute values of two patients in the dataset while
assigning to the difference between the attribute
values of the same two patients. The algorithm
repeats this process for combinations of all two
patients attribute values, and calculates ,,
,
,
and

for each of them. When the two selected
patient’s attribute values are picked up from a single
patient, the splitting would be crisp (with equal to
0); thus the algorithm does not have any tendency
for fuzzy splitting.
This algorithm performs better than simply
searching the interval of minimum to maximum
values of the attribute (the attribute range) with
changing and in small increments. In fact, we
first tried to search for and by changing from
the minimum to maximum in small increments (),
and changing from 0 to
min(− min
(

)
,max
(

)
−) in small
increments for each value. Because various
attributes have different units of measurement, the
value of could not be defined as a single value, and
it had to be defined as a fraction of the attribute
range. Most attributes had some very big values for
some patients. For example, while CA125 attribute
value is less than 500 for most patients, its value is
more than 5000 for few patients. If we have defined
the value of =range(125)/100, then the value
of could be more than 50, making the searching
algorithm inefficient. We additionally tried to
eliminate the outlier attribute values by eliminating
the attribute values out of the interval:
mean
(

)
±2 SD()
where SD
(

)
denotes standard deviation of the
attribute values. This approach did not assign
enough small values to either, because the
distribution of most attributes were left skewed,
causing this approach for eliminating outliers to be
inefficient.
On the other hand, the algorithm we developed
extensively searches for in high density areas of
the attribute distribution while minimal searching is
done over low density areas of the distribution.
The final parameters selected for the discriminator
function of the attribute were the ones minimizing

.
4.3 Defining the Stopping Criteria
The stopping criteria used in the non-fuzzy, binary
ID3 algorithm (Mitchell, 1997) should be
generalized to include fuzzy membership of patients
in each node. The stopping criteria in the ID3
algorithm include limited number of members in the
node, sufficient purity of the node members, and
consumption of all attributes already in the ancestor
nodes so that no attribute is remained for further
splitting.
The cardinality of a fuzzy set is defined as the sum
of membership values of all members:
|
|
=
()
∈
For having a measure for the purity of node
members, Node Error is defined as:
=
()
∈
.
[
(
)
−
]
where
denotes node error,
() denotes
patient’s membership value in the node, and
denotes node label.
HEALTHINF 2011 - International Conference on Health Informatics
370
The third criterion of ID3 algorithm can be
conceptualized as inability of the best selected
attribute to further purify the successor children.
When all attributes are used in the ancestor nodes, it
means that they can be used again but they will not
further purify the successor children. This concept is
close to the concept of

: if

of the best
selected attribute for the node is high, it means that
even the best attribute cannot further purify the
successor children.
Finally, we can define the generalized stopping
criteria to include fuzzy concepts:
a.
|
|
≤
|
|
b.
≤

c.

≥

where
|
|
denotes cardinality of the set of node
members,
|
|
denotes cardinality threshold,
denotes node error,

denotes node error threshold,

denotes division error, and

denotes
division error threshold.
By decreasing the values of
|
|
and

and
increasing the value of

, the resulted tree will
be bigger and, presumably, more overfitted.
Because
the grown tree would be pruned later, stopping
criteria should be tuned such that the tree would
grow large enough without concerning about the
overfitting. However, stopping criteria should not be
set such that the growing process would need a
plentiful amount of computational time for making
an unnecessarily overgrown tree.
5 PRUNING METHOD
Objective: given a grown fuzzy decision tree (FDT)
and a pruning set, find the subtree of FDT among all
subtrees which can be generated from FDT that has
the minimum mean absolute error () on the
pruning set:
=

|

|
=
∑[
(
)
−̂
(
)]
∈
|

|
A subtree of the  is made by contracting one or
several test nodes of the FDT. Contracting a test
nodes means replacing the test node with a leaf
(Mingers, 1989). The estimated output class of the
node is assigned the label of the node. Nodes labels
are already calculated in the growing step. The
number of subtrees which can be made from a tree
by contracting its test nodes increases exponentially
with the number of test nodes of the tree; thus
contracting all test nodes of a given tree one by one,
and testing the resulted trees on the pruning set one
by one, takes a considerable amount of
computational time. Therefore, we used a modified
version of the critical value pruning method in the
following three steps:
1- Test Nodes Sorting by increasing order of their
importance. The importance of each test node is
determined by the node error (
) calculated in
the growing step and saved in each node. The
more
of a test node, the less pure the test
node, and thus the more important the test node
for differentiation of output classes in the
successor nodes. If a node in this list is placed
in a more important position than any of its
ancestors, the node is removed from the list
because it would be pruned together with the
pruning of that ancestor. In the list, the most
important test node is invariably the root.
2- Subtrees Sequence Generation: the previous list
gives the order of the critical values for pruning.
In critical value pruning, a critical value for the
importance of test nodes is determined, and test
nodes which are less important than the critical
value are pruned, unless one of the successor
test nodes reach the critical value. The larger the
critical value selected, the greater the degree of
pruning, and the smaller the resulted tree. In
practice, a sequence of pruned trees is generated
using increasing critical values. The previous
list gives the order of critical values by which
the test nodes are contracted.
At the first step, the first node in the list is
contracted, and the resultant tree is saved in a tree
sequence. The process is continued by contracting
the next nodes in the list one by one, and saving the
resulted trees in the tree sequence. At the end, we
will have a sequence of trees in decreasing order of
complexity.
Before contracting the first test node in the list,
the complete tree is tested on the pruning set to
calculate its . For doing that, the pruning set
patients are propagated through the tree and the
membership value of all patients in each node is
calculated and saved. Then the output class of all
patients estimated by the complete tree is calculated
and saved. Afterwards,  of the complete tree is
calculated.
Then, every time a test node from the sorted
list of the important nodes is contracted, the output
class of all patients of the pruning set estimated by
the new subtree is recursively updated by removing
FUZZY DECISION TREE LEARNING FOR PREOPERATIVE CLASSIFICATION OF ADNEXAL MASSES
371
from ̂
(
)
the pooled output estimated by the
successor leaves of the contracted node, and then
adding to ̂
(
)
the output estimated by the
contracted node which has become a new leaf:
a. ∀    , ∀:
̂
(
)
=̂
(
)
−
(
)
.
b. ∀: ̂
(
)
=̂
(
)
+
(
)
.
Where is the test node being contracted. The
above process for updating ̂
(
)
of all patients
eliminates the necessity of propagating all patients
through the pruned subtree in each step of subtrees
generation. Each step of subtrees sequence
generation is completed by calculating  of the
resulted pruned subtree on the pruning set. Subtrees
sequence generation is finished when there are no
more nodes in the sorted list candidate for
contracting.
3- Best subtree selection: finally, the best subtree
in the sequence is selected. By having  of
all subtrees in the sequence, the smallest subtree
having the least  is selected.
6 REFITTING METHOD
In the growing step, structure of the tree is built and
labels are assigned to each node, both based on local
optimization strategies. In the pruning step, structure
of the tree is amended by contracting the overfitted
test nodes. In the refitting step, leaves labels are
amended based on global optimization strategies
(Olaru and Louis, 2003).
Let us consider the following definitions:
[
]
=
(
)
, ∀
∈
[
]
=̂
(
)
, ∀
∈
[
]
=
, ∀


=
(
)
, ∀,∀
∈
where is the index of leaves in the tree, and is the
index of patients in the learning set.
[
]
,
[
]
, and []
are defined as column vectors, and
[
]
is defined as
a matrix of × dimension while denotes
number of patients in the learning set and denotes
number of leaves (presumably >). Then:
Objective: given the pruned fuzzy decision tree and
the learning set as the set of examples for refitting,
find amended fuzzy decision tree leaves labels
[
]
so that

is minimized:

=
[
(
)
−̂
(
)]
∈
=
‖[
]
[
]
.
[
]‖
[
]
=arg min
[
]
‖[
]
[
]
.
[
]‖
By solving this problem, we will have the amended
leaves labels,
[
]
. The solution of above
optimization problem is as follow.
[
]
=
[
]
.
[
]

.
.
[
]
7 THE DATASET
A dataset of 305 patients collected for the
International Ovarian Tumor Analysis (IOTA) study
was used in this project. The IOTA study is a
multicenter collaborative project for preoperative
differentiation of ovarian tumors based on predictive
models.
Patients assessed with transvaginal
ultrasonography and found to have an apparent
persistent extrauterine pelvic mass were included in
the IOTA study. Before surgery, clinical, laboratory,
and ultrasonographic data was recorded to be used
as input attributes. Patients then underwent surgical
resection of the mass. All surgically removed tissues
were extensively sampled for histologic
examination. The histologic classification of the
removed tissue (benign or malignant) was recorded
to be used as output, as follows:
(
)
=
1, ℎ
(
)
=
0, ℎ
(
)
=
(6)
The dataset contained the following variables:
patient’s age (years), menopausal status
(premenopausal versus postmenopausal), serum
CA125 level (units/mL), 8 sonographic morphologic
variables, 5 color Doppler variables, and pathology
results classification (benign or malignant).
Ultrasonographic examination was done and
reported based on the standard methods already
published by IOTA group (Timmerman et al., 2000).
8 RESULTS
The size of growing, pruning, and testing sets was
200, 55, and 50, respectively. The values of the
stopping criteria were set as
|
|
equal to 4,

HEALTHINF 2011 - International Conference on Health Informatics
372
equal to 4, and

equal to 30. Smaller values
resulted in reaching the maximum recursion limit of
MATLAB without returning any trees. Using above
algorithm parameters, a decision tree with 27 nodes
was made at the end of the growing step. Pruning
resulted in elimination of 6 overfitted nodes. Finally,
the pruned tree was refitted and tested.
For testing the fuzzy decision tree, the following
steps were done:
1- Patients of the testing set were propagated
through the tree, and membership values of all
patients in all leaves were calculated and saved.
Then the output class of all patients of the
testing set estimated by the complete tree was
calculated according to equation 1.
2- The performance error,

, was calculated:

=
[
(
)
−̂
(
)]
∈
The performance error is presumably increased by
larger sizes of the testing set, and thus should be
adjusted by the size of the testing set:
=

|

|
where
|

|
denotes cardinality of the testing set
equaling the number of patients in the testing set,
and  denotes mean absolute error.
3- A Receiver Operator Curve (ROC) was
developed for analyzing the performance of the
tree in terms of area under curve (AUC) of the
ROC as well as finding the optimal cutoff value
for the estimated output class.
The optimal point of the ROC was found by using
the built-in function  in MATLAB
programming language. The function plots the true
positive rate (TPR) versus false positive rate (FPR)
while the resultant ROC is parameterized as a
function of cutoff values:
[,]=FPR
(

)
,TPR
(

)
where and denote coordinates of each point of
the plot. The optimal point of the ROC was found by
moving a straight line with the slope of one from the
upper left corner of the ROC (FPR=0, TPR=1) down
and to the right until it intersects the ROC. Using the
coordinates of this point (x=FPR, y=TPR), the
optimum cutoff for the estimated output was then
computed.
4- Using the coordinates of the optimum point of
the ROC, sensitivity and specificity were
calculated:
= =1 − 
Then, using sensitivity and specificity, the likelihood
ratios for positive (malignant) and negative (benign)
results of the tree were calculated:

=

1 − 


=
1 − 

The results of the final fuzzy decision tree testing on
the testing set are summarized in table 1. To ensure
that the small size of the testing set had not biased
the testing results, all growing, pruning, and refitting
steps were repeated for 10 times, allocating all
patients to GS, PS, and TS again each time, and the
resulted trees were compared. Except for 2 times
Figure 4: ROC for classification of cases in the testing set
by fuzzy decision tree.
where the resulted trees were different in some
bottom nodes, the resulted trees were almost the
same in structure and leaves labels. Testing the
resulted trees on their corresponding testing sets
resulted in AUCs ranged from 0.89 to 0.96. The
ROC plot is shown in figure 4.
9 DISCUSSIONS
The performance measures of the final resulted tree
were acceptable for clinical utilization. The positive
likelihood ratio of near 10 as well as the negative
likelihood ratio of near 0.1 is indicative of the high
FUZZY DECISION TREE LEARNING FOR PREOPERATIVE CLASSIFICATION OF ADNEXAL MASSES
373
accuracy of the resulted fuzzy decision tree. More
importantly, the tree had the capacity of being
translated into the equivalent set of fuzzy if-then
rules; each rule is made by conjunction of all
conditions from the root to each leaf. Because the
decision tree had 11 leaves, the decision tree was
translated into 11 fuzzy if-then rules. Then the rules
were interpreted and criticized by domain experts.
This process of interpretation and amendment of the
rules by clinicians is the main advantage of this
method. A sample of rules extracted from the tree is:
Table 1: Decision tree testing results.
Performance measure Value
Mean squared error 0.1195
AUC of ROC curve 0.9092
Output class cutoff 0.3596
Positive likelihood ratio 9.0000
Negative likelihood ratio 0.1111
Sensitivity 90.00%
Specificity 90.00%
“If patient’s CA125 level is low, and the lesion
internal wall is not smooth, and color score is low,
and the cyst content is hemorrhagic, then the lesion
is benign.”
where italic words are linguistic fuzzy variables
defined over continuous attributes. The comment of
a domain expert on the above rule was:
“This rule is right, because a hemorrhagic cyst
having low blood flow (low color score) and low
CA125 level would be a hemorrhagic functional
ovarian cyst which is a benign lesion.”
Likewise, all rules were interpreted by clinicians,
and some rules were amended by clinicians based on
clinical knowledge.
9.1 Inductive Bias
An approximation to the inductive bias of decision
tree learning is (Mitchell, 1997):
“Smaller trees are preferred over larger trees. Trees
that place highly purifying attributes closer to the
root are preferred over those that do not.”
While selecting attribute for a test node, decision
tree learning algorithms can just think of the
immediate consequences of this selection, but cannot
think of further consequences in successor nodes.
Additionally, when the learning algorithm faces the
consequences of the bad selections in the ancestor
nodes (such as facing nodes which cannot be further
purified using any attribute), it never backtracks to
reconsider its previous choices. Therefore, these
learning algorithms are susceptible to the usual
drawback of simple-to-complex searching for
hypotheses without backtracking: selecting locally
optimal solutions which are not globally optimal.
All backfitting algorithms designed for decision tree
learning can just tune parameters of decision trees,
but cannot amend their structure. Pruning algorithms
are just able to contract overfitted nodes, but are not
able to reconsider various generations of nodes to
find the globally optimal decision tree. In fact,
amending the structure of decision trees by
simultaneously considering various generations of
nodes for finding a globally optimal decision tree is
too complex to be done by simple algorithms.
While inductive bias of decision tree learning cannot
be easily solved by simple algorithms, it can be
solved by using the aid of human experts to amend
the tree or its equivalent rules. Currently, the
complex thinking process of a clinician cannot be
simulated by any algorithm. The main advantage of
decision trees is their explicit and easy-to-
understand nature, as well as their ability to be
translated into the equivalent if-then rules. The
whole point was making a decision tree by an
artificial learner, because of the perfect abilities of
artificial learners in analyzing high-dimensional
data; and then amending the built tree (or its
equivalent rules) by human experts, because of the
perfect abilities of the human brain in interpreting
and criticizing rules.
10 CONCLUSIONS
Decision trees are easy-to-interpret for clinicians,
and fuzzy reasoning is a more general approach for
managing uncertainty than probability theory. We
proposed that a combination of decision trees and
fuzzy reasoning would result in a robust and
accurate classification method.
The performance results of the tree are acceptable,
with positive likelihood ratio of near 10 and negative
likelihood ratio of near 0.1 for diagnosing
malignancy. This model has minimal restriction
bias, the problem of overfitting is eliminated in the
pruning step, and the problem of model preference
bias was minimized by getting aid from human
experts to amend the extracted rules.
Eleven fuzzy if-then rules were extracted from the
tree and were interpreted and amended by clinicians.
These rules are ready to be used in clinical practice
guidelines as well as being implemented into some
expert system for management of patients with
adnexal mass.
HEALTHINF 2011 - International Conference on Health Informatics
374
REFERENCES
Hoffman, M. S. 2009. Overview of the evaluation and
management of adnexal masses. In: mann, W. J. &
goff, B. (eds.) Uptodate. 17.3 ed. Waltham: uptodate
inc.
Mann, W. J., chalas, e. & Valea, F. A. 2009. Epithelial
ovarian cancer: initial surgical management. In: goff,
b. (ed.) Uptodate. 17.3 ed. Waltham: uptodate inc.
Mingers, J. 1989. An empirical comparison of pruning
methods for decision tree induction. Mach learn, 4,
227-43.
Mitchell, T. M. 1997. Decision tree learning. In: Mitchell,
T. M. (ed.) Machine learning. 1st ed. Columbus:
mcgraw-hill.
Myers, E. R., bastian, L. A., Havrilesky, L. J.,
Kulasingam, S. L., Terplan, M. S., Cline, K. E., Gray,
R. N. & Mccrory, D. C. Management of Adnexal
Mass. Evidence report/technology assessment no.130
(prepared by the duke evidence-based practice center
under contract no. 290-02-0025.) Ahrq publication no.
06-e004. Rockville, md: agency for healthcare
research and quality. Feb 2006.
Olaru, C. & Louis, W. 2003. A complete fuzzy decision
tree technique. Fuzzy set syst, 138, 221-54.
Schaffer, J. I. 2008. Epithelial ovarian cancer. In: schorge,
J. O., schaffer, J. I., halvorson, L. M., hoffman, B. L.,
bradshaw, K. D. & cunningham, F. G. (eds.) Williams
gynecology. 1st ed. Dallas: mcgraw-hill.
Timmerman, D., Valentin, L., Bourne, T. H., collins,
W. P., Verrelst, H. & Vergote, I. 2000. Terms,
definitions and measurements to describe the
sonographic features of adnexal tumors: a consensus
opinion from the international ovarian tumor analysis
(iota) group. Ultrasound obstet gynecol, 16, 500-5.
FUZZY DECISION TREE LEARNING FOR PREOPERATIVE CLASSIFICATION OF ADNEXAL MASSES
375