2.1 Nondeterministic Data and
Overfitting
Other than the variables recorded in the dataset,
there might be other variables affecting the output
which are not recorded in the dataset. This is the
usual case in modeling medical problems, where the
model has to predict the output using some attributes
while none of the attributes have a direct effect on
the output (and not directly caused by the output). In
this study, the only attribute which can directly
determine the malignancy is the pathology results,
but it cannot be used as an input in our model
(because measuring this attribute needs surgery and
is invasive). As a result, the model has to predict the
malignancy by using attributes which are neither
directly caused by the malignancy nor have direct
effect on the malignancy, but are noticed to have
interactions with the malignancy (e.g. malignant
tumors often, but not always, have bigger sizes).
When data is not deterministic, the output cannot be
absolutely predicted by any combination of the
attributes. Thus, even the best models will have a
degree of inaccuracy, called residual error.
When residual error is present, even the best
attributes in the final test nodes cannot make
absolutely pure children. If residual error is not
recognized, the learning algorithm tries to make
absolutely pure leaves while it is not possible by
using any attributes. The learning algorithm
continues to make children for nodes recursively,
leading to small number of cases in the bottom
nodes of an excessively grown tree. In this stage,
because of the small number of cases in each test
node, there is a high probability that one among all
attributes has different values for cases of different
classes, thus selecting this attribute for the node is
associated with correct separation of cases in the
training dataset who have reached that node (the
cases are thus separated by chance, not by the
selected attribute); but selecting this attribute for the
node is associated with incorrect classification of
cases reaching that node in subsequent testing of the
tree on a separate dataset (because the same chance
is improbable to occur again in testing). This
learning algorithm will select irrelevant attributes for
multiple bottom test nodes, resulting in an overfitted
tree to the training dataset.
To prevent overfitting, the learner has to recognize
residual error and turn the node into a leaf if a
sufficient amount of purity, consistent with residual
error, is met. Another approach is that the learning
algorithm lets the tree to become overfitted, and then
post-prunes the overfitted tree to make an optimal
decision tree. This approach is used in this study and
is introduced in section 5.
2.2 Crisp Discretization of Continuous
Attributes
If attributes are continuous rather than binary, the
second step of the learning algorithm becomes more
elaborate. The learner should test multiple thresholds
for the first attribute, sending cases with the attribute
value of less than threshold to the left child, and
cases with the attribute value of more than threshold
to the right child. The pooled purity of children will
be assessed for each threshold, and the best
threshold is selected for that attribute.
Then the same process will be repeated for all
attributes, assessing the pooled purity of children for
each threshold of each attribute. The best attribute
with its best threshold is finally assigned to the node.
2.3 Fuzzy Decision Trees
Assume a patient being classified by a conventional
decision tree. In each test node, a single attribute is
tested using a single threshold having two possible
answers: less than threshold and more than
threshold. The attribute space of the node (the local
attribute space) is thus split into two non-
overlapping subspaces, as shown in figure 1.
Patients with value of the tested attribute less than
threshold go to the left child, while cases with
attribute value more than threshold go to the right
child. To classify a new patient, it starts at the root
node and is tested sequentially in multiple test nodes
until it reaches a leaf. All patients reaching a leaf
will be assigned to the same class corresponding to
that leaf. In summary, each patient follows a single
path, reaches a single leaf, and is assigned the class
stored in that leaf.
Instead of defining crisp sets, we can define two
fuzzy sets for members of the left and right children
using a smooth and overlapping fuzzy discriminator
function for continuous attributes tested in the test
nodes (Olaru and Louis, 2003). Each fuzzy test node
tests a single attribute using a pair of two parameters
which characterize the fuzzy discriminator function.
The two parameters are threshold which is the
cutpoint, and width which defines the overlapping
region of left and right children. The local attribute
space is thus split into two overlapping subspaces. In
a fuzzy decision tree, a case can be classified by
being propagated through multiple paths in the tree
and reaching multiple leaves, if the case is situated
in the overlapping region of some test nodes. At the
HEALTHINF 2011 - International Conference on Health Informatics
366