1. Extract a set of subsamples (Number_Samples)
from the original training set using the desired
resampling technique (Resampling_Mode): size
of the subsamples (100%, 75%, 50%, etc; with
regard to the original set’s size), with replacement
or without replacement, stratified or not, etc.
2. The final tree is built node by node in preorder.
Each consolidated node is built the following
way:
a. For each subsample, induce the variable that
would be used to make the split at that level of
the partial tree; in the example (B, F,…,B).
b. Analyse the number of partial trees that
propose to make an split and decide depending
on the established criteria Crit_Split (ex:
simple majority, absolute majority, etc.)
whether to split or not. If the decision is not to
split (a leaf node is created), jump to the next
node to consolidate and go to step 2a.
c. Analyse the number of votes that has the most
voted variable (votes of variable B in Figure 1).
If based on Crit_Variable the variable has not
enough votes, consolidate the node as a leaf
node and go to step 2a. When the number of
votes is enough this variable will be the one
used to split the consolidated node.
d. Decide the branches the node to split will have,
depending on Crit_Branches criteria. If the
variable to split is continuous, determine the
cutting point (ex: using the mean or the median
of the values proposed for that variable). If the
variable to split is discrete, decide the set of
categories for each branch (ex: a branch for
each category, using heuristics such as C4.5’s
subset option, etc.).
e. Force the accorded split (variable and
stratification) in every tree asociated to each
subsample (every partial tree in the example is
forced to make the split with the consolidated
variable B). Jump to next node and go to step
2a.
The different decisions can be made by voting,
weighted voting, etc. Once the consolidated tree has
been built, its behaviour is similar to the one of the
used base classifier. Section 5 will show that the
trees built using this methodology, have similar
discriminating capability (the differences are not
statistically significant) but they are structurally
more steady and less complex. In order to analyse
this second aspect, we have defined the structural
diversity measure that we present in next section.
3 STRUCTURAL DIVERSITY
MEASURE
We will use this section to define the diversity
measure or structural distance, that will allow us to
analyse the stability of the consolidated trees and
compare them to the standard ones. The aim is to be
able to measure the heterogeneity existing in sets of
trees built using each of the methodologies to be
compared. The estimation of the degree of structural
diversity in a group, is made analysing the structural
differences among each possible pair of trees in the
group, and, calculating average values of the
differences obtained.
The defined metric or distance (Structural_Distance,
SD) is based on a vector (M
0
, M
1
, M
2
) with three
values used to compare two trees (T
i
, T
j
). Both trees
are looked through in preorder, node by node, and
the corresponding split variables are compared to
know whether they match or not. The three
components are calculated the following way:
• M
0
: Number of common nodes in T
i
and T
j
.
We understand as common nodes the ones that
being in the same position in both trees, make
the split based on the same variable.
• M
1
: measures the number of times that
looking through a common branch, a tree
makes a split in a node and the other does not.
Each increment is weighted depending on the
complexity of the subtree beginning in this
node.
• M
2
: measures the number of times that
looking through a common branch and arriving
to a common node, the variables chosen to
make the split in both trees are different. Each
increment is weighted depending on the
complexity of both subtrees.
The pseudo-code of the algorithm used for
calculating each one of the components of the
proposed measure is in Appendix.
The definitions show that to increase the value of
M
0
, the same variable has to appear in the same
node, so that the variable used to make the split has
been chosen at the same level in both trees. Once a
different split appears, the remaining subtrees of the
compared trees have an effect on the value of M
1
or
M
2
. This is important because in the tree
construction process (top-down), the variables used
to split a node are selected depending on their
statistical importance (entropy, p-value, ...). As a
consequence, this measure of similarity/diversity
takes also into account when making the
ICEIS 2004 - ARTIFICIAL INTELLIGENCE AND DECISION SUPPORT SYSTEMS
16