belong to the learning sample.
The presence of noise in data changes the above
setting up of the generalization problem both at the
stage of building decision rules and at the stage of
the object classification. First of all, the original
learning sample K is replaced by the sample K' in
which distorted values or missing values of features
occur with a certain probability. We consider the
solution of the concept generalization problem using
the methods of decision trees (Quinlan, 1986, 1996)
including binary decision trees (Breiman et al.,
1984).
3 GENERALIZATION
ALGORITHMS BASED
ON DECISION TREES
The purpose of the given paper is to study a noise
influence on the work of generalization algorithms
that build decision trees.
The decision tree T is a tree in which each non-
final node accomplishes checking of some condition,
and in case a node is finite, it gives out a decision for
the element being considered. In order to perform
the classification of the given example, it is
necessary to start with the root node. Then, we go
along the decision tree from the root to the leaves
until the final node (or a leaf) is reached. In each
non-final node one of the conditions is verified.
Depending on the result of verification, the
corresponding branch is chosen for further
movement along the tree. The solution is obtained if
we reach a final node. Decision tree may be
transformed into a set of production rules.
Research of noise effect on the operation of
generalization algorithms has been performed on the
basis of comparative analyses of two known
algorithms C 4.5 and CART.
The algorithm C4.5 as its predecessor ID3
suggested by J.R.Quinlan (Quinlan, 1986, 1996)
refers to an algorithm type building the classifying
rules in the form of decision trees. However, C4.5
works better than ID3 and has a number of
advantages:
- numerical (continuous) attributes are introduced;
- nominal (discrete) values of a single attribute may
be grouped to perform more effective checking;
- subsequent shortening after inductive tree building
based on using a test set for increasing a
classification accuracy.
The algorithm C 4.5 is based on the following
recursive procedure:
An attribute for the root edge of a tree T is
selected, and branches for each possible values of
this attribute are formed.
The tree is used for classification of learning set
examples. If all examples of some leaf belong to the
same class, then this leaf is marked by a name of this
class.
If all leafs are marked by class names, the
algorithm ends. Otherwise, an edge is marked by a
name of a next attribute, and branches for each of
possible values of these attribute are created, go to
step 2.
The criterion for choosing a next attribute is the
gain ratio based on the concept of entropy (Quinlan,
1996).
In the algorithm CART (Breiman et al., 1984),
building a binary decision tree is performed. Each
node of such decision tree has two descendant. At
each step of building a tree, the rule that shares a set
of examples from a learning sample into two subsets
is assigned to a current node. In the first subset,
examples are entered where a rule is performed, and
the second subset includes examples where a rule
does not performed. Accordingly for the current
node, two descendant nodes are formed and the
procedure is recursively repeated until a tree will be
obtained. In this tree the examples of a single class
are assigned to each final node (tree leaf).
The most difficult problem of the algorithm
CART is a selection of best checking rules in tree
nodes. To choose the optimal rule, there is used the
assessment function of partition quality for a
learning set introduced in (Breiman et al., 1984).
The important distinction of the algorithm CART
from other algorithms of building the decision trees
is the use the mechanism of tree cutting. The cutting
procedure is necessary to obtain the tree of an
optimal size with a small probability of erroneous
classification.
4 NOISE MODELS
Assume that examples in a learning sample contain
noise, i.e., attribute values may be missed or
distorted. Noise arises due to following causes:
incorrect measurement of the input parameters;
wrong description of parameter values by an expert;
the use of damaged measurement devices; and data
lost in transmitting and storing the information
(Mookerjee et al., 1995). Our purpose is to study
noise effect on the functioning C 4.5 and CART
algorithms.
One of basic parameters of research is a noise
NoiseModelsinInductiveConceptFormation
445