data:image/s3,"s3://crabby-images/3c247/3c247c746bfd0e5a9deebe87ed29382253e24933" alt=""
5 DISCUSSION
As we have seen in the previous section, automatic
and interactive algorithms have nearly the same
results concerning the accuracy and the tree size, but
what about the comprehensibility of these results?
Let us take two examples. With the diabetes data set,
the best tree size is 2 leaves (OC1). The result of this
algorithm is a tree with only one split: a 14-
dimensional hyper-plane (with accuracy equal to
85.9%). OC1 performs real oblique cuts in the data
space. To get the same accuracy CIAD needs to
perform eight more splits, these splits are also
"oblique" cuts but they are only 2D-oblique cuts.
The hyper-plane obtained with OC1 is a 14-
dimensional one: the result is an equation such as:
a
1
.x
1
+ a
2
.x
2
+ ... + a
14
.x
14
+ a
15
= 0. How can we
interpret this result? A decision tree with merely
splits of the form y=ax+b or x=a is obviously more
understandable (especially if the user is not a data
mining or data analysis expert but the data expert).
The other interesting result is the one of C4.5 with
the satimage dataset: 85.2% accuracy with a very
large tree size. Here again there is one question we
can ask: how to interpret such a decision tree? Is not
it better to have a smaller tree with a lower
accuracy? (this is not an over-fitting problem we talk
about here). An advantage of interactive decision
tree construction algorithm is the fact that the user
can stop the decision tree construction when he
wishes to. He has only to make a leaf of the current
tree node instead of trying to divide it more and
more to have a better accuracy. Of course, this task
can also been achieved with automatic algorithms: it
is the role of the very important but so little
discussed parameters tuning. This parameters tuning
is a data mining or analysis expert's affair most of
the time.
These two examples illustrate some of the interests
of the visual data mining approach. But this kind of
approach has not only advantages and several
problems must be solved before it becomes really
useful for the data expert. Among these problems
are:
- the data expert has not necessarily enough
background in statistics, data-analysis or data-
mining to perform the correct choices during the
KDD process. A simple example is to find the best
algorithm to use according to the data set used and
the problem to solve. To address this problem it is
necessary to provide the user with help mechanisms
able to guide him in all the choices performed in the
KDD process. These mechanisms must be able to
deal with new data sets or new algorithms and must
learn from the new results obtained.
- all the visual data mining algorithms are based on a
graphical representation of the data. The size of the
data sets treated is limited by the screen size and the
human perception capacities. How do we deal with
very large data sets containing at least a billion n-
dimensional data points as automatic algorithms
already do (Poulet and Do, 2003)? One solution
could be to use a higher level representation of the
data instead of the data themselves. This is the topic
addressed by the symbolic data analysis (Bock and
Diday, 2000).
6 CONCLUSION AND FUTURE
WORK
All the tools presented in this paper have been
developed in C/C++ (on PC and SGI-O2) using only
open-source libraries. In this paper we have
presented some work trying to give a more
important part to the visualisation in the data mining
process. This can be achieved in several ways:
- in a cooperative approach with visualisation and
automatic tools working together for example to
improve the results or comprehensibility of
automatic algorithms with a graphical pre- or post-
processing step,
- by replacing the automatic algorithm usually used
by interactive ones, like the interactive decision tree
construction algorithm presented.
The most important fact in this approach is that the
user of the system is the data specialist and no
longer the data mining or data analysis expert. This
has the following advantages:
- the comprehensibility and confidence of the
constructed model are increased because the user has
participated in its creation,
- we can use the domain knowledge in the whole
process,
- we can use the human capabilities in pattern
recognition tasks to overcome some computational
complexity.
TOWARDS VISUAL DATA MINING
355