Let us apply these results for the comparison of
different algorithms using the test results presented
in Table 7 for Fisher’s exact test. The results of the
comparison are presented in Table 11.
From the analysis of this table we see that the
statistical evidence is not strong enough to claim that
algorithm A
1
is necessarily equivalent to algorithm
B
1
. This evidence is even weaker if we claim the
equivalence of algorithms B
1
and C
1
. In all other
cases the evidence is very strong that the considered
algorithms are equivalent. It is worthy to note, that
by using classical statistical interpretation in all
considered cases we would not reject the hypothesis
of the equivalence of compared algorithms.
The possiblilistic comparisons are not necessary
when null and alternative hypotheses are, as in the
particular cases considered in this paper,
complementary. In such case strong evidence in
favour of the null hypothesis means automatically
weak support of its complementary alternative.
Table 11: Possibilistic comparison of different algorithms.
PD PSD,NSD
A
1
vs. B
1
0,908 0
A
1
vs. C
1
1 0,836
B
1
vs. C
1
0,428 0
A
2
vs. B
2
1 0,816
A
2
vs. C
2
1 0,982
B
2
vs. C
2
1 0,602
In general, it must not be the case. Consider, for
example, a test of the equivalence of a new
classification algorithm against two alternatives
representing known results of the usage of other
algorithms. We want to know which of those
algorithms our new algorithm is similar to with
respect to its efficiency. Consider, for example, the
problem of the classification of wheat kernels
described in (Charytanowicz et al., 2010). Two
algorithms, namely QDA and CRT, have been used
on large samples of data. The results of those
experiments have been used for the estimation of
class probabilities. They are presented in Table 12.
Table 12: Wheat kernels - probabilities of classes.
Alg.\Class 1 2 3 4
QDA 0,319 0,310 0,314 0,057
CRT
0,300 0,324 0,310 0,066
Test results for a new algorithm are described by
the following vector (29, 29, 32, 15). The
comparison of this result with probabilities obtained
by the QDA algorithm, performed according to the
methodology presented in the third section, gives a
very small p-value equal to 0,002. Similar
comparison with the probabilities obtained by the
CRT algorithm yield also a very small p-value equal
to 0,018. Using (25) - (27) we can calculate
possibilistic indices showing that our algorithm is
more closer to the CRT algorithm than to the QDA
algorithm. The results are the following: PD=1,
PSD=0,036, NSD=0. The necessity measure that the
new algorithm is more similar to the CRT than to
QDA is equal to zero. Thus, the obtained statistical
data do not let us to exclude that our algorithm is
more similar to the QDA than to the CRT. However,
the possibility indices show that it fully possible
(PD=1) that the efficiency of the new algorithm is
similar to the efficiency of both other algorithms, but
it is only slightly possible (PSD=0,036) that the new
algorithm is more similar to the CRT than to the
QDA.
The applicability of the proposed possibilistic
measures is even much stronger when we omit the
assumption that the ‘expert’ indicates only one ‘true’
class. This is always the case when the role of ‘an
expert’ is played by a fuzzy clustering algorithm. In
all such cases we have to use the methodology of
fuzzy statistics, whose overview can be found e.g. in
(Gil and Hryniewicz, 2009).
5 CONCLUSIONS
In the paper we have considered the problem of the
evaluation and comparison of different classification
algorithms. For this purpose we have applied the
methodology of statistical tests for the multinomial
distribution. We restricted our attention to the case
of the supervised classification when an external
‘expert’ evaluates the correctness of classification.
The results of the proposed statistical tests are
interpreted using the possibilistic approach
introduced in (Hryniewicz, 2006). This approach
will be more useful or even indispensable when we
assume more complicated statistical tests and
imprecise statistical data. We will face such
problems when we will adapt the methodology
presented in this paper for the case of fuzzy
classifiers.
The future development of the proposed
methodology should be concentrated on two general
problems. First, we should compare the results of
classification with ‘better’ alternatives. The meaning
of the word ‘better’ in the considered context
requires further investigations. The same can be said
in case fuzzy classifiers built using supervised and
POSSIBILISTIC METHODOLOGY FOR THE EVALUATION OF CLASSIFICATION ALGORITHMS
321