Figure 1: Example of coding matrix. The decomposition
into binary problems can be represented by a matrix
M ∈ {−1,0,+1}
k×l
, where k is the number of classes and l
the number of induced binary problems. If M(c,b) = +1,
the examples of the class c are considered to be positive
examples for the binary classification problem b. If
M(c,b) = −1, they are considered negative. If M(c,b) = 0,
the examples of class c are not used to train b.
approaches by taking the matrix from the larger set
{−1,0,+1}
k×l
(Figure1).
Since its appearance, the significance of the
coding strategy has being brought into question. The
ECOC method was originally motivated by error-
correcting principles, assuming that the learning task
can be modeled as a communication problem, in
which class information is transmitted over a
channel. From the perspective of error-correcting
theory, it is desirable that codewords are far from
each other. Allwein et al (2000) deduced a bound on
the generalization error that confirms this. However,
they noted that this may lead to difficult binary
problems. Guruswami and Sahai (1999) argued that
one reason why the powerful theorems from coding
theory cannot be directly applied to prove stronger
bounds on the performance of the ECOC approach is
that in the classification context errors made by
binary classifiers do not occur independently. Dekel
and Singer (2003) considered the fact that
predefined output codes ignore the complexity of the
induced binary problems, and proposed an approach
in which the set of classifiers and the code are found
concurrently. On our side, we still rely on the error-
correcting properties of codes, but only as a point of
departure to build our final coding matrix. As for the
issues of the correlation and the difficulty of binary
tasks, we deal with both by empirically assessing the
joint performance of the set of classifier. Thus, our
approach guarantees the choice of informative and
complementary binary classifiers.
Coupled with the issue of finding an appropriate
decomposition, the other major issue concerns the
design of a suitable combining strategy for inferring
the correct class given the set of outputs. Allwein et
al (2000) recalled Hamming decoding and proposed
loss-based decoding as an alternative for margin-
based classifiers. However, this decoding paradigm
does not deal with our aim to recover the probability
of each class.
In order to obtain probability estimates, Kong
and Dietterich (1997) and Hastie and Tibshirani
(1998) proposed combining methods for ECOC and
all-pairs respectively. Zadrozny (2001) extended the
latter to the general matrices of Allwein et al (2000).
Nevertheless, in these works the design of a code for
a given multiclass problem is not considered and the
individual biclass classifiers are assumed to return
probabilities, which is not always the case. In order
to fuse the outputs of such a set of classifiers, a
calibration is needed first, as indicated by Zadrozny
and Elkan (2002). Our framework deals with all
these issues in an efficient and generic way.
In parallel to the work on the decomposition into
binary problems, the design of a multiclass
algorithm that treats all classes simultaneously has
been addressed, among others, by Zou et al (2005).
In particular, they suggested a multiclass SVM.
However, it does not focus on discovering the
probabilities of each class. Instead, the output is a
vector of scores, which we cannot calibrate
effectively with the techniques proposed so far on
restricted databases due to the curse of
dimensionality. Besides, we prefer not being bound
to a particular classifier; indeed, our framework can
for instance exploit simultaneously SVMs and
AdaBoost classifiers: the system is free to select the
most appropriate model according to the data and its
properties.
1.2 Overview of the System
We present a multiclass classification framework
which fits automatically any multiclass classification
task, regardless of the nature and amount of data or
the number of classes. We follow the approach of
decomposing into several biclass problems and then
combining the biclass predictions. This is
qualitatively motivated by two main aspects: the aim
to recover probability estimates for each class given
limited learning data and the existence of high-
performing biclass methods.
As first main contribution, we propose a scheme
to automatically learn an appropriate decomposition
given training data and a user-defined measure of
performance (Section 2.1), which avoids too
correlated or too difficult biclass classification
problems which are maladjusted to the particular
multiclass task. The obtained biclass problems are
solved by means of state-of-the-art classifiers, which
are automatically optimized and calibrated.
Calibration (Section 2.2) allows the classifiers to
provide probability estimates, and thus makes it
possible to take into account their different
binary classifiers
b
1
b
2
b
3
b
4
b
5
b
6
classes
c
1
+1 +1 +1 +1 0 –1
c
2
0 –1 –1 +1 +1 –1
c
3
–1 +1 -1 0 –1 –1
c
4
–1 0 +1 –1 –1 +1
ICAART 2010 - 2nd International Conference on Agents and Artificial Intelligence
590