oADABOOST

An ADABOOST Variant for Ordinal Classiﬁcation

ao Costa

and Jaime S. Cardoso

Faculdade de Engenharia, Universidade do Porto, Rua Dr. Roberto Frias, 4200-465 Porto, Portugal

INESC TEC and Faculdade de Engenharia, Universidade do Porto,

Rua Dr. Roberto Frias, n

378, 4200-465 Porto, Portugal

Keywords:

Supervised Learning, Ordinal Classiﬁcation, Ensemble Methods, ADABOOST.

Abstract:

Ordinal data classiﬁcation (ODC) has a wide range of applications in areas where human evaluation plays an

important role, ranging from psychology and medicine to information retrieval. In ODC the output variable

has a natural order; however, there is not a precise notion of the distance between classes. The Data Repli-

cation Method was proposed as tool for solving the ODC problem using a single binary classiﬁer. Due to

its characteristics, the Data Replication Method is straightforwardly mapped into methods that optimize the

decision function globally. However, the mapping process is not applicable when the methods construct the

decision function locally and iteratively, like decision trees and ADABOOST (with decision stumps). In this

paper we adapt the Data Replication Method for ADABOOST, by softening the constraints resulting from the

data replication process. Experimental comparison with state-of-the-art ADABOOST variants in synthetic and

real data show the advantages of our proposal.

1 INTRODUCTION

One of the most representative problems of super-

vised learning is classiﬁcation, consisting of the es-

timation of a mapping from the feature space into a

ﬁnite class space. Depending on the cardinality of

the output space, we are left with binary or multiclass

classiﬁcation problems. Finally, the presence or ab-

sence of a “natural” order among classes differenti-

ates nominal from ordinal problems.

A large number of real world classiﬁcation prob-

lems can be seen as ordinal tasks, that is, we want

to learn a function that is able to classify points in a

ﬁnite set of classes, which have an ordering relation

between them (e.g. we might want to classify a movie

as good, average or bad).

Note that, since only the order is known, one can-

not simply pick an arbitrary mapping from our classes

to numbers (e.g. bad 7→ 0, average 7→ 1, good 7→ 2)

and solve a regression problem, as our error function

would be highly dependent of that mapping. Also,

our classes might not have a valid numerical interpre-

tation, (e.g. is a good movie equivalent to two average

movies?) therefore choosing an appropriate mapping

is not trivial.

One solution to ordinal problems is to simply ig-

nore the ordering relation, treat them as nominal clas-

siﬁcation problems and use a learning algorithm de-

signed for this kind of tasks (e.g. C4.5 decision trees

and ADABOOST). A better alternative is to devise so-

lutions that take advantage of the order information

in the design of the classiﬁer. A possible approach is

to transform our dataset in such a way that we force

our learning algorithm to respect the ordering rela-

tion. One such transformation is the Data Replication

Method (Cardoso and da Costa, 2007), which trans-

forms an ordinal problem into a larger binary clas-

siﬁcation one. One of the limitations of the Data

Replication Method is that it cannot be immediately

applicable when the decision function is constructed

iteratively and locally, like in decision trees or AD-

ABOOST.

In this work, we present a new ADABOOST vari-

ant that performs its growth on the replicated feature

space, and therefore is able to use the ordering rela-

tion when training the weak classiﬁers.

1.1 Related Work

Frank and Hall presented a simple method that en-

ables standard classiﬁcation algorithms to make use

of ordering information in class attributes (Frank and

Costa J. and Cardoso J..

oAdaBoost - An AdaBoost Variant for Ordinal Classiﬁcation.

DOI: 10.5220/0005191600680076

In Proceedings of the International Conference on Pattern Recognition Applications and Methods (ICPRAM-2015), pages 68-76

ISBN: 978-989-758-076-5

 2015 SCITEPRESS (Science and Technology Publications, Lda.)

Hall, 2001). By applying it in conjunction with a deci-

sion tree learner, the authors show that it outperforms

the naive approach, which treats the class values as

an unordered set. Compared to special-purpose algo-

rithms for ordinal classiﬁcation, the method has the

advantage that it can be applied without any modi-

ﬁcation to the underlying learning scheme. The ra-

tionale encompasses using (K − 1) standard binary

classiﬁers to address the K-class ordinal data prob-

lem. Toward that end, the training of the i-th clas-

siﬁer is performed by converting the ordinal dataset

with classes C

, . . . , C

into a binary dataset, discrim-

inating C

, . . . , C

against C

i+1

, . . . , C

. To predict the

class value of an unseen instance, the (K − 1) outputs

are combined to produce a single estimation. Any bi-

nary classiﬁer can be used as the building block of

this scheme. Observe that the (K − 1) classiﬁers are

trained in an independent fashion. This independence

is likely to lead to intersecting boundaries, a topic to

which we will return further on in this paper.

The Data Replication Method (Cardoso and

da Costa, 2007) overcomes the limitations identiﬁed

above by building all the boundaries at once. That

guarantees that the boundaries of the classiﬁers will

never intersect. This method, however, has limitations

with methods that build the decision function itera-

tively (and greedily), and therefore cannot be easily

mapped to ADABOOST.

In the ensemble approach to ordinal data classi-

ﬁcation, although not directly related to our work, it

is worth mentioning the work introducing global con-

straints in the design of decision trees (Cardoso and

Sousa, 2010; Sousa and Cardoso, 2011). The method

consists on growing a tree (or an ensemble of trees)

and relabeling the leaves according to certain con-

straints. Therefore, the trees are still built without

taking the order into account and only post-processed

to satisfy ordinality constraints. Moreover, the post-

processing is very computationally demanding, only

possible in low dimensional input spaces. More re-

cently, the combination of multiple orthogonal direc-

tions has been suggested to boost the performance of

a base classiﬁer (Sun et al., 2014). Sequentially, mul-

tiple orthogonal directions are found; these different

directions are combined in a ﬁnal stage.

There are also some boosting-related approaches

for ordinal ranking. For example, RankBoost (Fre-

und et al., 2003) approach is based on the pairwise

comparison perspective. Lin and Li proposed or-

dinal regression boosting (ORBoost) (Lin and Li,

2006), which is a special instance of the extended bi-

nary classiﬁcation perspective. The ensemble method

most in line with our work is ADABOOST.OR (Lin

and Li, 2009). This method uses a primal-dual ap-

proach to solve an ordinal problem both in the bi-

nary space and the ordinal space, by taking into ac-

count the order relation when updating the binary

point’s weights. However, ADABOOST.OR is more

constrained than our proposed approach; while AD-

ABOOST.OR is closer to a single ADABOOST instan-

tiated with an ordinal data classiﬁer, our approach is

closer to having multiple ADABOOST coupled in the

construction of the weak classiﬁer.

2 BACKGROUND

In this section we start by analysing the Frank and

Hall’s approach to ordinal classiﬁcation (Frank and

Hall, 2001), which facilitates the introduction of the

Data Replication Method (Cardoso and da Costa,

2007). The Data Replication Method is a framework

for ordinal data classiﬁcation that allows the applica-

tion of most binary classiﬁcation algorithms to ordinal

classiﬁcation and imposes a parallelism constraint on

the resulting boundaries. In the end, we summarize

the ADABOOST ensemble method, paving the way to

the presentation of the proposed adaptation of AD-

ABOOST to ordinal data.

2.1 Frank and Hall Method

Suppose we want to learn a function f : X → Y ,

where X is our feature space and Y =

{

, C

, ..., C

}

is our output space, where our labels are ordered ac-

cording to C

≺ C

≺ ... ≺ C

. Also, assume that we

have a dataset D = (D, f ), where D ⊆ X is our set of

examples and f : D → Y gives us the label of each

example.

The Frank and Hall method transforms the K class

ordinal problem into (K − 1) binary problems by cre-

ating (K − 1) datasets D

= (D, f

) where:

(x) =

(

−

if f (x)  C

if f (x)  C

Intuitively, learning a binary classiﬁer from each

of the D

datasets will create (K − 1) classiﬁers that

answer the questions “is the label of point x larger

than C

?”. This is to say, each classiﬁer will give us

an estimate of P( f (x)  C

Frank and Hall then propose that one ﬁnds the

P( f (x) = C

) using the usual rule:











1 − P( f (x)  C

) if k = 1

P( f (x)  C

k−1

) − P( f (x)  C

) if k ∈ [2, K − 1]

P( f (x)  C

K−1

) if k = K

oAdaBoost-AnAdaBoostVariantforOrdinalClassification

Even though our conversion from ordinal to bi-

nary guarantees that f

(x) = C

−

⇒ f

k+1

(x) = C

−

and

(x) = C

⇒ f

k−1

(x) = C

, those rules do not al-

ways hold for the learnt probabilities. In practice,

this means that it is possible that the combination

rule proposed by Frank and Hall returns a negative

probability for some classes. One solution to this

problem is setting that negative probabilities to zero.

Another possible way to combine our binary classi-

ﬁers is by a simple counting method:

f (x) = C

, with

i = 1 +

∑

K−1

k=1

[[ f

(x) = C

]]

. In either way, the main

conceptual problem is not addressed.

2.2 Consistency and Parallelism

One important concept in ordinal data classiﬁcation is

the idea of consistency with the ordinal setting (Car-

doso and Sousa, 2010). The idea behind it is intu-

itive: a small change in the input data should not lead

to a ‘big jump’ in the output decision (e.g. it is not

expected that a small change in a feature makes the

estimated product quality to go from “bad” to “good”

without going through “average”). One way to guar-

antee this restriction is by enforcing that there is no

intersection between decision boundaries, which can,

in turn, be guaranteed by enforcing our boundaries to

be parallel.

In Figure 1 it is possible to see two partitions of the

input space corresponding to two different classiﬁers,

one without the parallelism constraint and another one

with it. In the ﬁrst one, a small variation around the

point marked with

can make an abrupt jump be-

tween two very distinct classes (“black” and “white”),

while on the second classiﬁer that is not possible.

It can be seen that the method proposed by Frank

and Hall does not respect this concept, and therefore

can lead to problematic classiﬁers.

2.3 The Data Replication Method

The Data Replication Method uses the idea behind the

Frank and Hall method in a different way: instead of

simply creating (K −1) datasets, it extends the feature

space so that all points from the (K − 1) replicas are

present on the same dataset.

Assume e

as a vector composed of (K − 2) ze-

ros and a vector e

as a vector composed by (K − 3)

zeros and a positive constant (e.g. 1) on the q-th posi-

tion (e.g. if K = 5, e

= [0, 1, 0]). We then transform

each point x ∈ D into (K −1) points z

= (x, e

). This

does allows us to train a single binary classiﬁer on the

whole data and then combine the results in the same

[[·]] is the indicator function. [[·]] is 1 if the inner condi-

tion is true, 0 otherwise.

0 1 2 3 4

(a)

0 1 2 3 4

(b)

Figure 1: Comparison of a (a) non-consistent multiclass

classiﬁer vs. (b) a multiclass classiﬁer with parallelism con-

straints

fashion as previously presented. Also, this replication

guarantees that some algorithms such as Suport Vec-

tor Machines (SVMs) and Artiﬁcial Neural Networks

(ANNs) will always produce parallel boundaries on

the ﬁnal ordinal space. A simple example of the Data

Replication Method can be seen in Figure 2. Note

that: a) x ∈ R while z ∈ R

; b) a single binary clas-

siﬁer is designed in the extended dataset to solve the

multiclass problem in the original space. The inter-

section of the binary boundary with each of the K − 1

replicas provide the necessary K −1 boundaries in the

original space. Further details of the method can be

found on the original paper (Cardoso and da Costa,

2007).

2.4 ADABOOST

ADABOOST is a boosting algorithm introduced by

Freund and Schapire (Freund and Schapire, 1995).

Boosting algorithms are a part of a big set of machine

learning techniques called ensemble methods which

general idea is to use several models to classify obser-

vations and combine them together to obtain a classi-

ICPRAM2015-InternationalConferenceonPatternRecognitionApplicationsandMethods

−4 −2 0 2 4

−1

−0.5

0.5

(a)

−4 −2 0 2 4

0.2

0.4

0.6

0.8

(b)

−4 −2 0 2 4

0.2

0.4

0.6

0.8

(c)

−4 −2 0 2 4

−1

−0.5

0.5

(d)

Figure 2: Toy example of the Data Replication Method. (a)

Original Problem. (b) Problem on the replicated space. (c)

Solution to the problem on the replicated space. (d) Result

on the original space.

ﬁer with a predictive performance superior than any

of its constituents. ADABOOST uses a weak learner

to classify observations. A weak learner is deﬁned as

a classiﬁer which is only slightly correlated with the

true data labels. In the case of binary prediction, a

weak learner is a classiﬁer which is only slightly bet-

ter than throwing a coin and deciding an object’s class

according to the trial’s result.

During each iteration, the algorithm trains a weak

learner

(x) using an iteratively determined distribu-

tion and selects the weak hypothesis minimizing the

expected error rate. After selecting the best weak hy-

pothesis

for the distribution D

, the observations

correctly identiﬁed by

are weighted less than

those misclassiﬁed, so that the algorithm will, when

ﬁtting a new weak hypothesis to D

t+1

in the next it-

eration, select one such rule which identiﬁes better

those observations that its predecessor failed. The

output of the ADABOOST algorithm is a ﬁnal or com-

bined hypothesis

is simply the sign of a

weighted combination of the weak hypothesis, i.e.,

H is a weighted majority rule of the weak classiﬁers,

(x) = sign



∑

t=1

(x)



3 oADABOOST

While the Data Replication Method (DRM) has al-

ready been instantiated with SVMs, ANNs and Kernel

Discriminant Analysis (Cardoso and da Costa, 2007;

Cardoso et al., 2012), its mapping to ADABOOST

is not trivial, as illustrated in Figure 3 for an AD-

ABOOST with a decision stump. This is due to the

fact that a decision stump can only make cuts at one

attribute at a time, and therefore, as it can be imag-

ined, there are only two possible types of cuts in our

data replicated space:

• If the cut happens on one of the original attributes,

then the cuts on each replica will be on the same

position (see Figure 3(a)).

• If the cut happens on one of the new attributes,

then the cut will represent a constant factor on the

original space (see Figure 3(b)).

Note that the difﬁculties of instantiating the DRM

with ADABOOST remain true for any weak learner

that use a single attribute. Since in each iteration

a weak learner is designed without strong ties with

the other weak learners, the strong classiﬁer resulting

from the boosting process may not possess the desired

property of consistency.

In here we propose a soft DRM, where the non in-

tersecting constraint is imposed on each iteration. Al-

though the ﬁnal strong classiﬁer may not possess that

property, intuitively, the ﬁnal model is biased towards

consistent models.

In order to exploit the parallelism constraint that

is usually imposed by the Data Replication Method,

instead of training one weak classiﬁer, we can train

(K − 1) weak classiﬁers (one for each replica), where

we force them to use the same attribute but with dif-

ferent thresholds. This will in turn guarantee that our

cuts are parallel. This makes our soft DRM a hybrid

of the original DRM and the Frank and Hall method.

We apply this idea to ADABOOST by indepen-

dently boosting each replica while forcing that, at

oAdaBoost-AnAdaBoostVariantforOrdinalClassification

0.5

(a)

0.5

(b)

Figure 3: Problems with decision stumps and the data repli-

cation model. (a) Cut on one of the new attributes. (b) Cut

on one of the original attributes.

every iteration, all weak classiﬁers use the same at-

tribute:

1. As before, we create (K −1) binary replicas of our

dataset;

2. Initialize the weights such as the sum of the

weights of each replica is 1;

3. Coupled selection of the weak classiﬁer between

all replicas;

(a) For a decision stump, pick the best attribute

(possibly with different decision thresholds)

att

based on a combination of the errors of each

replica;

4. Calculate the weight of each observation j in each

replica k at iteration i;

5. If a replica has an error superior to 50%, stop

boosting that replica;

6. The training stops if every replica has an error su-

perior to 50% or after a set number of iterations.

Note that the (K −1) binary ADABOOST are tied only

in step 3, which can be considered as part of build-

ing the weak learner.Also note the error of each weak

classiﬁer is computed independently for each replica,

resulting also in an independent update of the weights

and in independent estimation of the contribution of

the binary weak classiﬁer to the ﬁnal binary strong

classiﬁer. Moreover, the boosting process can stop

earlier in one replica than in the others. Nothing of

this is true for ADABOOST.OR.

When using the decision stump as weak classiﬁer,

all (K − 1) replicas of ADABOOST are constrained

to use the same attribute but with (likely) different

thresholds. The attribute is chosen to minimize a

function (in our case we selected the average but the

median or the maximum were also sensible options)

of the individual (K − 1) errors:

argmin

att

K − 1

K−1

∑

k=1

Err

(att)

The individual error Err

(att) is the smallest mis-

classiﬁcation error (by optimizing the threshold) in

replica k when using attribute att.

A more detailed explanation of this algorithm can

be seen on Algorithm 1.

4 EXPERIMENTAL STUDY

In order to compare the performance of the various

ADABOOST variants, we performed experiments on

both artiﬁcial and real-world datasets. All variants

were implemented on top of Weka’s ADABOOST.M1

implementation. The results were obtained by per-

forming 10 experiments using 10-fold cross valida-

tion, with the number of iterations limited to 100. The

statistical signiﬁcance analysis was done in Weka’s

experimenter interface using a corrected paired t-test,

with a conﬁdence of 0.05. The error metrics used

were the Mean Error Rate (MER) and the Mean Ab-

solute Error (MAE)

4.1 Ordinal Datasets

In our experiments we use two synthetic datasets and

six real datasets. A more detailed description of each

dataset is presented in Table 1.

The synthetic datasets are the following:

1. 1000 points x = (x

, x

)

were randomly gen-

erated in the unit square [0, 1] × [0, 1] ∈ R

according to the uniform distribution. The

class was assigned to each point accord-

ing to its distance to (0.5, 0.5): f (x) =



6 × ((x

− 0.5)

+ (x

− 0.5)

)



We calculated the absolute error in the following way:

If our predicted class is C

and the correct class is C

, then

the absolute error is |c − p|. Weka comes with an imple-

mentation of MAE for classiﬁcation that is different from

the one we presented and is not suited for ordinal tasks, and

therefore should not be used to reproduce our results.

ICPRAM2015-InternationalConferenceonPatternRecognitionApplicationsandMethods

Data: A dataset D = (D, f ) with N elements

Result: A classiﬁer

Replicate the dataset in K − 1 binary replicas D

= (D

, f

)

Initialize the example weights as w

;

Initialize active

= true; i := 1;

while ¬EndingCondition do

forall the att ∈ Attributes do

forall the k : 0 ≤ k < K − 1 ∧ active

k,att

= train(D

, att);

error

k,att

∑

j=1

[[

k,att

) 6= f (d

)]];

end

att

= argmin

att

combination(error

k,att

) ; // The attribute to split on

forall the k : 0 ≤ k < K − 1 ∧ active

att

;

error

= error

att

;

if error

> 0.5 then

active

:= f alse; // Stop boosting this replica

else

= 0.5log(

error

1−error

) ; // Classifier weight

forall the j do

:= w

(α

)

1−[[

)6= f

)]]

; // Updates the example weights

end

:= F

∪





; // Binary ensemble for replica k

end

Normalize the weights so that ∀k :

∑

= 1;

i := i + 1

end

forall the k do

(x) = sign(

∑

(x)) ; // Binary classifier for replica k

end

= getOrdinalClassifier(

, ...,

K−1

) ; // Combines the binary results.

Algorithm 1: oADABOOST with decision stump.

2. 1000 points from a dataset commonly used for

evaluating ordinal data methods (Cardoso and

da Costa, 2007) (this dataset is also shown on Fig-

ure 4).

In the Balance-Scale dataset each example rep-

resenting a balanced scale tipped to the right, to

the left or balanced, based on the relation between

the left distance, right distance, left weight and

right weight. It is available on the UCI reposi-

tory (https://archive.ics.uci.edu/ml/). The Arie Ben

David datasets are available on the MLData Repos-

itory (https://mldata.org/) and consist of examples

classiﬁed on a ordinal scale according to subjective

judgements (e.g. the degree of ﬁtness of a candidate

to a certain job). The BCCT dataset, encompassing

1144 observations, expresses the aesthetic evaluation

0 0.2 0.4

0.6

0.8 1

0.2

0.4

0.6

0.8

Figure 4: Synthetic dataset.

oAdaBoost-AnAdaBoostVariantforOrdinalClassification

Table 1: List of considered datasets. The number of attributes excludes the class attribute and the SWD dataset has an unused

label.

Name Points Attributes Labels Class Distribution

Synthetic 1 (Circle) 1000 2 3 [523, 413, 63]

Synthetic 2 (Non-monotonic) 5000 2 5 [115, 296, 225, 229, 135]

Arie Ben David ERA 1000 3 9 [92, 142, 181, 172, 158, 118, 88, 31, 18]

Arie Ben David ESL 488 4 9 [2, 12, 38, 100, 116, 135, 62, 19, 4]

Arie Ben David LEV 1000 4 5 [93, 280, 403, 197, 27]

Arie Ben David SWD 1000 10 4(5) [32, 352, 399, 217]

Balance-Scale 625 4 3 [288, 49, 288]

BCCT 1144 30 4 [160, 592, 272, 120]

of Breast Cancer Conservative Treatment (Cardoso

and Cardoso, 2007; Cardoso and Sousa, 2011). For

each patient submitted to BCCT, 30 measurements

were recorded, capturing visible skin alterations or

changes in breast volume or shape. The aesthetic

outcome of the treatment for each and every patient

was classiﬁed in one of the four categories: Excellent,

Good, Fair and Poor.

4.2 Results

In Table 2 we show a comparison of the follow-

ing ADABOOST variants, instantiated with Decision

Stumps, limited to 100 iterations:

oADABOOST Our ADABOOST variant.

ADABOOST.M1 One of the most common AD-

ABOOST variants with support for multiple

classes (Freund et al., 1996).

ADABOOST.M1W A small variant of the AD-

ABOOST.M1 algorithm designed to have better

performance on problems with many possible

classes (Eibl and Pfeiffer, 2002).

ADABOOST.OR A variant of the ADABOOST.M1

designed for ordinal classiﬁcation (Lin and Li,

2009). Since it needs an ordinal classiﬁer as the

weak learner, we use a decision stump that picks

(K − 1) cuts on the same attribute.

Frank and Hall (with ADABOOST) The method

proposed by Frank and Hall (Frank and Hall,

2001) instantiated with ADABOOST. Note that

in this method the (K − 1) weak learners are

independently designed, while in our proposed

oADABOOST method they are coupled.

Results seem to also indicate a slight superi-

ority of ADABOOST.M1W over ADABOOST.M1,

more clear in the datasets with more classes. AD-

ABOOST.OR seems superior to ADABOOST.M1W

and ADABOOST.M1, suggesting that the integra-

tion of the knowledge of the order in the design

brings performance advantages. It is also possible

to see that oADABOOST and Frank and Hall meth-

ods present the most favorable results when com-

pared to the other boosting algorithms under compari-

son. The performance difference between our method

oADABOOST and Frank and Hall instantiated with

ADABOOST seems negligible. However, note that

since oADABOOST is constrained to use the same at-

tribute in all (K − 1) weak learners (which is not the

case in the Frank and Hall method), it results in sim-

pler models without loss of performance.

(a) (b)

(c)

Figure 5: Boundaries generated by (a) oADABOOST, (b)

Frank and Hall (with ADABOOST) and (c) ADABOOST.OR

on our synthetic non-monotonic dataset.

Based on the boundaries of the various classi-

ﬁers trained on our synthetic non-monotonic dataset

(Figure 5), it appears that ADABOOST.OR has more

problems with noise, while both oADABOOST and

the Frank and Hall method have problems with non-

monotonicity.

ICPRAM2015-InternationalConferenceonPatternRecognitionApplicationsandMethods

Table 2: Comparison of oADABOOST with ADABOOST variants.

(a) Mean Error Rate: mean (standard deviation) of 10 experiments

Dataset oADABOOST ADABOOST.M1 ADABOOST.M1W ADABOOST.OR Frank and Hall

Circle 0.07(0.03) 0.40(0.03)• 0.55(0.01)• 0.16(0.04)• 0.07(0.03)

Non-mon. 0.66(0.03) 0.70(0.02)• 0.61(0.05)◦ 0.76(0.02)• 0.50(0.04)◦

ERA 0.75(0.04) 0.78(0.02) 0.78(0.04) 0.78(0.02) 0.73(0.05)

ESL 0.33(0.06) 0.57(0.03)• 0.47(0.06)• 0.45(0.05)• 0.33(0.06)

LEV 0.38(0.04) 0.58(0.03)• 0.42(0.05)• 0.50(0.04)• 0.38(0.05)

SWD 0.43(0.05) 0.48(0.04)• 0.48(0.05)• 0.48(0.04)• 0.43(0.05)

Balance 0.03(0.02) 0.28(0.04)• 0.08(0.02)• 0.17(0.08)• 0.04(0.02)

BCCT 0.13(0.03) 0.37(0.03)• 0.38(0.05)• 0.32(0.03)• 0.13(0.03)

(b) Mean Absolute Error: mean (standard deviation) of 10 experiments

Dataset oADABOOST ADABOOST.M1 ADABOOST.M1W ADABOOST.OR Frank and Hall

Circle 0.07(0.03) 0.44(0.03)• 0.55(0.01)• 0.16(0.04)• 0.07(0.03)

Non-Mon. 0.99(0.07) 1.30(0.08)• 1.19(0.14)• 1.03(0.04) 1.02(0.03)

ERA 1.24(0.10) 1.43(0.07)• 1.44(0.12)• 1.43(0.07)• 1.34(0.13)•

ESL 0.35(0.07) 0.73(0.06)• 0.56(0.08)• 0.51(0.07)• 0.35(0.07)

LEV 0.41(0.05) 0.71(0.03)• 0.46(0.06)• 0.57(0.05)• 0.42(0.06)

SWD 0.45(0.05) 0.50(0.04)• 0.54(0.06)• 0.50(0.04)• 0.46(0.05)

Balance 0.03(0.02) 0.49(0.09)• 0.08(0.02)• 0.18(0.09)• 0.04(0.02)

BCCT 0.13(0.03) 0.38(0.03)• 0.40(0.07)• 0.33(0.03)• 0.14(0.03)

◦, • statistically signiﬁcant improvement or degradation.

5 CONCLUSIONS

In this work we have presented a new variant of the

well known ADABOOST algorithm designed for or-

dinal classiﬁcation. In the proposed methodology,

(K − 1) binary ADABOOST are built in parallel, tied

in phase of designing the weak learner. In the end,

the (K − 1) strong binary classiﬁers are combined to

yield the multiclass model.

Based on our results, it appears that by enforc-

ing local constraints at each boosting iteration (in this

case, by enforcing local parallelism) and by working

on the replicated space, one can achieve better re-

sults on ordinal classiﬁcation tasks, when compared

to other ADABOOST variants instantiated with deci-

sion stumps. We plan now to extend these ideas to

other learning methodologies whose behaviour is sim-

ilar to ADABOOST (i.e. the ﬁnal classiﬁer is built

from a set of weak classiﬁers that only use one at-

tribute), such as decision trees (which should have

less issues with non-monotonic datasets, since the

recursive division of the space should lead to small

monotonic cells). We also plan to study the impact

of stronger restrictions on the set of weak classiﬁers

(e.g. enforce the splits to be ordered), as that should

lead to classiﬁers more similar to the ones generated

via the original DRM.

ACKNOWLEDGEMENTS

This work is ﬁnanced by the ERDF - European Re-

gional Development Fund through the COMPETE

Programme (operational programme for competitive-

ness) and by National Funds through the FCT within

project PTDC/SAU-ENB/114951/2009. The work

leading to these results has received funding from

the European Community’s Seventh Framework Pro-

gramme under grant agreement n FP7–600948. The

authors would also like to thank the Masters in Infor-

matics and Computing Engineering and Department

of Informatics Engineering of the Faculty of Engi-

neering of the University of Porto for supporting some

of the costs of this work.

REFERENCES

Cardoso, J. S. and Cardoso, M. J. (2007). Towards an intel-

ligent medical system for the aesthetic evaluation of

breast cancer conservative treatment. Artiﬁcial Intel-

ligence in Medicine, 40:115–126.

Cardoso, J. S. and da Costa, J. F. P. (2007). Learning to clas-

sify ordinal data: the data replication method. Journal

of Machine Learning Research, 8:1393–1429.

Cardoso, J. S. and Sousa, R. (2010). Classiﬁcation models

with global constraints for ordinal data. In Proceed-

ings of The Ninth International Conference on Ma-

chine Learning and Applications (ICMLA), pages 71–

77.

oAdaBoost-AnAdaBoostVariantforOrdinalClassification

Cardoso, J. S. and Sousa, R. (2011). Measuring the per-

formance of ordinal classiﬁcation. International Jour-

nal of Pattern Recognition and Artiﬁcial Intelligence,

25(8):1173–1195.

Cardoso, J. S., Sousa, R., and Domingues, I. (2012). Ordi-

nal data classiﬁcation using kernel discriminant anal-

ysis: A comparison of three approaches. In Pro-

ceedings of The Eleventh International Conference on

Machine Learning and Applications (ICMLA), pages

473–477.

Eibl, G. and Pfeiffer, K. P. (2002). How to make adaboost.

m1 work for weak base classiﬁers by changing only

one line of the code. In Machine Learning: ECML

2002, pages 72–83. Springer.

Frank, E. and Hall, M. (2001). A simple approach to ordinal

classiﬁcation. Springer.

Freund, Y., Iyer, R., Schapire, R. E., and Singer, Y. (2003).

An efﬁcient boosting algorithm for combining prefer-

ences. Journal of Machine Learning Research, 4:933–

969.

Freund, Y. and Schapire, R. E. (1995). A decision-theoretic

generalization of on-line learning and an application

to boosting. In Proceedings of the Second European

Conference on Computational Learning Theory, Eu-

roCOLT ’95, pages 23–37. Springer-Verlag.

Freund, Y., Schapire, R. E., et al. (1996). Experiments with

a new boosting algorithm. In ICML, volume 96, pages

148–156.

Lin, H.-T. and Li, L. (2006). Large-margin thresholded en-

sembles for ordinal regression: Theory and practice.

In Balc

azar, J., Long, P., and Stephan, F., editors, Al-

gorithmic Learning Theory, volume 4264 of Lecture

Notes in Computer Science, pages 319–333.

Lin, H.-T. and Li, L. (2009). Combining ordinal pref-

erences by boosting. In Proceedings ECML/PKDD

2009 Workshop on Preference Learning, pages 69–83.

Sousa, R. and Cardoso, J. S. (2011). Ensemble of deci-

sion trees with global constraints for ordinal classiﬁ-

cation. In International Conference on Intelligent Sys-

tems Design and Applications (ISDA).

Sun, B.-Y., Wang, H.-L., Li, W.-B., Wang, H.-J., Li, J.,

and Du, Z.-Q. (2014). Constructing and combining

orthogonal projection vectors for ordinal regression.

Neural Processing Letters, pages 1–17.

ICPRAM2015-InternationalConferenceonPatternRecognitionApplicationsandMethods