Learning Ensembles in the Presence of Imbalanced Classes

Amal Saadallah

, Nico Piatkowski

, Felix Finkeldey

, Petra Wiederkehr

and Katharina Morik

Artiﬁcial Intelligence Group, Department of Computer Science, TU Dortmund, Germany

Virtual Machining Group, Department of Computer Science, TU Dortmund, Germany

Keywords:

Class Imbalance, Ensemble, Classiﬁcation.

Abstract:

Class imbalance occurs when data classes are not equally represented. Generally, it occurs when some classes

represent rare events, while the other classes represent the counterpart of these events. Rare events, especially

those that may have a negative impact, often require informed decision-making in a timely manner. However,

class imbalance is known to induce a learning bias towards majority classes which implies a poor detection of

minority classes. Thus, we propose a new ensemble method to handle class imbalance explicitly at training

time. In contrast to existing ensemble methods for class imbalance that use either data driven or randomized

approaches for their constructions, our method exploits both directions. On the one hand, ensemble members

are built from randomized subsets of training data. On the other hand, we construct different scenarios of

class imbalance for the unknown test data. An ensemble is built for each resulting scenario by combining

random sampling with the estimation of the relative importance of speciﬁc loss functions. Final predictions are

generated by a weighted average of each ensemble prediction. As opposed to existing methods, our approach

does not try to ﬁx imbalanced data sets. Instead, we show how imbalanced data sets can make classiﬁcation

easier, due to a limited range of true class frequencies. Our procedure promotes diversity among the ensemble

members and is not sensitive to speciﬁc parameter settings. An experimental demonstration shows, that our

new method outperforms or is on par with state-of-the-art ensembles and class imbalance techniques.

1 INTRODUCTION

In many real-word situations, rare events and unusual

behaviors, such as process failure or instability in

machine engineering, rare diseases and bank frauds,

are usually represented with imbalanced data obser-

vations. In other words, one or more classes, usually

the ones that represent such events, are underrepre-

sented in the data set. This issue, known to the Data

Mining community as the class imbalance problem,

makes the detection of rare events a challenging task

(Galar et al., 2012; Haixiang et al., 2017).

In the case of binary classiﬁcation with class im-

balance ratio of 1%, a trivial solution that always pre-

dicts the majority class, will achieve an accuracy of

≈ 99%; though the accuracy seems high, the solution

is meaningless since not a single instance of the mi-

nority class is classiﬁed correctly. Thus, in our work,

we make the implicit assumption that the minority

class has a higher cost than the majority class (Zhou,

2012).

Several machine learning approaches have been

proposed over the past decades to handle the class

imbalance problem, most of which are based on re-

sampling techniques, cost sensitive learning, and en-

semble methods (Galar et al., 2012; Haixiang et al.,

2017).

In this paper, we address the problem of class im-

balance via an ensemble method. In general, ensem-

ble methods train multiple classiﬁers and combine

their predictions to solve the same classiﬁcation task

Ensembles are known to deliver a higher quality than

each single ensemble member (Hansen and Salamon,

1990) and they provide state-of-the-art results in var-

ious real-world tasks (Galar et al., 2012).

Ensemble methods are not designed to work with

imbalanced data sets since they do not take class

imbalance into account during the ensemble con-

struction. However, they have shown successful re-

sults when applied to this task through the combi-

nation of various processing techniques at a data-

level (e.g. random sampling, feature selection, cost-

sensitive learning methods) with different ensemble

methods (Galar et al., 2012; Haixiang et al., 2017).

While ensembles may indeed be applied to regression

problems as well, we focus in this work on classiﬁcation

tasks.

866

Saadallah, A., Piatkowski, N., Finkeldey, F., Wiederkehr, P. and Morik, K.

Learning Ensembles in the Presence of Imbalanced Classes.

DOI: 10.5220/0007681508660873

In Proceedings of the 8th International Conference on Pattern Recognition Applications and Methods (ICPRAM 2019), pages 866-873

ISBN: 978-989-758-351-3

Nevertheless, to the best of our knowledge, none of

these techniques allow to include explicit (estimated)

knowledge about the class imbalance in the distribu-

tion.

In this work, we explain how to address class im-

balance directly by a new ensemble construction that

exploits both data-driven and randomized approaches.

On the one hand, ensembles members are built from

randomized subsets of the training data. On the other

hand, we use the class ratio from the training data to

construct different scenarios of class imbalance in the

unknown test set. The random sampling is then car-

ried out for each scenario. This procedure promotes

diversity among the ensemble members and subdi-

vides the ensemble into multiple weighted stages. As

opposed to existing methods, our approach does not

try to ﬁx imbalanced data sets. Instead, we show how

imbalanced data sets can make classiﬁcation easier,

due to a limited range of true class frequencies.

2 REVIEW ON EXISTING

TECHNIQUES FOR HANDLING

IMBALANCED DATA SETS

Several works have been proposed in literature to

address the class imbalance problem over the last

decades. Resampling methods and cost-sensitive

learning are the two main strategies that have been

employed for imbalanced learning.

Resampling strategies allow to rebalance the data

set in order to mitigate the effect of the bias of ma-

chine learning algorithms towards majority classes

which results in poor generalization and unaccept-

able error rates on minority classes (Japkowicz, 2000;

Chawla et al., 2002). Resampling methods are adapt-

able preprocessing techniques as they are independent

of the selected classiﬁer. They fall into three main

families with regards to the method used for balanc-

ing the class distribution:

Over-sampling methods: aim at balancing the class

distribution through the creation of new minority class

samples, either by random duplication (Japkowicz,

2000) or synthetic creation. SMOTE (synthetic mi-

nority oversampling technique) (Chawla et al., 2002)

is a method which artiﬁcially generates synthetic in-

stances from minority class by inserting samples with

random linear interpolation between a minority sam-

ple and its k nearest neighbours. Many approaches

based on SMOTE were introduced in the literature

(Gao et al., 2012; Zhang and Li, 2014). Gao et al.

(Gao et al., 2012) employed a Parzen-window kernel

function to estimate the probability density function

of the minority class, from which synthetic instances

are generated as additional training data to rebalance

the class distribution. Zhang et al. (Zhang and Li,

2014) introduced RWO-sampling which is a random

walk oversampling method to balance the class distri-

bution by generating synthetic instances via random

walks through the training data.

Under-sampling methods: aim at balancing the class

distribution through the random elimination of sam-

ples from majority classes. Batista et al. (Batista

et al., 2000) proposed a more sophisticated under-

sampling technique by classifying samples from the

majority class into three main categories safe, border-

line and noise. Only safe majority class instances and

all minority class instances are considered for training

the classiﬁer.

Hybrid methods: are a combination of over-sampling

and under-sampling methods (Peng and Yao, 2010;

Cateni et al., 2014). AdaOUBoost (Peng and

Yao, 2010) adaptively oversamples minority class in-

stances and undersamples majority class ones to build

different classiﬁers. The classiﬁers are combined ac-

cording to their accuracy to create the ﬁnal prediction.

Cateni et al. (Cateni et al., 2014) address the class

imbalance problem by combining oversampling and

a similarity-based under-sampling techniques.

Solving the class imbalance problem is also possi-

ble through algorithmic modiﬁcations of existing ma-

chine learning classiﬁers or ensemble constructions.

Several works reported in (Haixiang et al., 2017) em-

ploy various modiﬁcations to existing learning meth-

ods, e.g. support vector machines, k-nearest neigh-

bors, or neural networks. Modiﬁcations can be in-

troduced by enhancing the discriminatory power of

the classiﬁers using kernel transformation to increase

the separability of the original training space (Gao

et al., 2016). They can also be formulated by con-

verting the loss functions to penalize errors made in

the classiﬁcation of minority samples stronger (Kim

et al., 2016).

Combining classiﬁers in ensemble frameworks is

also another common approach in the class imbal-

ance literature (Galar et al., 2012; Haixiang et al.,

2017). Within ensemble-based classiﬁers, we can dis-

tinguish four main families. The ﬁrst family includes

resampling-based ensembles. An ensemble of classi-

ﬁers is created after training base classiﬁers on bal-

anced data sets obtained with a resampling technique

(Sun et al., 2015; Tian et al., 2011). In the second

family, the ensemble is built based on boosting af-

ter applying some data resampling strategy. This ap-

proach adds a bias toward the minority class to the

weight distribution used to train the next classiﬁer at

each iteration. Most of the proposed methods inside

Learning Ensembles in the Presence of Imbalanced Classes

867

this family are based on the ﬁrst applicable boost-

ing algorithm, Adaboost, proposed by Freund and

Schapire (Freund et al., 1996). Many extensions of

Adaboost were employed in the context of class im-

balance such as Adaboost.M1 and Adaboost.M2 (Fre-

und and Schapire, 1997; Schapire and Singer, 1999).

SMOTEBoost (Chawla et al., 2003) is a combina-

tion of SMOTE and AdaBoost.M2, where synthetic

instances are introduced before the reweighting step

in AdaBoost.M2. The synthetic instances have the

same weight as the instances in the original dataset,

while the weights attributed to the originals are up-

dated according to a pseudo-loss function. RUSBoost

(Seiffert et al., 2010) is another example of boost-

ing combined with random sampling and oppositely

to SMOTEBoost, it uses random under-sampling to

eliminate instances from the majority class in each

iteration. Within the third family, we ﬁnd Bagging-

based ensembles, which are known for their simplic-

ity and good generalization ability (Galar et al., 2012).

The key factor in these methods is the combination of

resampling for class imbalance with the collection of

each bootstrap replica in order to obtain a useful clas-

siﬁer in each iteration while maintaining the diversity

aspect. Many approaches have been developed us-

ing bagging ensembles (Galar et al., 2012), such as

UnderBagging (Barandela et al., 2003) and OverBag-

ging (Galar et al., 2012). SMOTEBagging (Wang and

Yao, 2009) is also an example of bagging ensembles,

where minority class instances are created through a

combination of over-sampling and SMOTE and the

set of majority class instances is bootstrapped in each

iteration in order to form a more diverse ensemble.

3 GENERAL METHODOLOGY

Here, input data is denoted as a multi-set D =

{(x

(i)

)}

1≤i≤N

which consists of |D| = N data

points x

(i)

and their corresponding label y

(i)

. We as-

sume that each pair (x

(i)

) is an independent re-

alization of a random variable (X ,Y ) which follows

some arbitrary but ﬁxed probability measure P. Each

random variable has its own state space, which is de-

noted by a calligraphic version of the random vari-

able’s letter, e.g., X denotes the state space of X. In

this paper, we will use X = R

for some d ∈ N and

binary class labels, e.g., Y = {0, 1}. However, our

method works with any state spaces which are sup-

ported by the underlying classiﬁers. Generic realiza-

tions of random variables are denoted by lowercase

boldface letters, like x. The symbol

{expression}

is the

indicator function that evaluates to 1 if and only if the

expression is true. We denote empirical estimates by

a tilde symbol, e.g.,

E[X] is the expected value of X

estimated from data, and E[X] is the true expectation.

To simplify notation, we use P(Y = y | X = x) = P(y |

x) whenever the corresponding random variables can

be inferred from the context.

3.1 Motivation

To understand the intuition behind our approach, sup-

pose we learn a model M from a data set D to solve

a classiﬁcation task. Choosing M = P results in the

Bayes optimal classiﬁcation: ˆy = argmax

y∈Y

P(y | x),

where ˆy is the predicted class label. P is unknown,

but we may indeed try to learn it from data, result-

ing in the estimate

P(y | x) =

P(y, x)/

P(x) ∝

P(y, x).

However, since our data set D is ﬁnite, there will

always be a discrepancy between the true P and its

empirical counterpart

P. Due to this deviation from

the true distribution our classiﬁer is likely to behave

erroneous. If we would know the estimation error

ε(y, x) = P(y, x) −

P(y, x) for each possible data point

(y, x) ∈ Y × X , we could correct our model.

3.2 Class Frequency Correction for

Binary Classiﬁcation

Now, assume Y = {0, 1} and consider that we esti-

mate

P from an arbitrary but ﬁxed data set D. With-

out any prior knowledge about ε, we must distin-

guish the following three cases (it sufﬁces to con-

sider only one class due to symmetry of the binary

classiﬁcation task): (1) P(Y = 1) >

P(Y = 1), (2)

P(Y = 1) ≈

P(Y = 1), and (3) P(Y = 1) <

P(Y = 1).

To prepare our classiﬁer for these situations, we con-

struct three new data sets D

Val

≈

Val

—one for

each of the above scenarios. In the ﬁrst case, the class

y = 1 is actually more likely than the data suggests,

hence ε > 0 and we should create a subsample D

Val

from D, such that

(Y = 1) =

P(Y = 1) + ε, where

is the class distribution in D

Val

—we refer to this

process as rebalancing. The third case is symmetri-

cal: we have ε < 0 and we shall subsample D

Val

from

D. Finally, for the case (2), we subsample the set

≈

Val

via stratiﬁed proportionate allocation—the class

distribution will approximately match the class distri-

bution in D and ε ≈ 0.

The general procedure works as follows: Split all

data which is available for training into a training set

and a validation set. Then, subsample the validation

set to generate three speciﬁc sets D

Val

≈

Val

. For

each case µ ∈ {>,≈,<}, we learn a model m

on the

training data and use the data in the validation set D

Val

to reﬁne the learning process. This procedure yields

three families of models, namely M

≈

and M

ICPRAM 2019 - 8th International Conference on Pattern Recognition Applications and Methods

868

may contain one or many models m

depending

on the chosen optimization procedure e.g., repeating

the subsampling process multiple times or optimizing

different quality metrics q. The resulting models will

later be combined to form an ensemble, but ﬁrst, we

shed some light on how to choose the probability off-

set α =

, which is required for our method.

3.3 Class Imbalance and the Probability

Offset

At a ﬁrst glance, choosing the probability offset α

might seem infeasible. How should one even guess

which α ∈ (0; 1) is appropriate? Surprisingly, it turns

out that class imbalance simpliﬁes this problem! In

fact, the stronger the class imbalance, the tighter is

the range for reasonable probability offsets. Let us

formalize this result.

Lemma 1 (Probability of Probability Offsets). Let D

be a data set with empirical class distribution

P(y).

We denote the class ratio w.r.t. the minority class by

r = min

P(Y =0)

P(Y =1)

P(Y =0)

. There exists c,c

such that

for all α > 0, the probability of the event

P(Y = 0) − P(Y = 0)| ≥ α

is upper bounded by

exp



−

r + αc

+ ln2



Proof. For a sequence of independent random vari-

ables Z

,...,Z

with zero-mean and |Z

| ≤ K, we

know from the Bernstein inequality (cf. Theorem

7.30 in (Foucart and Rauhut, 2013)) that



∑

i=1



≥ ε

≤ 2 exp

−

∑

i=1

V[Z

] +

Kε

Consider the random data set D = {(X

(i)

)}

1≤i≤N

Using the above inequality with ε = |D|α,

= (

(i)

=1}

− E[

(i)

=1}

]) and observing

that E[

(i)

=1}

] = P(Y = 1) yields





P(Y = 1) − P(Y = 1)



≥ α



≤ 2exp



−

(αN)

∑

i=1

V[Z

] + K(αN)/3



where the inequality on the left hand side is divided by

|D| = N, and V[Z

] is the variance of Z

. Since V[R +

c] = V[R] for any random variable R, we have V[Z

] =

(i)

=1}

]. Since

(i)

=1}

is a Bernoulli random

variable, its variance is P(Y = 0)P(Y = 1) and thus

V[Z

] = P(Y = 0)P(Y = 1). By construction, r is an

upperbound on the variance, e.g., V[Z

] < r. Finally,

noticing that |Z

| < 1 proves the lemma with c = N/2

and c

= 1/3.

The lemma tells us that the probability that α is

large decreases exponentially fast as a function α,

damped by the class imbalance which is measured

by r. Let us consider as an example the case where,

P(Y = 1) = 0.9, i.e., 10% of all data points belong

to the minority class. By applying Lemma 1 to this

scenario while keeping all other quantities ﬁxed, we

see that a deviation of more than 5% happens in at

most 0.011285% of all data sets. A deviation of at

least 2.5% occurs for at most 14.615% of all data sets.

Thus, we see that large α values (say, > 5%) are very

unlikely in the presence of class imbalance.

3.4 The Ensemble Framework

The procedure that we described at the beginning of

this section yields three diverse types of classiﬁers

≈

and M

, each optimized with respect to a

different validation set. The resulting classiﬁers are

combined into an ensemble which is prepared to han-

dle different scenarios of class discrepancy and dif-

ferent quality metrics q. The ensemble weights can be

considered as unobserved random variables and apply

expectation-maximization to estimate them (Soltan-

mohammadi et al., 2015). However, such techniques

assume the existence of a context and require the clas-

siﬁer accuracy to be Lipschitz continuous. In contrast,

we propose to interpret µ ∈ {>,≈,<} and q as ran-

dom variables. Let m(x) = ˆy be the prediction of a

single model, the outcome of the ensemble is:

E[m(x)] = (1)

∑

µ∈{>,≈,<}

P(µ)

∑

q∈Q

P(q | µ)

∑

t=1

P(t | µ, q)m

q,t

(x) ,

where Q is a set of quality metrics, e.g., Q =

{precision,recall}, and T is the number of models

that we have generated for each combination of µ and

q—our experimental results will show that small val-

ues of T sufﬁce. The values of P(µ), P(q | µ), and

P(t | µ,q) control how the speciﬁc models m

q,t

in-

ﬂuence the ﬁnal outcome. The probability P(µ) ex-

presses our knowledge on how the class distribution

in the training set differs from the true one. Accord-

ing to Lemma 1, all directions are equally likely—our

experiments validate this. The probabilities P(q | µ)

can be interpreted as the importance of the quality

measure q for the actual prediction task. As such, the

importance depends on D, Q , or even on µ, and can-

not be derived in general. Finally, P(t | µ,q) are the

classical (normalized) ensemble weights. Since each

weak model depends on µ, q, and t, the outcome may

be interpreted as a mixture of mixtures of experts.

Learning Ensembles in the Presence of Imbalanced Classes

869

4 EXPERIMENTAL

DEMONSTRATION

We conduct an experimental evaluation to assess the

behavior of our proposed method. Our experiments

are driven by the following questions: (Q1) Do

the empirical results conﬁrm our theoretical ﬁndings

about the choice of α to design the class imbalance

scenarios? (Q2) How does our ensemble construction

perform, compared to state-of-the-art methods? To

this end, we consider six case studies.

4.1 Case Studies

The ﬁrst four sets, namely, “Pima”, “Yeast”, “Glass”

and “Haberman” are benchmark data sets, which are

publicly available on the UCI repository (Lichman,

2013). The “Welding” and “Milling” data sets de-

scribe real-word industrial processes, where result-

ing manufactured pieces have to be classiﬁed into

{ok,not-ok}, given a set of processes features.

Table 1: Summary description of the imbalanced data sets

used in the experimental study.

Data-set #Ex. #Atts. (min% : maj%)

Pima 768 8 (34.90 : 65.10)

Yeast 1484 8 (10.99 : 89.01)

Glass 214 10 (07.98 : 92.02)

Haberman 306 3 (26.56 : 73.44)

Welding 809 5 (09.00 : 91.00)

Milling 5700 4 (11.00 : 89.00)

Table 1 summarizes the properties of the selected

data-sets, namely: number of examples (#Ex.), num-

ber of attributes (#Atts.), class name (minority and

majority Class(min;maj)), the percentage of examples

of each class.

4.2 Experimental Setup

All experiments are cross-validated and conducted us-

ing the R software. The data sets of the six case studies

are split into a training, a validation and a test set for

each cross validation fold. 60 % of the data was used

for training, while the remaining 40% are equally split

between validation and testing. The baseline classiﬁer

in this work is Decision Tree (DT) (Breiman et al.,

1984). Our framework is compared to three baseline

models, namely Decision Tree (Breiman et al., 1984),

Random Forest (RF) (Breiman, 2001), and Gradient

Boosting Machine (GBM) (Friedman, 2001). The pa-

rameters of both GBM and RF are tuned using grid

search procedures over a set of input parameters. In

addition to the plain baseline methods, our method

is also compared to various state-of-the-art methods

for handling class imbalance. The name of the meth-

ods and the corresponding literature references are de-

tailed in Table 2. An additional ensemble of DT com-

bined with random undersampling is also added for

comparison and denoted by “Ens”. The number of

trees in the ensemble is set to 6 (= maximum number

of trees used in our proposed method).

Our Method. Three validation sets D

Val

≈

Val

are designed from the validation set to represent

each of the three possible scenarios of class imbal-

ance. We investigate two sets of quality metrics Q :

{precision,recall} and {F1-score}. For simplicity, the

number of trees T per quality measure is set to 1 in

these experiments which implies P(t | µ,q) = 1. So

for each scenario, we have one tree for each compo-

nent of Q . For example, for the ﬁrst setting of Q :

{precision,recall}, we have two trees one aiming to

maximize the recall and the other aiming at maximiz-

ing the precision. In the ﬁrst setting , P(recall | µ) is

set to

P(Y = 1) with P(precision | µ) = 1 − P(recall |

µ). The rationale behind this choice is to interpret

the frequency of the minority class as a proxy for

the importance of the recall, where we estimate the

frequencies on the training data. Since in a realistic

application of our method, each of the three cases in

{>,≈,<} is equally likely, we set P(µ) =

/3, which

is a generic choice in the absence of any prior knowl-

edge about class proportion in the test set. Predic-

tions are performed by thresholding the output of the

ensemble (Eq. 1) at

/2.

4.3 Results

Table 2 encloses descriptive statistics of the experi-

mental results on the F1-score for both majority and

minority classes per method/case study.

4.3.1 Impact of α

Experiments on varying α for the six data sets con-

ﬁrms our theoretical ﬁndings in Lemma 1, which in-

dicates that large α values are unlikely. In fact, the

maximum alpha value depends on the particular data

set. E.g., on the glass data, the frequency of the mi-

nority class is 7.98%, thus, we cannot set α above

this value. In contrast, the frequency of the minor-

ity class of the Pima data is 34.9%, which implies

that we can drive α up to 30% without problems. On

“Yeast”, “Haberman”, and “Milling”, the AUC is al-

most constant w.r.t. to α. On “Welding”, we can ob-

serve a clear decrease in AUC for an increasing α. On

ICPRAM 2019 - 8th International Conference on Pattern Recognition Applications and Methods

870

Table 2: Experimental Results for our six case studies. Abbreviations: DT = decision tree (Breiman et al., 1984), GBM =

gradient boosting machine (Friedman, 2001), RF = random forest (Breiman, 2001), DT/GBM/RF+U = DT/GBM/RF+under-

sampling (Japkowicz, 2000), DT/GBM/RF+O = DT/GBM/RF+over-sampling (Japkowicz, 2000) DT/GBM/RF+S =

DT/GBM/RF+SMOTE (Chawla et al., 2002), DT/RF+UBA = DT/RF+UnderBagging (Barandela et al., 2003), DT/RF+AD

= DT/RF+Adaboost.m2 (Schapire and Singer, 1999), DT/RF+SBA = DT/RF+SMOTEBagging (Wang and Yao, 2009),

DT/RF+SBO = DT/RF+SMOTEBoost (Chawla et al., 2003), DT/RF+RUS = DT/RF+RUSBoost (Seiffert et al., 2010),

PRO1 = our Ensemble using precision and recall, PRO2 = our Ensemble using F1-score, Min. = F1 for minority class, Maj.

= F1 for majority class.

Method

Welding Pima Glass Haberman Milling Yeast

Min. Maj. Min. Maj. Min. Maj. Min. Maj. Min. Maj. Min. Maj.

20.34

±13.43

93.18

±1.9

54.97

±6.26

81.19

±3.91

6.67

±11.55

93.52

±2.36

29.47

±15.95

80.51

±6.76

18.94

±10.32

87.69

±3.01

67.59

±11.64

97.1

±0.56

GBM

11.57

±4.02

94.98

±0.59

59.22

±3.3

83.09

±3.6

0.0

±0.0

93.97

±1.71

32.96

±13.97

81.53

±2.58

12.78

±12.51

90.42

±3.57

56.15

±18.43

96.78

±0.99

12.83

±9.95

95.53

±0.87

62.4

±2.21

85.02

±1.79

6.67

±11.55

93.89

±2.47

21.65

±19.89

77.85

±9.82

5.13

±8.88

90.85

±2.02

67.46

±11.05

97.58

±0.29

DT+U

24.17

±12.51

88.43

±6.22

56.81

±2.42

76.77

±2.76

13.47

±11.84

89.55

±6.46

28.59

±17.9

71.62

±10.95

26.86

±4.32

83.75

±7.75

63.66

±23.14

94.9

±3.37

GBM+U

24.6

±7.73

93.69

±1.84

62.6

±1.86

80.56

±3.07

6.06

±10.5

92.37

±2.15

37.33

±5.33

74.26

±6.6

18.91

±6.27

81.51

±6.33

72.57

±8.89

97.06

±0.88

RF+U

23.29

±10.33

94.61

±1.94

62.96

±4.87

79.91

±2.82

5.56

±9.62

91.53

±3.8

28.38

±24.6

78.45

±9.81

26.19

±2.06

85.65

±5.13

71.33

±20.06

96.65

±2.4

DT+O

18.72

±8.94

90.02

±1.66

59.6

±5.09

72.32

±3.98

14.07

±12.24

93.08

±2.45

35.47

±12.99

72.27

±2.63

22.33

±11.54

74.55

±2.65

65.54

±18.32

95.93

±1.95

GBM+O

15.43

±11.22

94.75

±0.66

59.04

±2.96

80.28

±3.19

0.0

±0.0

93.22

±1.75

35.81

±13.26

78.37

±2.14

23.02

±3.44

79.62

±1.82

64.23

±10.46

96.65

±1.03

RF+O

13.69

±6.03

94.83

±1.08

64.77

±4.59

82.82

±2.25

0.0

±0.0

93.97

±1.71

22.8

±20.35

82.88

±1.89

31.11

±10.18

85.52

±3.21

66.72

±17.23

97.15

±0.78

DT+S

19.27

±3.74

87.86

±1.37

36.07

±5.38

81.78

±2.02

18.14

±3.31

0.0

±0.0

34.02

±1.4

56.45

±20.48

21.67

±20.21

85.15

±3.38

77.01

±7.38

97.48

±0.65

GBM+S

11.85

±6.29

94.77

±1.36

57.53

±5.04

82.41

±0.62

13.01

±12.28

60.58

±28.81

40.49

±9.53

49.84

±25.36

6.67

±11.55

90.51

±1.26

61.63

±21.63

96.8

±1.35

RF+S

0.0

±0.0

95.22

±0.83

57.33

±1.32

81.57

±5.51

21.9

±7.19

57.8

±9.21

24.2

±15.89

68.55

±10.76

8.33

±14.43

90.87

±2.91

73.0

±13.48

97.53

±0.62

Ens

14.45

±1.84

8.0

±2.88

21.23

±36.78

81.22

±2.66

0.0

±0.0

93.97

±1.71

43.89

±4.85

80.47

±5.2

14.56

±14.29

45.07

±36.02

71.05

±13.6

96.85

±0.67

DT+UBA

25.02

±4.46

91.87

±2.28

68.32

±3.58

82.31

±2.32

0.0

±0.0

94.73

±1.11

42.21

±9.33

69.27

±6.4

26.57

±5.47

79.28

±4.87

57.95

±19.71

93.46

±2.38

DT+AD

7.39

±9.22

94.79

±0.93

64.07

±0.28

84.64

±1.36

7.41

±12.83

93.17

±1.36

40.61

±9.2

79.62

±6.9

14.7

±15.43

87.02

±3.43

65.62

±10.21

97.01

±0.66

DT+SBA

22.11

±2.54

71.87

±5.22

65.28

±3.61

84.3

±1.5

29.76

±15.02

90.86

±1.59

42.05

±7.2

77.87

±5.95

29.67

±6.98

58.0

±3.82

77.6

±6.49

97.6

±0.53

DT+RUS

15.17

±10.5

93.67

±1.48

63.25

±2.76

78.8

±1.99

7.14

±12.37

63.52

±55.02

38.76

±10.58

57.84

±29.71

22.65

±12.99

83.34

±1.35

64.5

±14.72

95.2

±2.25

DT+SBO

16.47

±2.49

0.0

±0.0

62.23

±3.3

74.8

±1.39

14.39

±12.92

89.47

±4.23

42.39

±8.75

77.56

±5.83

22.65

±12.99

83.34

±1.35

68.57

±11.06

97.09

±0.57

RF+UBA

12.63

±18.91

95.47

±1.1

65.2

±4.6

78.65

±1.67

23.06

±4.12

59.93

±17.86

41.01

±6.66

61.44

±24.54

23.53

±11.76

86.3

±2.12

60.41

±20.0

93.9

±2.74

RF+AD

5.9

±6.84

95.41

±0.93

61.62

±0.8

84.36

±1.89

8.33

±14.43

94.31

±1.32

32.8

±18.72

80.71

±4.31

14.7

±15.43

87.02

±3.43

69.37

±20.63

96.41

±1.58

RF+SBA

24.62

±4.67

76.19

±4.18

62.75

±4.03

84.64

±1.06

13.46

±12.61

91.48

±3.58

27.51

±24.04

81.02

±5.74

28.33

±7.11

49.95

±14.79

71.59

±14.17

97.3

±1.04

RF+RUS

8.66

±11.21

95.41

±1.14

67.27

±3.55

81.41

±0.76

21.8

±3.43

83.04

±6.79

40.25

±7.21

64.79

±18.8

16.73

±14.58

86.54

±1.66

64.01

±17.16

95.23

±1.65

RF+SBO

16.47

±2.49

0.0

±0.0

63.15

±6.92

78.23

±3.84

21.67

±20.21

90.59

±2.51

35.5

±19.59

80.2

±5.67

14.86

±12.93

84.86

±1.73

69.91

±12.65

97.21

±0.98

PRO1

38.65

±4.09

94.17

±1.17

67.82

±2.07

83.55

±1.99

26.19

±2.06

93.86

±0.66

46.32

±4.17

78.79

±6.43

42.21

±14.73

87.92

±6.56

83.37

±10.0

98.5

±0.71

PRO2

37.84

±5.77

94.18

±1.19

67.93

±1.35

84.08

±3.54

26.46

±3.67

93.85

±1.34

47.47

±6.14

78.89

±5.9

42.77

±14.46

86.81

±8.47

82.37

±18.89

98.61

±0.96

Learning Ensembles in the Presence of Imbalanced Classes

871

“Glass” and “Pima”, changing α induces some jitter

on the AUC. However, there is neither an increasing

nor a decreasing trend—all results have the same or-

der of magnitude. We hence conclude that small α

values sufﬁce and that our method is not overly sensi-

tive w.r.t. speciﬁc choices of α. Hence, the answer to

question Q1 is in the afﬁrmative, and ﬁx α = 3% for

the other experiments, i.e., no optimization over α is

performed.

4.3.2 Classiﬁcation Performance

The results of our large comparison of classiﬁcation

methods is shown in Table 2. In most real-world ap-

plications, the F1-score of the minority class (“Min.”)

is of exceptional importance. Not surprisingly, in

almost all cases, the baseline methods DT, RF and

GBM are outperformed by methods for imbalanced

data. There is no clear winner among the state-of-the-

art undersampling, oversampling, or SMOTE tech-

niques. It depends on the particular data set which

strategy delivers the best F1-score for the minor-

ity class. It is thus especially remarkable that our

method achieves the best and second best F

-score—

depending on the choice of quality measures Q —for

the minority class on ﬁve of six data sets. Only de-

cision tree UnderBagging achieves a better minority

F1-score on the “Pima” data. Nevertheless, note that

our proposed methods exhibits a lower standard er-

ror. Thus, a “typical” run of our method might out-

perform DT+UBA. In terms of majority (“Maj.”) F1-

score, our method is outperformed in 5 of 6 cases.

However, in two cases, the best result is delivered by

a plain random forest without any technique to han-

dle class imbalance—the corresponding minority F1-

score is thus far from optimal. The other three cases

are led by RF + oversampling, RF + SMOTE, and RF

+ Adaboost—all three are variants of the random for-

est. However, the decline majority F1-score of our

proposed method is in four of ﬁve cases below 3 per-

cent. Assuming that F1-score on the minority class

is the most important measure, we answer Q2 in the

afﬁrmative.

5 CONCLUSION

In this paper, we presented a new ensemble method

for classiﬁcation in the presence of imbalanced

classes. We started by reviewing the state-of-the-art

in the area of classiﬁcation with imbalanced classes.

Real data sets are always ﬁnite which leads to a cor-

ruption of empirical class frequencies. Moreover, we

proved that these corruptions are unlikely to be large

in case of class imbalance and proposed a method

to correct for small corruptions. The correction is

performed by generating specialized validation sets

which correspond to different scenarios. Each vali-

dation set may then be used to induce an ensemble

of classiﬁers. We discussed how the classiﬁers for

different scenarios can be combined into an ensem-

ble and proposed different choices for the ensemble

weights. In an experimental demonstration, we val-

idated our theoretical ﬁndings, and showed that our

method outperforms several state-of-the-art methods

in terms of F

-score. Since our insights about class

imbalance and erroneous empirical class frequencies

are completely new, our work may serve as the basis

for multiple new research directions.

REFERENCES

Barandela, R., Valdovinos, R., and S

anchez, J. (2003). New

applications of ensembles of classiﬁers. Pattern Anal-

ysis & Applications, 6(3):245–256.

Batista, G. E., Carvalho, A. C., and Monard, M. C. (2000).

Applying one-sided selection to unbalanced datasets.

In Proc. of MICAI, pages 315–325. Springer.

Breiman, L. (2001). Random forests. Machine Learning,

45(1):5–32.

Breiman, L., Friedman, J., Stone, C. J., and Olshen, R. A.

(1984). Classiﬁcation and regression trees. CRC

press.

Cateni, S., Colla, V., and Vannucci, M. (2014). A method

for resampling imbalanced datasets in binary classiﬁ-

cation tasks for real-world problems. Neurocomput-

ing, 135:32 – 41.

Chawla, N. V., Bowyer, K. W., Hall, L. O., and Kegelmeyer,

W. P. (2002). SMOTE: synthetic minority over-

sampling technique. Journal of Artiﬁcial Intelligence

Research, 16:321–357.

Chawla, N. V., Lazarevic, A., Hall, L. O., and Bowyer,

K. W. (2003). Smoteboost: Improving prediction of

the minority class in boosting. In Proc. of the 7th

PKDD, pages 107–119.

Foucart, S. and Rauhut, H. (2013). A Mathematical Intro-

duction to Compressive Sensing. Springer New York.

Freund, Y. and Schapire, R. E. (1997). A decision-theoretic

generalization of on-line learning and an application

to boosting. Journal of Computer and System Sci-

ences, 55(1):119–139.

Freund, Y., Schapire, R. E., et al. (1996). Experiments with

a new boosting algorithm. In Proc. of ICML 1996,

volume 96, pages 148–156.

Friedman, J. H. (2001). Greedy function approximation: A

gradient boosting machine. The Annals of Statistics,

29(5):1189–1232.

Galar, M., Fernandez, A., Barrenechea, E., Bustince, H.,

and Herrera, F. (2012). A review on ensembles for

the class imbalance problem: bagging-, boosting-, and

hybrid-based approaches. IEEE Transactions on Sys-

ICPRAM 2019 - 8th International Conference on Pattern Recognition Applications and Methods

872

tems, Man, and Cybernetics, Part C (Applications and

Reviews), 42(4):463–484.

Gao, M., Hong, X., Chen, S., and Harris, C. J. (2012).

Probability density function estimation based over-

sampling for imbalanced two-class problems. In Neu-

ral Networks (IJCNN), The 2012 International Joint

Conference on, pages 1–8. IEEE.

Gao, X., Chen, Z., Tang, S., Zhang, Y., and Li, J. (2016).

Adaptive weighted imbalance learning with applica-

tion to abnormal activity recognition. Neurocomput-

ing, 173:1927–1935.

Haixiang, G., Yijing, L., Shang, J., Mingyun, G., Yuanyue,

H., and Bing, G. (2017). Learning from class-

imbalanced data: Review of methods and applica-

tions. Expert Systems with Applications, 73:220–239.

Hansen, L. K. and Salamon, P. (1990). Neural network en-

sembles. IEEE Transactions on Pattern Analysis and

Machine Intelligence, 12(10):993–1001.

Japkowicz, N. (2000). The class imbalance problem: Sig-

niﬁcance and strategies. In Proc. of the Int. Conf. on

Artiﬁcial Intelligence.

Kim, S., Kim, H., and Namkoong, Y. (2016). Ordinal clas-

siﬁcation of imbalanced data with application in emer-

gency and disaster information services. IEEE Intelli-

gent Systems, 31(5):50–56.

Lichman, M. (2013). UCI machine learning repository.

Peng, Y. and Yao, J. (2010). Adaouboost: adaptive over-

sampling and under-sampling to boost the concept

learning in large scale imbalanced data sets. In Pro-

ceedings of the international conference on Multime-

dia information retrieval, pages 111–118. ACM.

Schapire, R. E. and Singer, Y. (1999). Improved boosting al-

gorithms using conﬁdence-rated predictions. Machine

learning, 37(3):297–336.

Seiffert, C., Khoshgoftaar, T. M., Van Hulse, J., and Napoli-

tano, A. (2010). Rusboost: A hybrid approach to al-

leviating class imbalance. IEEE Transactions on Sys-

tems, Man, and Cybernetics-Part A: Systems and Hu-

mans, 40(1):185–197.

Soltanmohammadi, E., Naraghi-Pour, M., and van der

Schaar, M. (2015). Context-based unsupervised data

fusion for decision making. In Proc. of the 32nd

ICML, pages 2076–2084.

Sun, Z., Song, Q., Zhu, X., Sun, H., Xu, B., and Zhou,

Y. (2015). A novel ensemble method for classifying

imbalanced data. Pattern Recognition, 48(5):1623–

1637.

Tian, J., Gu, H., and Liu, W. (2011). Imbalanced classiﬁca-

tion using support vector machine ensemble. Neural

Computing and Applications, 20(2):203–209.

Wang, S. and Yao, X. (2009). Diversity analysis on im-

balanced data sets by using ensemble models. In

Computational Intelligence and Data Mining, 2009.

CIDM’09. IEEE Symposium on, pages 324–331.

IEEE.

Zhang, H. and Li, M. (2014). Rwo-sampling: A random

walk over-sampling approach to imbalanced data clas-

siﬁcation. Information Fusion, 20:99–116.

Zhou, Z.-H. (2012). Ensemble Methods: Foundations and

Algorithms. Machine Learning & Pattern Recogni-

tion. CRC Press.

Learning Ensembles in the Presence of Imbalanced Classes

873