Balanced Sampling Method for Imbalanced Big Data Using AdaBoost

Hong Gu and Tao Song

School of Control Science and Engineering, Dalian University of Technology, Dalian, China

Keywords:

Big data, Class Imbalance Learning, Sampling, Boosting, Bagging, Parallel Algorithms.

Abstract:

With the arrival of the era of big data, processing large volumes of data at much faster rates has become

more urgent and attracted more and more attentions. Furthermore, many real-world data applications present

severe class distribution skews and the underrepresented classes are usually of concern to researchers. Variants

of boosting algorithm have been developed to cope with the class imbalance problem. However, due to the

inherent sequential nature of boosting, these methods can not be directly applied to efﬁciently handle large-

scale data. In this paper, we propose a new parallelized version of boosting, AdaBoost.Balance, to deal with

the imbalanced big data. It adopts a new balanced sampling method which combines undersampling methods

with oversampling methods and can be simultaneously calculated by multiple computing nodes to construct

a ﬁnal ensemble classiﬁer. Consequently, it is easily implemented by the parallel processing platform of big

data such as the MapReduce framework.

1 INTRODUCTION

Big data is a dynamic deﬁnition whose large quan-

tity and complexity makes it difﬁcult to capture,

store, manage, search, share, transfer, mine, ana-

lyze and visualize using existing data management

tools, data processing applications and data mining

approaches (Manyika et al., 2011). It is generated

in many scientiﬁc and real-world areas such as ge-

nomics, proteomics, bioinformatics (Greene et al.,

2014), telecommunications, health care, pharmaceu-

tical or ﬁnancial businesses. Big data is character-

ized with volume, velocity, variety, veracity, validity,

volatility, variability or value (Laney, 2001; del R

ıo

et al., 2014). Consequently, it is vital to develop more

suitable and faster algorithms to efﬁciently deal with

such massive amounts of data.

What makes this challenge even more difﬁcult is

that most of these big data also exhibit imbalanced

class distributions, in other words, the number of data

that represent one class is much lower than the ones

of the other classes. Moreover, the minority class

is usually the main interest and more important, i.e.,

there is always an implicit assumption in class imbal-

ance learning that the minority class has higher cost

than the majority class. Imbalanced learning prob-

lem has become an important issue in the ﬁeld of

machine learning and a large amount of techniques

have been developed trying to address this problem

(Chawla et al., 2004; He and Garcia, 2009). De-

pending on how they deal with class imbalance, these

methods can be categorized into three groups: al-

gorithm level approaches, data level techniques and

cost-sensitive methods (del R

ıo et al., 2014; Galar

et al., 2012). The algorithm level approaches create

all new algorithms or modify current existing meth-

ods to pay more attention to positive samples (mi-

nority class) than negative samples (majority class).

Data level techniques consist of a preprocessing step

to modify an imbalanced data set in order to pro-

vide a balanced distribution, usually by oversampling

methods, undersampling methods and hybrid meth-

ods (combine oversampling with undersampling) (Es-

tabrooks et al., 2004). Cost-sensitive methods com-

bine the ideas from both data and algorithm level ap-

proaches by using different cost matrices to consider

costs of misclassiﬁcations, and its usual practice is

to adopt higher misclassiﬁcation costs for instances

of the minority class and minimize the overall cost

(Zadrozny et al., 2003; Sun et al., 2007).

In addition to these methods, a large number of

techniques based on ensemble learning have been

developed to solve the problem of class imbalance

(Galar et al., 2012). Ensemble methods train multiple

weak learners and then combine them to construct a

strong learner by combination methods such as aver-

aging, voting and learning, etc., and it is well known

that an ensemble is usually signiﬁcantly more accu-

189

Gu H. and Song T..

Balanced Sampling Method for Imbalanced Big Data Using AdaBoost.

DOI: 10.5220/0005254601890194

In Proceedings of the International Conference on Bioinformatics Models, Methods and Algorithms (BIOINFORMATICS-2015), pages 189-194

ISBN: 978-989-758-070-3

 2015 SCITEPRESS (Science and Technology Publications, Lda.)

rate than a single learner. Ensemble diversity (the

difference among the individual learners) is neces-

sary and plays an important role in ensemble learning,

since there would be no performance improvement if

identical individual learners were combined. Boost-

ing (Schapire, 1990; Freund and Schapire, 1995) and

Bagging (Breiman, 1996) are the most representative

among the ensemble learning algorithms and have al-

ready achieved great success in many real-world ap-

plications. The series of boosting algorithms began

to develop from the proof of weak learnable theo-

rem (Schapire, 1990), i.e., the equivalence conjecture

of strong and weak learnability (Kearns and Valiant,

1994). Bagging contains two key elements: boot-

strap and aggregation. It adopts bootstrap sampling

(Efron and Tibshirani, 1994) to generate the data sub-

sets for training different base learners and applies

voting or averaging for aggregating the outputs of the

base learners.

In general, the members of boosting family are se-

quential ensemble methods where the base learners

are generated sequentially. The basic motivation of

sequential methods is to exploit the dependence be-

tween the base learners, where the later learners fo-

cus more on the mistakes made by the former. While

the base learners of parallel ensemble methods repre-

sented by bagging are generated in parallel. The basic

motivation of parallel ensemble methods is to exploit

the independence between the base learners, since the

error can be reduced dramatically by combining in-

dependent base learners (Zhou, 2012). The paral-

lel architecture of bagging make it easier to achieve

fast, scalable and parallel implementations by using

MapReduce (Dean and Ghemawat, 2008) framework.

MapReduce is a programming model and an asso-

ciated implementation for processing and generating

large data sets. It abstracts the calculation process in

two phases: Map divides the original set into indepen-

dent subsets and distributes them to worker nodes; Re-

duce combines the partial solutions in a way to form

the ﬁnal output. However, there still are class imbal-

ance problems when directly using the MapReduce to

process the imbalanced big data (because the subsets

obtained in the Map phase still present severe class

distribution skews).

In this work, we present an new technique which

combines both bagging and boosting to deal with two-

class imbalanced big data, where there is a positive

(minority) class,with the lowest number of instances,

and a negative (majority) class, with the highest num-

ber of instances. Speciﬁcally, a balanced resampling

method is applied in the data preprocessing phase. It

combines random oversampling with random under-

sampling or SMOTE (Chawla et al., 2011) to divide

the original imbalanced set into independent balanced

subsets, as well as takes the relationship between the

imbalance ratio (IR, the size of majority class divided

by that of minority class) and the computing resource

(number of computing nodes) into account. The pur-

pose of considering this relationship is to fully exploit

useful information of the majority class and increase

the ensemble diversity by SMOTE under some com-

puting resource. Then, boosting is running on these

independent balanced subsets in parallel simultane-

ously. Finally, a single ensemble is obtained which

is an ensemble of ensembles. To be brief, we use bag-

ging as the main ensemble learning method and train

each bag using boosting. Consequently, our method

combines the advantages of both boosting and bag-

ging, that boosting mainly reduces bias while bagging

mainly reduces variance (Maclin and Opitz, 2011).

The experimental results indicate that our method can

effectively deal with the class imbalance problem. In

addition, due to the parallel nature of our method, it is

easier to be implemented on the MapReduce platform.

2 MATERIALS AND METHODS

Our method is described below in the following

aspects: the AdaBoost algorithm, the Balance-

Sampling algorithm, the AdaBoost.Balance algo-

rithm (includes AdaBoost.BSR and AdaBoost.BSS),

database and performance evaluation criteria.

2.1 AdaBoost

The AdaBoost (Freund and Schapire, 1995) algorithm

is a representative of boosting family which resolves

many practical issues of the early boosting algorithm.

The input of AdaBoost is a training set of n labeled

samples D

{

),...,(x

)

}

, where x

∈ X (X

represents the instance space), y

∈ Y (Y =

{

−1,+1

}

is the set of labels). AdaBoost repeatedly calls the

given weak learner (base classiﬁer), whose main idea

is to maintain a weight distribution on the training

set. The weak learners have to be slightly better than

a random guess. The weight of sample (x

) is

denoted as p

in the tth iteration (t = 1,. . . , T , T is

the number of iterations). Initially, the weights of

all the examples are set to be equal (1/n). But the

weights of the misclassiﬁed instances will increase in

each iteration, in order to force weak learner to fo-

cus on the difﬁculties in training set. The task of the

weak learner is to ﬁnd a suitable weak hypothesis: h

X → [0, 1] based on the distribution

. When coping

with the two-class classiﬁcation problem, the range of

are two values:

{

−1,+1

}

. Therefore, the learner’s

BIOINFORMATICS2015-InternationalConferenceonBioinformaticsModels,MethodsandAlgorithms

190

task is to minimize the error ε

= Pr

i∼

) 6= y

Once the h

is obtained, AdaBoost selects a parameter

⊆ R, this parameter intuitively measure the correct-

ness or the degree of importance of h

, i.e., a learner

with lower error rate gets higher weight. The ﬁnal

hypothesis H is obtained by combining T weak hy-

pothesis together using weighted majority vote after

T cycles. Typically, AdaBoost sets α

log



1−ε



for the two-class classiﬁcation problem. In general,

AdaBoost is an ensemble learning method which iter-

atively induces a strong classiﬁer from a pool of weak

hypotheses.

Algorithm 1: AdaBoost.

Input: Training set of n labeled samples

{

),...,(x

)

}

Number of iterations T

Weak learning algorithm WeakLearn

Output: The ﬁnal classiﬁer H

1: Initialize w

, i = 1,...,n

2: for t = 1,...,T do

3: Set the distribution

∑

i=1

4: Call WeakLearn, providing it with distribu-

tion

, get back a hypothesis h

: X → [0, 1]

5: Calculate the error of h

∑

i=1

) − y

6: Set β

= (1 − ε

)



7: Set the weight of h

: α

log(β

)

8: Set the new weights vector to be

t+1

= w

1−

)−y

9: end for

10: Set the ensemble’s threshold: θ =

∑

t=1

11: Return H (x) = sgn



∑

t=1

(x) − θ



2.2 The Balance-sampling Algorithm

The purpose of using sampling methods in class im-

balance problems is to modify an imbalanced data set

by some mechanisms to provide a balanced distribu-

tion for training a classiﬁer. Among the sampling

methods, random undersampling methods randomly

remove the majority class examples or randomly se-

lect a subset of examples from the original majority

class set to make the size of this subset the same as

the minority class. Random oversampling methods

randomly select examples from the original minor-

ity class set and augment original set by replicating

the selected examples and adding them to it until the

size is equal to that of majority class set. However,

randomly removing the majority class examples may

lose some useful information, while randomly dupli-

cating the minority class examples may increase the

risk of overﬁtting. To relax those problems, many

advanced resampling methods have been developed.

For example, the one-sided sampling method (Kubat

et al., 1997) selectively remove the majority class ex-

amples in order to keep more informative examples,

SMOTE generates synthetic examples by randomly

interpolating between a minority class example and

one of its neighbors from the same class instead of

exact copies to reduce the risk of overﬁtting. In addi-

tion, it has been empirically evaluated that neither the

oversampling nor the undersampling alone is always

the best one to use, and a combining strategy could be

useful and effective (Estabrooks et al., 2004). Conse-

quently, we introduce a way to combine the two re-

sampling methods.

The proposed Balance-Sampling algorithm com-

bines random oversampling with random undersam-

pling or SMOTE to divides the original imbalanced

set into independent balanced subsets. It determines

the size of majority class examples for each subset

by the ratio of imbalance ratio R and the number of

computing workers M. When

≥ 1, the size of ma-

jority class examples for each computing worker is

selected as

, and the size of minority class

example subset is determined as

. In other

cases of

, the size of majority class example sub-

set is selected as

, and the size of minority

class example subset is the same as that of the original

minority class example set. In all cases, union of all

the majority class example subsets is able to contains

as many majority class examples as possible, i.e., the

Balance-Sampling algorithm can fully exploit useful

information of the majority class.

In line 5 and 11, we used random under-sampling

method to randomly select majority class examples

from the original dataset N for N

until the number

is achieved. In line 6, according to different

under-sampling methods, the Balance-Sampling al-

gorithm can be classiﬁed as BSR (Balance-Sampling

by Random under-sampling) and BSS (Balance-

Sampling by the SMOTE algorithm).

2.3 The AdaBoost.Balance Algorithm

The AdaBoost.Balance algorithm is proposed based

on the AdaBoost algorithm and the Balance-Sampling

algorithm. It is an ensemble of ensembles, i.e.,

the Balance-Sampling algorithm is ﬁrstly taken as

the bootstrap sampling, then AdaBoost is applied

to train a base learner on each obtained balanced

subset and ﬁnally the weighted majority voting is

adopted for aggregating the outputs of the base learn-

ers. Consequently, our method combines the advan-

tages of both boosting and bagging, that boosting

BalancedSamplingMethodforImbalancedBigDataUsingAdaBoost

191

mainly reduces bias while bagging mainly reduces

variance. Moreover, because AdaBoost.Balance takes

bagging as the main ensemble learning method, it has

the same parallel nature as bagging and that makes

AdaBoost.Balance easier to be implemented on the

MapReduce platform.

In line 1, when using the BSR method the

AdaBoost.Balance algorithm is denoted as Ad-

aBoost.BSR and it is represented as AdaBoost.BSS

when adopting the BSS method.

Algorithm 2: The Balance-Sampling algorithm.

Input: The training sets D =

{

P,N

}

P is a set of minority class examples

N is a set of majority class examples

Number of computing workers M

Output: Balanced training sets



,...,D



1: Calculate the Imbalance Ratio (IR): R =

2: if

≥ 1 then

3: for k = 1,. . . , M do

5: Undersampling N

from N

6: Oversampling P

from P

7: end for

8: else

9: for k = 1,. . . , M do

10:

11: Undersampling N

from N

12: P

= P

13: end for

14: end if

15: Return



,...,D



, where

{

}

, n

Algorithm 3: The AdaBoost.Balance.

Input: Training set D =

{

P,N

}

Number of boosting iterations T

Number of computing workers M

Weak learning algorithm WeakLearn

Output: The ﬁnal classiﬁer H



,...,D



←Balance-Sampling(D)

2: for k = 1,...,M do

3: H

(x) ←AdaBoost





4: end for

5: Return

H (x) = sgn



∑

k=1

∑

t=1

(x) −

∑

k=1



3 EXPERIMENTS

In this section, we demonstrate the performance of

our proposed algorithms in term of AUC (Area Under

the receiver operating characteristic Curve) (Bradley,

1997). Eleven UCI data sets (Blake and Merz, 1998)

have been used to characterize the efﬁciency of our

methods compared with other methods.

3.1 Performance Evaluation Criteria

It is now well-known that the traditional evaluation

criteria such as the sensitivity, precision, recall, speci-

ﬁcity and accuracy are not longer appropriate for eval-

uating the performance of algorithms when there are

class imbalance problems, because the minority class

would be dominated by the majority class. For ex-

ample, in credit card fraud detection it is meaningless

to achieve high accuracy when a data set whose im-

balance ratio is 100, i.e., the number of majority class

examples (legitimate transactions) is 100 times that of

minority class examples (fraudulent transactions). An

algorithm that tries to maximize the accuracy may ob-

tain an accuracy of 99% just by predicting all the ex-

amples as majority class, though the accuracy seems

high, the predictor is useless since no fraudulent trans-

action will be detected.

Consequently, we adopt the AUC as the perfor-

mance evaluation measure, which has been proved to

be a reliable performance measure for class imbal-

ance problems (Fawcett, 2004). AUC is the area un-

der the receiver operating characteristic (ROC) curve,

where the ROC curve depicts the trade-off between

the beneﬁts (t pr, true positive rate) and costs ( f pr,

false positive rate). AUC provides a single measure of

a method’s performance to evaluate which one is bet-

ter (the higher the value of AUC the better the perfor-

mance of the method). The AUC measure is achieved

by calculated the area under the ROC curve.

3.2 Implements and Parameters

Information about 11 UCI data sets is summarized in

Table 1: number of attributes (#Attribute), number of

examples (Size), number of minority class examples

(#Min), number of majority class examples (#Maj),

and imbalance ratio (IR). The data sets in Table 1

are ordered according to the ascending order of im-

balance ratio. The original Multi-class data sets of

UCI were revised to two-class imbalanced problems

by combining one or more classes into the minority

class and combining the remaining classes into the

majority class.

BIOINFORMATICS2015-InternationalConferenceonBioinformaticsModels,MethodsandAlgorithms

192

Table 1: Summary Description of the Imbalanced

Databases. #Attribute is the Number of Attributes, Size

is the Total Number of Examples, #Min and #Maj are the

Number of Minority Class Examples and Majority Class

Examples Respectively, and IR is the Imbalance Ratio.

Database #Attribute Size #Min #Maj IR

wdbc 30 569 212 357 1.7

ionosphere 33 351 126 225 1.8

pima 8 768 268 500 1.9

car 6 1728 518 1210 2.3

haberman 3 306 81 225 2.8

wpbc 33 198 47 151 3.2

mf-zernike 47 2000 200 1800 9.0

mf-morph 6 2000 200 1800 9.0

mf-kar 64 2000 200 1800 9.0

balance 5 625 49 576 11.8

abalone 8 4177 261 3916 15.0

We compared the performance of our proposed al-

gorithms with standard AdaBoost and EasyEnsemble

(Liu et al., 2009) in term of AUC. The EasyEnsemble

algorithm is also a an ensemble of ensembles. It uses

bagging as the main ensemble learning method and

trains each bag using AdaBoost. However, it only ap-

plies random undersampling method to deal with the

class imbalance problem. The number of boosting it-

erations T was set as 10 and the number of computing

workers (or the number of bags) M was selected as 4.

We adopted the decision tree (Mitchell, 1997) as the

weak learning algorithm. For each data set, we per-

form a 10-fold cross-validation for all methods. That

is, each data set was split into 10 folds, each fold con-

taining 10% of examples of the data set. For each

fold, algorithms were trained with the examples con-

tained in the remaining nine folds and then tested with

the current fold. The mean and standard deviation of

AUC of each method was calculated from the 10 re-

sults of this cross validation process.

3.3 Performances of Different Methods

We implemented four methods (AdaBoost,

EasyEnsemble, AdaBoost.BSR and AdaBoost.BSS)

on a single machine to compare their performance

in class imbalance problem. The mean and standard

deviation of AUC of all these methods are summa-

rized in Table 2. The mean and standard deviation

are presented above and below respectively. The

highest average AUC with its standard deviation of

each data set is marked bold. Table 2 shows that

EasyEnsemble, AdaBoost.BSR and AdaBoost.BSS

which are specially designed to deal with class

imbalance problem achieve better performance

than the standard AdaBoost. The proposed Ad-

aBoost.BSR and AdaBoost.BSS obtain the highest

average AUC on all the imbalanced data sets except

pima. Moreover, AdaBoost.BSS, which combines

the random undersampling method with the SMOTE

algorithm to cope with the class imbalance problem,

shows the superior performance with the increase

of imbalance ratio. This is probably due to that the

division mechanism of the the Balance-Sampling

algorithm can make the AdaBoost.BSR algorithm to

fully exploit useful information of the majority class

and SMOTE increases the ensemble diversity of it.

Table 2: AUC of the Compared Methods. The Mean and

Standard Deviation are Presented Above and Below Re-

spectively. The Highest Average AUC with Its Standard

Deviation of Each Data Set is Marked Bold.

AdaBoost EasyEnsemble AdaBoost.BSR AdaBoost.BSS

wdbc 0.9892 0.9914 0.9954 0.9870

0.0077 0.0083 0.0066 0.0134

ionosphere 0.9453 0.9739 0.9716 0.9709

0.0359 0.0240 0.0235 0.0398

pima 0.7514 0.8181 0.8339 0.8156

0.0293 0.0268 0.0362 0.0499

car 0.9999 0.9999 0.9999 0.9999

0.0002 0.0002 0.0002 0.0002

haberman 0.6617 0.6571 0.6812 0.6614

0.0901 0.1484 0.0842 0.0975

wpbc 0.6247 0.7387 0.7480 0.7573

0.1338 0.1739 0.0861 0.1388

mf-zernike 0.7957 0.9086 0.9011 0.9113

0.0389 0.0174 0.0213 0.0206

mf-morph 0.8824 0.9143 0.9216 0.9258

0.0350 0.0170 0.0159 0.0202

mf-kar 0.9915 0.9949 0.9956 0.9957

0.0119 0.0046 0.0039 0.0054

balance 0.6162 0.5779 0.6417 0.6707

0.1161 0.1399 0.0875 0.0873

abalone 0.8134 0.8728 0.8769 0.8815

0.0432 0.0314 0.0239 0.0259

4 CONCLUSION

We have developed two parallel ensemble meth-

ods (AdaBoost.BSR and AdaBoost.BSS) with the

purpose of trying to deal with the imbalanced big

data. We use bagging as the main ensemble learn-

ing method and train each bag using AdaBoost which

combines the advantages of both boosting and bag-

ging. In addition, the proposed Balance-Sampling

algorithm, which combines random undersampling

with random oversampling or the SMOTE algorithm,

is able to fully exploit useful information of the major-

ity class and increases the ensemble diversity. The ex-

perimental results indicate that our method can effec-

tively deal with the class imbalance problems, espe-

cially the data sets with higher imbalance ratio. Due

BalancedSamplingMethodforImbalancedBigDataUsingAdaBoost

193

to the algorithms’ parallel structure, our methods are

easier to be implemented in parallel. In the future, we

plan to implement our proposed methods on the im-

balanced big data sets in a parallel distributed MapRe-

duce framework to test their efﬁciency.

REFERENCES

Blake, C. and Merz, C. J. (1998). Uci repository of ma-

chine learning databases [http://www. ics. uci. edu/˜

mlearn/mlrepository. html]. irvine, ca: University of

california. Department of Information and Computer

Science, 55.

Bradley, A. P. (1997). The use of the area under the

roc curve in the evaluation of machine learning algo-

rithms. Pattern recognition, 30(7):1145–1159.

Breiman, L. (1996). Bagging predictors. Machine learning,

24(2):123–140.

Chawla, N. V., Bowyer, K. W., Hall, L. O., and Kegelmeyer,

W. P. (2011). Smote: synthetic minority over-

sampling technique. arXiv preprint arXiv:1106.1813.

Chawla, N. V., Japkowicz, N., and Kotcz, A. (2004). Edi-

torial: special issue on learning from imbalanced data

sets. ACM SIGKDD Explorations Newsletter, 6(1):1–

Dean, J. and Ghemawat, S. (2008). Mapreduce: simpliﬁed

data processing on large clusters. Communications of

the ACM, 51(1):107–113.

del R

ıo, S., L

opez, V., Ben

ıtez, J. M., and Herrera, F. (2014).

On the use of mapreduce for imbalanced big data us-

ing random forest. Information Sciences.

Efron, B. and Tibshirani, R. J. (1994). An introduction to

the bootstrap, volume 57. CRC press.

Estabrooks, A., Jo, T., and Japkowicz, N. (2004). A multi-

ple resampling method for learning from imbalanced

data sets. Computational Intelligence, 20(1):18–36.

Fawcett, T. (2004). Roc graphs: Notes and practical consid-

erations for researchers. Machine learning, 31:1–38.

Freund, Y. and Schapire, R. E. (1995). A desicion-theoretic

generalization of on-line learning and an application

to boosting. In Computational learning theory, pages

23–37. Springer.

Galar, M., Fernandez, A., Barrenechea, E., Bustince, H.,

and Herrera, F. (2012). A review on ensembles for

the class imbalance problem: bagging-, boosting-, and

hybrid-based approaches. Systems, Man, and Cyber-

netics, Part C: Applications and Reviews, IEEE Trans-

actions on, 42(4):463–484.

Greene, C. S., Tan, J., Ung, M., Moore, J. H., and Cheng, C.

(2014). Big data bioinformatics. Journal of cellular

physiology.

He, H. and Garcia, E. A. (2009). Learning from imbalanced

data. Knowledge and Data Engineering, IEEE Trans-

actions on, 21(9):1263–1284.

Kearns, M. and Valiant, L. (1994). Cryptographic lim-

itations on learning boolean formulae and ﬁnite au-

tomata. Journal of the ACM (JACM), 41(1):67–95.

Kubat, M., Matwin, S., et al. (1997). Addressing the curse

of imbalanced training sets: one-sided selection. In

ICML, volume 97, pages 179–186. Nashville, USA.

Laney, D. (2001). 3d data management: Controlling data

volume, velocity and variety. META Group Research

Note, 6.

Liu, X.-Y., Wu, J., and Zhou, Z.-H. (2009). Exploratory

undersampling for class-imbalance learning. Systems,

Man, and Cybernetics, Part B: Cybernetics, IEEE

Transactions on, 39(2):539–550.

Maclin, R. and Opitz, D. (2011). Popular ensem-

ble methods: An empirical study. arXiv preprint

arXiv:1106.0257.

Manyika, J., Chui, M., Brown, B., Bughin, J., Dobbs, R.,

Roxburgh, C., and Byers, A. H. (2011). Big data: The

next frontier for innovation, competition, and produc-

tivity. Technical report, McKinsey Global Institute.

Mitchell, T. M. (1997). Machine learning. 1997. Burr

Ridge, IL: McGraw Hill, 45.

Schapire, R. E. (1990). The strength of weak learnability.

Machine learning, 5(2):197–227.

Sun, Y., Kamel, M. S., Wong, A. K., and Wang, Y. (2007).

Cost-sensitive boosting for classiﬁcation of imbal-

anced data. Pattern Recognition, 40(12):3358–3378.

Zadrozny, B., Langford, J., and Abe, N. (2003). Cost-

sensitive learning by cost-proportionate example

weighting. In Data Mining, 2003. ICDM 2003. Third

IEEE International Conference on, pages 435–442.

IEEE.

Zhou, Z.-H. (2012). Ensemble methods: foundations and

algorithms. CRC Press.

BIOINFORMATICS2015-InternationalConferenceonBioinformaticsModels,MethodsandAlgorithms

194