A COMPREHENSIVE STUDY OF THE EFFECT OF CLASS

IMBALANCE ON THE PERFORMANCE OF CLASSIFIERS

Rodica Potolea and Camelia Lemnaru

Computer Science Department, Technical University of Cluj-Napoca, 26 Baritiu st., Cluj-Napoca, Romania

Keywords: Class imbalance, Metrics, Classifiers, Comprehensive study.

Abstract: Class imbalance is one of the significant issues which affect the performance of classifiers. In this paper we

systematically analyze the effect of class imbalance on some standard classification algorithms. The study is

performed on benchmark datasets, in relationship with concept complexity, size of the training set, and ratio

between number of instances and number of attributes of the training set data. In the evaluation we

considered six different metrics. The results indicate that the multilayer perceptron is the most robust to the

imbalance in training data, while the support vector machine’s performance is the most affected. Also, we

found that unpruned C4.5 models work better than the pruned versions.

1 INTRODUCTION

One of the current important issues in data mining

research, triggered by the rapid shift in status from

academic to applied science, is that of class

imbalance. It appears in areas where the classifier

has to identify a rare but important case (Barandela

et al, 2003), such as detecting fraudulent phone calls

(Barandela and Provost, 1996), intrusions (Cieslak et

al, 2006), failures in the manufacturing process

(Japkowicz et al, 1995), or in diagnosing rare

medical diseases (Cohen et al, 2006). In such

domains, the imbalance hinders the capability of

traditional classification algorithms to identify cases

of interest.

A problem is imbalanced if, in the available data,

a specific class is represented by a very small

number of instances compared to other classes

(Japkowicz and Stephen, 2002). It is common

practice to consider only binary problems when

dealing with imbalance (multi-class problems can be

converted to binary problems). The majority class is

usually referred to as the negative class and the

minority class as the positive class which is the one

of interest and possesses the same or (often) greater

importance than the negative class.

The first step in providing viable solutions for

imbalanced domains is to understand the problem:

what is the real issue with the imbalance? Recent

studies suggest that the nature of the imbalance

problems is actually manifold. In (Weiss, 2004), two

issues are considered as being crucial: (1)

insufficient data to build a model, in case the

minority class has only a few examples (similar to

dealing with small samples/small data sets), (2) too

many “special cases” in the minority class, so that in

the class itself, some kind of sub-clustering occurs,

which might lead again to insufficient examples for

correctly identifying such a sub-cluster.

An important theoretical result related to the

nature of class imbalance is presented in (Japkowicz

and Stephen, 2002), where it is concluded that the

imbalance problem is a relative problem, which

depends on: (1) the imbalance ratio, i.e. the ratio of

the majority to the minority instances, (2) the

complexity of the concept represented by the data,

(3) the overall size of the training set and (4) the

classifier involved. The experiments there were

conducted on artificially generated data, in the

attempt to simulate different imbalance ratios,

complexities and data set sizes. The results have

indicated that C5.0 is the most sensitive learner to

the imbalance problem, while the Multilayer

Perceptron showed a less categorical sensitivity

pattern and the Support Vector Machine seemed to

be insensitive to the problem.

In this paper we extend the analysis from

(Japkowicz and Stephen, 2002), by performing a set

of experiments on benchmark data sets, to study the

effect of the class imbalance problem on several

classes of algorithms: Decision Trees, instance based

learning, Bayesian methods, ensemble methods,

Potolea R. and Lemnaru C..

A COMPREHENSIVE STUDY OF THE EFFECT OF CLASS IMBALANCE ON THE PERFORMANCE OF CLASSIFIERS.

DOI: 10.5220/0003415800140021

In Proceedings of the 13th International Conference on Enterprise Information Systems (ICEIS-2011), pages 14-21

ISBN: 978-989-8425-53-9

 2011 SCITEPRESS (Science and Technology Publications, Lda.)

Artificial Neural Networks and Support Vector

Machines. Our initial analysis focuses on the factors

described in (Japkowicz and Stephen, 2002) – data

set size, imbalance ratio, complexity and learning

algorithm, in an attempt to address some of the open

questions presented in the above mentioned work,

related to the applicability of the conclusions drawn

on artificial data in real-world settings. We

conducted our experiments by evaluating various

performance metrics. The results of this first

investigation suggest that a more meaningful

analysis can be performed by considering the

imbalance ratio and the ratio between the total

number of instances and the number of attributes in

the entire data set, further referred to as IAR. We

show that the new grouping of problems, by this

meta-feature which combines data size and

complexity information, is more significant,

allowing for a faster and easier initial assessment of

a particular data set.

2 METRICS

Perhaps the most popular performance metric for

classification problems is the accuracy of the

induced model on a test sample. It provides a good

general estimation of the prediction capabilities of a

model, but it is widely accepted by the scientific

community as inadequate for imbalanced or cost-

sensitive problems (Chawla, 2006).

A classical example of why the accuracy is not

an appropriate metric in imbalanced problem is the

classification of pixels in mammogram images

(Woods et al, 1993).

Recent studies suggest using new approaches for

evaluating the performance in such problems. In

((Garcia and Herrera, 2009), (Batista et al, 2004),

(Chawla et al, 2002)), the area under the ROC curve

(AUC) is employed to assess the performance of

several sampling techniques. The ROC curve

measures the performance of a learner under all

possible trade-offs between the true positive rate

(TP

rate

) and the false positive rate (FP

rate

). It is

considered to be a consistent measure, even under

highly skewed class distributions. The AUC

provides a scalar summary performance assessment

for learning algorithms, based on the ROC curve.

However, it evaluates all possible decision

thresholds, while in imbalanced domains the focus

should be on the performance at a high decision

threshold.

In (Barandela et al, 2003) the geometric mean

(GM) is proposed as a metric for evaluating

classifiers in imbalanced domains. It is computed as

the geometric mean of TP

rate

and TN

rate

and it

provides a more objective estimation of the

prediction capabilities of a model than the accuracy.

It has been employed in several studies on

imbalanced problems ((Garcia and Herrera, 2009),

(Guo and Viktor, 2004)).

The average accuracy obtained on either/each

class also known as balanced accuracy

is another

symmetric measure which is more suited for

imbalanced problems (Brodersen et al, 2010). If a

classifier performs equally well on both classes, the

balanced accuracy reduces to its conventional

correspondent. If, on the other hand, the classifier

favours one class – the majority class – in an

imbalanced problem, and performs weakly on the

other, then the balanced accuracy will drop

accordingly, while the conventional accuracy will

still be high.

Another metric is the f-measure, or f-score ((Guo

and Viktor, 2004), (Chawla, 2006)), the harmonic

mean between the precision (Prec = TP /

(TP + FP)) and recall (Rec = TP

rate

). It provides a

trade-off between the correct identification of the

positive class and the cost (in number of FP errors)

of false alarms. A generalization of the metric – the

-measure – can be tuned to put more emphasis on

either the recall or precision: f

-measure = (1+β

) *

precision * recall / (β

* recall + precision); β>1

when we need to accentuate recall more. For a

specific problem, the goal is to identify the

appropriate β such that the right amount of

penalization for the false negatives is provided.

For an imbalanced problem, the TP

rate

is usually

the most important. In (Chawla, 2006), the strategy

to follow in imbalanced problems is to maximize

recall (i.e. TP

rate

) while keeping precision under

control. (Grzymala et al, 2005) suggests that in

imbalanced problems more attention should be given

to sensitivity (TP

rate

) than to specificity (TN

rate

). This

is rather natural, since usually the TN

rate

is high

while the TP

rate

is low in such problems. Therefore

the goal is to increase the sensitivity, without

degrading of specificity.

We argue that the careful and correct selection of

the metric in imbalanced problems is essential for

the success of a data mining effort in such domains.

The metric should also reflect the goal of the

classification process, not just focus on the data

imbalance. Thus, if we are also dealing with

imbalance at the level of the error costs, then a cost-

sensitive metric should be more appropriate (e.g.

associate a cost parameter to the balanced accuracy

or geometric mean). If, on the other hand, we have

A COMPREHENSIVE STUDY OF THE EFFECT OF CLASS IMBALANCE ON THE PERFORMANCE OF

CLASSIFIERS

the interest in identifying both classes correctly, then

an equidistant metric, such as the geometric mean,

or balanced accuracy provides a fair estimation.

3 EVALUATION STRATEGY AND

THE ANALYSIS OF RESULTS

As concluded in (Japkowicz and Stephen, 2002), the

nature of the imbalance problem resides in more

than just the imbalance ratio (IR). Our set of

experiments tries to validate the statement on

benchmark problems.

In order to study the nature of the imbalance

problem, we have considered 32 data sets from the

UCI machine learning data repository (Table 1). A

number of problems were modified to obtain binary

classification problems from multi-class data. Also,

three of the relatively large datasets were under-

sampled to generate higher IR values (contain _IR in

their name). The complexity of each data set was

approximated, as suggested in (Japkowicz and

Stephen, 2002), to C = log

L, where L is the number

of leaves generated by the C4.5 decision tree learner.

Also, the values for IR, IAR and C have been

rounded.

Learning algorithms belonging to 6 different

classes were considered: instance based learning –

kNN (k Nearest Neighbor), Decision Trees – C4.5,

Support Vector Machines – SVM, Artificial Neural

Networks – MLP (Multilayer Perceptron), Bayesian

learning – NB (Naïve Bayes) and ensemble learning

– AB (AdaBoost.M1). We have employed the

implementation in the Weka framework for the six

methods selected, and their default parameter values.

The evaluations were performed using a 10-fold

cross validation loop, and reporting the average

values obtained. The following metrics were

recorded: the accuracy (Acc), TP

rate

, and TN

rate

Also, the geometric mean (GM), the balanced

accuracy (BAcc) and the Fmeasure (Fmeas) have

been computed. The minority class in all problems is

the positive class.

An initial analysis was carried out on the data

grouped by size, IR and complexity (C), into the

categories presented in Table 2.

Not all combinations of the three categories can

be found in the data sets we have evaluated: for

Table 1: Benchmark data sets employed in the experiments.

Datase

Att.

Inst.

IR IAR C Datase

Att.

Inst.

IR IAR C

Bupa 6 345 1 58 3 Ecoli_im_r

8 336 3 42 2

Haberman

1 4 367 1 92 3 Glass

NW 11 214 3 19 4

Cleve 14 303 1 22 5 Vehicle_van 19 846 3 45 4

Monk3 7 554 1 79 4 Chess_IR5 37 2002 5 54 5

Monk1 7 556 1 79 5 Segment_1 20 1500 6 75 3

Australian 15 690 1 46 5 Ecoli_imu 8 336 9 42 4

Crx 16 690 1 43 5 Se

ment

IR10 20 1424 10 71 3

Chess 37 3196 1 86 5 Tic-tac-toe

IR10 10 689 10 69 6

Mushrooms 23 8124 1 353 4 German_IR10 21 769 10 37 7

Breas

-cance

10 286 2 29 2 Sic

-euthyroi

26 3163 10 122 5

Glass

BWNFP 11 214 2 19 3 Glass

VWFP 11 214 12 19 3

Glass_BWFP 11 214 2 19 4 Sic

30 3772 15 126 5

Vote 17 435 2 26 3 Ecoli_bin 8 336 16 42 3

Wisconsin 10 699 2 70 4 Caravan 86 5822 16 68 11

Pima 7 768 2 110 4 Ecoli_im_r

8 336 3 42 2

Tic-tac-toe 10 958 2 96 7 Glass_NW 11 214 3 19 4

German 21 1000 2 48 7 Vehicle_van 19 846 3 45 4

Table 2: Dataset grouping on size, IR, C.

Dimension Category Very small Small

edium Large Very large

Size (no. of instances) <400 400-1500 2000-5000 >5000 -

Rounded IR - <9 - >=9 -

Rounded C - <=2 [3,4] [5,9] >=10

ICEIS 2011 - 13th International Conference on Enterprise Information Systems

Figure 1: Size very small, IR<9, C small. Figure 2: Size very small, IR<9, C medium.

Figure 3: Size very small, IR<9, C large. Figure 4: Size very small, IR>=9, C medium.

example, a very large complexity is only represented

in the large data sets category. Table 3 presents a

summary of the results obtained by the learning

algorithms on the different categories of problems.

Shaded rows represent data categories sensitive to

imbalance, while non-shaded rows represent groups

of problems on which classifiers have a robust

behavior, under TP

rate

. We have selected this metric

to assess robustness since, as suggested in

(Japkowicz and Stephen, 2002), performance

degradation is related to a large drop in the TP

rate

Also, for each data set category we have marked the

best performance (bolded) and the worst

performance (underlined).

The results agree with the conclusions presented

in (Japkowicz and Stephen, 2002) that the value of

the IR plays an important role in the performance of

the classifiers. However, an increase in the

complexity does not necessarily lead to classifier

performance degradation: for very small datasets,

one would expect that a large complexity

significantly affects the capacity of classifiers to

achieve acceptable performance scores, even for

small IRs.

As it can be observed from Fig. 1 - 4, the

behavior of classifiers on large complexity data sets

is better than on categories of problems of smaller

complexity (in Fig. 3 almost all classifiers seem to

be robust to the imbalance problem). Still, for the

other set size categories (small, medium and large),

a large imbalance (IR>=9) associated with increased

complexity (large, large and very large, respectively)

always affects the learning process (Table 3).

The results suggest that neither data set size, nor

the complexity alone represent good (i.e. monotonic)

indicators of the IR's influence in the classification

process. We consider that poor concept

identification is related to the lack of information

caused by insufficient examples to learn from.

However, a relation between problem size,

complexity and classifier performance is revealed,

i.e. the larger the data set size, the higher the

complexity for which the performance degradation

becomes clear. This suggests the existence of

another meta-feature which better discriminates the

classifier robustness when faced with imbalanced

problems. Such a meta-feature, the instance per

attribute ratio (IAR), will be introduced shortly.

The diagrams in Fig. 5 - 7 present the

performance of the different classifiers, under

different metrics, on the problem categories which

affect their learning capacity. The accuracy alone is

not a good measure of performance. The analysis

should focus on the following criteria: high values

for TP

rate

, GM, BAcc and Fmeasure indicate a good

classification, while high TN

rate

values reveal a

A COMPREHENSIVE STUDY OF THE EFFECT OF CLASS IMBALANCE ON THE PERFORMANCE OF

CLASSIFIERS

Table 3: TPrates obtained by classifiers on the different categories of problems.

Set Size IR Complexit

kNN C4.5 SVM MLP

B AB

very small

Small .53 .5 .5 .61 65 .57

edium .72 .71 .3 .61 .65 .65

Lar

e .73 .72 .79 .76 .8 .81

>=9

edium .52 .6 .15 .59 .83 .4

mall <9

edium .88 .89 .89 .9 .89 .83

Lar

e .81 .77 .85 .81 .62 .67

>=9

edium .98 .94 .98 .99 .98 .99

Large .24 .09 .47 .65 .09 .0

medium

<9 Lar

e .74 .97 .92 .98 .69 .85

>=9

edium .6 .91 .5 .86 .78 .89

Large .57 .88 .04 .73 .84 .82

large <9 Lar

e 111 1 .92 .98

>=9 Ver

Lar

e .06 .0 .01 .0 .39 .0

Figure 5: Size small, C large. Figure 6: Size medium, C large. Figure 7: Size large, C v. large.

classification which is biased towards the majority

class. Moreover, the larger the difference between

the TN

rate

and the TP

rate

, the more biased the

classification process is.

The results prove that the learning capabilities of

the classifiers considered are affected to some extent

by an increased imbalance in conjunction with the

other data-related particularities. It can be observed

that, like in (Japkowicz and Stephen, 2002), MLPs

are generally more robust than C4.5 to the imbalance

problem. Moreover, they are the least affected by the

imbalance-related factors, in most cases. As an

exception, C4.5 performs noticeably better than

MLP (and all the others, actually) on medium sized

datasets, with large IR and C (Fig. 6).

The analysis also reveals that the NB classifiers

have a good general behavior when dealing with a

very large imbalance. In some cases they even yield

the best performance (Fig. 1, 4, 7 – all with IR>=9).

However, they are not as robust as MLPs, since, in

some cases, they achieve a very poor performance

(Fig. 5). Although not always the best classifier,

MLPs yield at least the second best performance in

all cases, which makes them the most robust out of

all the classifiers evaluated. None of the kNN and

AB show remarkable results in any of the cases

studied, which makes them suitable only for baseline

problem assessment.

The above observations provide an affirmative

answer to one of the open questions in (Japkowicz

and Stephen, 2002), whether the conclusions

presented there can be applied to real-world

domains. However, our results also indicate that

SVM are the most sensitive to imbalance. This

means that, for the particular case of SVMs, the

conclusion drawn from experiments on artificial data

cannot be extended to real data sets. A justification

for this could be the following: in the case of

artificial data sets, even for large IRs, the examples

which represent reliable support vectors are present

in the data, due to the systematic data generation

process, while in the case of real problems, these

vital learning elements might be missing. This

makes SVMs the weakest classifiers in most real-

world imbalanced problems.

We have performed a second analysis for

studying the effect of imbalanced problems on the

performance of the classifiers, using another data set

grouping: by IR and by the ratio between the number

of instances and the number of attributes (IAR). We

consider this new meta-feature successfully

combines size and complexity information: a small

IAR should yield a higher classifier sensibility to the

ICEIS 2011 - 13th International Conference on Enterprise Information Systems

Table 4: Dataset grouping on IR, IAR.

Cate

Rounded IR Rounded IAR

pe Balanced IR Small IR Large IR Small Mediu

Large Very large

Value 1 [2,3] >=4 <=60 (60,110] (110,200] >200

Table 5: TPrates GM scores on IR and IAR grouping.

IR IAR kNN C4.5 SVM MLP

BAB

Balanced Small .68 .71 .72 .7 .58 .75

edium .94 .95 .8 .86 .78 .85

Ver

lar

e 1 1 1 1 .92 .98

Small Small .71 .69 .53 .72 .78 .65

edium .81 .77 .82 .83 .67 .63

Large

Small .5 .55 .27 .62 .64 .4

edium .53 .52 .72 .73 .59 .49

Lar

e .58 .89 .19 .74 .82 .84

Figure 8: IR small imbalance, IAR small. Figure 9: IR large, IAR small.

Figure 10: IR large, IAR medium. Figure 11: IR large, IAR large.

imbalance problem, while a very large IAR should

provide more robustness to the imbalance. The

categories for this second analysis are summarized

in Table 4.

By re-grouping the evaluations according to this

new criterion, we noticed a more clear separation

between the different categories and that classifiers

better learn with larger IARs. Indeed, as we can

observe from Table 5, the larger the IAR, the larger

the IR for which the TP

rate

value of the classifiers

decreases. Also, for the same IR, as IAR increases,

classifiers are more robust to the imbalance. The

different levels of shading used for the rows indicate

the performance level (more shading, better average

performance). Again, we have marked the highest

and lowest TP

rate

values for each problem category

(bolded and underlined, respectively).

Fig. 8 – 11 present the performance of the

classifiers under this second categorization, for all

metrics considered, on the relevant groups (problems

which are affected the most by the imbalance related

issues). The diagrams indicate again that SVM are

unstable classifiers for imbalanced problems

(strongly biased towards the majority class). Out of

all classifiers, MLP are the most robust, yielding

either the best or second best performance. The NB

A COMPREHENSIVE STUDY OF THE EFFECT OF CLASS IMBALANCE ON THE PERFORMANCE OF

CLASSIFIERS

0.2

0.4

0.6

0.8

110100

log (IR )

BAcc

C4.5 pruned, C =0.25 C4.5unpruned

0.2

0.4

0.6

0.8

110100

log (IR)

C4.5 pruned,C =0.25 C4.5unpruned

Figure 12: Performance degradation for C4.5 on mushrooms data set, under the balanced accuracy (BAcc) and the

geometric mean (GM).

classifier generally achieves the best recognition of

the minority class (maximum TP

rate

However, it is not the best classifier due to poor

recognition of the majority class (lowest TN

rate

in all

cases). This makes the NB classifier the most

appropriate for imbalanced problems in which the

minority class possesses a significantly larger

importance than the majority class. Similar to the

previous analysis, kNN and AB have a variable

behavior, which hinders the identification of a

situation in which they could guarantee quality

results. If we have found that a large IAR improves

the behavior of classifiers for the same IR, it appears

that C4.5 is the most responsive to a large IAR, as it

can be observed from Fig. 11. All the above

measurements refer to pruned versions of C4.5.

In (Japkowicz and Stephen, 2002), it is argued

that, for large IRs, unpruned C4.5 models are better

than the pruned versions. We have performed an

evaluation to validate this statement, using the

mushrooms problem – large size, balanced data set –

by varying the IR up to 100. The evaluation was

performed in a 10-fold cross validation loop. The

results are presented in the diagrams from Fig. 12.

We have employed the logarithmic scale for the x

axis (IR), to better differentiate between the two

curves at smaller IRs. By comparing the two

diagrams we notice that GM is more fitted for this

situation, as it is more realistic in estimating the

performance (BAcc being overoptimistic), and it

better differentiates between the pruned/unpruned

versions. This is due to the fact that a larger

difference between two variables is more visible in

the product than the sum of their values. This makes

GM a better metric than BAcc in imbalanced

problems.

Also, as IR increases, pruning deteriorates the

performance of the decision tree model. This result

supports the statement in (Weiss, 2004), that pruning

might eliminate rare and important cases, thus

affecting the correct identification of the minority

class. However, no pruning at all results in an

increase of complexity for the majority class as well,

which might lead to overfitting in that area. A more

sophisticated approach is therefore required for

imbalanced domains, an intelligent pruning

mechanism, which adjusts the level of pruning for

branches according to the number of minority cases

they contain.

4 ONGOING WORK

We are currently focusing on three approaches for

dealing with the class imbalance problem. A first

approach is an intelligent pruning mechanism which

allows the reduction of the branches which predict

the majority class(es), while branches corresponding

to the minority class are pruned proportionally with

the number of examples they cover (i.e. less

examples imply less pruning). This could be done in

correlation with the change of the threshold which

identifies a class in favour of the minority class. This

method is currently under development.

A second approach, which is currently under

experimental validation, is a general method for

improving the performance of classifiers in

imbalanced problems. It involves the identification

of an optimal cost matrix for the given problem and

the selected evaluation metric. The matrix is then

employed in conjunction with a cost-sensitive

classifier in order to build a more efficient

classification model, focused on better identifying

the underrepresented/interest cases. The experiments

performed so far have shown that the method indeed

improves the behavior of the classifiers, reducing

their bias towards the majority class. Comparative

evaluations with sampling methods are currently

under development.

Another focal point of our current research

efforts is to identify the optimal distribution for

learning and employ sampling and ensemble

ICEIS 2011 - 13th International Conference on Enterprise Information Systems

learning mechanisms to generate several classifiers

which employ voting to provide a classification.

5 CONCLUSIONS

Starting from the observation that when dealing with

IDS there is no winner strategy for all data sets

(neither in terms of sampling, nor algorithm), special

attention should be paid to the particularities of the

data in hand. In doing so, one should focus on a

wider context, taking into account several factors

simultaneously: the imbalance rate, together with

other data-related meta-features, the algorithms and

their associated parameters.

Our experiments show that, in an imbalanced

problem, the IR can be used in conjunction with the

data set dimensionality and the IAR factor, to

evaluate the appropriate classifier that best fits the

situation. Moreover, a good metric to assess the

performance of the model built is important; again, it

should be chosen based on the particularities of the

problem and of the goal established for it.

When starting an evaluation, we should begin with

the imbalanced data set and the MLP, as it proved to

be the best classifier on every category we have

evaluated on imbalanced data sets. In case the

training time with MLP is too large, the second best

choice is either the decision tree with C4.5 (without

pruning would be better as IR increases), or NB. In

terms of evaluation metrics, the choice should be

based on the data particularities (i.e. imbalance), but

also on the goal of the classification process (are we

dealing with a cost-sensitive classification or are all

errors equally serious?).

ACKNOWLEDGEMENTS

The work for this paper has been supported by

research grant no. 12080/2008 – SEArCH, founded

by the Ministry of Education, Research and

Innovation.

REFERENCES

Barandela, R., Sanchez, J. S., Garcia, V., Rangel, E.

(2003). Strategies for Learning in Class Imbalance

Problems. Pattern Recognition. 36(3). 849--851

Batista, G.E.A.P.A., Prati, R. C. Monard, M. C. (2004). A

Study of the Behavior of Several Methods for

Balancing Machine Learning Training Data, 20—29

Brodersen, K.H., Ong, C.S. ,Stephen, K.E. and Buhmann,

J.M. (2010). The balanced accuracy and its posterior

distribution. Proc. of the 20th Int. Conf. on Pattern

Recognition. pp. 3121–3124

Cieslak, D. A., Chawla, N. V., Striegel, A. (2006).

Combating Imbalance in Network Intrusion Datasets.

In: Proceedings of the IEEE International Conference

on Granular Computing. 732--737

Chawla, N. V., Bowyer, K. W., Hall, L. O., Kegelmeyer,

W. P. (2002). SMOTE: Synthetic Minority Over-

Sampling Technique. Journal of Artificial Intelligence

Research, 16:321--357

Chawla, N. V. (2006). Data Mining from Imbalanced Data

Sets, Data Mining and Knowledge Discovery

Handbook, chapter 40, Springer US, 853--867

Cohen, G., Hilario, M., Sax, H., Hugonnet, S.,

Geissbuhler, A. (2006). Learning from Imbalanced

Data in Surveillance of Nosocomial Infection.

Artificial Intelligence in Medicine, 37(1):7--18

Garcia, S., Herrera, F. (2009). Evolutionary

Undersampling for Classification with Imbalanced

Datasets: Proposals and Taxonomy, Evolutionary

Computation 17(3): 275--306

Grzymala-Busse, J. W., Stefanowski, J., Wilk, S. (2005).

A Comparison of Two Approaches to Data Mining

from Imbalanced Data. Journal of Intelligent

Manufacturing, 16, Springer Science+Business Media,

65--573

Guo, H., Viktor, H.L. (2004). Learning from Imbalanced

Data Sets with Boosting and Data Generation: The

DataBoost-IM Approach, Sigkdd Explorations.

Volume 6, 30—39

Huang, K., Yang, H., King, I., and Lyu, M. R. (2006).

Imbalanced Learning with a Biased Minimax

Probability Machine. IEEE Transactions on Systems,

Man, and Cybernetics, Part B: Cybernetics, 36(4):

913--923

Japkowicz, N., Myers, C. and Gluck, M. A. (1995). A

Novelty Detection Approach to Classification. IJCAI :

518--523

Japkowicz, N., and Stephen, S. (2002). The Class

Imbalance Problem: A Systematic Study. Intelligent

Data Analysis Journal. Volume 6: 429--449

Weiss, G., and Provost, F. (2003). Learning when Training

Data are Costly: The Effect of Class Distribution on

Tree Induction. Journal of Artificial Intelligence

Research 19, 315--354

Weiss, G. (2004). Mining with Rarity: A Unifying

Framework, SIGKDD Explorations 6(1), 7--19

Woods, K., Doss, C., Bowyer, K., Solka, J., Priebe, C.,

Kegelmeyer, P. (1993). Comparative Evaluation of

Pattern Recognition Techniques for Detection of

Microcalcifications in Mammography. Int. Journal of

Pattern Rec. and AI, 7(6), 1417--1436

A COMPREHENSIVE STUDY OF THE EFFECT OF CLASS IMBALANCE ON THE PERFORMANCE OF

CLASSIFIERS