Impact on Bayesian Networks Classiﬁers

When Learning from Imbalanced Datasets

M. Julia Flores and Jos´e A. G´amez

Computing Systems Department – SIMD group (I3A), University of Castilla - La Mancha, Campus, 02071, Albacete, Spain

Keywords:

Bayesian Networks, Supervised Classiﬁcation, Data Mining, Imbalanced Datasets, Naive Bayes.

Abstract:

In this paper we present a study on the behaviour of some representative Bayesian Networks Classiﬁers

(BNCs), when the dataset they are learned from presents imbalanced data, that is, there are far fewer cases

labelled with a particular class value than with the other ones (assuming binary classiﬁcation problems). This

is a typical source of trouble in some datasets, and the development of more robust techniques is currently

very important. In this study, we have selected a benchmark of 129 imbalanced datasets, and performed an

analytical approach focusing on BNCs. Our results show good performance of these classiﬁers, that outper-

form decision trees (C4.5). Finally, an algorithm to improve the performance of any BNC is also given. We

have carried out an experimentation where we show how the using of oversampling of the minority class to

achieve the desired value for the imbalance ratio (IR), which is the division of the number of cases for the

majority class by the cases of the minority class. From this work we can conclude that BNCs show a very

good performance for imbalanced datasets, and that our proposal enhance their results for those datasets that

provided poor results.

1 INTRODUCTION

Supervised learning construct models from externally

supplied instances to produce general hypotheses,

which then make predictions about future instances.

In other words, the goal of supervised learning is to

build a concise function of the distribution of class la-

bels in terms of predictive attributes features. There

exist many families for these models, known as clas-

siﬁers, for example decision trees, support vector ma-

chines, artiﬁcial neuralnetworks or rule systems. Any

kind of classiﬁer is always used to assign class la-

bels to a set of testing instances where the values of

the predictor features are known, but the value of the

class label is unknown. In fact, a classiﬁer is needed

to provide the correct class label for future or unseen

cases, for example determining if a manufacture piece

is defective based on a set of features extracted from

visual information (an image taken by a camera), or

determine if an e-mail is spam or not based on those

attributes that can be extracted from the correspond-

ing information (subject, sender, etc...).

Most classiﬁer learning algorithms assume a rela-

tively balanceddistribution (Breiman, 1998), so when

there is an under-represented class, this poses a seri-

ous problem for them, and they generally have poor

generalisation performance on the minority class. In

real applications, the imbalance class scenery appears

more frequently than expected. For example, in med-

ical diagnosis the disease cases are usually quite rare

in global populations (where most of the people do

not suffer from the related disease). However, the

key task here is to detect which people have this dis-

ease. In this case, a classiﬁer that provides a higher

identiﬁcation rate on the disease category would pro-

vide better performance. Other ﬁelds of application

where imbalanced data naturally arise are fraud de-

tection, monitoring, text categorization, risk manage-

ment, etc... It happens in so many real problems that

it can be considered one of the top problems in data

mining today (Lopez et al., 2013).

Bayesian Network classiﬁers (BNCs) are

Bayesian Network (BN) models (Korb and Nichol-

son, 2010) speciﬁcally tailored for classiﬁcation

tasks. There is a wide range of existing models that

vary in complexity and efﬁciency. All of them have in

common the ability to deal with uncertainty in a very

natural way, at the same time providing a descriptive

environment. In this work, we will focus on the

family of semi-naive (Kononenko, 1991) Bayesian

classiﬁers (Naive Bayes, AODE, TAN, kDB, etc.).

The capability of a BN to express relationships,

dependencies and independences between variables

(features in this case) is given by its associated graph

382

Flores M. and Gámez J..

Impact on Bayesian Networks Classiﬁers When Learning from Imbalanced Datasets.

DOI: 10.5220/0005201103820389

In Proceedings of the International Conference on Agents and Artiﬁcial Intelligence (ICAART-2015), pages 382-389

ISBN: 978-989-758-074-1

 2015 SCITEPRESS (Science and Technology Publications, Lda.)

(qualitative part), and these relationships are also

modelled with a second element, quantitative, that

forms a BN: probability distributions. In our opinion,

there are two main characteristics that have made

BNCs so popular: they provide predictions in terms

of probabilities (that could be interpreted as weights)

and they are easily and intuitively interpreted by non-

experts (thanks to the underlying graph). Empirically,

BNCs have also been successfully in many applica-

tion areas (Flores et al., 2012) such as Computing,

Robotics, Medicine, Healthcare, Finance, Banking

and Environmental Science.

This paper is organized as follows. In Section 2,

we review some previous work related to the current

one, and discuss the novelty of our approach. In Sec-

tion 3, we present the ﬁrst part of our experiments,

where we analyse the behaviour of some BNCs in a

benchmarkof imbalanceddatasets. Section 4 presents

the ﬁnal experimentation of the paper, from which

we conclude which is the algorithm to apply. Some

conclusions from these results are also given. Finally,

Section 5 provides a general discussion and future re-

search lines.

2 RELATED WORK

In (Lopez et al., 2013), authors present a very interest-

ing study where they identify which are the intrinsic

characteristics that affect when applying supervised

classiﬁcation models on imbalanced datasets. In this

work authors pre-select a set of 66 datasets (subset

of those in Table 1). This work proves that Imbal-

ance Ratio (IR) is, of course, a very important factor

when working with imbalanced datasets, however,the

performance of classiﬁers can not be obtained for a

simple (linear) function with respect to this measure.

They see how other aspects can also inﬂuence, such

as the presence of small disjoints, the lack of density

in the training data, the overlapping between classes,

the identiﬁcation of noisy data, the signiﬁcance of the

borderline instances, and the dataset shift between the

training and the test distributions. The problem with

these other measures is that obtaining them is not an

easy task, and when it can be done with approximate

values, they suppose a high computing cost. Besides,

these depend on the particular problem to solve, while

we are interested in developing generaltechniques ap-

plicable to any dataset. That is why, we will ﬁrstly

work on the behaviour with respect to the IR value, in

combination with other graphical tools and plots.

The novelty of the current work is that it is

uniquely focused on the behaviour on BNCs, since

most of the works related to imbalanced datasets are

devoted to other kind of models, for example (Lopez

et al., 2013) uses Decision trees (C4.5), Support vec-

tor machines (SVMs) and the k-Nearest Neighbours

(kNN) model, which goes into the family of Instance

based learning. On the other hand, (Sun et al., 2007)

applies two kinds of systems: again C4.5 decision

tree and an associative classiﬁcation system called

HPWR (H

igh-order Pattern and Weight-of- evidence

ule based classiﬁer). We considered that there is an

open research line in the study of the behaviour of

BNCs with imbalanced data and performing a study

of how to approach this problem is the main aim of

this paper.

Another related work, where BNCs are used is

(Wasikowski and wen Chen, 2010), but this is not ap-

plicable to the available datasets, since the number of

attributes in our problems are too low, and the num-

ber of instances is comparatively few, so it does not

have sense to apply feature subset selection. In the

referred work they use datasets which are not origi-

nally imbalanced, and we wanted to apply BNCs to

real imbalanced datasets.

3 ANALYTICAL EXPERIMENTS

Here we will show the basis for our study and an ini-

tial experimental set-up, analysing their results.

3.1 Selected BNCs

The classiﬁcation task consists of assigning one cate-

gory c

or value of the class variable C = {c

,...,c

}

to a new object ~e, which is deﬁned by the assign-

ment of a set of values, ~e = {a

,··· ,a

}, to the

attributes A

,...,A

. In the probabilistic case, this

task can be accomplished in an exact way by the ap-

plication of the Bayes theorem (Equation 1). How-

ever, since working with joint probability distribu-

tions is usually unmanageable, simpler models based

on factorizations are normally used for this problem.

In this work we apply four computationally efﬁcient

paradigmsNaive Bayes (NB), KDB,TAN and AODE.

Our experimentswill also show that imbalanced prob-

lems does not beneﬁt from more complex BNCs.

p(c|~e) =

p(c)p(~e|c)

p(~e)

. (1)

These models, as any BN, are represented by a Di-

rected Acyclic Graph, whose nodes indicate variables

and the presence/absence of edges imply their rela-

tionships, that can be obtained, with the d-separation

concept (Korb and Nicholson, 2010). For exam-

ple, from the structure of NB (Figure 1.(a)), when

ImpactonBayesianNetworksClassifiersWhenLearningfromImbalancedDatasets

383

(a) Naive Bayes (b) TAN/KDB1 (c) AODE

Figure 1: Examples of 4-nodes network structure for the BN classiﬁers: NB, TAN and KDB1, AODE.

the value of the class is known (observed), all the

other variables (called attributes or features) remain

independent. When we work with general BN any

variable can be linked to any other (as long as the

graph remains acyclic). Even though there are algo-

rithms for learning BNs (Korb and Nicholson, 2010,

Part II), they are much slower and complex that

those for semi-Naive classiﬁers, where the structure

is ﬁxed or at least constrained. Notice that any learn-

ing algorithm needs to discover the graph structure

ﬁrst and then perform a parametrical learning for the

probability distribution, for any variable X it stores

P(X|pa(X), where pa means parents. In particular,

kDB allows k parents apart from the class, TAN learns

a tree and then the class is linked to all the features

(Figure 1.(b)), and AODE uses one model per every

variable as the one in Figure 1.(c) and averages.

Most of these models require discrete data, so

when the values are numeric a previous discretization

task has to be performed. In the case of Naive Bayes,

the continuous model assumes conditional Gaussian

distributions.

3.2 Datasets for the Experimentation

We have used KEEL-dataset repository. KEEL

stands for Knowledge Extraction based on Evolu-

tionary Learning, and it is an open source Java soft-

ware tool developed by six Spanish Research Groups.

In particular we have worked with those datasets

in the imbalanced category, which can be found

at http://sci2s.ugr.es/keel/imbalanced.php. Among

them, we have omitted those which did not have bi-

nary class, this is because multi-valued classes make

the problem different (Sun, 2007), and we want to ﬁrst

provide a serious study for the binary case. In fact,

any problem could be transformed into a binary one,

using a minority class vs. all approach. This can be

tackled in a future research. Table 1 shows the names

of the datasets we used in our experiments, and their

basic information:

• number of instances (#Inst)

• number of features or attributes (#att)

Actual

Predicted as Predicted as

+ + (TP) - (FN)

- + (FP) - (TN)

Figure 2: Confusion matrix.

• imbalance ratio (IR), being

, where #M rep-

resents the number of instances belonging to the

majority class and #m those belonging to the mi-

nority class. See that #M + #m = #Inst.

3.3 Evaluating Models

We must indicate that we will use the measure Area

Under a ROC Curve (AUC) for evaluating the classi-

ﬁers in all our experiments. We selected this measure

because it has been shown that it can assess the per-

formance when the instances are imbalanced with re-

spect to the class labels. The area under a ROC curve

(AUC) provides a single measure of a classiﬁer’s per-

formance. For other applications, where the datasets

is supposed to be better distributed in terms of the

class labels, accuracy is a standard measure. By ac-

curacy we mean the percentage of correctly classiﬁed

instances. However, when classifying with the class

imbalance problem, accuracy is no longer a proper

measure since the rare class has very little impact on

accuracy as compared to the majority class. For in-

stance, if the minority class has only presence of 1%

in the training data, a simple strategy can be to pre-

dict the prevalent class label for every example. It can

achieve a high accuracy of 99%. However, this mea-

surement is useless if the main concern deals with the

identiﬁcation of the rare cases.

Most measures for evaluating classiﬁcation per-

formance can be derived from the confusion matrix,

as seen in Figure 2. In cells, T stands for True, F

stands for False, P stands for Positive, and N stands

for Negative, so we have the four possible combina-

tions. To better grasp their meaning, suppose for ex-

ample FN, this represents False Negative cases, that

is, those cases classiﬁed as negative but whose actual

value had to be positive, that is why they are called

false, because they imply an error.

From this table is easy to ﬁnd accuracy (in %),

ICAART2015-InternationalConferenceonAgentsandArtificialIntelligence

384

Table 1: Imbalanced datasets used in this work (taken from KEEL repository).

nr dataset #Inst #att IR nr dataset #Inst #att IR

1 ecoli-0 vs 1 220 8 1.86 2 ecoli1 336 8 3.36

3 ecoli2 336 8 5.46 4 ecoli3 336 8 8.6

5 glass-0-1-2-3 vs 4-5-6 214 10 3.2 6 glass0 214 10 2.06

7 glass1 214 10 1.82 8 glass6 214 10 6.38

9 haberman 306 4 2.78 10 iris0 150 5 2.0

11 new-thyroid1 215 6 5.14 12 new-thyroid2 215 6 5.14

13 page-blocks0 5472 11 8.79 14 pima 768 9 1.87

15 segment0 2308 20 6.02 16 vehicle0 846 19 3.25

17 vehicle1 846 19 2.9 18 vehicle2 846 19 2.88

19 vehicle3 846 19 2.99 20 wisconsin 683 10 1.86

21 yeast1 1484 9 2.46 22 yeast3 1484 9 8.1

23 abalone19 4174 9 129.44 24 abalone9-18 731 9 16.4

25 ecoli-0-1-3-7 vs 2-6 281 8 39.14 26 ecoli4 336 8 15.8

27 glass-0-1-6 vs 2 192 10 10.29 28 glass-0-1-6 vs 5 184 10 19.44

29 glass2 214 10 11.59 30 glass4 214 10 15.46

31 glass5 214 10 22.78 32 page-blocks-1-3 vs 4 472 11 15.86

33 shuttle-c0-vs-c4 1829 10 13.87 34 shuttle-c2-vs-c4 129 10 20.5

35 vowel0 988 14 9.98 36 yeast-0-5-6-7-9 vs 4 528 9 9.35

37 yeast-1-2-8-9 vs 7 947 9 30.57 38 yeast-1-4-5-8 vs 7 693 9 22.1

39 yeast-1 vs 7 459 8 14.3 40 yeast-2 vs 4 514 9 9.08

41 yeast-2 vs 8 482 9 23.1 42 yeast4 1484 9 28.1

43 yeast5 1484 9 32.73 44 yeast6 1484 9 41.4

45 ecoli-0-1-4-6 vs 5 280 7 13.0 46 ecoli-0-1-4-7 vs 2-3-5-6 336 8 10.59

47 ecoli-0-1-4-7 vs 5-6 332 7 12.28 48 ecoli-0-1 vs 2-3-5 244 8 9.17

49 ecoli-0-1 vs 5 240 7 11.0 50 ecoli-0-2-3-4 vs 5 202 8 9.1

51 ecoli-0-2-6-7 vs 3-5 224 8 9.18 52 ecoli-0-3-4-6 vs 5 205 8 9.25

53 ecoli-0-3-4-7 vs 5-6 257 8 9.28 54 ecoli-0-3-4 vs 5 200 8 9.0

55 ecoli-0-4-6 vs 5 203 7 9.15 56 ecoli-0-6-7 vs 3-5 222 8 9.09

57 ecoli-0-6-7 vs 5 220 7 10.0 58 glass-0-1-4-6 vs 2 205 10 11.06

59 glass-0-1-5 vs 2 172 10 9.12 60 glass-0-4 vs 5 92 10 9.22

61 glass-0-6 vs 5 108 10 11.0 62 led7digit-0-2-4-5-6-7-8-9 vs 1 443 8 10.97

63 yeast-0-2-5-6 vs 3-7-8-9 1004 9 9.14 64 yeast-0-2-5-7-9 vs 3-6-8 1004 9 9.14

65 yeast-0-3-5-9 vs 7-8 506 9 9.12 66 abalone-17 vs 7-8-9-10 2338 9 39.31

67 abalone-19 vs 10-11-12-13 1622 9 49.69 68 abalone-20 vs 8-9-10 1916 9 72.69

69 abalone-21 vs 8 581 9 40.5 70 abalone-3 vs 11 502 9 32.47

71 car-good 1728 7 24.04 72 car-vgood 1728 7 25.58

73 dermatology-6 358 35 16.9 74 ﬂare-F 1066 12 23.79

75 kddcup-buffer overﬂow vs back 2233 42 73.43 76 kddcup-guess passwd vs satan 1642 42 29.98

77 kddcup-land vs portsweep 1061 42 49.52 78 kddcup-land vs satan 1610 42 75.67

79 kddcup-rootkit-imap vs back 2225 42 100.14 80 kr-vs-k-one vs ﬁfteen 2244 7 27.77

81 kr-vs-k-three vs eleven 2935 7 35.23 82 kr-vs-k-zero-one vs draw 2901 7 26.63

83 kr-vs-k-zero vs eight 1460 7 53.07 84 kr-vs-k-zero vs ﬁfteen 2193 7 80.22

85 lymphography-normal-ﬁbrosis 148 19 23.67 86 poker-8-9 vs 5 2075 11 82.0

87 poker-8-9 vs 6 1485 11 58.4 88 poker-8 vs 6 1477 11 85.88

89 poker-9 vs 7 244 11 29.5 90 shuttle-2 vs 5 3316 10 66.67

91 shuttle-6 vs 2-3 230 10 22.0 92 winequality-red-3 vs 5 691 12 68.1

93 winequality-red-4 1599 12 29.17 94 winequality-red-8 vs 6 656 12 35.44

95 winequality-red-8 vs 6-7 855 12 46.5 96 winequality-white-3-9 vs 5 1482 12 58.28

97 winequality-white-3 vs 7 900 12 44.0 98 winequality-white-9 vs 4 168 12 32.6

99 zoo-3 101 17 19.2 100 03subcl5-600-5-0-BI 600 3 5.0

101 03subcl5-600-5-30-BI 600 3 5.0 102 03subcl5-600-5-50-BI 600 3 5.0

103 03subcl5-600-5-60-BI 600 3 5.0 104 03subcl5-600-5-70-BI 600 3 5.0

105 03subcl5-800-7-0-BI 800 3 7.0 106 03subcl5-800-7-30-BI 800 3 7.0

107 03subcl5-800-7-50-BI 800 3 7.0 108 03subcl5-800-7-60-BI 800 3 7.0

109 03subcl5-800-7-70-BI 800 3 7.0 110 04clover5z-600-5-0-BI 600 3 5.0

111 04clover5z-600-5-30-BI 600 3 5.0 112 04clover5z-600-5-50-BI 600 3 5.0

113 04clover5z-600-5-60-BI 600 3 5.0 114 04clover5z-600-5-70-BI 600 3 5.0

115 04clover5z-800-7-0-BI 800 3 7.0 116 04clover5z-800-7-30-BI 800 3 7.0

117 04clover5z-800-7-50-BI 800 3 7.0 118 04clover5z-800-7-60-BI 800 3 7.0

119 04clover5z-800-7-70-BI 800 3 7.0 120 paw02a-600-5-0-BI 600 3 5.0

121 paw02a-600-5-30-BI 600 3 5.0 122 paw02a-600-5-50-BI 600 3 5.0

123 paw02a-600-5-60-BI 600 3 5.0 124 paw02a-600-5-70-BI 600 3 5.0

125 paw02a-800-7-0-BI 800 3 7.0 126 paw02a-800-7-30-BI 800 3 7.0

127 paw02a-800-7-50-BI 800 3 7.0 128 paw02a-800-7-60-BI 800 3 7.0

129 paw02a-800-7-70-BI 800 3 7.0

ImpactonBayesianNetworksClassifiersWhenLearningfromImbalancedDatasets

385

which is

TP+TN

TP+FN+FP+TN

. As we commented before,

the problem of this measure is that it can not catch

the differences between errors. One measure able to

deal with this difference is F-measure, which com-

bines precision (

TP+FP

) and recall (

TP+FN

AUC was the measure reported in the works we

revised about imbalanced datasets, and it has been

proved to work properly for measuring good perfor-

mance for this kind of problems (Huang and Ling,

2005). When classiﬁers assign a probabilistic score

to its prediction, class prediction can be changed by

varying the score threshold. Each threshold value

generates a pair of measurements of False Positive

Rate (FPR) and True Positive Rate (TPR). By link-

ing these measurements with FPR on the X-axis and

TPR on the Y-axis, a ROC graph is plotted. This plot

is called the ROC curve, and it gives a good summary

of the performance of a classiﬁcation model. AUC is

the area under this curve, being 1 the best value and

0.5 the worst (given by a random classiﬁer).

3.4 Impact of IR

The ﬁrst natural experiment is to replicate those done

in (Lopez et al., 2013) where authors present how

IR impacts on the classiﬁers evaluation, we show the

classiﬁers measuring AUC with 5fold stratiﬁed Cross

Validation (CV), this CV is also used when discretiza-

tion is previously performed. We used superviseddis-

cretization, in particular the one at (Fayyad and Irani,

1993). We discarded the use of distinct techniques,

since it has been proved for BNCs, that the discretiza-

tion method doesnot have an impact when the number

of datasets is signiﬁcantly large (Flores et al., 2011).

We have carried out a set of tests and the conclu-

sion is similar: we cannot ﬁnd a pattern of behaviour

for any range of IR, and the results can be poor both

for low and high imbalanced data. In Figure 3 we

show a plot with the AUC obtained for train and test

(average on 5-folds), only for NB (continuous and

Discretized). This tendency is similar for the other

BNCs we tested. This plot shows IR until 40, where

most of the datasets are concentrated to see it better,

since the behaviour for larger IR values is also similar,

with ups and downs and non linear with respect to IR.

Notice that it could happen that two datasets have the

same IR value, as datasets nr. 59 and 65, for example,

in those cases the AUC shown is the average of the

obtained measures.

From Figure 3 we can also see how the test and

train values are close, so we can corroborate we are

not producing high overﬁtting thanks probably to CV.

Initially NB seems to perform better in its continuous

version, but this is not always true, for example when

0.5

0.6

0.7

0.8

0.9

0 5 10 15 20 25 30 35 40

NBG Test

NBG Train

NBD Test

NBD Train

Figure 3: x axis shows IR, y axis show AUC value for test

and train, Naive Bayes G(aussian) and (D)iscrete.

IR is between 5 and 8 (approx.) the discrete version

outperforms the continuous model, the same happens

in the interval [13,15]. These results are not conclu-

sive, the comparison among classiﬁers will be done in

the next subsection

3.5 Comparison Among Classiﬁers

In order to compare all classiﬁers, we have discarded

the summary view of the previous analysis, where

each IR point could concentrate several datasets. In

this case we are going to use radial plots where each

angle (from 0 to 2π radians) represents a dataset, and

the length of the plotted point indicates the AUC ob-

tained. To ﬁnd the correspondence between datasets

in Table 1 and these plots, we indicate that they are

plotted anticlockwise (ACW), starting at three, and

the circle is drawn ACW, ﬁnishing at dataset nr. 129.

Notice that AUC can value 1 at maximum, so this

is the radius of this plot. From now on, we will just

plot AUC values for thetest (in fact, this is the average

of the 5-folds done in CV), so that we can focus on

the performance for all the classiﬁers. In Figure 4, we

show a comparison among NB, AODE and C4.5.

We found this analysis quite informative for our

work, and Figure 4 succeeds in summarising all the

values at a single glance. Firstly, this ﬁgure justiﬁes

that BNCs perform much better than C4.5, which was

one of the chosen models in previous relevant papers

dealing with imbalanced datasets. It is remarkable

how C4.5 reaches 0.5 value (the worst result) for so

many datasets. In some cases (angles from0 to

radi-

ans), it has slightly better results, but the general com-

parison indicates this is clearly the worst model, since

all the other circular lines mostly cover it around if we

look at the whole graph. With respect to the BNCs

here plotted, there is not a clear winner, in some areas

NBG seems to win, but in other parts it is clearly im-

ICAART2015-InternationalConferenceonAgentsandArtificialIntelligence

386

0.2

0.4

0.6

0.8

1 0.5 0 0.5 1

AODE

NBD

C45

Figure 4: Radial plot, each angle represents a distinct

dataset, from those listed in Table 1.

proved by AODE and its own discrete version, NBD.

We wanted to extend the comparisonto morecom-

plex BNCs, that allow relationships among attributes.

We originally chose AODE, as literature shows it as

the best semi-Naive Bayes classiﬁer (Webb et al.,

2005). However, this test seemed to be usefulness to

discard or not other models, which can catch more de-

pendencies but which are also slower to learn. We can

appreciate in Figure 6 that the differences when eval-

uating the chosen semi-Naive models (AODE, kDB

with k ∈ {2,3,4} and TAN) are almost insigniﬁcant.

The only area where we can perceive some differ-

ences is in some datasets between π and

15π

, which

means less than 6 datasets, 5% of the total sample.

Which is the explanation for this result? To our view,

including more complexity, in the form of more de-

pendencies, does not provide better results because of

the imbalance class problems. It has sense that sim-

pler models will perform better, they provide simi-

lar AUC values but simple anf faster-to-learn mod-

els. That is the reason why we will select, for our

experiments only AODE, and kDB with low values

for k, 2 and 3. On the other hand, we can see how

Naive Bayes is sometimes outperformed by all the

other BNCs, this is due to the fact that the conditional

independence naively assumed do not usually holds

in real problems.

4 OUR PROPOSAL: IMPROVING

IR WITH RE-SAMPLING

METHODS

We propose a general method that could be applied

0.2

0.4

0.6

0.8

1 0.5 0 0.5 1

AODE

kDB(k=2)

kDB(k=3)

kDB(k=4)

NBD

TAN

Figure 5: Radial plot comparing semi-Naive classiﬁers.

to any dataset enhancing the overall performance of

the classiﬁcation task. It is clear that IR affects clas-

siﬁcation, yet not in a linear way, since the problems

arrive when the number of instances for each class are

clearly imbalanced. So, we try to re-balanceddatasets

and see how this affects.

4.1 Smoted Re-balance

When dealing with the imbalanced class problem, we

can use data-level solutions or algorithm-level. We

are going to use the most widely known method in

the ﬁrst direction, called generally re-sampling. One

technique could be under-sampling the majority class

instances and onthe opposite side we can over-sample

the minority class cases. We chose this second possi-

bility, since the real datasets we have available do not

present many cases.

4.2 Description of the Experiment

We performed an experiment with the purpose of

analysing how re-balancing the dataset can improve

classiﬁcation. For every dataset, we have its initial IR,

that we note as IR

. Then, we apply SMOTE (Chawla

et al., 2002) to get a better balance (then, smaller),

using IR

′

= IR

,··· , IR

, since we have tested with

values from 1 to 4. There is a key point here, when

using SMOTE we use 5-CV, so that we only apply

oversampling to those instances in the training set (the

4-folds that correspond), but we test on the original

fold, where obviously SMOTE can not be applied so

that we can report fair evaluation. When applying

SMOTE we have selected the default value (5) for the

ImpactonBayesianNetworksClassifiersWhenLearningfromImbalancedDatasets

387

number of nearest neighbours for interpolating val-

ues. Tests indicated this is the most appropriate value.

SMOTE also needs to know the percentage of

samples belonging to the minority class to be cre-

ated. Since we want to compensate IR

′

to produce

distinct values, this “% of instances to-add” can be

obtained as shown in Equation 2. For example, sup-

pose we have a dataset with 1200 cases for the ma-

jority class (#M) and 100 for the minority class (#m),

that yields IR

= 12. If we want to oversample the

minority class cases so that we reach an IR

′

= 4, we

obtain 100 ×(

1200

4×100

− 1) = 200. That will produce

300 (original 100 + generated 200) cases for the mi-

nority class, and IR

′

= 1200/300 = 4, which was our

aimed smoted IR.

perc = 100×



′

× #m

− 1



(2)

So the experiment we have done in this stage can

be summarised as below:

[ ] ← {1,2, 3, 4}

Classifiers[ ] ← {NBD, AODE, 2DB,3DB}

Data[ ] ← {d

,··· , d

} ⊲ datasets in Table 1.

for i ← 1,··· ,4 do

m ← Classifiers[i] ⊲ m is a model

for j ← 1, · · · ,n do

d ← Data[j]

auc[ j][0] ← CVORIGINAL(m,d)

for k ← 1,··· ,4 do

p ← OBTAINPERCFROMIR(IR

s[k])

auc[ j][k] ← CVSMOTEDTRFOLDS(m,d,p)

end for

The output of this experiment is shown per classi-

ﬁer, just to see if this smoted re-balance produce bet-

ter results and to which extend these results are rela-

tive to IR

′

In the light of the results (Figure 6), this smoted

technique produce incredibly good results for all clas-

siﬁers. For space problems, we do not show all the

classiﬁers, but we remark that the same tendency re-

peats for every classiﬁer. There seem to be a particu-

lar dataset where using IR

′

3 and 4, performs slightly

worse than using the original IR. This is less than 1%

of our datasets. Furthermore, when IR

′

is 1 or 2, we

obtain much better results in all the datasets.

4.3 Conclusions from the Results

From the previous results, it is evident that the use

of smoted re-balancing provides incredibly good re-

sults for all datasets. Experimentation seems to rec-

ommend especially IR’ as 1 or 2. The decision about

this value can be taken depending on the importance

of using less times (IR2 will create less new minority

cases). However, time for small datasets is not impor-

tant, on average (for all tested models), evaluating the

four new IR values for one dataset takes around one

third of a second, in a relatively old computer (Java

on Intel(R) Core(TM)2 Duo CPU, 2.66GHz and 2GB

for RAM), and AODE is faster than this average.

The most important conclusions from our explo-

ration on BNCs in this representative (129) set of im-

balanced datasets could be summarised as follows:

• All BNCs performed much better in imbalanced

data than decision trees (C4.5), one of the tested

models in previous papers.

• Among BNCs, Naive Bayes does not work

bad, but the conditional independence assumption

makes the model too simple and it provides lower

performance for some datasets.

• AODE seems to provide the better results, since

more complex semi-Naive Bayes, as kDB or TAN

do not obtain better results, in terms of AUC.

• Our proposal, based on smoted re-balancing, out-

performs the original results very signiﬁcantly.

5 FINAL DISCUSSION

In this work we have performed a study on how

Imbalance Ratio affects when using Bayesian Net-

works models in classifying, when the dataset from

which the model is learned presents imbalance be-

tween classes. As seen in the introduction, this prob-

lem is quite frequent in certain ﬁelds. We have seen

that this relation IR with performance is not trivial,

and cannot be caught with a linear function, as many

other factors, intrinsic in the dataset can affect, as dis-

cussed in (Lopez et al., 2013). However, working

with this IR in combination with over-sampling tech-

niques, as SMOTE, can produce an incredible gain in

the classiﬁcation assessment. So, the two main con-

tributions of this paper is the use of BNCs for imbal-

anced datasets together with analysis of their perfor-

mance in a public benchmark with 129 datasets, and

the proposal of a new algorithm in which we recom-

mend to use AODE as classiﬁer and IR’ as 1 or 2,

but which could be parametrised with other BNC and

value and will work properly depending on the user

preferences.

As future work there are two possible lines: inves-

tigate sophisticated boosting techniques focused on

imbalanced dataset that use cost functions (Sun et al.,

2007), probably with some adaptation to BNCs, and

also, the use of other oversampling methods distinct

ICAART2015-InternationalConferenceonAgentsandArtificialIntelligence

388

0.5

0.6

0.7

0.8

0.9

0 20 40 60 80 100 120 140

Orig

IR1

IR2

IR3

IR4

0.5

0.6

0.7

0.8

0.9

0 20 40 60 80 100 120 140

Orig

IR1

IR2

IR3

IR4

0.5

0.6

0.7

0.8

0.9

0 20 40 60 80 100 120 140

Orig

IR1

IR2

IR3

IR4

0.5

0.6

0.7

0.8

0.9

0 20 40 60 80 100 120 140

Orig

IR1

IR2

IR3

IR4

Figure 6: For each classiﬁer, datasets are set in the x-axis in an ascending order with respect to original AUC value. Top left:

NB Discrete; Top right: AODE; Bottom left: 2DB; Bottom right: 3DB. y-axis range is 0.5 to 1.

from SMOTE. Finally, we aim at studying algorithms

to learn BNCs tailored for imbalanced data. The ex-

tension to multi-class problem could also be useful.

ACKNOWLEDGEMENTS

This work has been partially funded by FEDER

funds and the Spanish Government (MINECO) un-

der projects TIN2010-20900-C04-03 and TIN2013-

46638-C3-3-P.

REFERENCES

Breiman, L. (1998). Arcing classiﬁers. Annals of Statistics,

26:801–823.

Chawla, N., Bowyer, K., Hall, L. O., and Kegelmeyer, W. P.

(2002). SMOTE: Synthetic Minority Over-sampling

TEchnique. Journal of Artiﬁcial Intelligence Research

(JAIR), 16:321–357.

Fayyad, U. M. and Irani, K. B. (1993). Multi-interval dis-

cretization of continuous valued attributes for classi-

ﬁcation learning. In Thirteenth International Joint

Conference on Artiﬁcial Intelligence, volume 2, pages

1022–1027. Morgan Kaufmann Publishers.

Flores, M. J., G´amez, J. A., and Mart´ınez., A. M. (2012).

Supervised classiﬁcation with Bayesian networks: A

review on models and applications., chapter 5, pages

72–102. IGI Global.

Flores, M. J., G´amez, J. A., Mart´ınez, A. M., and Puerta,

J. M. (2011). Handling numeric attributes when com-

paring bayesian network classiﬁers: does the dis-

cretization method matter? Applied Intelligence,

34(3):372–385.

Huang, J. and Ling, C. X. (2005). Using auc and accuracy

in evaluating learning algorithms. IEEE Transactions

on Knowledge and Data Engineering, 17(3):299–310.

Kononenko, I. (1991). Semi-naive bayesian classiﬁer. In

Machine Learning EWSL-91, volume 482 of Lecture

Notes in Computer Science, pages 206–219.

Korb, K. B. and Nicholson, A. E. (2010). Bayesian artiﬁcial

intelligence. Chapman & Hall/CRC, 2nd edition.

Lopez, V., Fernandez, A., Garcia, S., Palade, V., and Her-

rera, F. (2013). An insight into classiﬁcation with im-

balanced data: Empirical results and current trends on

using data intrinsic characteristics. Information Sci-

ences, 250(0):113 – 141.

Sun, Y. (2007). Cost-sensitive boosting for classiﬁcation of

imbalanced data. PhD thesis, Department of Electri-

cal and Computer Engineering, University of Water-

loo.

Sun, Y., Kamel, M. S., Wong, A. K. C., and Wang, Y.

(2007). Cost-sensitive boosting for classiﬁcation of

imbalanced data. Pattern Recognition, 40(12):3358–

3378.

Wasikowski, M. and wen Chen, X. (2010). Combating the

small sample class imbalance problem using feature

selection. Knowledge and Data Engineering, IEEE

Transactions on, 22(10):1388–1400.

Webb, G. I., Boughton, J. R., and Wang, Z. (2005). Not so

naive bayes: Aggregating one-dependence estimators.

Machine Learning, 58(1):5–24.

ImpactonBayesianNetworksClassifiersWhenLearningfromImbalancedDatasets

389