A Performance Analysis of Classiﬁers on Imbalanced Data

Nathan F. Garcia

, R

omulo A. Strzoda

, Giancarlo Lucca

and Eduardo N. Borges

Centro de Ci

encias Computacionais, Universidade Federal do Rio Grande, Rio Grande, Brazil

Keywords:

Machine Learning, Data Imbalance, Supervised Classiﬁcation.

Abstract:

In the machine learning ﬁeld, there are many classiﬁcation algorithms. Each algorithm performs better in

certain scenarios, which are very difﬁcult to deﬁne. There is also the concept of grouping multiple classiﬁers,

known as ensembles, which aim to increase the model generalization capacity. Comparing multiple models

is costly, as, for certain cases, training classiﬁers can take a long time. In the literature, many aspects of the

data have already been studied to help in the task of classiﬁer selection, such as measures of diversity among

classiﬁers that form an ensemble, data complexity measures, among others. In this context, the main objective

of this work is to analyze class imbalance and how this measure can be used to guide the selection of classiﬁers.

We also compare the model’s performances when using class balancing techniques such as oversampling and

undersampling.

1 INTRODUCTION

Machine Learning (ML) (Bishop, 2006) is a ﬁeld of

study in the ﬁeld of artiﬁcial intelligence. Among

others, ML is used to deal with classiﬁcation prob-

lems (Hart et al., 2000). From a supervised point of

view, such a problem consists of ﬁnding a model or a

function that can identify patterns and describe differ-

ent classes of data. Therefore, the purpose of classiﬁ-

cation is to label new examples by applying the model

or learned function. This model is based on a set of

features extracted from available data.

There are several techniques proposed in the

literature. Support Vector Machines (SVM) (Stein-

wart and Christmann, 2008), Decision Trees

(DT) (Quinlan, 1986), Artiﬁcial neural networks

(ANNs) (Haykin and Network, 2004), Fuzzy Rules

Based Classiﬁcation Systems (Ishibuchi et al., 2004)

are examples of well-known classiﬁcation algo-

rithms. Each approach is more suitable for a speciﬁc

classiﬁcation problem, that is, one classiﬁcation

algorithm may not effectively and/or efﬁciently

recognize some patterns in complex datasets, while

another may perform optimally for the same task.

To obtain better performance, classiﬁcation en-

sembles were proposed, combining several classiﬁers

https://orcid.org/0000-0002-3149-9903

https://orcid.org/0000-0001-7634-7236

https://orcid.org/0000-0002-3776-0260

https://orcid.org/0000-0003-1595-7676

to improve the model’s generalization and, conse-

quently, to develop more efﬁcient models for identi-

fying data classes. Several types of ensemble (Opitz

and Maclin, 1999a) generation can be generalized

in techniques such as Bagging, Boosting and Stack-

ing (Quinlan et al., 1996), among others.

Selecting a speciﬁc solution to deal with a classi-

ﬁcation problem is a complicated task, considering a

large number of algorithms and proposed techniques

available in the literature. Each problem can have its

ideal classiﬁcation algorithm. Running them all and

adjusting their hyperparameters is not a good idea, as

the amount of time and resources will be prohibitive

if the training data is high.

It is noteworthy that the best performance of the

ensembles is not guaranteed and obtained in all types

of data. In many cases, the ensemble’s efforts are not

justiﬁed since models are even worse than the base

models in some situations. A deeper analysis of this

issue is essential, given the wide use of classiﬁers in

society, becoming more critical every year.

Researchers have analyzed the ensemble’s perfor-

mance from the point of view of distinct aspects: the

diversity among classiﬁers, the complexity of the data

sets, and the unbalance of data (Garcia et al., 2021).

They could not see a strong relationship between the

ensemble’s performance and these aspects. However,

in cases where there was an advantage in using en-

sembles, data were imbalanced above the average.

Classiﬁcation models can be used in a variety

602

Garcia, N., Strzoda, R., Lucca, G. and Borges, E.

A Performance Analysis of Classiﬁers on Imbalanced Data.

DOI: 10.5220/0011089100003179

In Proceedings of the 24th International Conference on Enterprise Information Systems (ICEIS 2022) - Volume 1, pages 602-609

ISBN: 978-989-758-569-2; ISSN: 2184-4992

of scenarios, such as bankruptcy prediction (Bar-

boza et al., 2017), discovering diseases such as

cancer (Kourou et al., 2015), and ﬂood predic-

tion (Mosavi et al., 2018). Given the signiﬁcant difﬁ-

culty in selecting an algorithm based on data or classi-

ﬁer characteristics and the lack of studies that address

this issue, the present paper aims to analyze the sce-

narios in which ensembles perform better than base

classiﬁers, speciﬁcally from the perspective of imbal-

ance. The degree of imbalance in the datasets is dras-

tically different, making it possible to analyze behav-

ior effectively. Some speciﬁc goals are considered:

(i) analyze the performance of the best base classiﬁers

versus the ensemble, comparing their performance us-

ing different metrics, mainly F1-Score, and (ii) com-

pare the performance of classiﬁers at different degrees

of imbalance, using balancing techniques such as un-

der and oversampling.

This paper is organized as follows. The theoret-

ical part is presented in Section 2. Section 3 intro-

duce the related works. After that, the methodology

is presented in Section 4. The obtained results are

discussed in Section 5. Finally, in Section 6, the con-

clusions are stated.

2 THEORETICAL BACKGROUND

The development of ML techniques was based on the

idea that systems can learn from data, identifying pat-

terns that can eventually be used for decision making

with low human intervention. In other words, classi-

ﬁcation refers to the scenario in which an algorithm

predicts a class based on a set of labeled data. Sev-

eral algorithms aim to classify speciﬁc data, whether

supervised or not. Each algorithm will thrive in dif-

ferent scenarios, making it variable which algorithm

will have the best performance for a given dataset.

2.1 Data Imbalance

A dataset is imbalanced when one class label has

many more examples than another (Japkowicz and

Stephen, 2002). Class imbalance is an essential factor

as it is found in many ﬁelds. For instance, regarding

the problem of detecting bank fraud, only a minority

of transactions presents fraud. The presence of imbal-

ance has a signiﬁcant impact on the classiﬁer perfor-

mance (Japkowicz, 2000). Considering that the algo-

rithm aims to recognize patterns to perform the clas-

siﬁcation activity, few examples of a particular class

make the algorithm not learn its nuances. Further-

more, statistically, it is a tendency that the generated

model classiﬁes a minority class as the majority class,

simply due to the large distribution variation.

There are a few ways to mitigate the data imbal-

ance problem, the most common being over and un-

dersampling (Yap et al., 2014). The ﬁrst technique,

oversampling, consists of adding cases to the minor-

ity classes by replicating existing data. On the other

hand, the undersampling technique eliminates cases

from the majority classes. The instances are cho-

sen randomly, and the process stops when all classes

reach the same number of examples. It is important to

indicate that the degree of balance is arbitrary: there

is no need for equal distribution, e.g., 50% of balance

considering two classes. Therefore, such techniques

guarantee a degree of ﬂexibility in experimentation.

Several problems are arising from such balancing

techniques. One of them is modifying the natural dis-

tribution of events. When we balance bank transac-

tions, we present data to the learning algorithm with

a higher proportion of fraud. Furthermore, in the case

of undersampling, a signiﬁcant portion of cases can

be lost, negatively impacting the classiﬁer. In the case

of oversampling, data duplication can generate the

phenomenon known as overﬁtting (Hawkins, 2004),

where the algorithm presents excellent results in train-

ing but bad results in real scenarios since the data is

no longer representative.

2.2 Classiﬁers

A supervised machine learning classiﬁer aims to rec-

ognize patterns in labeled data ﬁtting a classiﬁcation

function or model. Several types of classiﬁers vary

in many factors, such as functioning, type, and inter-

pretability (Polat et al., 2008).

There is also the possibility of combining dif-

ferent classiﬁers, which we commonly call ensem-

bles (Dietterich et al., 2002). They are meta-

classiﬁers that combine multiple algorithms or clas-

siﬁcation schemas (Opitz and Maclin, 1999b). This

combination makes ensemble methods have a bet-

ter capacity for generalization. There are several

ways to combine classiﬁers. In the present study,

we use Stacking (Merz, 1999), Boosting (Schapire,

2013), Bagging (Ahmad et al., 2018), and Vot-

ing (Saqlain et al., 2019) techniques. In our exper-

iments, these ensembles have used the following al-

gorithms as base classiﬁers: Naive Bayes (Zhang,

2005), Logistic Regression (Dong et al., 2016), Sup-

port Vector Machines (Bhavan et al., 2019), K-

Nearest Neighbors (Muliono et al., 2020), and Deci-

sion Trees (Myles et al., 2004).

A Performance Analysis of Classiﬁers on Imbalanced Data

603

Table 1: Confusion matrix example. N represents the num-

ber of instances inside the predictive model during their

avaliation, it’s the sum of all classiﬁcation possibilities.

Actual / Predicted Positive Negative

Positive TP FN

Negative FP TN

N = TP + TN + FN + FP

2.3 Evaluation of Predictive Models

An essential step in the generation of predictive mod-

els is their evaluation. The evaluation metrics summa-

rize the model’s predictive power and provide a basis

for comparison when new models are generated. It is

widespread to generate several models before deﬁn-

ing the ﬁnal model. There are many hyperparameters

and possible data processing tasks to be varied, and

each combination generates a model to be compared.

Furthermore, in the case of interpretable models, it

may be found that some features do not help in pre-

dicting a particular phenomenon.

The most common metrics used in the evaluation

of machine learning models use a matrix organiza-

tion known as Confusion Matrix (CM) (Visa et al.,

2011), which provides a summarized view of the per-

formance of the predictive model on a case-by-case

basis. That is, the possible scenarios related to a pre-

diction are counted. The ﬁrst is the case of the hit,

known as True positive (TP). The opposite scenario is

the other possible correct prediction, known as True

Negative (TN). The other two cases are about errors:

False Positives (FP) and False Negatives. Table 1 pro-

vides a more visual way to make the confusion matrix

easier to understand.

Accuracy is the most common metric used in the Ma-

chine Learning ﬁeld. It is the proportion of correct

predictions (TP, TN) and the total instances used in

evaluating the model, deﬁned by the equation 1.

Accuracy =

T P + T N

(1)

Precision is a predictive model evaluation metric that

aims to answer the following question: What propor-

tion of cases classiﬁed as positive is positive? The

equation 2 deﬁnes the Precision.

Precision =

T P

T P + FP

(2)

Recall is usually used in conjunction with Precision,

as it aims to answer a question that is also relevant in

the analysis of the performance of predictive models:

which proportion of the positive cases was correctly

classiﬁed? Equation 3 deﬁnes the Recall.

Recall =

T P

T P + FN

(3)

F1-Score is the harmonic mean between Accuracy

and Recall, and is deﬁned by equation 4. When gener-

ating predictive models that use unbalanced datasets

as a basis, F1 is very powerful: it is possible to

summarize the model’s performance between classes,

with minority classes having a signiﬁcant impact on

the ﬁnal result.

F1-Score = 2 ∗

Precision ∗ Recall

Precision + Recall

(4)

Area Under the ROC Curve (AUC) is an evaluation

metric very good in generalizing results also in unbal-

anced scenarios. Obtaining them, however, is consid-

erably more complex than the other measures. The

curve that delimits its area is called the Receiver Op-

erating Characteristic (ROC) (Hoo et al., 2017). Such

a curve is constructed with the possible combinations

between the proportion of True Positives and False

Positives, deﬁned through different decision thresh-

olds.

When machine learning models are built, valida-

tion is required, mainly due to the possibility that

training was ineffective or the scenario that datasets

are not representative. It is common to use validation

methods to ensure that the model has been trained and

tested in different scopes of data, observing how it be-

haves.

Cross-validation is a sampling method used to evalu-

ate machine learning models on a limited dataset. The

method takes a parameter that indicates the number

of partitions. Each partition will be used as the val-

idation set and the remaining folds to train a distinct

model. The performance of all models is averaged. It

is important to note that the partitions are stratiﬁed,

as this reduces bias and variance compared to non-

stratiﬁed models, as pointed out by (Kohavi, 1995).

3 RELATED WORK

Considering the number of resources spent searching

for the ideal machine learning model, a deeper study

is essential on how speciﬁc characteristics can help us

reduce the number of possibilities.

In the work of (Thabtah et al., 2020), different sets

of data with different degrees of imbalance were an-

alyzed. For the analysis, Recall, Precision, and Ac-

curacy asymmetries were observed for each scenario.

Authors also tested balancing techniques such as

SMOTE (Chawla et al., 2002), which generates new

synthetic data using the K-Nearest Neighbors (KNN)

algorithm. Five different datasets were used to con-

duct the study, named Cleveland, Credit-German, Di-

abetes, and Hepatitis. The largest is Credit-German,

ICEIS 2022 - 24th International Conference on Enterprise Information Systems

604

with 1,000 instances and 20 features. The smallest

is Hepatitis, with 155 instances and 19 features. The

degree of imbalance varies for all datasets, reaching

9:1. During training, 10-fold cross-validation was

performed, and the balance was changed to 1:1. The

authors concluded that the fundamental behavior of

the algorithm was to maximize the accuracy of the

model. Furthermore, although the highest accuracy

is concentrated in high imbalance experiments, these

are biased and are not good classiﬁers.

In the work of (Oreski and Oreski, 2014), the per-

formance of different classiﬁers on different datasets

was analyzed. The SMOTE synthetic data gener-

ation technique was applied for each dataset. The

classiﬁcation algorithms employed were Neural Net-

works (Hagan et al., 1997), Support Vector Ma-

chines (Hearst et al., 1998), Repeated Incremental

Pruning to Produce ErrorReduction (RIPPER) (Park

and Bae, 2015), and Naive Bayes (Zhang, 2005).

Thirty different datasets were used, all obtained

through the platform Knowledge Extraction based on

Evolutionary Learning (KEEL) (Alcal

a-Fdez et al.,

2009). The proportion of instances in the majority

and minority classes ranges from 9:1 to 41:1. The

number of instances ranges from 92 to 1829. To eval-

uate the results, a paired t-test (Hsu and Lachenbruch,

2014) as well as a Wilcoxon signed-rank test (Ben-

nett, 1964) ware performed. Cross-validation was not

used, and the hyperparameters of the classiﬁers were

not ﬁne-tuned. As a result, it was possible to observe

a decrease in the performance of the algorithms after

applying the SMOTE data balancing technique for the

Naive Bayes algorithm. In contrast, a considerable

increase was observed for the other classiﬁers when

analyzing the AUC metric. However, when analyz-

ing accuracy, the SMOTE technique did not positively

contribute to the data sets’ performance: the classi-

ﬁers showed better performance when applied to the

original, unbalanced data.

4 METHODOLOGY

In this section, the methodology adopted in this study

is presented. We describe the datasets used and their

particularities. The learning process that was per-

formed for each algorithm and dataset and the the best

hyperparameters for each model are detailed.

4.1 Datasets

To establish a solid basis for the present experiment,

15 datasets with different levels of unbalance, several

instances, and several characteristics were selected.

Table 2: Description of the datasets used in the study.

ID Name #Inst #Feat #Prop

Eco ecoli 336 7 8.6:1

Sat satimage 6435 36 9.3:1

Aba abalone 4177 10 9.7:1

Sic sick euthyroid 3163 42 9.8:1

Usc us crime 1994 100 12:1

Yea18 yeast ml8 2417 103 13:1

Lib libras move 360 90 14:1

Arr arrhytmia 452 278 17:1

Sol solar ﬂare m0 1389 32 19:1

Oil oil 937 49 22:1

Car car eval 4 1728 21 26:1

Yea2 yeast me2 1484 8 28:1

Wine wine quality 4898 11 26:1

Ozo ozone level 2536 72 34:1

Aba19 abalone 19 4177 10 130:1

Table 2 presents all data referring to the datasets used.

The table columns indicate dataset name, the number

of instances (#Inst) and features (#Feat), and also the

proportion of the majority class to the minority class

(Prop.) used for ordering the datasets.

All data were obtained from the open-source li-

brary imblearn (Lema

ıtre et al., 2017). The selected

datasets were structured to provide a basis for perfor-

mance comparison, known as the term benchmark.

Therefore, it is possible to see that after processing

and formatting, a dataset generated more than one

benchmark dataset, which is the case of the Abalone.

In this study, the original dataset was used, as well as

a variation called Abalone 19, where the difference

between them was the formatting of the features to be

classiﬁed.

4.2 Selection of Machine Learning

Algorithms

Ten machine learning algorithms were selected. Five

are base classiﬁers, namely: Naive Bayes (NB)

(Zhang, 2005), Logistic Regression (LR) (Dong et al.,

2016), Support Vector Machines (SVM) (Bhavan

et al., 2019), K-Nearest Neighbors (Muliono et al.,

2020) (KNN), and Decision Trees (DT) (Myles et al.,

2004). The other ﬁve are ensembles: Stacking (Merz,

1999), Boosting (Schapire, 2013), Bagging (Ahmad

et al., 2018), and Voting (Saqlain et al., 2019). Each

algorithm has a drastically different operation.

The choice was made based on the assumption

that different algorithms are more likely to gener-

ate models that perform different predictions, as they

make decisions using different logic. Considering

that each dataset is distinct, the use of several al-

gorithms makes it more likely to generate at least

A Performance Analysis of Classiﬁers on Imbalanced Data

605

one base classiﬁer with good performance. Using

distinct base classiﬁcation algorithms is also essen-

tial for building the Stacking because the diversity

among classiﬁers generates models with greater gen-

eralizability and better performance (Kuncheva and

Whitaker, 2003). The same logic was deﬁned for the

Voting-based algorithm.

4.3 Experimental Setup

Each ML algorithm has customizable hyperparam-

eters, and they can impact the generated model in

several ways, such as the learning rate, complexity,

and weights for certain features. In our experiment

evaluation, several classiﬁers were used, each with

its speciﬁc hyperparameters. As there are many, the

technique known as Grid Search (Liashchynskyi and

Liashchynskyi, 2019) (GS) was used, which com-

bines all of them, generating N models. The num-

ber of distinct hyperparameter combinations deﬁnes

N. When searching for the hyperparameters, the func-

tion returns the model with the best performance con-

sidering an evaluation metric. Each generated model

was evaluated using Cross-Validation.

The best models ﬁtted by the base classiﬁcation

algorithms are compared using F1-Score since it re-

lates two very relevant metrics in unbalanced scenar-

ios: Precision and Recall. However, the other met-

rics presented in Section 2.3 are also generated for

further analysis. Then, the same process is applied to

the ensembles. Evaluation metrics are generated, and,

ﬁnally, the performance of the best ensemble is ana-

lyzed. Results are generated for each dataset, and in

the end, it is possible to compare the performance of

the best base classiﬁer and the best ensemble.

To create the Stacking algorithm, we used the es-

timators obtained using the Grid Search in each base

algorithm used: LR, DT, SVM, and KNN. These are

then combined in the meta-classiﬁer, which the best is

selected based on performance of the base classiﬁers.

Finally, we present the 15 unbalanced datasets se-

lected for this study in Table 2. Where ID is re-

lated with the identiﬁcation of the dataset, used in the

obtained results, Name is the complete name of the

dataset, #Inst the number of instances, #Feat the num-

ber of features and #Prop the proportion of examples.

5 RESULTS

In this section, we present the results and discuss

them. To do so, we provide them in Table 3, which

is divided into two parts. The ﬁrst one presents the

results related to the base classiﬁers and the second

to the ensemble. If an obtained value in the ensem-

ble part is superior to the base classiﬁer, we highlight

it with boldface. Considering the structure of this ta-

ble, rows are related to the different datasets. We have

provided results considering the original datasets and

their versions where the undersample ( Under) and

oversample ( Over) were applied. Additionally, the

columns represent the different evaluation metrics for

the best-tuned model. The base model column shows

the acronym of the best base classiﬁcation algorithm.

For F1, precision (Prec), and recall (Rec), we provide

the results per class (C0 and C1) and its averaging

(Avg). The last two columns are related to the Area

Under the Curve (AUC) and Accuracy (Acc).

Our ﬁrst analysis takes into account the accuracy.

It is necessary to highlight that the base classiﬁers are

already achieving high accuracy, therefore the usage

of an ensemble can not provide a signiﬁcant perfor-

mance. Moreover, as mentioned before, consider the

accuracy for an unbalanced problem does not seem a

good alternative. For this reason, we are focusing on

averaging F1-Score as the general evaluation metric.

Up to this point, we analyze in which cases en-

semble outperformed the base classiﬁer. Considering

all 15 datasets in their standard versions, i.e., without

applying over or undersampling, the ensembles per-

formed best considering averaging F1-Score in four

of them (26%): Oil, Sic, Spec, and Yea2. In the

oversampling scenario, this rate remained the same

(Aba19, Eco, Lib, and Yea2). Regarding undersam-

ple, ensembles were best in 8 datasets (53%): Aba,

Aba19, Oil, Ozo, Sat, Sic, Spec, and Win.

Although almost all experiments presented better

results using balancing techniques, the degree of im-

balance of the standard datasets did not show a corre-

lation with the improvement in F1-Score when apply-

ing these techniques. For instance, despite Aba 19 be-

ing the most imbalanced dataset (130:1), it reached a

smaller F1 gain when compared with its original ver-

sion, Abalone, which presents a distribution of 9.7:1.

It is important to note that the difference between

ensembles and base classiﬁers was insufﬁcient to jus-

tify their use. Only two cases presented signiﬁcant

differences after over or undersampling (Eco = 4.6%

and Ozo = 4.3%). Therefore, it is necessary to ana-

lyze whether few gains are essential for the applica-

tion ﬁeld. Furthermore, even in the scenario in which

the ensemble overreach the best base classiﬁer apply-

ing undersampling, it is necessary to verify in-depth

that there was no loss of information in removing in-

stances at random. Possibly, techniques that analyze

the distribution of data points can provide a basis of

study with less damage to data quality in some cases.

In a closer look at the averaging Recall, we have

ICEIS 2022 - 24th International Conference on Enterprise Information Systems

606

Table 3: Evaluation metrics performed in the 15 datasets considering over and undersampling.

Dataset Base Model F1 Avg F1 C0 F1 C1 Rec Avg Rec C0 Rec C1 Prec Avg Prec C0 Prec C1 AUC Acc

Aba KNN 0.482 0.949 0.014 0.503 0.907 0.100 0.502 0.996 0.008 0.502 0.903

Aba Over KNN 0.872 0.855 0.888 0.899 1.000 0.799 0.874 0.748 1.000 0.874 0.874

Aba Under SVM 0.791 0.764 0.819 0.816 0.892 0.740 0.795 0.672 0.918 0.795 0.795

Aba 19 KNN 0.498 0.996 0.000 0.496 0.992 0.000 0.500 1.000 0.000 0.500 0.992

Aba 19 Over KNN 0.976 0.976 0.977 0.977 1.000 0.954 0.976 0.952 1.000 0.976 0.976

Aba 19 Under DT 0.714 0.644 0.785 0.812 0.933 0.690 0.742 0.542 0.942 0.742 0.736

Arr DT 0.780 0.976 0.584 0.808 0.977 0.640 0.780 0.977 0.583 0.780 0.956

Arr Over SVM 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000

Arr Under DT 0.833 0.840 0.826 0.854 0.900 0.808 0.842 0.817 0.867 0.842 0.840

Car SVM 0.990 0.999 0.980 0.999 0.999 1.000 0.983 1.000 0.967 0.983 0.999

Car Over SVM 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000

Car Under NB 0.992 0.992 0.992 0.993 1.000 0.986 0.993 0.986 1.000 0.993 0.992

Eco KNN 0.802 0.964 0.639 0.856 0.958 0.753 0.793 0.970 0.617 0.793 0.934

Eco Over DT 0.943 0.940 0.947 0.949 0.996 0.902 0.943 0.890 0.997 0.943 0.943

Eco Under KNN 0.925 0.921 0.930 0.938 0.975 0.902 0.925 0.883 0.967 0.925 0.929

Lib LR 0.877 0.990 0.763 0.940 0.980 0.900 0.850 1.000 0.700 0.850 0.981

Lib Over DT 0.994 0.994 0.994 0.994 1.000 0.989 0.994 0.988 1.000 0.994 0.994

Lib Under NB 0.832 0.807 0.857 0.850 0.783 0.917 0.867 0.867 0.867 0.867 0.850

Oil DT 0.700 0.976 0.425 0.742 0.973 0.510 0.694 0.979 0.410 0.694 0.954

Oil Over SVM 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000

Oil Under LR 0.836 0.834 0.839 0.847 0.827 0.867 0.840 0.855 0.825 0.840 0.842

Ozo DT 0.540 0.981 0.099 0.557 0.973 0.141 0.536 0.989 0.082 0.536 0.963

Ozo Over SVM 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000

Ozo Under LR 0.761 0.756 17.357 0.785 0.833 0.738 0.764 0.711 0.818 0.764 0.768

Sat KNN 0.811 0.966 0.655 0.840 0.958 0.723 0.788 0.975 0.601 0.788 0.938

Sat Over SVM 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000

Sat Under KNN 0.870 0.859 0.881 0.882 0.938 0.826 0.872 0.796 0.947 0.872 0.871

Sic DT 0.921 0.985 0.857 0.925 0.985 0.864 0.920 0.986 0.854 0.920 0.973

Sic Over SVM 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000

Sic Under DT 0.944 0.943 0.944 0.946 0.954 0.938 0.943 0.935 0.952 0.943 0.944

Sol NB 0.619 0.964 0.275 0.626 0.963 0.289 0.621 0.964 0.279 0.621 0.931

Sol Over SVM 0.925 0.923 0.927 0.927 0.950 0.903 0.925 0.898 0.952 0.925 0.925

Sol Under SVM 0.754 0.736 0.773 0.777 0.826 0.727 0.760 0.679 0.840 0.760 0.758

Spec KNN 0.878 0.983 0.774 0.972 0.968 0.975 0.824 0.998 0.650 0.824 0.968

Spec Under LR 0.910 0.917 0.902 0.921 0.887 0.955 0.913 0.960 0.865 0.913 0.911

Usc LR 0.743 0.970 0.516 0.834 0.954 0.715 0.700 0.986 0.413 0.700 0.943

Usc Over SVM 0.994 0.994 0.994 0.994 1.000 0.989 0.994 0.989 1.000 0.994 0.994

Usc Under NB 0.873 0.874 0.871 0.880 0.864 0.897 0.873 0.893 0.853 0.873 0.873

Win DT 0.570 0.981 0.159 0.716 0.966 0.467 0.547 0.996 0.099 0.547 0.962

Win Over SVM 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000

Win Under LR 0.709 0.717 0.701 0.714 0.702 0.726 0.710 0.738 0.683 0.710 0.710

Yea2 DT 0.649 0.981 0.318 0.694 0.975 0.413 0.630 0.987 0.273 0.630 0.962

Yea2 Over KNN 0.946 0.943 0.949 0.952 1.000 0.904 0.947 0.893 1.000 0.947 0.947

Yea2 Under SVM 0.831 0.832 0.830 0.839 0.833 0.845 0.833 0.840 0.827 0.833 0.832

Yea18 NB 0.496 0.926 0.065 0.495 0.926 0.065 0.498 0.927 0.068 0.498 0.864

Yea18 Over SVM 0.997 0.997 0.997 0.997 1.000 0.995 0.997 0.995 1.000 0.997 0.997

Yea18 Under KNN 0.578 0.573 0.584 0.580 0.587 0.573 0.579 0.562 0.597 0.579 0.579

Mean – 0.833 0.920 1.100 0.853 0.939 0.769 0.831 0.911 0.751 0.831 0.917

Dataset Ensemble Model F1 Avg F1 C0 F1 C1 Rec Avg Rec C0 Rec C1 Prec Avg Prec C0 Prec C1 AUC Acc

Aba AdaBoost 0.478 0.951 0.005 0.503 0.907 0.100 0.501 1.000 0.003 0.501 0.907

Aba Over AdaBoost 0.859 0.843 0.875 0.880 0.964 0.795 0.861 0.750 0.972 0.861 0.861

Aba Under Random Forest 0.796 0.773 0.819 0.815 0.880 0.749 0.799 0.693 0.906 0.799 0.799

Aba 19 Random Forest 0.498 0.996 0.000 0.496 0.992 0.000 0.500 1.000 0.000 0.500 0.992

Aba 19 Over AdaBoost 0.998 0.998 0.998 0.998 1.000 0.995 0.998 0.995 1.000 0.998 0.998

Aba19 Under Extra Trees 0.735 0.682 0.787 0.828 0.927 0.730 0.763 0.608 0.917 0.763 0.752

Arr AdaBoost 0.753 0.977 0.530 0.766 0.975 0.557 0.764 0.979 0.550 0.764 0.956

Arr Over Random Forest 0.998 0.998 0.998 0.998 1.000 0.995 0.998 0.995 1.000 0.998 0.998

Arr Under AdaBoost 0.791 0.790 0.792 0.813 0.850 0.775 0.800 0.767 0.833 0.800 0.800

Car AdaBoost 0.954 0.997 0.911 0.997 0.994 1.000 0.921 1.000 0.843 0.921 0.994

Car Over AdaBoost 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000

Car Under Random Forest 0.992 0.992 0.992 0.993 1.000 0.986 0.993 0.986 1.000 0.993 0.992

Eco AdaBoost 0.769 0.957 0.581 0.826 0.952 0.700 0.765 0.963 0.567 0.765 0.923

Eco Over AdaBoost 0.983 0.983 0.984 0.984 1.000 0.969 0.983 0.967 1.000 0.983 0.983

Eco Under AdaBoost 0.899 0.899 0.899 0.904 0.892 0.917 0.904 0.917 0.892 0.904 0.900

Lib Random Forest 0.872 0.987 0.757 0.987 0.974 1.000 0.817 1.000 0.633 0.817 0.975

Lib Over Random Forest 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000

Lib Under Extra Trees 0.832 0.807 0.857 0.850 0.783 0.917 0.867 0.867 0.867 0.867 0.850

Oil AdaBoost 0.721 0.980 0.461 0.786 0.973 0.600 0.689 0.988 0.390 0.689 0.962

Oil Over AdaBoost 0.998 0.998 0.998 0.998 1.000 0.996 0.998 0.996 1.000 0.998 0.998

Oil Under AdaBoost 0.850 0.852 0.847 0.868 0.853 0.883 0.853 0.875 0.830 0.853 0.853

Ozo Random Forest 0.493 0.985 0.000 0.486 0.971 0.000 0.500 1.000 0.000 0.500 0.971

Ozo Over AdaBoost 0.997 0.997 0.997 0.997 1.000 0.994 0.997 0.993 1.000 0.997 0.997

Ozo Under Random Forest 0.796 0.794 0.798 0.799 0.809 0.790 0.797 0.784 0.811 0.797 0.797

Sat AdaBoost 0.789 0.964 0.613 0.838 0.951 0.725 0.757 0.978 0.536 0.757 0.935

Sat Over AdaBoost 0.942 0.939 0.945 0.947 0.992 0.902 0.942 0.892 0.993 0.942 0.942

Sat Under Random Forest 0.872 0.865 0.880 0.879 0.915 0.842 0.873 0.823 0.923 0.873 0.873

Sic AdaBoost 0.941 0.989 0.892 0.945 0.988 0.902 0.937 0.990 0.884 0.937 0.980

Sic Over AdaBoost 0.990 0.990 0.991 0.991 0.999 0.982 0.990 0.982 0.999 0.990 0.990

Sic Under AdaBoost 0.945 0.946 0.944 0.947 0.945 0.949 0.945 0.949 0.941 0.945 0.945

Sol AdaBoost 0.556 0.974 0.138 0.690 0.955 0.425 0.540 0.994 0.086 0.540 0.950

Sol Over AdaBoost 0.848 0.840 0.855 0.853 0.890 0.816 0.848 0.797 0.899 0.848 0.848

Sol Under Extra Trees 0.735 0.745 0.725 0.766 0.763 0.769 0.744 0.767 0.721 0.744 0.743

Spec Extra Trees 0.890 0.985 0.796 0.969 0.972 0.967 0.849 0.998 0.700 0.849 0.972

Spec Under Random Forest 0.931 0.933 0.928 0.944 0.947 0.942 0.928 0.930 0.925 0.928 0.933

Usc AdaBoost 0.711 0.967 0.454 0.802 0.949 0.654 0.670 0.986 0.353 0.670 0.938

Usc Over AdaBoost 0.989 0.989 0.989 0.989 1.000 0.978 0.989 0.978 1.000 0.989 0.989

Usc Under Extra Trees 0.843 0.837 0.848 0.849 0.875 0.823 0.843 0.807 0.880 0.843 0.843

Win AdaBoost 0.536 0.982 0.090 0.807 0.964 0.650 0.525 1.000 0.050 0.525 0.964

Win Over AdaBoost 0.968 0.967 0.969 0.969 0.991 0.947 0.968 0.945 0.991 0.968 0.968

Win Under Random Forest 0.775 0.786 0.763 0.783 0.754 0.811 0.776 0.825 0.727 0.776 0.776

Yea2 AdaBoost 0.668 0.982 0.353 0.787 0.975 0.600 0.632 0.990 0.273 0.632 0.966

Yea2 Over AdaBoost 0.989 0.989 0.989 0.989 1.000 0.979 0.989 0.978 1.000 0.989 0.989

Yea2 Under Extra Trees 0.807 0.817 0.796 0.820 0.811 0.830 0.812 0.840 0.783 0.812 0.811

Yea18 AdaBoost 0.481 0.962 0.000 0.463 0.926 0.000 0.500 1.000 0.000 0.500 0.926

Yea16 Over AdaBoost 0.977 0.976 0.977 0.978 1.000 0.955 0.977 0.953 1.000 0.977 0.977

Yea18 Under Extra Trees 0.565 0.538 0.591 0.576 0.597 0.556 0.571 0.500 0.642 0.571 0.570

Mean – 0.826 0.919 0.732 0.854 0.933 0.776 0.823 0.915 0.730 0.823 0.917

A Performance Analysis of Classiﬁers on Imbalanced Data

607

a mean of 0.85 for all the ensembles and base clas-

siﬁers. Comparing the achieved results, we have that

for 21 different datasets, out of the 47 considered, we

have that the achieved results is superior or equal than

the ones obtained in the base classiﬁers. We also have

that both balancing techniques improved the obtained

values for the dataset Aba 19, Car and Lib. On the

other hand, neither approach holds this behavior for

the datasets Aba, Arr, Sat, Sol, Usc, and Yea18. For

the remaining cases, at least one technique, over or

undersampling, enhances the results.

A similar analysis was performed with the averag-

ing Precision, for both the base classiﬁer and the en-

semble. Once again, we can notice a similar behavior

in these cases, which means around 0.83. We have 20

different cases presenting best Precision scores than

the base classiﬁers. Considering the datasets, we have

that the sampling approaches did not increase the re-

sults only for Arr, Sol, and Usc.

The mean AUC regarding all datasets for the base

classiﬁers was 0.831, while the achieved mean for

the ensembles was 0.823. Additionally, for 20 dif-

ferent datasets, the ensembles AUC outperformed or

tied the base classiﬁers. Again, for Arr, Sol, and Usc,

this metric was worst than base classiﬁers. In all ot-

ter cases, the usage of balancing demonstrated at least

one situation better than those in base classiﬁers.

Finally, analyzing the performance of the ensem-

bles and base classiﬁers, it is possible to see that, even

for cases where there is a signiﬁcant imbalance, there

is no direct relationship that allows us to perform a

prediction based on this variable.

6 CONCLUSION

In this paper, we study the performance relationship

of ensembles and base classiﬁers from the perspective

of unbalance. We collect several metrics such as ac-

curacy, recall, precision, and F1-score. We analyzed

whether we can ﬁt ensembles with outstanding pre-

dictive capacity than base classiﬁers in scenarios of

greater imbalance.

After analyzing the experimental results over ﬁf-

teen datasets, it is possible to state that there is no

direct and strong correlation between imbalance and

the performance of ensembles. In most cases, we had

base classiﬁers that performed better. These results

added to the extra resources spent for selection and

the classiﬁers’ training time, helping us to conclude

that it is not reasonable to use ensembles just based on

an imbalance distribution. The problem presents itself

to the authors as a mixture of possible and previously

analyzed variables, in addition to other measures that

were possibly not analyzed.

By applying balancing techniques, we got slightly

different results. For oversampling, we got the same

rate of scenarios where ensembles outperformed the

base classiﬁers, and for undersampling, that rate was

53%. It was impossible to observe a correlation be-

tween the degree of imbalance in the dataset and how

this distribution can beneﬁt the ensemble. In addi-

tion to the results, it is also necessary to analyze how

the sampling method selects data to be cloned or ex-

cluded, as it can drastically affect the representative-

ness of the studied phenomenon.

ACKNOWLEDGEMENTS

This study was supported by CNPq (305805/2021-5)

and PNPD/CAPES (464880/2019-00).

REFERENCES

Ahmad, M. W., Reynolds, J., and Rezgui, Y. (2018). Pre-

dictive modelling for solar thermal energy systems: A

comparison of support vector regression, random for-

est, extra trees and regression trees. Journal of cleaner

production, 203:810–821.

Alcal

a-Fdez, J., S

anchez, L., Garc

ıa, S., Del Jesus, M., Ven-

tura, S., Garrell, J., Otero, J., Romero, C., Bacardit, J.,

Rivas, V., et al. (2009). Knowledge extraction based

on evolutionary learning. Reference manual, 0–31.

Barboza, F., Kimura, H., and Altman, E. (2017). Machine

learning models and bankruptcy prediction. Expert

Systems with Applications, 83:405–417.

Bennett, B. (1964). A bivariate signed rank test. Journal of

the Royal Statistical Society: Series B (Methodologi-

cal), 26(3):457–461.

Bhavan, A., Chauhan, P., Shah, R. R., et al. (2019). Bagged

support vector machines for emotion recognition from

speech. Knowledge-Based Systems, 184:104886.

Bishop, C. M. (2006). Pattern recognition. Machine learn-

ing, 128(9).

Chawla, N. V., Bowyer, K. W., Hall, L. O., and Kegelmeyer,

W. P. (2002). Smote: synthetic minority over-

sampling technique. Journal of artiﬁcial intelligence

research, 16:321–357.

Dietterich, T. G. et al. (2002). Ensemble learning.

The handbook of brain theory and neural networks,

2(1):110–125.

Dong, L., Wesseloo, J., Potvin, Y., and Li, X. (2016). Dis-

crimination of mine seismic events and blasts using

the ﬁsher classiﬁer, naive bayesian classiﬁer and lo-

gistic regression. Rock Mechanics and Rock Engineer-

ing, 49(1):183–211.

Garcia, N., Tiggeman, F., Borges, E., Lucca, G., Santos, H.,

and Dimuro, G. (2021). Exploring the relationships

between data complexity and classiﬁcation diversity

ICEIS 2022 - 24th International Conference on Enterprise Information Systems

608

in ensembles. In Proceedings of the 23rd International

Conference on Enterprise Information Systems, pages

652–659. INSTICC, SciTePress.

Hagan, M. T., Demuth, H. B., and Beale, M. (1997). Neural

network design. PWS Publishing Co.

Hart, P. E., Stork, D. G., and Duda, R. O. (2000). Pattern

classiﬁcation. Wiley Hoboken.

Hawkins, D. M. (2004). The problem of overﬁtting. Jour-

nal of chemical information and computer sciences,

44(1):1–12.

Haykin, S. and Network, N. (2004). A comprehensive foun-

dation. Neural networks, 2(2004):41.

Hearst, M. A., Dumais, S. T., Osuna, E., Platt, J., and

Scholkopf, B. (1998). Support vector machines. IEEE

Intelligent Systems and their applications, 13(4):18–

28.

Hoo, Z. H., Candlish, J., and Teare, D. (2017). What is an

roc curve?

Hsu, H. and Lachenbruch, P. A. (2014). Paired t test. Wiley

StatsRef: statistics reference online.

Ishibuchi, H., Nakashima, T., and Nii, M. (2004). Classiﬁ-

cation and modeling with linguistic information gran-

ules: Advanced approaches to linguistic Data Mining.

Springer Science & Business Media.

Japkowicz, N. (2000). The class imbalance problem: Sig-

niﬁcance and strategies. In Proc. of the Int’l Conf. on

Artiﬁcial Intelligence, volume 56. Citeseer.

Japkowicz, N. and Stephen, S. (2002). The class imbalance

problem: A systematic study. Intelligent data analy-

sis, 6(5):429–449.

Kohavi, R. (1995). A study of cross-validation and boot-

strap for accuracy estimation and model selection. In

Proceedings of the 14th International Joint Confer-

ence on Artiﬁcial Intelligence - Volume 2, IJCAI’95,

page 1137–1143, San Francisco, CA, USA. Morgan

Kaufmann PUBLISHERs Inc.

Kourou, K., Exarchos, T. P., Exarchos, K. P., Karamouzis,

M. V., and Fotiadis, D. I. (2015). Machine learning ap-

plications in cancer prognosis and prediction. Compu-

tational and structural biotechnology journal, 13:8–

17.

Kuncheva, L. I. and Whitaker, C. J. (2003). Measures

of diversity in classiﬁer ensembles and their relation-

ship with the ensemble accuracy. Machine learning,

51(2):181–207.

Lema

ıtre, G., Nogueira, F., and Aridas, C. K. (2017).

Imbalanced-learn: A python toolbox to tackle the

curse of imbalanced datasets in machine learning. The

Journal of Machine Learning Research, 18(1):559–

563.

Liashchynskyi, P. and Liashchynskyi, P. (2019). Grid

search, random search, genetic algorithm: a big com-

parison for nas. arXiv preprint arXiv:1912.06059.

Merz, C. J. (1999). Using correspondence analysis to com-

bine classiﬁers. Machine Learning, 36(1-2):33–58.

Mosavi, A., Ozturk, P., and Chau, K.-w. (2018). Flood pre-

diction using machine learning models: Literature re-

view. Water, 10(11):1536.

Muliono, R., Lubis, J. H., and Khairina, N. (2020). Anal-

ysis k-nearest neighbor algorithm for improving pre-

diction student graduation time. Sinkron: jurnal dan

penelitian teknik informatika, 4(2):42–46.

Myles, A. J., Feudale, R. N., Liu, Y., Woody, N. A., and

Brown, S. D. (2004). An introduction to decision tree

modeling. Journal of Chemometrics: A Journal of the

Chemometrics Society, 18(6):275–285.

Opitz, D. and Maclin, R. (1999a). Popular ensemble meth-

ods: An empirical study. Journal of artiﬁcial intelli-

gence research, 11:169–198.

Opitz, D. and Maclin, R. (1999b). Popular ensemble meth-

ods: An empirical study. Journal of Artiﬁcial Intelli-

gence Research, 11:169–198.

Oreski, G. and Oreski, S. (2014). An experimental

comparison of classiﬁcation algorithm performances

for highly imbalanced datasets. In Central Euro-

pean Conference on Information and Intelligent Sys-

tems, page 4. Faculty of Organization and Informatics

Varazdin.

Park, B. and Bae, J. K. (2015). Using machine learning

algorithms for housing price prediction: The case of

fairfax county, virginia housing data. Expert systems

with applications, 42(6):2928–2934.

Polat, K., Yosunkaya, S¸., and G

unes¸, S. (2008). Compari-

son of different classiﬁer algorithms on the automated

detection of obstructive sleep apnea syndrome. Jour-

nal of Medical Systems, 32(3):243–250.

Quinlan, J. R. (1986). Induction of decision trees. Machine

learning, 1(1):81–106.

Quinlan, J. R. et al. (1996). Bagging, boosting, and c4. 5.

In Aaai/iaai, Vol. 1, pages 725–730.

Saqlain, M., Jargalsaikhan, B., and Lee, J. Y. (2019). A

voting ensemble classiﬁer for wafer map defect pat-

terns identiﬁcation in semiconductor manufacturing.

IEEE Transactions on Semiconductor Manufacturing,

32(2):171–182.

Schapire, R. E. (2013). Explaining adaboost. In Empirical

inference, pages 37–52. Springer.

Steinwart, I. and Christmann, A. (2008). Support vector

machines. Springer Science & Business Media.

Thabtah, F., Hammoud, S., Kamalov, F., and Gonsalves, A.

(2020). Data imbalance in classiﬁcation: Experimen-

tal evaluation. Information Sciences, 513:429–441.

Visa, S., Ramsay, B., Ralescu, A. L., and Van Der Knaap,

E. (2011). Confusion matrix-based feature selection.

MAICS, 710:120–127.

Yap, B. W., Abd Rani, K., Abd Rahman, H. A., Fong, S.,

Khairudin, Z., and Abdullah, N. N. (2014). An appli-

cation of oversampling, undersampling, bagging and

boosting in handling imbalanced datasets. In Pro-

ceedings of the ﬁrst international conference on ad-

vanced data and information engineering (DaEng-

2013), pages 13–22. Springer.

Zhang, H. (2005). Exploring conditions for the optimal-

ity of naive bayes. International Journal of Pattern

Recognition and Artiﬁcial Intelligence, 19(02):183–

198.

A Performance Analysis of Classiﬁers on Imbalanced Data

609