SMOTE: Are We Learning to Classify or to Detect Synthetic Data?

Nada Boudegzdame

1 a

, Karima Sedki

1 b

, Rosy Tspora

2,3,4 c

and Jean-Baptiste Lamy

1 d

LIMICS, INSERM, Universit

e Sorbonne Paris Nord, Sorbonne Universit

e, France

INSERM, Universit

e de Paris Cit

e, Sorbonne Universit

e, Cordeliers Research Center, France

HeKA, INRIA, France

Department of Medical Informatics, H

opital Europ

een Georges-Pompidou, AP-HP, France

Keywords:

Imbalanced Data, Oversampling, SMOTE, Data Augmentation, Class Imbalance, Machine Learning, Neural

Networks, Synthetic Data.

Abstract:

Oversampling algorithms are used as preprocess in machine learning, in the case of highly imbalanced data in

an attempt to balance the number of samples per class, and therefore improve the quality of models learned.

While oversampling can be effective in improving the performance of classiﬁcation models on minority

classes, it can also introduce several problems. From our work, it came to light that the models learn to

detect the noise added by the oversampling algorithms instead of the underlying patterns. In this article, we

will deﬁne oversampling, and present the most common techniques, before proposing a method for evaluating

oversampling algorithms.

1 INTRODUCTION

Oversampling is a technique used to solve the prob-

lem of class imbalance in machine learning. Class

imbalance occurs when the number of samples in one

class is much lower than the number of samples in the

other class(es). This is a problem because the classi-

ﬁer will have a hard time learning from the minority

class. Oversampling techniques generate additional

samples belonging to the minority class so that the

classiﬁer has a better chance of learning from them

(He and Garcia, 2009; Batista et al., 2004).

Oversampling creates new instances of the mi-

nority classes by either 1) replicating existing in-

stances or, 2) synthesizing samples, to increase its

representation in the dataset. Some popular tech-

niques include Random Oversampling (Chawla et al.,

2002), SMOTE (Chawla et al., 2002), ADASYN (He

et al., 2008), Borderline SMOTE (Han et al., 2005),

SMOTEN(Chawla et al., 2002), Safe-Level SMOTE

(Bunkhumpornpat et al., 2009), and Minority Over-

sampling Technique (MOTE) (Huang et al., 2006).

While oversampling can enhance the performance

https://orcid.org/0000-0003-1409-6560

https://orcid.org/0000-0002-2712-5431

https://orcid.org/0000-0002-9406-5547

https://orcid.org/0000-0002-5477-180X

of classiﬁcation models on minority classes but brings

signiﬁcant problems, especially in highly imbalanced

data. In this article, we will deﬁne the potential prob-

lems and challenges when using oversampling. A

core concern arises from the shift in dataset compo-

sition due to oversampling, where the original minor-

ity class data becomes a small portion, overshadowed

by synthetic data. This transformation fundamentally

alters the learning problem for machine learning mod-

els. Consequently, models often prioritize predicting

synthetic data, learning noise instead of underlying

minority class patterns.This can result in poor model

generalisation and performance on real-world data

(Tarawneh et al., 2022; Drummond and Holte, 2003;

Chen et al., 2004; Rodr

ıguez-Torres et al., 2022).

Consequently, we propose a method for evaluat-

ing oversampling techniques on a given dataset, with

a focus on the speciﬁc application of drug-induced

hemorrhage. The method consists in trying to learn

a model that can predict the synthetic status of the

sample; the better this model is, the worst the over-

sampling technique performs. Finally, we will put to

test the most common oversampling techniques and

evaluate their effectiveness in a practical example.

Boudegzdame, N., Sedki, K., Tspora, R. and Lamy, J.

SMOTE: Are We Learning to Classify or to Detect Synthetic Data?.

DOI: 10.5220/0012325300003636

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 16th International Conference on Agents and Artiﬁcial Intelligence (ICAART 2024) - Volume 3, pages 283-290

ISBN: 978-989-758-680-4; ISSN: 2184-433X

283

2 BACKGROUND

Since the introduction of SMOTE over 20 years ago in

2002, numerous techniques have evolved to enhance

its effectiveness. Borderline SMOTE was among

the ﬁrst improvements, mitigating SMOTE’s overﬁt-

ting risk by generating synthetic samples exclusively

for minority class instances near the decision bound-

ary. ADASYN followed, addressing harder-to-learn

minority samples by adapting the density distribu-

tion of the feature space. Safe-Level SMOTE selec-

tively generates synthetic samples for minority class

instances with proximate majority class neighbors to

reduce misclassiﬁcation risk. The latest advancement,

SMOTEN, handles datasets with both nominal and

continuous features through a distinct approach for

synthetic sample generation in nominal features.

These oversampling techniques exhibit varying

strengths and weaknesses, with performance inﬂu-

enced by the dataset and classiﬁcation task. To further

tackle remaining challenges, several approaches have

emerged, including combining oversampling with

under-sampling, utilizing advanced synthetic sam-

pling techniques, or adjusting classiﬁcation thresh-

olds. Each approach carries its unique advantages

and limitations, necessitating careful selection based

on the speciﬁc dataset and classiﬁcation problem at

hand.

2.1 Known Problems of Oversampling

SMOTE, the most common (He and Garcia, 2009)

and effective (Batista et al., 2004) oversampling tech-

nique, generates synthetic minority class instances

by interpolating between existing samples and their

k nearest neighbors in the feature space. This pro-

cess enriches minority class representation without

duplication, making it a valuable tool for addressing

class imbalance in real-world datasets where one class

is severely underrepresented. However, while over-

sampling enhances model performance on minority

classes, it presents challenges that can be categorized

into six main areas:

1. First, one of the most common problems asso-

ciated with oversampling is the potential bias towards

the minority class (Tarawneh et al., 2022; Drummond

and Holte, 2003). When oversampling is applied, the

minority class is artiﬁcially inﬂated by creating new

synthetic samples, leading the model to become over-

reliant on this class and ignore the majority class.

This can result in high accuracy on training data but

poor performance on real-world data, given the infre-

quency of the minority class.

2. Oversampling can also lead to inconsistencies

in data types, as synthetic data points may generate

values that are outside the typical range of the vari-

able or in a different format. For example, if the orig-

inal data only contains whole numbers for age, over-

sampling may generate decimal numbers that are not

present in real-world data.

3. Synthetic samples created through oversam-

pling are assumed to belong to the minority class,

but this may not be true. It may also produce miss-

labeled samples belonging to the majority class, and

also ”noise” samples that are absurd and do not cor-

respond to any class or reality, such as a patient aged

3 and weighing 100kg.

4. The distribution of the data may also be altered

by synthetic samples. For example, if the minority

class includes 50% of children but the synthesized

data includes only 20% then the distribution is not the

same.

5. Oversampling can reduce the diversity of the

dataset by creating synthetic samples that are very

similar to existing samples. This can result in overﬁt-

ting and negatively impact the model’s ability to gen-

eralize to new data. The oversampled dataset may not

accurately reﬂect the true diversity of the problem.

It’s important to carefully consider the impact of

oversampling on the distribution and diversity of the

dataset to ensure that the resulting model accurately

reﬂects the true nature of the problem.

6. Oversampling can increase the computational

cost of training a model, as it requires generating ad-

ditional data points for the minority class (Chen et al.,

2004; Rodr

ıguez-Torres et al., 2022). When working

with large datasets, where generating synthetic data

can be time-consuming and resource-intensive.

Additionally, the more imbalanced the dataset is

the less the oversampled dataset accurately reﬂects

the true nature of the problem (He and Garcia, 2009).

As explained above, the oversampling algorithm will

adjust the class distribution of a data set. So the more

imbalanced the dataset the more data will be a need to

adjust the class distribution as a result more synthetic

data the oversampled dataset contains. This can be

particularly challenging when working with anomaly

detection datasets since they tend to have highly im-

balanced class distributions, as the occurrence of

rare events or conditions is infrequent compared to

the overall population. Medical and fraud detection

datasets are common examples of such highly imbal-

anced datasets where detecting anomalies is critical,

but these anomalies are rare in occurrence (Chandola

et al., 2009).

Medical datasets, in particular, pose signiﬁcant

challenges for oversampling techniques. These

datasets are often the most imbalanced because cer-

ICAART 2024 - 16th International Conference on Agents and Artiﬁcial Intelligence

284

tain diseases or conditions may be rare in compari-

son to the overall population. For instance, a speciﬁc

disease might affect only a small percentage of peo-

ple, while the majority are healthy. Consequently, the

dataset will have a highly imbalanced class distribu-

tion, with the minority class being the medical condi-

tion of interest (Longadge and Dongre, 2013).

Moreover, the expense and complexity of medical

data collection can contribute to class imbalance. Col-

lecting medical data often involves costly and time-

intensive procedures, like medical tests or imaging,

which can be difﬁcult to perform on a large and di-

verse population. Consequently, the data collected

may be biased towards certain groups or demograph-

ics, leading to imbalanced class distributions.

In the next section, we will illustrate some prob-

lems encountered with oversampling over a medical

example.

3 EXPERIMENTS

3.1 Description of the Initial Machine

Learning Task

Our initial goal was to predict hemorrhage risk us-

ing patient medical prescriptions from the MIMIC

database, a comprehensive resource of de-identiﬁed

electronic health records for over 40,000 ICU patients

in the United States. The database includes clinical

data like demographics, diagnoses, laboratory results,

medication details, and vital signs (Johnson et al.,

2021).

Speciﬁcally, we sought to identify patients at high

risk of hemorrhage due to speciﬁc medications, doses,

and medical histories. High-risk individuals typically

have a prior history of hemorrhage, which is a criti-

cal concern, as certain medications, dosages, and in-

dividual medical backgrounds can increase the like-

lihood of life-threatening hemorrhagic events. Com-

mon medications known to heighten hemorrhage risk

include anticoagulants like warfarin, dabigatran, and

apixaban, as well as antiplatelet agents like aspirin

and clopidogrel. Other medications, such as non-

steroidal anti-inﬂammatory drugs (NSAIDs) and se-

lective serotonin reuptake inhibitors (SSRIs), may

also increase the risk of hemorrhage, especially when

taken in high doses or in combination with other med-

ications (Hamrick and Nykamp, 2015).

We deﬁne the machine learning classiﬁcation

problem as follows:

Predicting hemorrhage risk

Input: Medical patient prescription history, hospi-

tal patient admission history. Output: Patient has

a hemorrhage risk or not ?

To label the data, we ﬁrst needed to deﬁne how

to extract information on medication-induced hemor-

rhage. We achieved this by examining the patient hos-

pital admission record, which contains the reason for

admission coded using the International Classiﬁcation

of Diseases (ICD) system. This system is a standard-

ized medical classiﬁcation system used for coding and

classifying medical procedures, symptoms, and diag-

noses (World Health Organization, 2016). By analyz-

ing the International Classiﬁcation of Diseases sys-

tem, we were able to deﬁne a list of ICD codes that

represent medication-induced hemorrhage.

The input data included hospital admission

records and current prescription details, with medi-

cations coded using the US-speciﬁc National Drug

Code (NDC). However, NDC is speciﬁc to the US and

is too speciﬁc, since distinct codes exist for the vari-

ous dosages, forms, and presentations of a drug (U.S.

Food and Drug Administration, nd).Thus, we used

the Anatomical Therapeutic Chemical (ATC) classi-

ﬁcation system, which organizes medications based

on their therapeutic properties and anatomical site of

action (WHO Collaborating Centre for Drug Statis-

tics Methodology, 2013). To address this difference,

we mapped the NDC code to its corresponding ATC

code. Some medications have multiple ATC codes; in

this case, we considered all of them.

Finally, we coded patient’s medication using one

hot encoding. It’s a process used in machine learn-

ing to convert categorical data into a numerical rep-

resentation that can be used by machine learning al-

gorithms. It involves creating a binary vector that has

one value for each possible drug, the value being 1

if the drug is present and 0 otherwise. For exam-

ple, if there are three medications - M

, M

, and M

- each medication or drug order would be represented

by a binary vector of length three. The medication M

would be represented by the vector [1,0,0], the med-

ication M

would be represented by [0,1,0], and the

medication M

would be represented by [0,0,1]. A

drug order associating medication M

and M

would

be represented by [1,1,0]. This allows algorithms to

work with categorical data, which can be useful in

many applications such as text classiﬁcation.

The resulting dataset was highly imbalanced with

a minority class representing only 3.47 % of patients

who have a hemorrhage risk. The imbalanced na-

ture of the dataset can pose a signiﬁcant challenge

for the model in accurately predicting the minority

class. This is because the model may become biased

SMOTE: Are We Learning to Classify or to Detect Synthetic Data?

285

toward the majority class, which resulted in poor per-

formance when predicting the minority class. To ad-

dress this issue, we employed oversampling as a com-

mon technique to balance the dataset.

While we have presented results based on a single

dataset for clarity, it is essential to note that our ap-

proach has been tested on multiple datasets (Boudegz-

dame et al., 2024).

3.2 New Problem Encountered with

Oversampling

After oversampling, we observed signiﬁcant improve-

ment in the model’s performance on both the training

and validation which was oversampled but performed

poorly on the original data in terms of performance

metrics. Moreover, predicting the risk of hemorrhage

is a challenging task as it occurs infrequently, and it

is difﬁcult to predict if a prescription will result in

a hemorrhage. However, we obtained an f1 score

of 90% in predicting hemorrhage on training which

seemed too optimistic. To investigate this issue, we

conducted an analysis of the model’s predictions to

determine if it was still addressing the initial problem.

We formulate the hypothesis that the model was

learning to predict whether a sample was synthetic,

instead of predicting whether it belongs to the minor-

ity class which, indeed, is almost the same, since a

large majority of the samples belonging to the minor-

ity class are synthetic.

3.3 A Method for Measuring the

Detectability of Synthetic Data

To test this hypothesis, we deﬁned a new machine

learning problem to detect synthetic data. We

generated a number of synthetic samples equal to

the number of samples in the minority class using

oversampling, we removed samples from the majority

class, and we labeled the samples as either synthetic

(1) or original (0). We applied this approach to

different oversampling methods. It aims at deter-

mining the ease with which synthetic data generated

by these methods could be detected, with lower-

quality data being more easily detected. Our reﬁned

dataset was used to address the following problem:

Detecting synthetic data

Input: Minority class VS synthetic data produced

by oversampling. Output: Is the instance synthetic

or original?

In our analysis, we considered a range of evalua-

tion metrics to assess the model’s performance:

1. Precision, Recall, and F1 Score: Precision mea-

sures correct positive predictions out of all pre-

dicted positives, while recall measures correct

positives out of all actual positives. F1 score, the

harmonic mean of precision and recall, assesses

model performance, especially with highly imbal-

anced data (He and Garcia, 2009; Powers, 2011).

2. Area Under the Precision-Recall Curve

(AUPRC): A single score capturing the trade-off

between precision and recall, especially valuable

for imbalanced data as it focuses on the positive

class and can provide a more informative eval-

uation than accuracy or ROC AUC (Davis and

Goadrich, 2006).

3. Receiver Operating Characteristic (ROC) and

Area Under the Curve (AUC): ROC depicts true

positive rates versus false positive rates at var-

ious decision thresholds, while AUC condenses

this curve into a single performance score. These

metrics are valuable for comparing models with

varying thresholds (He and Garcia, 2009; Powers,

2011; Fawcett, 2006).

4. Confusion Matrix: Providing detailed insights

into true positives, true negatives, false positives,

and false negatives. This helps identify correct

and incorrect classiﬁcations for each class.

5. Cohen’s kappa: measures inter-rater agreement

between the original and oversampled datasets. It

can be useful for evaluating how well the syn-

thetic data captures the true nature of the problem

(McHugh, 2012).

When learning to predict whether samples are

synthetic, we obtain performance metric values that

indicate the success of the learning process. These

metric values serve as a measure that quantiﬁes the

problem we discovered.

In this second learning task, it is crucial to use

the same machine learning technique as in the initial

learning process. This consistency ensures a fair test

to determine whether this technique can effectively

discern the synthetic nature of the samples, as using

a different technique may behave differently. There-

fore, we have opted for a neural network.

For the current implementation, we used a neu-

ral network with two hidden layers containing 30

and 20 neurons respectively. To prevent the issue

of ”dead” neurons in deep networks, we opted for

the LeakyReLU activation function, which has been

shown to perform well in similar applications (He

et al., 2016). The output layer was designed with a

sigmoid activation function, commonly used in binary

classiﬁcation problems.

To optimize training, we employed ReduceLROn-

Plateau learning rate scheduling. This technique al-

ICAART 2024 - 16th International Conference on Agents and Artiﬁcial Intelligence

286

lowed us to dynamically adjust the learning rate of the

optimizer during training, based on a monitored met-

ric such as validation loss. By doing so, we were able

to help the model escape plateaus and continue to im-

prove, even as it approached convergence. Our model

was trained over 100 epochs, which was sufﬁcient to

ensure full learning and convergence of the model.

3.4 Results and Analysis

The results in the table 1 demonstrate that the neu-

ral network performed exceptionally well in predict-

ing synthetic samples in terms of the evaluation met-

rics across all four oversampling techniques. Notably,

both precision and recall are consistently high and

similar, indicating an absence of bias toward the ma-

jority class. This suggests that the model effectively

identiﬁes both synthetic and original data instances,

which is particularly noteworthy given that the model

was trained on data where the synthetic class repre-

sents the majority of the samples (about 95%).

The highest score was achieved for SMOTEN in

terms of f1 score, recall, precision, accuracy, co-

hen kappa, and AUC among all oversampling tech-

niques. The Borderline SMOTE algorithm also leads

to high scores in all evaluation metrics except for

AUC. Therefore, we can easily predict whether a sam-

ple is synthetic or not. This prediction is much easier

than predicting hemorrhage risk. Thus it conﬁrms our

hypothesis: the initial model was in fact predicting the

synthetic nature of data instead of hemorrhage risk.

As explained above, the ROC curve and Precision-

Recall curve provide important information about the

performance of a binary classiﬁcation model. There-

fore, we plotted both curves to obtain a more compre-

hensive evaluation of the model’s performance. Fig-

ure 1a summarise the ROC curve for the four over-

sampling algorithms. It indicates that the model has

high accuracy in distinguishing between positive and

negative samples. In fact, an AUC of 0.5 suggests a

random classiﬁcation, while an AUC of 1 suggests a

perfect classiﬁcation. The AUC values for SMOTE,

borderlineSMOTE and ADASYN are 0.97, indicat-

ing that the model’s performance is very close to per-

fect, with only a small number of false positives and

false negatives. Furthermore, we observed that the

oversampled data generated by SMOTEN on our data

were the easiest to detect, as conﬁrmed by Figure 1b,

which summarises the Recall-Precision curve.

Therefore, the table 1 and the ﬁgures 1a and 1b

suggest that oversampling techniques can be easily

detected to a great extent. However, the choice of the

oversampling algorithm should depend on the speciﬁc

characteristics of the dataset and the evaluation met-

(a) ROC Curves for Oversampling Techniques.

(b) Precision-Recall Curves for Oversampling Techniques.

Figure 1: Performance Evaluation of Oversampling Tech-

niques: ROC and Precision-Recall Curves.

rics of interest.

While it’s known that oversampling algorithm

does not behave the same on different dataset, The

testing results on medical data including medical pre-

scription, which is a highly imbalanced dataset and

strongly indicates that oversampling will not be a con-

siderable technique for balancing our data. Further

analysis and experimentation may be necessary to de-

termine the most effective approach for balancing the

medical prescription dataset in question.

3.5 Understanding Why Synthetic Data

Are Easily Detected

Oversampled medication prescriptions may not accu-

rately represent real-world data, as they are easily de-

tectable by machine learning algorithms. To gain a

better understanding of this issue, we have formulated

the following hypotheses:

SMOTE: Are We Learning to Classify or to Detect Synthetic Data?

287

Table 1: Performance comparison of oversampling algorithms on synthetic data classiﬁcation.

F1 Score Recall Precision Accuracy Cohen Kappa AUC

SMOTE 0.92 0.94 0.90 0.92 0.84 0.97

Borderline SMOTE 0.93 0.96 0.91 0.93 0.86 0.97

SMOTEN 0.97 0.99 0.96 0.97 0.95 0.99

ADASYN 0.92 0.93 0.91 0.92 0.84 0.97

Hypothesis 1: Over or under representation of

drugs in synthetic data. (Problem #4 in section

3) Medical prescriptions typically contain a limited

number of medications. However, synthetic data gen-

erated may contain a smaller or larger number of med-

ications, resulting in an under and over representation

of drugs respectfully, which could lead to discrepan-

cies between the synthetic and real-world data.

The following table 2 shows that all four over-

sampling methods (SMOTE, Borderline SMOTE,

SMOTEN, and ADASYN) have resulted in a decrease

in the mean number of ATC codes for medication in

the oversampled data compared to the original data.

This indicates an under-representation of medication

in the oversampled samples.

Table 2: Drug distribution in original and synthetic data.

Dataset Mean number of

ATC codes

Original 34.78

SMOTE 20.11

Borderline SMOTE 21.13

SMOTEN 20.76

ADASYN 19.49

Hypothesis 2: Changes in the nature of data.

(Problem #2) SMOTE can introduce small pertur-

bations to feature values in order to create synthetic

samples, which may result in non-integer or ﬂoating-

point values for discrete features (Blagus and Lusa,

2013). For example, drugs are represented as dis-

crete values of 0 or 1, indicating the presence or ab-

sence of the drug in a prescription. However, syn-

thetic data generated for the purpose of analysis may

contain drugs with continuous values, which may lead

to inaccuracies in the results.

After further investigation, we found that the ap-

plication of SMOTE, Borderline SMOTE, SMOTEN,

and ADASYN did not result in any signiﬁcant

changes to the nature of the oversampled data. All

four oversampling methods applied to our data did not

alter the nature of the data.

Hypothesis 3: Inconsistencies in ATC codes.

(Problem #3) Some drugs such as aspirin have several

ATC codes, and we associated them with all of their

corresponding codes in the original data. However, in

the synthetic samples, such a drug may be associated

with only one of its codes. For instance, an aspirin

prescription might be coded as a platelet aggregation

inhibitor but not as an analgesic in the synthetic sam-

ples.

Hypothesis 4: Inconsistencies in drug associa-

tions. (Problem #3) Synthetic prescriptions generated

may include inconsistent drug associations. For in-

stance, drugs such as ramipril and enalapril, which are

both angiotensin-converting-enzyme inhibitors and

have the same effects, thus they are never associated

together. However, such inconsistencies may occur in

the synthetic samples.

4 DISCUSSION

In this paper, we address a common problem associ-

ated with oversampling: the risk of the machine learn-

ing model learning to detect the synthetic nature of

oversampled data rather than the original underlying

patterns. We propose a method to identify and quan-

tify this problem, focusing on the extent to which syn-

thetic data can be detected.

In the literature many studies have explored the

problems associated with oversampling and SMOTE,

however, to the best of our knowledge, none of them

neither mentioned the learning of the synthetic nature

of data nor proposed a method for quantifying it.

Tarawneh and al. (Tarawneh et al., 2022) provide

a comprehensive review of class imbalance mitigation

in machine learning, focusing on oversampling. They

highlight issues like overﬁtting, higher computational

costs, and reduced generalization performance. The

article also emphasizes the risks of model bias and

decreased generalization when oversampled data is

tested on original database, along with the signiﬁcant

computational overhead of creating and storing syn-

thetic samples. The authors suggest alternative ap-

proaches like cost-sensitive learning and anomaly de-

tection as more effective solutions to tackle class im-

balance.

The work of R. Buda and al. (Buda et al., 2018)

investigates the impact of class imbalance on CNN

ICAART 2024 - 16th International Conference on Agents and Artiﬁcial Intelligence

288

performance for image classiﬁcation tasks and evalu-

ates various strategies, including oversampling. They

caution that oversampling alone may not sufﬁce for

addressing class imbalance in CNNs due to the risk of

overﬁtting, where models memorize training data and

perform poorly on new data. Furthermore, oversam-

pling can generate unrealistic and redundant samples,

inefﬁciently utilizing computational resources.

Several studies propose modiﬁcations to the

oversampling technique to mitigate these issues.

Rodr

ıguez-Torres and al. (Rodr

ıguez-Torres et al.,

2022) introduce Large-scale Random Oversampling

(LRO) to address class imbalance in large datasets.

Comparisons with other oversampling methods, such

as SMOTE and Borderline-SMOTE, show that LRO

achieves higher accuracy and F1-score while be-

ing computationally efﬁcient. The study highlights

SMOTE’s limitations, including sample diversity is-

sues and sensitivity to noise.

Overall, the literature highlights the potential lim-

itations and challenges of oversampling and SMOTE

in addressing imbalanced data in machine learning,

and suggests alternative approaches and modiﬁca-

tions to address these issues. The presented articles

cover various aspects of oversampling and SMOTE

problems, including overﬁtting, performance evalu-

ation, large dataset handling, multi-class imbalance,

noise handling, and synthetic oversampling.

5 CONCLUSION AND

PERSPECTIVES

In conclusion, oversampling is a valuable tool for im-

proving machine learning model performance on im-

balanced datasets. However, our research highlights

the potential issues introduced by oversampling al-

gorithms, particularly in the quality of synthetic mi-

nority class data, which can lead to models learning

to predict noise rather than underlying patterns. To

address these concerns, we have proposed a novel

evaluation method that assesses and quantiﬁes both

the effectiveness of oversampling techniques and their

potential to introduce detectable noise. By evaluat-

ing a model’s ability to differentiate synthetic data

from real data, we can identify potentially problem-

atic oversampling methods and select the most suit-

able ones for speciﬁc datasets, ultimately enhancing

model accuracy and generalizability (Boudegzdame

et al., 2024). This approach also aids in determining

the suitability of oversampling for dataset balancing.

The perspectives of this study are: 1) delimiting

the exact perimeter of the problem we discovered, in

particular testing other similar existing oversampling

techniques, such as Generative Adversarial Networks

(GANs) (Goodfellow et al., 2014), 2) improving the

measure we proposed for quantifying the detectability

of synthetic data, for instance for multi-class and/or

multi-label classiﬁcation, and 3) designing new meth-

ods of oversampling that are resilient to this problem.

ACKNOWLEDGEMENTS

This work was partially funded by the French Na-

tional Research Agency (ANR) through the ABiMed

Project [grant number ANR-20-CE19-0017-02].

REFERENCES

Batista, G. E., Prati, R. C., and Monard, M. C. (2004). A

study of the behavior of several methods for balancing

machine learning training data. ACM SIGKDD Explo-

rations Newsletter, 6(1):20–30.

Blagus, R. and Lusa, L. (2013). Smote for high-dimensional

class-imbalanced data. BMC Bioinformatics, 14:106.

Boudegzdame, N., Sedki, K., Tspora, R., and Lamy, J.-B.

(2024). An approach for improving oversampling by

ﬁltering out unrealistic synthetic data. ICAART 2024.

Buda, R., Maki, A., and Mazurowski, M. A. (2018). A

systematic study of the class imbalance problem in

convolutional neural networks. Neural Networks,

106:249–259.

Bunkhumpornpat, C., Sinapiromsaran, K., and Lursinsap,

C. (2009). Safe-level-smote: Safe-level-synthetic mi-

nority over-sampling technique for handling the class

imbalanced problem. In Paciﬁc-Asia Conference on

Knowledge Discovery and Data Mining, pages 475–

482.

Chandola, V., Banerjee, A., and Kumar, V. (2009).

Anomaly detection: A survey. ACM Computing Sur-

veys (CSUR), 41(3):15.

Chawla, N. V., Bowyer, K. W., Hall, L. O., and Kegelmeyer,

W. P. (2002). Smote: Synthetic minority over-

sampling technique. Journal of Artiﬁcial Intelligence

Research, 16(1):321–357.

Chen, C., Liaw, A., and Breiman, L. (2004). Using random

forest to learn imbalanced data. Technical Report 110,

University of California, Berkeley.

Davis, J. and Goadrich, M. (2006). The relationship be-

tween precision-recall and roc curves. In Proceed-

ings of the 23rd International Conference on Machine

Learning, pages 233–240.

Drummond, C. and Holte, R. (2003). C4.5, class imbalance,

and cost sensitivity: Why under-sampling beats over-

sampling. In Proceedings of the ICML’03 Workshop

on Learning from Imbalanced Datasets.

Fawcett, T. (2006). An introduction to roc analysis. Pattern

Recognition Letters, 27(8):861–874.

SMOTE: Are We Learning to Classify or to Detect Synthetic Data?

289

Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B.,

Warde-Farley, D., Ozair, S., Courville, A., and Ben-

gio, Y. (2014). Generative adversarial nets. In Neural

Information Processing Systems, pages 2672–2680.

Hamrick, J. W. and Nykamp, D. (2015). Drug-induced

bleeding. US Pharmacist, 40(12):HS17–HS21.

Han, H., Wang, W. Y., and Mao, B. H. (2005). Borderline-

smote: A new over-sampling method in imbalanced

data sets learning. In International Conference on In-

telligent Computing, pages 878–887.

He, H., Bai, Y., Garcia, E. A., and Li, S. (2008). Adasyn:

Adaptive synthetic sampling approach for imbalanced

learning. In IEEE International Joint Conference on

Neural Networks, pages 1322–1328.

He, H. and Garcia, E. A. (2009). Learning from imbalanced

data. IEEE Transactions on Knowledge and Data En-

gineering, 21(9):1263–1284.

He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep resid-

ual learning for image recognition. In Proceedings of

the IEEE Conference on Computer Vision and Pattern

Recognition.

Huang, G. B., Zhu, Q. Y., and Siew, C. K. (2006). Extreme

learning machine: Theory and applications. Neuro-

computing, 70(1-3):489–501.

Johnson, A., Bulgarelli, L., Pollard, T., Celi, L. A., Mark,

R., and Horng, S. (2021). Mimic-iv (version 1.0).

PhysioNet.

Longadge, R. and Dongre, S. (2013). Class imbalance prob-

lem in data mining: Review. International Journal of

Computer Science and Network.

McHugh, M. L. (2012). Interrater reliability: The kappa

statistic. Biochemia Medica, 22(3):276–282.

Powers, D. M. W. (2011). Evaluation: From precision, re-

call and f-factor to roc, informedness, markedness and

correlation. Journal of Machine Learning Technolo-

gies, 2(1):37–63.

Rodr

ıguez-Torres, F., Mart

ınez-Trinidad, J. F., and

Carrasco-Ochoa, J. A. (2022). An oversampling

method for class imbalance problems on large

datasets. Applied Sciences, 12(7):3424.

Tarawneh, S., Al-Betar, M. A., and Mirjalili, S. (2022). Stop

oversampling for class imbalance learning: A review.

IEEE Transactions on Neural Networks and Learning

Systems, 33(2):340–354.

U.S. Food and Drug Administration (n.d.). National drug

code (ndc) directory.

WHO Collaborating Centre for Drug Statistics Methodol-

ogy (2013). Guidelines for atc classiﬁcation and ddd

assignment 2013. Technical report, World Health Or-

ganization, Oslo, Norway.

World Health Organization (2016). International Classiﬁ-

cation of Diseases, 11th Revision (ICD-11). Geneva.

ICAART 2024 - 16th International Conference on Agents and Artiﬁcial Intelligence

290