performance for image classification tasks and evalu-
ates various strategies, including oversampling. They
caution that oversampling alone may not suffice for
addressing class imbalance in CNNs due to the risk of
overfitting, where models memorize training data and
perform poorly on new data. Furthermore, oversam-
pling can generate unrealistic and redundant samples,
inefficiently utilizing computational resources.
Several studies propose modifications to the
oversampling technique to mitigate these issues.
Rodr
´
ıguez-Torres and al. (Rodr
´
ıguez-Torres et al.,
2022) introduce Large-scale Random Oversampling
(LRO) to address class imbalance in large datasets.
Comparisons with other oversampling methods, such
as SMOTE and Borderline-SMOTE, show that LRO
achieves higher accuracy and F1-score while be-
ing computationally efficient. The study highlights
SMOTE’s limitations, including sample diversity is-
sues and sensitivity to noise.
Overall, the literature highlights the potential lim-
itations and challenges of oversampling and SMOTE
in addressing imbalanced data in machine learning,
and suggests alternative approaches and modifica-
tions to address these issues. The presented articles
cover various aspects of oversampling and SMOTE
problems, including overfitting, performance evalu-
ation, large dataset handling, multi-class imbalance,
noise handling, and synthetic oversampling.
5 CONCLUSION AND
PERSPECTIVES
In conclusion, oversampling is a valuable tool for im-
proving machine learning model performance on im-
balanced datasets. However, our research highlights
the potential issues introduced by oversampling al-
gorithms, particularly in the quality of synthetic mi-
nority class data, which can lead to models learning
to predict noise rather than underlying patterns. To
address these concerns, we have proposed a novel
evaluation method that assesses and quantifies both
the effectiveness of oversampling techniques and their
potential to introduce detectable noise. By evaluat-
ing a model’s ability to differentiate synthetic data
from real data, we can identify potentially problem-
atic oversampling methods and select the most suit-
able ones for specific datasets, ultimately enhancing
model accuracy and generalizability (Boudegzdame
et al., 2024). This approach also aids in determining
the suitability of oversampling for dataset balancing.
The perspectives of this study are: 1) delimiting
the exact perimeter of the problem we discovered, in
particular testing other similar existing oversampling
techniques, such as Generative Adversarial Networks
(GANs) (Goodfellow et al., 2014), 2) improving the
measure we proposed for quantifying the detectability
of synthetic data, for instance for multi-class and/or
multi-label classification, and 3) designing new meth-
ods of oversampling that are resilient to this problem.
ACKNOWLEDGEMENTS
This work was partially funded by the French Na-
tional Research Agency (ANR) through the ABiMed
Project [grant number ANR-20-CE19-0017-02].
REFERENCES
Batista, G. E., Prati, R. C., and Monard, M. C. (2004). A
study of the behavior of several methods for balancing
machine learning training data. ACM SIGKDD Explo-
rations Newsletter, 6(1):20–30.
Blagus, R. and Lusa, L. (2013). Smote for high-dimensional
class-imbalanced data. BMC Bioinformatics, 14:106.
Boudegzdame, N., Sedki, K., Tspora, R., and Lamy, J.-B.
(2024). An approach for improving oversampling by
filtering out unrealistic synthetic data. ICAART 2024.
Buda, R., Maki, A., and Mazurowski, M. A. (2018). A
systematic study of the class imbalance problem in
convolutional neural networks. Neural Networks,
106:249–259.
Bunkhumpornpat, C., Sinapiromsaran, K., and Lursinsap,
C. (2009). Safe-level-smote: Safe-level-synthetic mi-
nority over-sampling technique for handling the class
imbalanced problem. In Pacific-Asia Conference on
Knowledge Discovery and Data Mining, pages 475–
482.
Chandola, V., Banerjee, A., and Kumar, V. (2009).
Anomaly detection: A survey. ACM Computing Sur-
veys (CSUR), 41(3):15.
Chawla, N. V., Bowyer, K. W., Hall, L. O., and Kegelmeyer,
W. P. (2002). Smote: Synthetic minority over-
sampling technique. Journal of Artificial Intelligence
Research, 16(1):321–357.
Chen, C., Liaw, A., and Breiman, L. (2004). Using random
forest to learn imbalanced data. Technical Report 110,
University of California, Berkeley.
Davis, J. and Goadrich, M. (2006). The relationship be-
tween precision-recall and roc curves. In Proceed-
ings of the 23rd International Conference on Machine
Learning, pages 233–240.
Drummond, C. and Holte, R. (2003). C4.5, class imbalance,
and cost sensitivity: Why under-sampling beats over-
sampling. In Proceedings of the ICML’03 Workshop
on Learning from Imbalanced Datasets.
Fawcett, T. (2006). An introduction to roc analysis. Pattern
Recognition Letters, 27(8):861–874.
SMOTE: Are We Learning to Classify or to Detect Synthetic Data?
289