G-Means and 6 occasions based on PR-AUC.
MAHAKIL comes next in line in terms of
performance, where it statistically outperforms its
comparable methods on 11 occasions based on F1, 13
occasions based on G-Means and only 2 occasions
based on PR-AUC. Both BL-SMOTE and DB-
SMOTE only statistically outperform their
comparable methods on 4 occasions using F1. DB-
SMOTE performs significantly better when evaluated
using G-Means, where it outperformed its
comparable methods on 8 occasions. BL-SMOTE
comes last as it only statistically outperformed its
comparable methods on 3 occasions based on G-
Means and on 6 occasions based on PR-AUC.
From Table 7, CDO is the best performing
algorithm for 10 minority instances. It statistically
outperformed its comparable algorithms on 13
occasions based on F1, 14 occasions based on G-
Means and 7 occasions based on PR-AUC. For the
remaining algorithms, MAKAHIL is the 2nd best
performing algorithm as it statistically outperformed
its comparable methods on 12 occasions using F1, 14
occasions using G-Means and 2 occasions using PR-
AUC. DB-SMOTE comes 3rd, as it statistically
outperformed its comparable methods on 12
occasions using F1, 8 occasions using G-Means and
2 occasions using PR-AUC. BL-SMOTE comes last
as it barely outperformed other methods (2 occasions
using F1, 0 occasion on G-Means and 1 occasion on
PR-AUC).
6 DISCUSSIONS
As shown in the statistical test results, although CDO
outperforms MAHAKIL in most cases, CDO and
MAHAKIL have superior performance results when
compared to BL-SMOTE and DB-SMOTE. This can
be explained by their better ability to capture more
information when constructing minority generation
region. Both CDO and MAHAKIL consider the entire
minority class distribution and generating instances
within the boundaries of the identified data generation
region diversely. In contrast, SMOTE-based methods
typically create synthetic instances using linear
interpolation.
If we evaluate the statistical significance of
CDO’s performance, it has better performance
compared to MAHAKIL when minority instances
become more sparse. This is due to the nature of
MAHAKIL algorithm that it only performs well when
minority data distribution is convex and when there
are sufficient number of minority instances
(Khorshidi & Aickelin, 2021). In addition,
MAHAKIL algorithm does not consider clusters
within datasets, which results in a broader generation
region for minority instances and leads to a higher
false positive rate. The main reason for superiority of
CDO in comparison with MAHAKIL in terms of PR-
AUC is that MAHAKIL generates synthetic
instances, even though few, in the majority space (see
Figure 1). This leads to lower precision that can be
picked up by PR-AUC.
7 CONCLUSIONS
In this study, our key objective is to design an
algorithm which generates diversified synthetic
instances within the minority class while considering
the distribution of the minority data space. We
incorporate diversity optimization which optimises
both similarity to minority instances and diversity of
synthetic instances. The proposed algorithm first
utilises clustering technique to identify the
boundaries for the generation of minority instances
and preserve similarity between minority instances.
Subsequently, diversity optimization is incorporated
to promote diversity within clusters. The proposed
method CDO is evaluated on 10 real-world datasets,
and it has statistically superior performance to its
comparable methods. Its superior performance can be
attributed to its ability to identify the minority space
for synthetic data generation and its ability to obtain
optimal spread of generated instances due to genetic
algorithm. The proposed algorithm is evaluated on 2
class imbalance datasets. For future research, we
extend CDO to address multi-class imbalance
problems.
REFERENCES
Ali, A., Shamsuddin, S. M., & Ralescu, A. L. (2013).
Classification with class imbalance problem. Int. J.
Advance Soft Compu. Appl, 5(3).
Bennin, K. E., Keung, J., Phannachitta, P., Monden, A., &
Mensah, S. (2017). Mahakil: Diversity based
oversampling approach to alleviate the class imbalance
issue in software defect prediction. IEEE Transactions
on Software Engineering, 44(6), 534-550.
Chawla, N. V. (2009). Data mining for imbalanced datasets:
An overview. Data mining and knowledge discovery
handbook, 875-886.
Chawla, Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P.
(2002). SMOTE: synthetic minority over-sampling
technique. Journal of artificial intelligence research,
16, 321-357.