handles within-class imbalance. Further, the
combination of cluster-based undersampling and
SMOTE aids to reduce between-class imbalance,
without excessive use of sampling.
We were not able to establish a clear superiority
of one oversampling method over the other.
However, we were able to determine that the SCUT
method is a promising candidate for further
experimentation. Our results suggest that our SCUT
algorithm is suitable for domains where the number
of classes is high and the levels of examples vary
considerably. We intend to further investigate this
issue. We also intend extending our approach to very
large datasets with extreme levels of imbalances,
since our early results indicate that our SCUT
approach would potentially outperform
undersampling-only techniques in such a setting. In
our paper, the number of instances for each class
was set to the mean value. Exploring the optimal
strategy for fixing the number of instances will be
further explored, e.g. by sampling the instances
directly from the distribution associated with the
mixture of Gaussians as obtained from the EM
algorithm.
Cost-sensitive learning is another common
approach for dealing with the class-imbalance
problem. Most of the existing solutions are
applicable to binary-class problems, and cannot be
applied directly to multi-class imbalanced datasets
(Sun et al., 2006). Rescaling, which is a popular
cost-sensitive learning approach for binary class
problems can be applied directly on multi-class
datasets to obtain good performance only when the
costs are consistent (Zhou and Liu, 2010). In
addition, rescaling classes based on cost information
may not be suitable for highly imbalanced datasets.
Designing a multi-class cost-sensitive learning
approach for inconsistent costs without transforming
the problem into a binary-class problem will be the
focus of our future work.
REFERENCES
Chawla, N., Bowyer, K., Hall, L., Kegelmeyer, W., 2002.
SMOTE: Synthetic minority over-sampling technique.
In Journal of Artificial Intelligence Research, Volume
16, pages 321-357.
Alcalá-Fdez, J., Fernandez, A., Luengo, J., Derrac, J.,
García, S., Sánchez, L., Herrera, F., 2011. KEEL data-
mining software tool: data set repository, integration
of algorithms and experimental analysis framework. In
Journal of Multiple-Valued Logic and Soft Computing,
pages 255-287.
Cortez, P., Cerdeira, A., Almeida, F., Matos, T., Reis, J,
2009. Modelling wine preferences by data mining
from physicochemical properties. In Decision Support
Systems, Elsevier, Volume 47, number 4, pages 547-
553.
Dempster, A. P., Laird, N.M., Rubin, D.B., 1977.
Maximum likelihood from incomplete data via the EM
algorithm. In Journal of the Royal Statistical Society,
Volume 39, Number 1, pages 1-38.
Fernández, A., Jesus, M., Herrera, F., 2010. Multi-class
imbalanced data-sets with linguistic fuzzy rule based
classification systems based on pairwise learning. In
Computational Intelligence for Knowledge-Based
System Design, Volume 6178, Number 20, pages 89-
98.
Han, H., Wang, W., Mao, B., 2005. Borderline-SMOTE:
A new over-sampling method in imbalanced data sets
learning. In Proceedings of International Conference
on Advances in Intelligent Computing, Springer,
Volume Part I, pages 878-887.
Japkowicz, N., 2001. Concept-learning in the presence of
between-class and within-class imbalances. In AI
2001: Lecture Notes in Artificial Intelligence, Volume
2056, Springer, pages 67-77.
Lichman, M., 2013. UCI Machine Learning Repository
[http://archive.ics.uci.edu/ml]. Irvine, CA: University
of California, School of Information and Computer
Science.
Rahman, M. M., Davis, D. N., 2013. Addressing the class
imbalance problem in medical datasets, In
International Journal of Machine Learning and
Computing, Volume 3, Number 2, pages 224-228.
Ramanan, A., Suppharangsan, S., Niranjan, M., 2007.
Unbalanced decision trees for multi-class
classification. In ICIIS 2007: IEEE International
Conference on Industrial and Information Systems,
IEEE Press, pages 291-294.
Sobhani, P., Viktor, H., Matwin, S., 2014. Learning from
imbalanced data using ensemble methods and cluster-
based undersampling. In PKDD nfMCD 2013: Lecture
Notes in Computer Science, Volume 8983, Springer,
pages 38-49.
Sun, Y., Kamel, M., Wang, Y., 2006. Boosting for
learning multiple classes with imbalanced class
distribution. In IEEE ICDM ’06: Proceedings of the
Sixth International Conference on Data Mining, IEEE
Press, pages 592-602.
Viktor, H.L., Paquet, E. and Zhao, J., 2013. Artificial
neural networks for predicting 3D protein structures
from amino acid sequences, In IEEE IJCNN:
International Joint Conference on Neural Networks,
IEEE Press, pages 1790-1797.
Wang, S., Yao, X., 2012. Multi-class imbalance problems:
analysis and potential solutions. In IEEE Transactions
on Systems, Man, and Cybernetics, Part B, Number 4,
pages 1119-1130.
Yen, S. J., Lee, Y. S., 2009. Cluster-based under-sampling
approaches for imbalanced data distributions. In
Expert Systems with Applications, Volume 36,
Number 3, pages 5718-5727.