MULTI-CLASS DATA CLASSIFICATION FOR IMBALANCED DATA SET USING COMBINED SAMPLING APPROACHES
Wanthanee Prachuabsupakij, Nuanwan Soonthornphisaj
2011
Abstract
Two important challenges in machine learning are the imbalanced class problem and multi-class classification, because several real-world applications have imbalanced class distribution and involve the classification of data into classes. The primary problem of classification in imbalanced data sets concerns measure of performance. The performance of standard learning algorithm tends to be biased towards the majority class and ignore the minority class. This paper presents a new approach (KSAMPLING), which is a combination of k-means clustering and sampling methods. K-means algorithm is used for spitting the dataset into two clusters. After that, we combine two types of sampling technique, over-sampling and under-sampling, to re-balance the class distribution. We have conducted experiments on five highly imbalanced datasets from the UCI. Decision trees are used to classify the class of data. The experimental results showed that the prediction performance of KSAMPLING is better than the state-of-the-art methods in the AUC results and F-measure are also improved.
References
- Anand, R., Mehrotra, K., Mohan, C. K., & Ranka, S. (1995). Efficient classification for multiclass problems using modular neural networks. IEEE Transactions on Neural Networks, 6(1), 117-124.
- Arthur Asuncion , D. N. (2007). UCI machine learning repository from http://archive.ics.uci.edu/ml/datasets. html
- Benjamin, W., & Nathalie, J. (2008). Boosting Support Vector Machines for Imbalanced Data Sets. In A. An, S. Matwin, Z. Ras & D. Slezak (Eds.), Foundations of Intelligent Systems (Vol. 4994, pp. 38-47): Springer Berlin / Heidelberg.
- Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: synthetic minority over-sampling technique. J. Artif. Int. Res., 16(1), 321-357.
- Chen, S., He, H., & A., G. E. (2010). RAMOBoost: Ranked Minority Oversampling in Boosting. IEEE Transactions on Neural Networks, 21(10), 1624-1642.
- Fernadez-Navarro, F., Hervas-Martinez, C., & Gutierrez, P. A. (2011). A dynamic over-sampling procedure based on sensitivity for multi-class problems. Pattern Recogn., 44(8), 1821-1833.
- Fernandez, A., Jesus, M. J. D., & Herrera, F. (2010). Multi-class imbalanced data-sets with linguistic fuzzy rule based classification systems based on pairwise learning. Paper presented at the Proceedings of the Computational intelligence for knowledge-based systems design, and 13th International Conference on Information Processing and Management of Uncertainty.
- Forgy, E. (1965). Cluster analysis of multivariate data: efficiency versus interpretability of classifications. Biometrics, 21, 768-780.
- Hand, D. J., & Till, R. J. (2001). A Simple Generalisation of the Area Under the ROC Curve for Multiple Class Classification Problems. Mach. Learn., 45(2), 171- 186.
- Hastie, T., & Tibshirani, R. (1998). Classification by Pairwise Coupling. 26(2), 451-471.
- Huang, J., & Ling, C. X. (2005). Using AUC and Accuracy in Evaluating Learning Algorithms. IEEE Trans. on Knowl. and Data Eng., 17(3), 299-310.
- Liu, Y., Yu, X., Huang, J. X., & An, A. (2010). Combining integrated sampling with SVM ensembles for learning from imbalanced datasets. [doi: DOI: 10.1016/j.ipm.2010.11.007]. Information Processing & Management, In Press, Corrected Proof.
- Orriols-Puig, A., & Bernadó-Mansilla, E. (2009). Evolutionary rule-based systems for imbalanced data sets. Soft Computing - A Fusion of Foundations, Methodologies and Applications, 13(3), 213-225.
- Quinlan, J. R. (1986). Induction of Decision Trees. Machine Learning, 1(1), 81-106.
- Seiffert, C., Khoshgoftaar, T. M., Van Hulse, J., & Napolitano, A. (2010). RUSBoost: A Hybrid Approach to Alleviating Class Imbalance. Systems, Man and Cybernetics, Part A: Systems and Humans, IEEE Transactions on, 40(1), 185-197.
- Witten, I. H., Frank, E., & Hall, M. A. (2005). Data Mining: Practical Machine Learning Tools and Techniques (Third Edition ed.). San Francisco: Morgan Kaufmann.
- Yen, S.-J., & Lee, Y.-S. (2009). Cluster-based undersampling approaches for imbalanced data distributions. Expert Syst. Appl., 36(3), 5718-5727.
Paper Citation
in Harvard Style
Prachuabsupakij W. and Soonthornphisaj N. (2011). MULTI-CLASS DATA CLASSIFICATION FOR IMBALANCED DATA SET USING COMBINED SAMPLING APPROACHES . In Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2011) ISBN 978-989-8425-79-9, pages 158-163. DOI: 10.5220/0003635201660171
in Bibtex Style
@conference{kdir11,
author={Wanthanee Prachuabsupakij and Nuanwan Soonthornphisaj},
title={MULTI-CLASS DATA CLASSIFICATION FOR IMBALANCED DATA SET USING COMBINED SAMPLING APPROACHES},
booktitle={Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2011)},
year={2011},
pages={158-163},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0003635201660171},
isbn={978-989-8425-79-9},
}
in EndNote Style
TY - CONF
JO - Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2011)
TI - MULTI-CLASS DATA CLASSIFICATION FOR IMBALANCED DATA SET USING COMBINED SAMPLING APPROACHES
SN - 978-989-8425-79-9
AU - Prachuabsupakij W.
AU - Soonthornphisaj N.
PY - 2011
SP - 158
EP - 163
DO - 10.5220/0003635201660171