Balanced Sampling Method for Imbalanced Big Data Using AdaBoost

Hong Gu, Tao Song

Abstract

With the arrival of the era of big data, processing large volumes of data at much faster rates has become more urgent and attracted more and more attentions. Furthermore, many real-world data applications present severe class distribution skews and the underrepresented classes are usually of concern to researchers. Variants of boosting algorithm have been developed to cope with the class imbalance problem. However, due to the inherent sequential nature of boosting, these methods can not be directly applied to efficiently handle largescale data. In this paper, we propose a new parallelized version of boosting, AdaBoost.Balance, to deal with the imbalanced big data. It adopts a new balanced sampling method which combines undersampling methods with oversampling methods and can be simultaneously calculated by multiple computing nodes to construct a final ensemble classifier. Consequently, it is easily implemented by the parallel processing platform of big data such as the MapReduce framework.

References

  1. Blake, C. and Merz, C. J. (1998). Uci repository of machine learning databases [http://www. ics. uci. edu/˜ mlearn/mlrepository. html]. irvine, ca: University of california. Department of Information and Computer Science, 55.
  2. Bradley, A. P. (1997). The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognition, 30(7):1145-1159.
  3. Breiman, L. (1996). Bagging predictors. Machine Learning, 24(2):123-140.
  4. Chawla, N. V., Bowyer, K. W., Hall, L. O., and Kegelmeyer, W. P. (2011). Smote: synthetic minority oversampling technique. arXiv preprint arXiv:1106.1813.
  5. Chawla, N. V., Japkowicz, N., and Kotcz, A. (2004). Editorial: special issue on learning from imbalanced data sets. ACM SIGKDD Explorations Newsletter, 6(1):1- 6.
  6. Dean, J. and Ghemawat, S. (2008). Mapreduce: Simplified data processing on large clusters. Communications of the ACM, 51(1):107-113.
  7. del Río, S., Ló pez, V., Benítez, J. M., and Herrera, F. (2014). On the use of mapreduce for imbalanced big data using random forest. Information Sciences.
  8. Efron, B. and Tibshirani, R. J. (1994). An introduction to the bootstrap, volume 57. CRC press.
  9. Estabrooks, A., Jo, T., and Japkowicz, N. (2004). A multiple resampling method for learning from imbalanced data sets. Computational Intelligence, 20(1):18-36.
  10. Fawcett, T. (2004). Roc graphs: Notes and practical considerations for researchers. Machine learning, 31:1-38.
  11. Freund, Y. and Schapire, R. E. (1995). A desicion-theoretic generalization of on-line learning and an application to boosting. In Computational learning theory, pages 23-37. Springer.
  12. Galar, M., Fernandez, A., Barrenechea, E., Bustince, H., and Herrera, F. (2012). A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches. Systems, Man, and Cybernetics, Part C: Applications and Reviews, IEEE Transactions on, 42(4):463-484.
  13. Greene, C. S., Tan, J., Ung, M., Moore, J. H., and Cheng, C. (2014). Big data bioinformatics. Journal of cellular physiology.
  14. He, H. and Garcia, E. A. (2009). Learning from imbalanced data. Knowledge and Data Engineering, IEEE Transactions on, 21(9):1263-1284.
  15. Kearns, M. and Valiant, L. (1994). Cryptographic limitations on learning boolean formulae and finite automata. Journal of the ACM (JACM), 41(1):67-95.
  16. Kubat, M., Matwin, S., et al. (1997). Addressing the curse of imbalanced training sets: one-sided selection. In ICML, volume 97, pages 179-186. Nashville, USA.
  17. Laney, D. (2001). 3d data management: Controlling data volume, velocity and variety. META Group Research Note, 6.
  18. Liu, X.-Y., Wu, J., and Zhou, Z.-H. (2009). Exploratory undersampling for class-imbalance learning. Systems, Man, and Cybernetics, Part B: Cybernetics, IEEE Transactions on, 39(2):539-550.
  19. Maclin, R. and Opitz, D. (2011). Popular ensemble methods: An empirical study. arXiv preprint arXiv:1106.0257.
  20. Manyika, J., Chui, M., Brown, B., Bughin, J., Dobbs, R., Roxburgh, C., and Byers, A. H. (2011). Big data: The next frontier for innovation, competition, and productivity. Technical report, McKinsey Global Institute.
  21. Mitchell, T. M. (1997). Machine learning. 1997. Burr Ridge, IL: McGraw Hill, 45.
  22. Schapire, R. E. (1990). The strength of weak learnability. Machine learning, 5(2):197-227.
  23. Sun, Y., Kamel, M. S., Wong, A. K., and Wang, Y. (2007). Cost-sensitive boosting for classification of imbalanced data. Pattern Recognition, 40(12):3358-3378.
  24. Zadrozny, B., Langford, J., and Abe, N. (2003). Costsensitive learning by cost-proportionate example weighting. In Data Mining, 2003. ICDM 2003. Third IEEE International Conference on, pages 435-442. IEEE.
  25. Zhou, Z.-H. (2012). Ensemble methods: foundations and algorithms. CRC Press.
Download


Paper Citation


in Harvard Style

Gu H. and Song T. (2015). Balanced Sampling Method for Imbalanced Big Data Using AdaBoost . In Proceedings of the International Conference on Bioinformatics Models, Methods and Algorithms - Volume 1: BIOINFORMATICS, (BIOSTEC 2015) ISBN 978-989-758-070-3, pages 189-194. DOI: 10.5220/0005254601890194


in Bibtex Style

@conference{bioinformatics15,
author={Hong Gu and Tao Song},
title={Balanced Sampling Method for Imbalanced Big Data Using AdaBoost},
booktitle={Proceedings of the International Conference on Bioinformatics Models, Methods and Algorithms - Volume 1: BIOINFORMATICS, (BIOSTEC 2015)},
year={2015},
pages={189-194},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0005254601890194},
isbn={978-989-758-070-3},
}


in EndNote Style

TY - CONF
JO - Proceedings of the International Conference on Bioinformatics Models, Methods and Algorithms - Volume 1: BIOINFORMATICS, (BIOSTEC 2015)
TI - Balanced Sampling Method for Imbalanced Big Data Using AdaBoost
SN - 978-989-758-070-3
AU - Gu H.
AU - Song T.
PY - 2015
SP - 189
EP - 194
DO - 10.5220/0005254601890194