to the algorithms’ parallel structure, our methods are
easier to be implemented in parallel. In the future, we
plan to implement our proposed methods on the im-
balanced big data sets in a parallel distributed MapRe-
duce framework to test their efficiency.
REFERENCES
Blake, C. and Merz, C. J. (1998). Uci repository of ma-
chine learning databases [http://www. ics. uci. edu/˜
mlearn/mlrepository. html]. irvine, ca: University of
california. Department of Information and Computer
Science, 55.
Bradley, A. P. (1997). The use of the area under the
roc curve in the evaluation of machine learning algo-
rithms. Pattern recognition, 30(7):1145–1159.
Breiman, L. (1996). Bagging predictors. Machine learning,
24(2):123–140.
Chawla, N. V., Bowyer, K. W., Hall, L. O., and Kegelmeyer,
W. P. (2011). Smote: synthetic minority over-
sampling technique. arXiv preprint arXiv:1106.1813.
Chawla, N. V., Japkowicz, N., and Kotcz, A. (2004). Edi-
torial: special issue on learning from imbalanced data
sets. ACM SIGKDD Explorations Newsletter, 6(1):1–
6.
Dean, J. and Ghemawat, S. (2008). Mapreduce: simplified
data processing on large clusters. Communications of
the ACM, 51(1):107–113.
del R
´
ıo, S., L
´
opez, V., Ben
´
ıtez, J. M., and Herrera, F. (2014).
On the use of mapreduce for imbalanced big data us-
ing random forest. Information Sciences.
Efron, B. and Tibshirani, R. J. (1994). An introduction to
the bootstrap, volume 57. CRC press.
Estabrooks, A., Jo, T., and Japkowicz, N. (2004). A multi-
ple resampling method for learning from imbalanced
data sets. Computational Intelligence, 20(1):18–36.
Fawcett, T. (2004). Roc graphs: Notes and practical consid-
erations for researchers. Machine learning, 31:1–38.
Freund, Y. and Schapire, R. E. (1995). A desicion-theoretic
generalization of on-line learning and an application
to boosting. In Computational learning theory, pages
23–37. Springer.
Galar, M., Fernandez, A., Barrenechea, E., Bustince, H.,
and Herrera, F. (2012). A review on ensembles for
the class imbalance problem: bagging-, boosting-, and
hybrid-based approaches. Systems, Man, and Cyber-
netics, Part C: Applications and Reviews, IEEE Trans-
actions on, 42(4):463–484.
Greene, C. S., Tan, J., Ung, M., Moore, J. H., and Cheng, C.
(2014). Big data bioinformatics. Journal of cellular
physiology.
He, H. and Garcia, E. A. (2009). Learning from imbalanced
data. Knowledge and Data Engineering, IEEE Trans-
actions on, 21(9):1263–1284.
Kearns, M. and Valiant, L. (1994). Cryptographic lim-
itations on learning boolean formulae and finite au-
tomata. Journal of the ACM (JACM), 41(1):67–95.
Kubat, M., Matwin, S., et al. (1997). Addressing the curse
of imbalanced training sets: one-sided selection. In
ICML, volume 97, pages 179–186. Nashville, USA.
Laney, D. (2001). 3d data management: Controlling data
volume, velocity and variety. META Group Research
Note, 6.
Liu, X.-Y., Wu, J., and Zhou, Z.-H. (2009). Exploratory
undersampling for class-imbalance learning. Systems,
Man, and Cybernetics, Part B: Cybernetics, IEEE
Transactions on, 39(2):539–550.
Maclin, R. and Opitz, D. (2011). Popular ensem-
ble methods: An empirical study. arXiv preprint
arXiv:1106.0257.
Manyika, J., Chui, M., Brown, B., Bughin, J., Dobbs, R.,
Roxburgh, C., and Byers, A. H. (2011). Big data: The
next frontier for innovation, competition, and produc-
tivity. Technical report, McKinsey Global Institute.
Mitchell, T. M. (1997). Machine learning. 1997. Burr
Ridge, IL: McGraw Hill, 45.
Schapire, R. E. (1990). The strength of weak learnability.
Machine learning, 5(2):197–227.
Sun, Y., Kamel, M. S., Wong, A. K., and Wang, Y. (2007).
Cost-sensitive boosting for classification of imbal-
anced data. Pattern Recognition, 40(12):3358–3378.
Zadrozny, B., Langford, J., and Abe, N. (2003). Cost-
sensitive learning by cost-proportionate example
weighting. In Data Mining, 2003. ICDM 2003. Third
IEEE International Conference on, pages 435–442.
IEEE.
Zhou, Z.-H. (2012). Ensemble methods: foundations and
algorithms. CRC Press.
BIOINFORMATICS2015-InternationalConferenceonBioinformaticsModels,MethodsandAlgorithms
194