ing up approach. Pattern Recognition, 41(8):2693–
2709.
Garcia, S., Derrac, J., Cano, J., and Herrera, F. (2012).
Prototype selection for nearest neighbor classification:
Taxonomy and empirical study. IEEE transactions on
pattern analysis and machine intelligence, 34(3):417–
435.
Garc
´
ıa, S., Luengo, J., and Herrera, F. (2014). Data Pre-
processing in Data Mining. Springer Publishing Com-
pany, Incorporated.
Garc
´
ıa-Gil, D., Luengo, J., Garc
´
ıa, S., and Herrera, F.
(2019). Enabling Smart Data: Noise filtering in Big
Data classification. Information Sciences, 479:135 –
152.
Garc
´
ıa-Gil, D., Ram
´
ırez-Gallego, S., Garc
´
ıa, S., and Her-
rera, F. (2017). A comparison on scalability for batch
big data processing on apache spark and apache flink.
Big Data Analytics, 2(1):1.
Garc
´
ıa-Gil, D., Ram
´
ırez-Gallego, S., Garc
´
ıa, S., and
Herrera, F. (2018). Principal Components Analy-
sis Random Discretization Ensemble for Big Data.
Knowledge-Based Systems, 150:166–174.
Iafrate, F. (2014). A Journey from Big Data to Smart Data,
pages 25–33. Springer International Publishing.
Katakis, I., Tsoumakas, G., and Vlahavas, I. (2005). On the
utility of incremental feature selection for the classifi-
cation of textual data streams. In Panhellenic Confer-
ence on Informatics, pages 338–348. Springer.
Meng, X., Bradley, J., Yavuz, B., Sparks, E., Venkataraman,
S., Liu, D., Freeman, J., Tsai, D., Amde, M., Owen,
S., et al. (2016). Mllib: Machine learning in apache
spark. The Journal of Machine Learning Research,
17(1):1235–1241.
Ram
´
ırez-Gallego, S., Garc
´
ıa, S., and Herrera, F. (2018a).
Online entropy-based discretization for data stream-
ing classification. Future Generation Computer Sys-
tems, 86:59–70.
Ram
´
ırez-Gallego, S., Garc
´
ıa, S., Mouri
˜
no-Tal
´
ın, H.,
Mart
´
ınez-Rego, D., Bol
´
on-Canedo, V., Alonso-
Betanzos, A., Ben
´
ıtez, J. M., and Herrera, F. (2016).
Data discretization: taxonomy and big data challenge.
Wiley Interdisciplinary Reviews: Data Mining and
Knowledge Discovery, 6(1):5–21.
Ram
´
ırez-Gallego, S., Mouri
˜
no-Tal
´
ın, H., Mart
´
ınez-Rego,
D., Bol
´
on-Canedo, V., Ben
´
ıtez, J. M., Alonso-
Betanzos, A., and Herrera, F. (2018b). An information
theory-based feature selection framework for big data
under apache spark. IEEE Transactions on Systems,
Man, and Cybernetics: Systems, 48(9):1441–1453.
Ram
´
ırez-Gallego, S., Garc
´
ıa, S., Ben
´
ıtez, J., and Herrera,
F. (2018). A distributed evolutionary multivariate
discretizer for big data processing on apache spark.
Swarm and Evolutionary Computation, 38:240 – 250.
S
´
anchez, J., Barandela, R., Marqu
´
es, A., Alejo, R., and
Badenas, J. (2003). Analysis of new techniques to ob-
tain quality training sets. Pattern Recognition Letters,
24(7):1015 – 1022.
S
´
anchez, J., Pla, F., and Ferri, F. (1997). Prototype selec-
tion for the nearest neighbour rule through proximity
graphs. Pattern Recognition Letters, 18(6):507 – 513.
Skalak, D. B. (1994). Prototype and feature selection by
sampling and random mutation hill climbing algo-
rithms. In Machine Learning Proceedings 1994, pages
293–301. Elsevier.
Tomek, I. (1976). An experiment with the edited nearest-
neighbor rule. IEEE Transactions on systems, Man,
and Cybernetics, (6):448–452.
Triguero, I., Derrac, J., Garcia, S., and Herrera, F. (2012). A
taxonomy and experimental study on prototype gener-
ation for nearest neighbor classification. IEEE Trans-
actions on Systems, Man, and Cybernetics, Part C
(Applications and Reviews), 42(1):86–100.
Triguero, I., Garc
´
ıa, S., and Herrera, F. (2011). Differential
evolution for optimizing the positioning of prototypes
in nearest neighbor classification. Pattern Recogni-
tion, 44(4):901–916.
Triguero, I., Garc
´
ıa-Gil, D., Maillo, J., Luengo, J., Garc
´
ıa,
S., and Herrera, F. Transforming big data into smart
data: An insight on the use of the k-nearest neigh-
bors algorithm to obtain quality data. Wiley Interdis-
ciplinary Reviews: Data Mining and Knowledge Dis-
covery, 0(0):e1289.
Triguero, I., Peralta, D., Bacardit, J., Garc
´
ıa, S., and Her-
rera, F. (2015). Mrpr: A mapreduce solution for pro-
totype reduction in big data classification. neurocom-
puting, 150:331–345.
Wang, J., Zhao, P., Hoi, S. C., and Jin, R. (2014). On-
line feature selection and its applications. IEEE
Transactions on Knowledge and Data Engineering,
26(3):698–710.
Webb, G. I. (2014). Contrary to popular belief incremen-
tal discretization can be sound, computationally ef-
ficient and extremely useful for streaming data. In
2014 IEEE International Conference on Data Mining,
pages 1031–1036.
Wilson, D. L. (1972). Asymptotic properties of nearest
neighbor rules using edited data. IEEE Transactions
on Systems, Man, and Cybernetics, SMC-2(3):408–
421.
Wu, X. and Zhu, X. (2008). Mining with noise knowledge:
error-aware data mining. IEEE Transactions on Sys-
tems, Man, and Cybernetics-Part A: Systems and Hu-
mans, 38(4):917–932.
Yu, L. and Liu, H. (2003). Feature selection for high-
dimensional data: A fast correlation-based filter solu-
tion. In Proceedings of the 20th international confer-
ence on machine learning (ICML-03), pages 856–863.
Big Data Preprocessing as the Bridge between Big Data and Smart Data: BigDaPSpark and BigDaPFlink Libraries
331