The main goal of this work is to propose two
Big Data preprocessing techniques to impute MVs.
We will use the K-Means and Fuzzy-K-Means algo-
rithms as imputation techniques where we will gen-
erate the MVs through the information of the clus-
ters (Li et al., 2004). These two approaches will be
implemented under Spark (Zaharia et al., 2016), re-
designing the original algorithms to take full advan-
tage of the MapReduce paradigm. The proposed tech-
niques will be validated using different well-known
data sets for Big Data classification benchmarking. In
particular, we will simulate different amount of MVs
to check whether our proposal behave correctly in low
to high missing data scenarios and we will evaluate
the effect of the number of centroids selected in the
performance and running times.
The rest of this contribution is organized as fol-
lows. Section 2 will introduce the background on MV
imputation. Section 3 describe the two imputation
techniques proposed. Section 4 contains the experi-
mental analysis carried out with the proposals and the
compared techniques. Finally, Section 5 presents the
conclusions of our work.
2 MISSING VALUES
TREATMENT
In the last few decades, a great deal of progress has
been made in our capacities to generate and store data,
basically due to the great power of the processing of
the machines in addition to its low cost of storage.
However, within these huge volumes of data, there is a
large amount of hidden information, of great strategic
importance. The discovery of this hidden information
is possible thanks to Big Data techniques, which ap-
plies machine learning algorithms to find patterns and
relationships within the data, allowing the creation of
models and abstract representations of reality.
To ensure that extracted models are accurate, the
quality of the source data must be as high as possi-
ble. Sadly, real-world data sources are often subject
to imperfections that will diminish such quality. It is
in this scenario where data preprocessing techniques
are demanded, by cleaning and transforming the data
to increment its quality.
Among the main problems of real-world data,
MVs are one of the most challenging problems, as
the majority of Big Data techniques assume that the
data is complete. In such a way, the presence of MVs
will disallow any practitioner to utilize a large set of
techniques. Hence, the treatment of MVs is one of the
main paradigms within imperfect data treatment.
Before deciding applying any preprocessing tech-
nique, we must acknowledge the type of missingness
we are facing. The statistical dependencies among the
corrupted and clean data will dictate how the imper-
fect data can be handled. Originally, Little and Ru-
bin (Little and Rubin, 2014) described the three main
mechanisms of MVs introduction. When the MV
distribution is independent of any other variable, we
face Missing Completely at Random (MCAR) mech-
anism. A more general case is when the MV appear-
ance is influenced by other observed variables, con-
stituting the Missing at Random (MAR) case. These
two scenarios enable the practitioner to utilise imputa-
tors to deal with MVs. Inspired by this classification,
Fr
´
enay and Verleysen (Fr
´
enay and Verleysen, 2014)
extended this classification to noise data, analogously
defining Noisy Completely at Random and Noisy at
Random. Thus, noise filters can only be safely ap-
plied with these two scenarios.
Alternatively, the value of the attribute itself can
influence the probability of having a MV or a noisy
value. These cases were named as Missing Not at
Random (MNAR) and Noisy Not at Random for MVs
and noisy data, respectively. Blindly applying im-
putators in this case will result in a data bias. In
these scenarios, we need to model the probability dis-
tribution of the missigness mechanism by using ex-
pert knowledge and introduce it in statistical tech-
niques as Multiple Imputation (Royston et al., 2004).
To avoid improperly application of correcting tech-
niques, some test have been developed to evaluate the
underlying mechanisms (Little, 1988) but still careful
data exploration must be carried out first.
Once we acknowledge the kind of MVs we are
facing, there are different ways to approach the prob-
lem of MVs. For the sake of simplicity, we will focus
on the MCAR and MAR cases by using imputation
techniques, as MNAR will imply a particular solution
and modeling for each problem. When facing MAR
or MCAR scenarios, the simplest strategy is to discard
those instances that contain MVs. However, these in-
stances may contain relevant information or the num-
ber of affected instances may also be extremely high,
and therefore, the elimination of these samples may
not be practical or even bias the data.
Instead of eliminating the corrupted instances, the
imputation of MVs is a popular option (Little and
Rubin, 2014). The simplest and most popular esti-
mate used to impute is the average value of the whole
dataset, or the mode in case of categorical variables.
Mean imputation would constitute a perfect candidate
to be applied in Big Data environments as the mean of
each variable remains unaltered and can be performed
in O(n). However, this procedure presents drawbacks
that discourage its usage: the relationship among the
IoTBDS 2019 - 4th International Conference on Internet of Things, Big Data and Security
316