et al., 2018). Although recent literature has shown
significant increase in the accuracy of advanced im-
putation methods, the high costs associated with these
methods in various tasks has often raised concerns.
Therefore, it has become paramount to address the
question of how the computation time of new meth-
ods could be reduced without sacrificing their accu-
racy (Tran et al., 2018).
In recent years, various machine learning (ML) al-
gorithms have been introduced to handle the issue of
data incompleteness which often occurs as a result of
missing values (Angelov, 2017). These algorithms are
designed to handle this issue by imputing the most
plausible values in instances with missing values. In
contrast to popular statistical methods for filling in
missing values, machine learning algorithms use ex-
isting data in a dataset to train and develop a model
that will be used to impute missing values. Various
ML algorithms for imputing missing values have been
identified in literature such as probabilistic methods,
decision trees, rule based methods etc. (Farhangfar
et al., 2008).
In this paper, we propose a novel imputation tech-
nique which utilizes the similarity between observed
values to perform imputation. This is achieved by par-
titioning an incomplete dataset in the first instance.
Then the similar records within cluster are used to
estimate the missing values. However, some chal-
lenging issues have been identified with the proposed
method including how to perform clustering on the
incomplete dataset before imputation. To solve this
problem, we initially assign distinctive values to re-
place all the missing values. This reduces the effect
of missing values in the datasets and enhances clus-
tering on the incomplete datasets.
We evaluate the performance of our pro-
posed BFMVI technique against existing techniques
namely- LSI, FIMUS, FCM, DMI and EMI, on six
datasets obtained from University of California Irvine
(UCI) machine learning repository.
2 RELATED WORKS
Many research efforts have been channelled towards
addressing the issue of data incompleteness by at-
tempting to develop more accurate and reliable im-
putation techniques. In this section, we will review
various related research and recent efforts aimed at
addressing this problem.
A framework for the imputation of missing values
using co-appearance, correlation and similarity anal-
ysis (FIMUS) was proposed by (Rahman and Islam,
2014). The overal idea behind this method is to make
educated guesses based on the correlation between at-
tributes, co-appearance of values and the similarity
between values that belong to an attribute. Unlike var-
ious existing technique, FIMUS can also be used to
impute missing categorical variables. To compute co-
appearances between values that belong to different
attributes, FIMUS first of all summarizes the values
of numerical attributes into various categories. For in-
stance, the algorithm groups the values of an attribute
A
p
into
p
|A
p
| number of categories, where |A
p
| is
the domain size of A
p
. This strategy of grouping is
advantageous due to its simplicity. However, it may
not always detect natural groups due to the fact that
it artificially makes the range of values for each cate-
gory equal.
Various missing value imputation techniques have
approached imputation using clustering schemes such
as k-means and FCM. Another technique proposed by
(Zhang et al., 2018) approaches imputation firstly by
partitioning a dataset into k clusters. This will re-
sult in the formation of membership values for items
within a particular cluster or cluster centroid. Then,
all the missing values are evaluated using the mem-
bership degree of objects that fall within the same
cluster centroid. The simplicity of this method con-
stitutes a major advantage. However, the accuracy of
the FCM imputation may be significantly affected by
clustering results in usual situations when the selec-
tion of a suitable number of k clusters is challenging
for data miners.
The Expectation maximization imputation (EMI),
proposed by (Schneider, 2001; Dempster et al., 1977)
is one of the most popular missing value imputation
techniques identified in literature. To impute miss-
ing numerical values, this technique estimates the
mean and covariance matrix from observed values in
a dataset and iterates until no considerable change is
noticed in the values of the imputed data, mean and
covariance matrix, from one iteration to another. Ac-
cording to research, the EMI algorithm only works
best in datasets with values that are missing at ran-
dom. The main disadvantage of this method however,
is that it relies on the information from other values in
the dataset. Therefore, this method is only suitable for
datasets with high correlation among attributes (Deb
and Liew, 2016).
Another technique used to handle the issue of
missing data is the Decision tree based missing value
imputation (DMI) algorithm proposed by (Rahman
and Islam, 2013). This technique incorporates the de-
cision tree and the EMI algorithm for imputing miss-
ing values. The authors argue that attributes within
the horizontal partition of a dataset can have higher
correlation than the correlation of attributes over the
Best Fit Missing Value Imputation (BFMVI) Algorithm for Incomplete Data in the Internet of Things
131