machine learning algorithms such as k-nearest
neighbours or deep learning (van Buuren &
Groothuis-Oudshoorn, 2011). These methods have
been found to improve the accuracy and reliability of
machine learning models. As a result, there is an
important drive to develop novel and accessible
software solutions that enable machine learning users
to easily fill in their datasets.
This study introduces AutoImpute (Autonomous
Imputation), a web-based solution for addressing the
missing data problem under different missing ratios.
To efficiently predict missing data, the proposed web-
tool AutoImpute embeds an ensemble supervised
learning technique named Extra Trees, presented by
(Geurts et al., 2006).
Thanks to its user-friendly online interface,
AutoImpute is accessible to everyone, regardless of
technical expertise. As a consequence, the end user
may start a missing data imputation remotely and
receive the results once the procedure is done. The
outcomes of the imputation data technique for
AutoImpute is presented on the web page and may be
exported for the standard imputation. Few software
tools exist in the literature for implementing missing
data imputation processes. These include R packages
as well as generalised machine learning tools like
KEEL (Triguero et al., 2017).
However, unlike other literature software
solutions, AutoImpute makes a missing data
imputation technique open to a diverse scientific
community by requiring no programming expertise or
software installation. The effectiveness of the
imputation technique, on the other hand, is
demonstrated in an experimental session in which
AutoImpute outperforms four software tools in
handling missing data on a healthcare dataset.
This paper is organised as follows. The problem
of missing values imputation is discussed in Section
2. The main part of the study is Section 3, which
describes the architecture of AutoImpute. Section 4
reports on the experimental setup and results before
concluding in Section 5.
2 MISSING VALUES PROBLEM
Missing data is a common challenge faced by
machine learning practitioners when analyzing real-
world data (Bertsimas et al., 2018). Missing data can
occur for a variety of reasons, including incomplete
replies, equipment failure, and attrition (Dhindsa et
al., 2018). These problems can arise at any time and
are often difficult to control. Missing values are
unavoidable, even if a specific metric was performed
throughout the data collecting procedure. Moreover,
failure to manage missing data correctly can result in
biased estimates, reduced statistical power, and
inaccurate conclusions, making it critical to treat the
issue correctly (Groenwold & Dekkers, 2020).
The handling of missing data during data pre-
processing has a substantial impact on the quality and
reliability of data analysis. Imputation is a common
data pre-processing approach that includes replacing
missing or incorrect information with predicted
values using various logical and statistical
methodologies (AZUR et al., 2011). In principle,
imputation allows researchers to make informed
guesses to fill in gaps in the data, hence improving the
dataset's accuracy and completeness (van Buuren &
Groothuis-Oudshoorn, 2011). The aim of this study is
to present a new machine learning-based technique
that replaces missing values or inaccurate data
automatically with an accurate approximation.
Rubin (1976) states that there are three basic
mechanisms for missing values, each with a unique
pattern of missing values. The first form is missing
completely at random (MCAR); as the name implies,
missing values in this type have no dependency and
the likelihood of missing data is fully random.
Because all missing data has no relationship to
observed, unobserved, or even missing data, it almost
never produces bias. The second form is missing at
random (MAR), which shows that the missing values
are connected to the observed data and that the
missingness is determined by the available values.
Both MCAR and MAR are useful for a variety of
approaches, including multiple imputation and
maximum likelihood (Gelman & Hill, 2010). The
third and most difficult form is missing not at random
(MNAR); in this mechanism, none of the other types
are relevant, and assumptions must be made explicitly
in order to grasp this process. This mechanism is
divided into two parts: (1) missingness linked to
unobserved predictors (MRUP), and (2) missingness
related to missing value itself (MRMVI) (Ford,
1983).
Starting with this examination, AutoImpute aims
to address the missing values in all scenarios having
the highest accuracy at MAR mechanism where the
missing values are related to observed values.
However, in the experiment section, the missing
values are artificially generated following the MCAR
mechanism with different missing ratios.