complexity and local optima issues. Conversely,
statistical methods may introduce bias and complexity,
relying on initial guesses and eigenvector
representations (Troyanskaya et al., 2001).
The use of rough set theory introduced by Pawlak
(2012) for missing data imputation is motivated by its
strength in handling vagueness and incompleteness in
data without requiring additional information. It
provides robust approximations and decision rules
directly from the dataset, ensuring both effectiveness
and interpretability. The use of intuitionistic rough
sets, rather than classical rough sets, further enhances
this capability by addressing both uncertainty and
vagueness through the inclusion of membership and
non-membership functions. This dual aspect offers a
more nuanced approximation, particularly useful in
scenarios with incomplete or imprecise data, where
classical rough sets may not fully capture the inherent
uncertainty.
In this paper, we introduce a novel approach to
missing data imputation, leveraging the combination
of Intuitionistic Fuzzy (IF) rough sets and the nearest
neighbour algorithm. By integrating IF rough sets
with NN estimation, we aim to capitalize on the
accuracy of NN methods while enhancing noise
tolerance and robustness. Specifically, we propose IF
rough-nearest neighbour imputation methods. The
subsequent sections of this paper are organized as
follows: Section 2 reviews relevant literature. Section
3 provides essential preliminaries to understand the
theoretical background. Section 4 introduces the
proposed methodologies. Section 5 presents the
implementation of these methods on benchmark
datasets and evaluates their performance using non-
parametric statistical analysis. Finally, Section 6
concludes our work and outlines future research
directions.
2 LITERATURE REVIEW
Various domains such as meteorology,
transportation, and others have witnessed the
treatment of missing-valued data by researchers.
Although several algorithms with different
approaches have been proposed, they are not
commonly employed for specific domains or datasets.
Notable imputation techniques frequently used across
fields include those based on Nearest Neighbours
(NN), which predict missing values based on
neighboring instances. While NN methods offer
accuracy and simplicity, they come with drawbacks
such as the need for specifying the number of
neighbors, high time complexity, and local optima
issues.
Troyanskaya et al. (2001) proposed two methods,
KNN and SVD, for imputation in DNA microarrays.
KNN computes a weighted average of values based
on Euclidean distance from the K closest genes, while
SVD employs an expectation maximization (EM)
algorithm to approximate missing values. Comparing
the two, KNN showed greater robustness, particularly
with increasing percentages of missing values. Batista
and Monard (2003) introduced the k-nearest neighbor
imputation (KNNI) method, which replaces missing
values with the mean value of specific attribute
neighbors. Grzymala-Busse (2005) introduced global
most common (GMC), global most common average
(GMCA) methods for nominal and numeric
attributes, respectively, where missing values are
replaced by the most common or average attribute
values. Kim et al. (2005) proposed the local least
squares imputation (LLSI) method, which estimate
missing attribute values as a linear combination of
similar genes selected through k-nearest neighbors.
Schneider (2001) introduced an algorithm based
on regularized Expectation-Maximization (EM) for
missing value prediction, utilizing Gaussian
distribution to parameterize data and iteratively
maximizing likelihoods. Oba et al. (2003) proposed
Bayesian PCA imputation (BPCAI), incorporating
Bayesian estimation into the approximation stage.
Honghai et al. (2005) presented SVM-based
imputation methods, utilizing Support Vector
Machines and Support Vector Regressors.
Clustering-based methods, such as those by Li et al.
(2004) and Liao et al. (2009), use techniques like K-
means and Fuzzy k- means for imputation, often
incorporating sliding window mechanisms for data
stream handling. Neural network-based methods,
including Multi-Layer Perceptrons (MLP) (Sharpe
and Sholly, 1995), Recurrent Neural Networks
(RNN) (Bengio and Gingras, 1995), and Auto
Associative Neural Networks (AANN) (Pyle, 1999),
have been employed for imputation, each with its own
approach and advantages. Amiri and Jensen (2016)
introduced fuzzy rough set-based nearest neighbor
algorithms for imputation, showing superior
performance compared to traditional methods. In the
paper (Pereira et al., 2020), the adaptability of
Autoencoders in handling various types of missing
data are discussed.
While clustering-based algorithms often exhibit
high computational complexity, those based on
nearest neighbors are preferred for their
computational efficiency. Intuitionistic Fuzzy (IF) set
theory, known for effectively handling vagueness and
FCTA 2024 - 16th International Conference on Fuzzy Computation Theory and Applications
400