sample. For a given test sample i, its prediction is the
weighted average of the target value over all training
samples. These weights are computed as
1
(1+d[i, j])
κ
2
,
where d[i, j] is the Euclidean distance between the test
sample i and the training sample j.
Numerical features pose an interesting challenge
because they can have a wider (potentially infinite)
range of values. For example, consider the feature
of Age. Compared to Gender, this feature can easily
span 40 values (e.g., 18 to 58) instead of two. Thus,
for the same training set, the number of samples per
age will be low which implies the means (i.e., µ
Age,20
,
µ
Age,21
, µ
Age,30
etc.,) used to compute distances (and
hence the weights) may not be robust. Additionally,
we may encounter some unique values in the test-
ing data set that are not in the training data set or
vice versa. Generally, categorical variables do not en-
counter these issues because the number of samples
per category value is sufficient. In order to solve this
problem for numerical features, we impute a value for
means for each unique value in both the training and
testing data sets.
These imputed means are calculated based on the
distances between the attribute value and all training
samples in the feature space. This distance assists the
algorithm in determining which training samples are
most relevant to the test point. Then the target val-
ues for these training samples are combined, using a
weighted average similar to before. The weights are
computed as
1
(1+d
f
[i, j])
κ
1
, where d
f
[i, j] is the distance
between the numerical feature values in test sample i
and training sample j. For example, say f = Age, and
test sample i’s age is 30 and training sample j’s age is
40 then d
Age
[i, j] = |30 − 40| = 10.
This modification means our proposed approach
uses two kappa values: one for the pre-processing
(i.e., κ
1
) and one for predicting (i.e., κ
2
). We de-
termine the optimal combination of kappa values
(κ
1
, κ
2
) that minimises the error. According to (Ho-
sein, 2022), as κ
2
increases, the error decreases up
to a certain point and then the error increases after
this point. Therefore, an optimal κ
2
can be found that
minimizes the error.
We define a range of values for both the pre-
processing and predicting parts. The initial range
is determined through a trial and error process. We
observe the MAE and adjust these values if needed.
However, note that while the initial range of kappa
values involves some trial and error, the process of
finding the optimal combination within the range is
essentially a grid search which is systematic and data-
driven and ensures that the model is robust. Since the
algorithm uses all the training data points in its pre-
diction, it will be robust for small data sets or where
there are not enough samples per category of a fea-
ture. The Pseudo code in Figure 1 summarizes the
steps of the proposed algorithm.
4 NUMERICAL RESULTS
In this section, we describe the datasets that were used
and apply the various techniques in order to compare
their performances. A GitHub repository, (Gooljar,
2023) containing the code used in this assessment has
been created to facilitate replication and validation of
the results by readers.
4.1 Data Set Description
The data sets used were sourced from the Univer-
sity of California at Irvine (UCI) Machine Learning
Repository. We used a wide variety of data sets to il-
lustrate the robustness of our approach. We removed
samples with any missing values and encoded the
categorical variables. No further pre-processing was
done so that the results can be easily replicated. Ta-
ble 1 shows a summary of the data sets used.
4.2 Feature Selection
There are various ways to perform feature selection
(Banerjee, 2020) but the best subset of features can
only be found by exhaustive search. However, this
method is computationally expensive so we select
the optimal subset of features for the Random For-
est model using Recursive Feature Elimination with
Cross-Validation(RFECV) (Brownlee, 2021) and use
these features for all other models. Note that for each
model, there may be a different optimal subset of fea-
tures and, in particular, this subset of features may not
be optimal for the proposed approach so it is not pro-
vided with any advantage. Table 2 shows the selected
attributes for each dataset. The optimal subset of fea-
tures was the full set of features for Auto, Energy Y2
and Iris datasets. The columns are indexed just as they
appear in the datasets from UCI.
4.3 Performance Results
We show the performances of the different algorithms
Random Forest, Decision Tree, k-Nearest Neighbors,
XG Boost, and the proposed method. The models
are evaluated on seven datasets (Auto, Student Per-
formance, Energy Y2, Energy Y1, Iris, Concrete, and
Wine Quality). We used Mean Absolute Error (MAE)
to measure the performance since it is robust and easy
DATA 2023 - 12th International Conference on Data Science, Technology and Applications
526