Huber Regression approach whereas we are looking
at a completely different regression approach which
we describe later.
In the paper (Stuke et al., 2020) the authors as-
sess three different hyperparameter selection meth-
ods: grid search, random search and an efficient au-
tomated optimization technique based on Bayesian
optimization. They found that for a large number
of hyper-parameters, Bayesian Optimization and ran-
dom search are more efficient than a grid search. We
also illustrate that our proposed hyper-parameter tun-
ing approach is more efficient than a grid search.
Our contribution is an improvement on the param-
eter optimization method used for the regression algo-
rithm presented in (Hosein, 2023). We demonstrate,
through examples, the potential speedup in computa-
tion thus increasing the usability of their regression
algorithm.
3 PROBLEM DESCRIPTION
Let us first consider the case of a single feature with
ordinal independent values (e.g., age). Denote the de-
pendent value of sample i by y
i
and the independent
value by x
i
. Suppose we need to predict the dependent
value for a test sample with independent value ˆx. One
predictor is the average of the dependent variable over
all samples with independent value ˆx. However there
may be none, or few samples, to obtain a robust pre-
diction. We can instead include nearby samples in the
prediction (i.e., aggregation) but how “big” should the
neighbourhood be and how much weight should we
assign to our neighbours. We use the following ap-
proach. We take a weighted average of the dependent
values over all samples but weight the average based
on the distance between the independent values of the
test sample and each training sample. In particular we
use the following predictor.
ˆy(κ) ≡
∑
s∈S
y
s
(1+d
s
)
κ
∑
s∈S
1
(1+d
s
)
κ
(1)
where d
s
= | ˆx−x
s
|, S is the set of training samples and
κ is a hyper-parameter. Note that, if κ = 0 then the
predictor is simply the average taken over dependent
values of all samples. If there are one or more sam-
ples such that x
s
= ˆx then as κ goes to infinity then the
predictor tends to the average of these samples. The
optimal κ typically lies somewhere between these ex-
tremes. One can find the optimal value of κ by doing
a linear search but that requires a significant amount
of computing. We introduce an efficient approach to
finding the optimal value of κ.
Note that one can perform a similar computation
in the case of categorical data. In this case the dis-
tance between two samples is defined as the absolute
difference between the average dependent values of
the categories of the two samples. In addition, this
approach can be extended to multiple features. In
this case the distance between two samples is the Eu-
clidean distance based on the single feature distances
(with some normalization). However, we again need
to optimize over κ. We provide examples for this case
as well. More details on the regression algorithm can
be found in (Hosein, 2023).
4 AN ILLUSTRATIVE EXAMPLE
We determine the optimal κ as follows. For a given
κ we use K-Fold validation to determine the resulting
Normalized Mean Square Error. This is the MSE di-
vided by the MSE obtained if the predictor was just
the average over all samples of the dependent vari-
ables (in the training set). Hence we obtain a value of
1 at κ = 0. We then find the value of κ that minimizes
this NMSE.
Let E(κ) represent the Normalized Mean Square
Error given a parameter value κ. In practice, we have
found that this function has the shape illustrated in
Figure 1. The function is convex to the left of the
dashed line and concave and increasing to the right.
The minimum point lies within the convex region. We
can summarise the proposed optimization approach as
follows. Starting with any three points on the curve
we determine the quadratic function passing through
these points. If this quadratic function is convex then
we find the minimum point and replace the maximum
of the previous three points with this new point. If,
however, this quadratic function is concave then we
know that the minimum lies to the left of the point
with the smallest κ value. In this case we replace
the three points with (1) the point with the lowest κ
value, (2) the point at κ = 0 and (3) the point midway
between these two. The quadratic function through
these three points is guaranteed to be convex and so
we can continue the process. This ensures that we
gradually move to the convex region and, once there,
we converge to the minimum value. Pseudo-code for
this algorithm is provided in Algorithm 1.
DATA 2023 - 12th International Conference on Data Science, Technology and Applications
630