lation before being inputted into the models. The
datasets were split into training, testing and valida-
tion data using k-fold cross validation. K-fold cross
validation is a commonly used technique in machine
learning for model selection (Mahjoobi and Adeli
Mosabbeb, 2009). The number of folds chosen for
experimentation was 10, as done by (Mahjoobi and
Adeli Mosabbeb, 2009).
3.3 Implementation
3.3.1 Baseline Interpolation Techniques
When dealing with gap-filling, one of the most com-
mon simple techniques which has been used through-
out literature has been interpolation, and more specif-
ically linear interpolation (Gauci et al., 2016; Ren
et al., 2018). The three interpolation techniques
considered in this research were: Bi-linear, Nearest
Neighbour and Inverse Distance Weighted (IDW). In
order to test how efficiently these techniques could fill
gaps, the notion of ‘bounding boxes’ was adopted.
When analysing the HFR data, it was discovered
that more data is available in the centre of the do-
main. Therefore, a central coordinate at Longitude:
14.679
◦
East (30) and Latitude: 36.421
◦
North (25)
was chosen and bounding boxes having sizes: 3, 5,
7, 9, 11, 13, 15 and 21 were constructed around this
central coordinate.
For each available time-step in the data, the U and
V matrices were processed separately. For each time-
step, the values within the bounding box were set to
‘Nan’ values to create artificial gaps in the raster grid.
The different interpolation methods were then applied
to try fill the values within the bounding box. The
reason increasing sizes of bounding boxes were used,
was to test how accurately these statistical techniques
could fill in missing data when neighbouring data is
reduced.
3.3.2 Machine-learning Architecture Overview
More complex, but efficient approaches which are
most commonly used for gap-filling, are machine
learning models. As seen in Section 2.1, one of the
most commonly used machine learning approaches
for gap-filling are FFNN models (Gauci et al., 2016;
Karimi et al., 2013; Pashova et al., 2013; Ren et al.,
2018; Vieira et al., 2020). Other commonly used ap-
proaches are LSTM models (Song et al., 2020; Wolff
et al., 2020) and RF models (Kim et al., 2020; Ren
et al., 2018; Wolff et al., 2020). Therefore, these three
different machine learning architectures were consid-
ered for gap-filling, in order to find the best perform-
ing model through hyper-parameter optimisation, as
stated in Objective 1.
Feed-Forward Neural Network Model - FFNN
models are a commonly used machine learning tech-
nique because of their simple structure. The models
trained for gap-filling were 3-Layer FFNN models be-
cause these were commonly used in literature (Gauci
et al., 2016; Pashova et al., 2013; Ren et al., 2018;
Vieira et al., 2020), and have been known to produce
satisfactory results. In this model, the sum carried
out on a hidden neuron a
j
is calculated as follows
a
j
=
∑
i
x
i
w
i, j
+b
1, j
, where x
i
represents the input val-
ues, w
i, j
represents the weight between the input and
hidden layers and b
1, j
represents the bias for the hid-
den layer (Vieira et al., 2020).
The size of the input layer was set according to
the number of look-back time-steps considered. The
look-back variable was initially set to 6, as done by
(Gauci et al., 2016; Pashova et al., 2013) who applied
FFNN models on oceanographic data for gap-filling.
The number of neurons h
n
used in the hidden layer
was set to 15 neurons, and the number of epochs was
set to 50, both determined through experimentation.
The size of the output layer was set to 1 neuron. The
ReLu activation function was applied to the input and
hidden layers as done by (Wolff et al., 2020). The
model was compiled using the Adam optimiser with
a learning rate α = 0.001 as done by (Sahoo et al.,
2019), also determined through experimentation. Fi-
nally, the loss function chosen was the Mean Squared
Error loss as done by (Sahoo et al., 2019).
Long Short-Term Memory Network Model - Al-
though less commonly used for gap-filling, the LSTM
model has been found in literature to perform better
than the classic FFNN model. The LSTM model net-
work block is composed of different gates, the input
i
t
, output o
t
and forget f
t
gates, where t represents the
prediction period (Sahoo et al., 2019). One or more
hidden layers were used with h
in
neurons in each layer
i, as done by (Wolff et al., 2020). Similar to the FFNN
implementation, the look-back variable was used to
define the input layer size, initially set to 6 and the
output layer was also set to 1. The size of the hidden
layer was set to 30 neurons and the number of epochs
used was set to 50, both determined through hyper-
parameter tuning. Finally, the model was compiled
using the same optimiser, loss function and activation
function applied to the FFNN model.
Random Forest Model - RF is not as commonly used
in literature for gap filling as the FFNN and LSTM
models, it has been used due its simplicity and rela-
tively good performance, as advised by (Kim et al.,
2020; Wolff et al., 2020).
Among the different parameters the RF model
accepts, the ‘n
estimators’ and Boolean ‘bootstrap’
KDIR 2022 - 14th International Conference on Knowledge Discovery and Information Retrieval
232