1 FPS is enough for the algorithm to produce a stable
output, this algorithm is easily capable of running in
real-time on low-powered embedded devices.
4.5 Validation
The main objective is to validate performance calcu-
lating the inference error of the model, which repre-
sents the distance in meters from the ground truth po-
sition against the distance in meters from the video
source. A correlation between these two factors
would be valuable for different reasons, starting from
the most obvious understanding of how the model is
performing to having the possibility of adding a pre-
diction of the error to the pipeline. The possibility of
making such a prediction is a great improvement of
the whole pipeline, both from a technical as well as
user-end perspective. Practically, along the predicted
position of the target vessel, the output will be en-
riched with the approximation of error.
This section explains how to perform the distance
estimation of the vessel from the camera. The dis-
tance of the predicted point from the ground truth will
also be presented, ending with a brief explanation of
possible error prediction methods. As mentioned in
the Object Detection section, the validation has been
performed on five different videos (videos 13, 14, 15,
16 and 17), for a total amount of 713 data points.
4.5.1 Distance Estimation
In order to calculate the distance from the camera and
the actual error from the predicted point, the Haver-
sine formula (M, 2010) is used. The formula calcu-
lates the shortest distance between two points on a
sphere using their latitude and longitude, and is ex-
pressed as follows:
d = 2r arcsin
r
sin
2
φ
2
− φ
1
2
+ cos φ
1
cosφ
2
sin
2
λ
2
− λ
1
2
(23)
r is the radius of the earth in meters, φ
1
,φ
2
is the
latitude of the two points, and λ1,λ2 is the longitude.
4.5.2 Considerations about the Validation Set
The distance from the camera of the target vessel is
approximately between 86 and 200 meters for 75% of
the data points, with peaks of 2500 meters. One of
the videos, however, represent a particular issue for
the validation process, as its frames show the vessel
from the back and at a close distance, a point of view
which is unique in this dataset. A possible solution
could consist of shrinking the testing set by including
the video in the training process. Although this work
aims at getting the best performance possible in terms
of inference and generalization with a limited amount
of data points, we found that the model is not able to
generalize to that level in this constrained scenario.
4.5.3 Validation Results
As shown in Figures 8 and 9, the error performed by
the model is below 20 meters for 80% of the valida-
tion data points, the only outlying points are the ones
belonging to Video 17, reaching important errors at
high distances. Some of these points represent the
model detecting another vessel, which was mislead-
ing these predictions. However, it can be noted that
the model is able to predict positions around 500 and
1000 meters with a small amount of error. Predic-
tions in the testing dataset have the highest density
well below 20 meters, even reaching a precision be-
low a single meter at the highest distances [Figures 8,
9]. It is now possible to use all the extrapolated infor-
mation to perform error prediction at the end of the
pipeline. The first stage of the work predicts the posi-
tion of the vessel performing on the validation set. Af-
ter that, the Haversine distance between both the pre-
dicted distance and ground truth, as well as the ground
truth and camera position, are computed. These two
parameters are used to feed the error prediction mod-
els and are respectively called prediction error and
distance from camera. The experimental results pre-
sented include the usage of classic Machine Learning
methods, Linear Regression and Support Vector Ma-
chine, applied to regression problems, that is, SVR, as
well as a Deep Learning approach.
The first concern is, once again, the dataset. A
plausible solution is to split the previous validation set
according to Videos, excluding video 17. By includ-
ing video 17, the predicted distance from the model
would only be uncontrolled noise to our dataset, with-
out adding any value to the scope. The training set is
now composed of 429 data points belonging to Videos
14, 15 and 16, while the testing set has 143 points
from Video 13. SVR adopts an approach similar to
Support Vector Machines as Large Margin Classifiers
(Cortes and Vapnik, 1995), with the usage of kernel
tricks (Aizerman et al., 1964) to create Nonlinear clas-
sifiers. The quality of the hyperplane is determined by
the points falling inside the decision boundary. The
results presented by Table 4 show that a simple net-
work, with only two hidden layers and trained for 5
epochs, is able to reach a Root Mean Squared Er-
ror between 4 and 5 meters, using a batch size of 32
data points. The Neural Network performance is also
shown by Figure 9. The SVR, using the Radial Ba-
sis Function kernel to approximate the nonlinear be-
Marine Vessel Tracking using a Monocular Camera
21