Marine Vessel Tracking using a Monocular Camera

Tobias Jacob

, Raffaele Galliera

, Muddasar Ali

and Sikha Bagui

Department of Computer Science, University of West Florida, Pensacola, Florida, U.S.A.

Keywords:

Camera Calibration, Monocular Object Tracking, Object Detection, Error Prediction, Edge-ml, Distance

Estimation, YoloV5, Kalman Filters.

Abstract:

In this paper, a new technique for camera calibration using only GPS data is presented. A new way of tracking

objects that move on a plane in a video is achieved by using the location and size of the bounding box to

estimate the distance, achieving an average prediction error of 5.55m per 100m distance from the camera. This

solution can be run in real-time at the edge, achieving efﬁcient inference in a low-powered IoT environment,

while being also able to track multiple different vessels.

1 INTRODUCTION

Ship avoidance systems are typically based on radio

signals. In this paper, a new technique for camera

calibration using only GPS data is presented. To get

a better position estimate from an object detector, the

property that ships move on a plane is utilized. Also,

a new way of tracking objects that move on a plane in

a video is achieved by using the location and size of

the bounding box to estimate the distance.

This approach has many possible use cases in edge

computing. This includes tracking airplanes in air-

ports, ships in harbours, cars in parking lots or hu-

mans in shopping malls. It also demonstrates how to

efﬁciently track one speciﬁc object from other similar

instances of that object. This can play an important

role in trafﬁc systems. For example, it is possible to

keep track of free spaces in a parking lot. This infor-

mation could be utilized in various IoT applications

helping citizens to ﬁnd a free parking space. Another

use case is smarter trafﬁc lights, which could use a

camera to react to oncoming trafﬁc quicker. For this,

all that is needed for camera calibration and train-

ing or evaluating the object detector is 33 minutes of

videos and ground truth GPS positions or something

comparable. The methods utilized in this paper are

fast enough to be deployed on inexpensive edge com-

puting devices that can be mass-produced.

https://orcid.org/0000-0002-8530-258X

https://orcid.org/0000-0001-7777-3835

https://orcid.org/0000-0002-6044-9892

https://orcid.org/0000-0002-1886-4582

2 RELATED WORK

A rich background of GPS camera calibration papers

are available(e.g. (Liu et al., 2010)), but to date, none

of these works use just GPS data to determine lens

distortion of a single camera. The novelty of this pa-

per is that it introduces an algorithm that relies on data

from a single camera without any a-priori informa-

tion about the camera parameters. Research has also

been done for monocular position tracking with a cal-

ibrated camera. (Ali and Hussein, 2016) used a car-

mounted front-facing calibrated camera and Hough

Transformations (Duda and Hart, 1972) to extract the

distance to other cars. They achieve an average lat-

eral error of 9.71%. The error is measured as dif-

ference between prediction and ground truth in me-

ters per 100 meters distance from the camera. (Dhall,

2018) presents a combination between YoloV2 and

a keypoint detector to achieve an average error of

6.25%. (Hu et al., 2019) uses Faster R-CNN for object

detection and an additional network for 3D Box esti-

mation, reaching an average position error of 7.4%.

Monocular position estimation is closely related to

depth estimation. Recent methods achieved between

7.1% and 21.5% accuracy (Bhoi, 2019).

Some solutions have been proposed for Un-

manned Aerial Vehicles (UAVs). (Sanyal et al., 2020)

suggests an approach to detect and estimate the loca-

tion of an object in an environment where the UAV

operates with a single monocular camera using its

GPS position. Given the home position, the GPS lo-

cation of the UAV, and the image taken from the cam-

era, the authors have developed a method to estimate

the depth of an object in the image after detecting it

Jacob, T., Galliera, R., Ali, M. and Bagui, S.

Marine Vessel Tracking using a Monocular Camera.

DOI: 10.5220/0010516000170028

In Proceedings of the 2nd International Conference on Deep Learning Theory and Applications (DeLTA 2021), pages 17-28

ISBN: 978-989-758-526-5

using Yolo. (Leira et al., 2017) Developed a solution

for UAVs with computational power and an onboard

camera to detect, recognize, and track multiple ob-

jects in a maritime environment. The system was able

to automatically detect and track the position and the

velocity of a boat with an accuracy of 5-15 meters, be-

ing at a height of 400 meters. Furthermore, the track-

ing could be done even when a boat was outside the

FOV (ﬁeld of view) for extended periods of time.

In summary, several applications have been imple-

mented for detection and tracking of objects present

in a view of the camera, but not much work has been

done on developing solutions that could run at the

edge, leveraging low-powered IoT commercial off-

the-shelf devices, which this work presents.

3 METHODS

The data consists of videos, a time series of GPS po-

sitions of the vessel and the GPS position of the cam-

era. First the relation between the world position of

the vessel and the on-screen position is determined.

Then, an object detector is trained to track the vessel

on the image. Post-processing is applied to increase

stability and precision.

3.1 Converting World/ Screen

Let (ϕ,λ) be the latitude and longitude in arc measure,

respectively, of the ship, and similarly (ϕ

,λ

) be the

position of the camera. The camera is at the origin

at the world coordinate system, while the ship is at

). The distance in meters can be obtained with

an equi-rectangular projection (Snyder, 1987):

= r(λ − λ

)cosϕ

, (1)

= r(ϕ − ϕ

), (2)

with r = 6371 km being the earth radius (Moritz,

2000). The ship is always at sea level, so z

= 0 can

be omitted. The following homo-graphic equation has

to be solved:









= H









. (3)

) are the projected coordinates. z

is the dis-

tance from the camera to the vessel. H is a matrix that

projects the world coordinates to the screen. The cam-

era has signiﬁcant lens distortion. This means that the

ﬁnal coordinate (x , y) of the ship on screen and the

projected coordinates (x

) have the relationship:

= x

+ (x

− x

)(1 + K

), (4)

= y

+ (y

− y

)(1 + K

), (5)

where (x

) is the center of the image and r

the

distance from the center of the image.

= (x

− x

)

+ (y

− y

)

. (6)

and H are parameters that have to be learned.

We created a dataset containing 11 hand-labeled pairs

of screen positions (x

s,lab

) and GPS positions

(λ,ϕ). The GPS data was reconstructed from the

video and frame index using a linear interpolation of

the timestamps provided and a linear interpolation of

the GPS data for that timeframe. The least squares so-

lution for equation 3 for H can be easily computed. K

introduces a non-linearity, hence it was brute-forced.

The result is shown in ﬁgure 5b. Different values

for K

were used for calculating (x

p,lab

) from

s,lab

) and solving equation 3 for H, and mea-

suring the MSE between the estimated screen posi-

tion (x

) and the hand-labeled screen position for

all labeled points i.

Loss =

∑

p,i

− x

p,lab,i

)

+ (y

p,i

− y

p,lab,i

)

. (7)

3.2 Object Detection

The recent advancements of Deep Learning-based ob-

ject detection models were state-of-art in achieving

improvements in terms of both inference time and re-

liability of the predictions (Jiao et al., 2019). Dealing

with the hardware constraints present in edge devices

brings additional attention to the need of smaller mod-

els performing fast inference, while keeping the accu-

racy of predictions as high as possible. We found two

architectures and implementations, YoloV5 (Jocher

et al., 2020) and EfﬁcientDet (Tan et al., 2020)

playing a key role in object detection model sce-

narios, offering different conﬁgurations in terms of

model sizes. We tested both models on a standard

NVIDIA Jetson Nano Developer Kit, a small, com-

mercial off-the-shelf device for AI embedded appli-

cations, equipped with 4GB of LPDDR4 RAM, 128-

core Maxwell GPU and a Quad-core ARM A57 CPU.

As presented in table 1, YoloV5s was found to

be the best solution for resource constrained de-

vices like the NVIDIA Jetson Nano, in terms of

trade-off between mAP (mean Average Precision)

and FPS (Frames per Second) processed, when us-

ing 320x320px images. Once camera calibration and

H and K

are known, they can be used to predict the

Bounding Box (c

) of the vessel. Let (c

)

be the location of the ship. The dimensions of the

DeLTA 2021 - 2nd International Conference on Deep Learning Theory and Applications

Table 1: Comparison of different YoloV5 and EfﬁcientDet conﬁgurations, with mean Average Precision evaluated on a

NVIDIA JetsonNano on COCO (Lin et al., 2015) 2017 validation dataset and FPS performance reported on a 720p video.

Models are sorted by type (Yolo, EfﬁcientDet), size of the model, and image resolution adopted.

Model Image resolution mAP (mean Average Precision) Processed FPS

YoloV5x 416x416px 0.451 1.25

YoloV5l 416x416px 0.328 2.48

YoloV5m 416x416px 0.402 3.11

YoloV5s 416x416px 0.333 8.70

YoloV5s 320x320px 0.299 10.75

EfﬁcientDet(D0) 256x256px 0.336 7

EfﬁcientDet(D1) 256x256px 0.394 5

bounding box (s

) are set on half image resolution

to:

= (1500/z

+ 25) , (8)

= (850/z

+ 25) . (9)

is the distance from the vessel to the camera in

project space, as obtained by equation 3. The scale

factors were empirically chosen to include the ves-

sel and have a minimum size of 25. An example is

shown in ﬁgure 1. To not process unnecessary data,

the image is scaled down by 2 and cropped to the up-

per (640,192) pixels. Besides the hyper-parameters

shown in table 2, the default hyper-parameters of

YoloV5s were used.

Table 2: Training hyper-parameters that differ from the de-

fault values.

Epochs 20

Batch size 16

Image size 352

3.3 Post Processing

Outputs of the networks tend to be noisy. They might

detect no ship, a slightly off position, multiple bound-

ing boxes for the same ship, or multiple ships.

First, the trajectories for every ship have to be ob-

tained. Each trajectory uses a Kalman ﬁlter (Kalman,

1960) to predict the location in the next frame. All

detected ships in the next frame are accounted to the

closest trajectory if they are within 80 pixels of the ra-

dius. Each trajectory is updated with the closest point

that was accounted to it. A trajectory is ﬁnished if

it is not updated for more than 10 frames. In case

no trajectory is found for a detection, it is considered

new. Out of those trajectories, the correct one has to

be found. The trajectory score is deﬁned as:

score = c

0.8l

logl . (10)

c is the sorted conﬁdence values that YoloV5s issued

to the trajectory positions with l being the trajectory

length for all data points i of the trajectory.

l =

l−1

∑

i=1

w,i

− x

w,i+1

)

+ (y

w,i

− y

w,i+1

)

. (11)

0.8l

represents the 80th percentile of c. It was as-

sumed that the trajectory with the highest score was

the vessel we were looking for. When the trajectory

got cut in the middle, other trajectories were found

that would be pre- or appended to the current trajec-

tory by looking at the timestamps.

p,Yolo

and y

p,Yolo

can be obtained directly from

the object detector and camera calibration. The

bounding box size is used as an additional indicator

for distance z

and vertical position y

. z

p,s

and z

p,s

can be obtained from equation 8.

− H

−1

2,0

− H

−1

2,0

−1

2,1

. (12)

The different predictors are shown in ﬁgure 2. The ﬁ-

nal estimate of the y position is the weighted average:

= 0.5y

Yol o

+ 0.25y

p,s

+ 0.25y

p,s

. (13)

The parameters were empirically chosen since the po-

sition based predictor is less noisy than the bound-

ing box size based predictors, as shown in ﬁgure 2.

Finally, a Kalman ﬁlter is used to smooth the com-

posed trajectory and estimate missing values. Points

that have a vertical difference of more than 5 pixels

to the smoothed trajectory are considered outliers and

masked out. Then the algorithm runs again, 5 times

in total. A masked out point may become unmasked

if the smoothed trajectory moved closer to it. The

effect is shown in ﬁgure 3. Then, the GPS coordi-

nate of the vessel can be obtained by reversing the

equations in section 3.1. In case the point is above

the horizon (z

has wrong sign), or very far, it is ex-

cluded. Missing data points may be obtained using a

cubic interpolation of the data. The output is clipped

to [−2000,2000] for safety.

All Kalman ﬁlters use the following state-

transition F

, observation R

, transition co-variance

Marine Vessel Tracking using a Monocular Camera

and observation co-variance R

, with ∆ being the

time in seconds between frames:







1 1 0 0

0 1 0 0

0 0 1 1

0 0 0 1







, (14)



1 0 0 0

0 0 1 0



, (15)

= diag(0,1∆

,0,1∆

), (16)

= diag(10

,10

). (17)

4 RESULTS

All methods were validated on the dataset of the “AI

tracks at sea” Challenge (Tall, 2020). This dataset

contained 11 videos, each 3 minutes in time, and

ground truth GPS positions for the ship.

4.1 Camera Parameters

The learner for inferring the camera parameters dis-

covered the following K1 and H values:

= −4.053 · 10

−7

, (18)

H ≈





4.41 · 10

3.79 · 10

−1.00 · 10

−2.17 · 10

−5.65 · 10

−5.48 · 10

−5.12 · 10

−1

−1.44 · 10

−1

1 · 10





(19)

The results are visualized in ﬁgure 4. The orange line

in ﬁgure 4a is characterized by the vanishing points of

the plane:

p,hor

−1

3,1

+ y

p,hor

−1

3,2

= −H

−1

3,3

. (20)

It matches the actual horizon within 5 pixels. The

difference between our hand-labeled points and the

project points is shown in ﬁgure b. The loss is visi-

ble as the mean squared error between the orange and

blue points, indicated by the green lines. The cam-

era orientation can also be obtained. With x

p,hor

= x

being in the middle of the image and y

p,hor

fulﬁlling

equation 20, the horizon point can be deprojected us-

ing:





w,hor





= H

−1





p,hor





. (21)

The heading of the camera is:

α = arctan

w,hor

= 76.5

◦

. (22)

This allows for the adjustment of the orientation of

the camera by multiplying (x

) with a 2D rotation

matrix.

4.2 Object Detection

14 videos that contained recorded camera imagery of

sea vessel trafﬁc along with the recorded GPS track of

the vessel of interest were used for this work. The net-

work was trained on 6 Videos (7, 8, 9, 10, 11 and 12)

and 5 videos (13, 14, 15, 16, and 17) were used for

validation. Videos 18, 19 and 20 did not have GPS

data, so they could only be used for testing. For ﬁ-

nal results, the network was trained for one additional

epoch on the training and validation dataset. The re-

sults shown in this paper are without training on the

validation set.

Two situations occurred depending on the train-

ing length. If YoloV5s was trained for 50 epochs, it

was able to distinguish the vessel from other ships like

sailboats and jetskis on the training data, but it per-

formed poorly on validation due to overﬁtting, which

started after epoch 20. After 20 training epochs,

YoloV5s not only detected the ship well, but also

other ships like sailboats and jetskis. Filtering the cor-

rect ship out of all the generated trajectories turned

out to be a major source of error.

4.3 Post Processing

YoloV5s infers a position that is 2 pixels (on half res-

olution) above the actual on-screen position needed.

Therefore, this was added as a ﬁxed offset to the po-

sitions that the network inferred. This could be at-

tributed to the fact that the pre-trained YoloV5s used

a different center point on objects. The Kalman ﬁl-

ters are able to remember the velocity of each ves-

sel. When two vessels cross each other, they can keep

track of both trajectories, as shown in ﬁgure 6. In

video 18 the model is able to interpolate the out-of-

screen position of the vessel with the information of

how the vessel left and reentered the frame.

The result of the ﬁnal pipeline is shown in ﬁg-

ure 7 and ﬁgure 8. The error for 4 out of 5 videos

is less than 20 m. The error is also proportional to

the distance. At a distance of 150 meters, if the ship

at (x

) moves roughly 10m away from the cam-

era ((150,50) to (160,53)), this equates to a vertical

change of roughly 0.276 pixels on the half resolution

the network is working on. This approach is working

with a sub-pixel accuracy.

4.4 Performance

On the NVIDIA Jetson Nano used for testing, 5.31

FPS was reached for the whole pipeline. YoloV5s

took 72% of that time, while the other 28% can be

mostly attributed to video decoding. Considering that

DeLTA 2021 - 2nd International Conference on Deep Learning Theory and Applications

1 FPS is enough for the algorithm to produce a stable

output, this algorithm is easily capable of running in

real-time on low-powered embedded devices.

4.5 Validation

The main objective is to validate performance calcu-

lating the inference error of the model, which repre-

sents the distance in meters from the ground truth po-

sition against the distance in meters from the video

source. A correlation between these two factors

would be valuable for different reasons, starting from

the most obvious understanding of how the model is

performing to having the possibility of adding a pre-

diction of the error to the pipeline. The possibility of

making such a prediction is a great improvement of

the whole pipeline, both from a technical as well as

user-end perspective. Practically, along the predicted

position of the target vessel, the output will be en-

riched with the approximation of error.

This section explains how to perform the distance

estimation of the vessel from the camera. The dis-

tance of the predicted point from the ground truth will

also be presented, ending with a brief explanation of

possible error prediction methods. As mentioned in

the Object Detection section, the validation has been

performed on ﬁve different videos (videos 13, 14, 15,

16 and 17), for a total amount of 713 data points.

4.5.1 Distance Estimation

In order to calculate the distance from the camera and

the actual error from the predicted point, the Haver-

sine formula (M, 2010) is used. The formula calcu-

lates the shortest distance between two points on a

sphere using their latitude and longitude, and is ex-

pressed as follows:

d = 2r arcsin

sin

− φ

+ cos φ

cosφ

sin

− λ

(23)

r is the radius of the earth in meters, φ

,φ

is the

latitude of the two points, and λ1,λ2 is the longitude.

4.5.2 Considerations about the Validation Set

The distance from the camera of the target vessel is

approximately between 86 and 200 meters for 75% of

the data points, with peaks of 2500 meters. One of

the videos, however, represent a particular issue for

the validation process, as its frames show the vessel

from the back and at a close distance, a point of view

which is unique in this dataset. A possible solution

could consist of shrinking the testing set by including

the video in the training process. Although this work

aims at getting the best performance possible in terms

of inference and generalization with a limited amount

of data points, we found that the model is not able to

generalize to that level in this constrained scenario.

4.5.3 Validation Results

As shown in Figures 8 and 9, the error performed by

the model is below 20 meters for 80% of the valida-

tion data points, the only outlying points are the ones

belonging to Video 17, reaching important errors at

high distances. Some of these points represent the

model detecting another vessel, which was mislead-

ing these predictions. However, it can be noted that

the model is able to predict positions around 500 and

1000 meters with a small amount of error. Predic-

tions in the testing dataset have the highest density

well below 20 meters, even reaching a precision be-

low a single meter at the highest distances [Figures 8,

9]. It is now possible to use all the extrapolated infor-

mation to perform error prediction at the end of the

pipeline. The ﬁrst stage of the work predicts the posi-

tion of the vessel performing on the validation set. Af-

ter that, the Haversine distance between both the pre-

dicted distance and ground truth, as well as the ground

truth and camera position, are computed. These two

parameters are used to feed the error prediction mod-

els and are respectively called prediction error and

distance from camera. The experimental results pre-

sented include the usage of classic Machine Learning

methods, Linear Regression and Support Vector Ma-

chine, applied to regression problems, that is, SVR, as

well as a Deep Learning approach.

The ﬁrst concern is, once again, the dataset. A

plausible solution is to split the previous validation set

according to Videos, excluding video 17. By includ-

ing video 17, the predicted distance from the model

would only be uncontrolled noise to our dataset, with-

out adding any value to the scope. The training set is

now composed of 429 data points belonging to Videos

14, 15 and 16, while the testing set has 143 points

from Video 13. SVR adopts an approach similar to

Support Vector Machines as Large Margin Classiﬁers

(Cortes and Vapnik, 1995), with the usage of kernel

tricks (Aizerman et al., 1964) to create Nonlinear clas-

siﬁers. The quality of the hyperplane is determined by

the points falling inside the decision boundary. The

results presented by Table 4 show that a simple net-

work, with only two hidden layers and trained for 5

epochs, is able to reach a Root Mean Squared Er-

ror between 4 and 5 meters, using a batch size of 32

data points. The Neural Network performance is also

shown by Figure 9. The SVR, using the Radial Ba-

sis Function kernel to approximate the nonlinear be-

Marine Vessel Tracking using a Monocular Camera

Table 3: Comparison of tracking accuracy with others. The average error is measured in meters per 100 meters distance.

Different cameras, angle of views, target objects, and image resolution make a direct comparison hard to value.

Paper Method Target Object Image resolution Avg. error

(Ali and Hussein, 2016) Hough Transform Cars 480x360 9.71%

(Hu et al., 2019) Faster R-CNN and 3D box prediction Cars 1392x512 7.40%

(Dhall, 2018) YoloV2 and keypoint prediction Cone Not stated 6.25%

Ours YoloV5s Vessel 640x360 5.55%

Table 4: The Root Mean Squared Error - RMSE - in meters

from the actual predicted point.

Method RMSE

DNN 4.95

Linear Regression 6.88

SVR(Linear) 6.96

SVR(Polynomial) 8.35

SVR(RBF) 6.04

haviour, was initially prone to overﬁtting, achieving

worse results then the Linear kernel on the testing set.

To overcome this undesirable situation, the allowed

margin error was increased in the training set. This al-

lowed to generalize better than the Linear Regression

and SVR models. The ﬁnal linear regression learned

is:

˜e = 0.0555d . (24)

This means that the accuracy of the pipeline is 5.55m

per 100 m distance. More data points and further ex-

perimentation would, of course, strengthen the infer-

ence capabilities of these models. Error prediction

can play a signiﬁcant role in the pipeline depending

on the use case where this methodology could be ap-

plied. In fact, external decision-making systems could

opt for a certain action, or veriﬁcation, depending on

the detected distance of the object.

5 DISCUSSION AND FUTURE

WORK

The results in table 3 show that our model reaches

state-of-the-art precision when compared with other

methodologies. However, comparisons are imprecise,

because of many factors like the angle of view, reso-

lution, camera position or target object that could in-

ﬂuence the result. Our approach efﬁciently exploits

GPS data with a fast and reliable inference for closer

distances, bringing great advantage in terms of possi-

ble useful insights, like a subsequent precise distance

estimation.

We can ﬁgure different applications in both small

and large scenarios. For example, tracking a car in

a parking lot, where the camera is close and the ac-

curacy proved to be high. A larger use-case might

be keeping track of trafﬁc in a container ship port or

airport, where the camera is further away but the ob-

jects are larger, and in this way this approach main-

tains the same relative accuracy. This method could

also be easily extended for cases with z

6= 0, as long

as there is a clear mapping from (x

) → z

and

) → z

As stated, the average error is 5.55m per 100m dis-

tance and the network is able to track the ship within a

sub-pixel resolution, which is crucial for distance re-

liability. The accuracy could be further improved by

using a steeper camera angle and more training data,

as well as the performance in terms of FPS improved

adopting TensorRT network deﬁnition APIs and 8-bit

inference(Migacz, 2017). A particular challenge for

the network is to learn to distinguish the vessel from

other ships like jet-skis or sailboats. This leads to the

problem that multiple ships may be detected. A style

detector, as presented in (Wojke et al., 2017), could be

used to differentiate between different ships. When

the object is moving quickly or the data point resolu-

tion is poor, the algorithm struggles with inferring the

trajectories. For ships, a data point every second is

recommended.

In edge cases, for example when the ship leaves

the frame, the whole pipeline is very sensitive to the

hyper-parameters chosen, for example, how long a

trajectory lasts without new positions. In more stan-

dard situations, when the ship is in a reasonable dis-

tance and well visible, the algorithm is stable and pre-

cise. Since our approach requires a single frame every

second, it can easily be run on resource constrained

devices at the edge achieving a low inference time,

while being able to track a speciﬁc vessel as well as

multiple different ones, depending on the needs and

training data available.

REFERENCES

Aizerman, M. A., Braverman, E. A., and Rozonoer, L.

(1964). Theoretical foundations of the potential func-

tion method in pattern recognition learning. Automa-

tion and Remote Control, 25:821–837.

Ali, A. A. and Hussein, H. A. (2016). Distance estima-

tion and vehicle position detection based on monoc-

ular camera. In 2016 Al-Sadeq International Confer-

ence on Multidisciplinary in IT and Communication

DeLTA 2021 - 2nd International Conference on Deep Learning Theory and Applications

Science and Applications (AIC-MITCSA), pages 1–4.

Bhoi, A. (2019). Monocular depth estimation: A survey.

Cortes, C. and Vapnik, V. (1995). Support vector networks.

Machine Learning, 20:273–297.

Dhall, A. (2018). Real-time 3d pose estimation with

a monocular camera using deep learning and ob-

ject priors on an autonomous racecar. CoRR,

abs/1809.10548.

Duda, R. O. and Hart, P. E. (1972). Use of the hough trans-

formation to detect lines and curves in pictures. Com-

mun. ACM, 15(1):11–15.

Hu, H.-N., Cai, Q.-Z., Wang, D., Lin, J., Sun, M.,

ahenb

uhl, P., Darrell, T., and Yu, F. (2019). Joint

monocular 3d vehicle detection and tracking.

Jiao, L., Zhang, F., Liu, F., Yang, S., Li, L., Feng, Z., and

Qu, R. (2019). A survey of deep learning-based object

detection. IEEE Access, 7:128837–128868.

Jocher, G., Stoken, A., Borovec, J., NanoCode012, Christo-

pherSTAN, Changyu, L., Laughing, tkianai, Hogan,

A., lorenzomammana, yxNONG, AlexWang1900, Di-

aconu, L., Marc, wanghaoyang0106, ml5ah, Doug,

Ingham, F., Frederik, Guilhen, Hatovix, Poznanski, J.,

Fang, J., Yu, L., changyu98, Wang, M., Gupta, N.,

Akhtar, O., PetrDvoracek, and Rai, P. (2020). ultra-

lytics/yolov5: v3.1 - Bug Fixes and Performance Im-

provements.

Kalman, R. E. (1960). A new approach to linear ﬁltering

and prediction problems. Transactions of the ASME–

Journal of Basic Engineering, 82(Series D):35–45.

Leira, F. S., Helgesen, H. H., Johansen, T. A., and Fos-

sen, T. I. (2017). Object detection, recognition, and

tracking from uavs using a thermal camera. Journal

of Field Robotics.

Lin, T.-Y., Maire, M., Belongie, S., Bourdev, L., Girshick,

R., Hays, J., Perona, P., Ramanan, D., Zitnick, C. L.,

and Doll

ar, P. (2015). Microsoft coco: Common ob-

jects in context.

Liu, H., Wang, C., Lu, H., and Yang, W. (2010). Outdoor

camera calibration method for a gps camera based

surveillance system. In 2010 IEEE International Con-

ference on Industrial Technology, pages 263–267.

M, M. M. G. (2010). Landmark based shortest path detec-

tion by using a* and haversine formula. International

Journal on Recent and Innovation Trends in Comput-

ing and Communication, 6(7):98–101.

Migacz, S. (2017). 8-bit inference with tensorrt. GPU Tech-

nology Conference.

Moritz, H. (2000). Geodetic reference system 1980. Jour-

nal of Geodesy, 74(1):128–133.

Sanyal, S., Bhushan, S., and Sivayazi, K. (2020). Detection

and location estimation of object in unmanned aerial

vehicle using single camera and gps. In 2020 First In-

ternational Conference on Power, Control and Com-

puting Technologies (ICPC2T), pages 73–78.

Snyder, J. P. (1987). Map projections: A working manual.

Technical report, Washington, D.C.

Tall, M. H. (2020). Ai tracks at sea. https://www.challenge.

gov/challenge/AI-tracks-at-sea/.

Tan, M., Pang, R., and Le, Q. V. (2020). Efﬁcientdet: Scal-

able and efﬁcient object detection.

Wojke, N., Bewley, A., and Paulus, D. (2017). Simple on-

line and realtime tracking with a deep association met-

ric.

Marine Vessel Tracking using a Monocular Camera

Figure 1: Example of bounding box size. It is empirically chosen to ﬁt the boat.

Figure 2: The y

by the two bounding-box-size-based estimators as in equation 12 and the bounding-box-center-based pre-

dictor.

(a) Effect of the Kalman ﬁlter for smoothing the trajectory

in screen coordinates.

(b) Effect of the Kalman ﬁlter for smoothing the trajectory

in world coordinates.

Figure 3: The Kalman ﬁlters are a crucial step in obtaining a smooth trajectory, particularly for lateral movement.

DeLTA 2021 - 2nd International Conference on Deep Learning Theory and Applications

(a) Visualization of the learned camera parameters. The grid shows 10 meter squares, with 100 > x

> 10 and

100 > y

> 0. Note the effect of the lens distortion, and how the orange line matches the horizon. The blue

vanishing point is exactly east.

(b) The difference between our hand labeled points and the projected GPS coordinates.

Figure 4: Evaluation of the Camera Parameters.

Marine Vessel Tracking using a Monocular Camera

(a) An example of a GPS track mapped into the video. Note that the camera is facing eastwards and not perfectly

horizontal. The green lines indicate the rectangle x

∈ [0,1280], y

∈ [440,720].

(b) Brute force evaluation of the lens distortion parameter K

Figure 5: Evaluation of the Camera Parameters.

DeLTA 2021 - 2nd International Conference on Deep Learning Theory and Applications

Figure 6: Green is the smoothed trajectory of the vessel. White/Grey is the trajectory of a ship in general (white when it got

detected in this frame, grey if it has not been detected in the current frame and position is guessed). Red are detected points

in the image. They may be obscured by the white points. When the vessel is behind the sailboat, the grey and green point

estimate the position.

Figure 7: Validation of the complete pipeline using Video 15. Showing the Ground Truth track of the vessel (orange line)

against the predicted one (blue line).

Marine Vessel Tracking using a Monocular Camera

Figure 8: Distribution of the error in meters, output of our pipeline. 80% of the predictions have an error below 20 meters.

Linear Regression

Deep Neural Network

Target distance from camera

Error in meters

Training Data

Testing Data

100

150

200

250

Figure 9: Visualization of the error prediction performance on the error prediction validation set. The linear regression show

the average error is 5.55m per 100m distance from camera. The distance of the target vessel from the camera, against the error

in meters produced by the predictions is plotted, with the training data points in blue and validation in orange. The learned

regression from the Linear Regression modelling (light-blue) and the Deep Neural Network (red) is showed.

DeLTA 2021 - 2nd International Conference on Deep Learning Theory and Applications