On-Board Estimation of Vehicle Speed and the Need of Braking

Using Convolutional Neural Networks

Razvan Itu

and Radu Danescu

Computer Science Department, Technical University of Cluj-Napoca, Memorandumului 28, Cluj-Napoca, Romania

Keywords: Autonomous Vehicles, Self-Driving Vehicles, Convolutional Neural Network, Velocity Prediction,

Vehicle Speed, Recurrent Neural Network, Long Short-Term Memory, Vehicle State Estimation, Visual

Odometry.

Abstract: Detecting the ego-vehicle state is a challenging problem in the context of autonomous vehicles. Perception-

based methods leverage information from on-board cameras and sensors to determine the surrounding traffic

scene and vehicle state. Monocular based approaches are becoming more popular for driver assistance, and

accurate vehicle speed prediction plays an important role for improving road safety. This research paper

presents an implementation of a Convolutional Neural Network (CNN) model for vehicle velocity prediction

using sequential image input, as well as an extended model that also features sensorial data as input. The CNN

model is trained on a dataset featuring sets of 20 sequential images, captured from a moving car in a road

traffic scene. The aim of the model is to predict the current vehicle speed based on the information encoded

in the previous 20 frames. The model architecture consists of convolutional layers followed by fully connected

layers, having a linear output layer for the ego-vehicle velocity prediction. We evaluate our proposed models

and compare them using existing published work that features Recurrent Neural Networks (RNNs). We also

examine the prediction of the brake pedal pressure required while driving.

1 INTRODUCTION

The field of autonomous driving has become a

thriving area of research for scientists and leading

manufacturers such as Tesla, Waymo, and Baidu.

This increased interest is driven by the remarkable

progress made in computer vision, particularly in

deep learning. These recent advancements have

paved the way for significant breakthroughs for

autonomous driving technologies.

Developing perception-based policies to facilitate

complex autonomous behaviours, such as driving,

remains an ongoing challenge in the fields of

computer vision and machine learning. The recent

advancements in deep learning techniques for visual

perception tasks have sparked considerable interest in

exploring their effectiveness for learning driver

intentions and actions to provide better road traffic

safety.

Accurate vehicle speed prediction plays a crucial

role for various applications in autonomous driving,

https://orcid.org/0000-0001-8156-7313

https://orcid.org/0000-0002-4515-8114

traffic monitoring, and driver assistance systems. The

main factor for road accidents is still represented by

the human error, according to the National Highway

Traffic Safety Administration (NHTSA). They also

claim that speeding represents one of the main

problems for road accidents with fatalities (NHTSA,

2023). Determining the vehicle speed from on-board

images can be used for video forensics, especially

when using uncalibrated video data. Accurate

prediction models enables us to gain a deeper

understanding of the underlying factors and

motivations driving the decision-making process of

drivers.

This research focuses on developing a CNN-

based model that leverages sequential image input to

predict the current vehicle velocity (speed) and the

need for braking during driving. By analysing a

sequence of images, captured from a moving car in a

road traffic scene, the model aims to estimate the

velocity based on the temporal information encoded

in the previous frames. We use only the front image

600

Itu, R. and Danescu, R.

On-Board Estimation of Vehicle Speed and the Need of Braking Using Convolutional Neural Networks.

DOI: 10.5220/0012163800003543

In Proceedings of the 20th International Conference on Informatics in Control, Automation and Robotics (ICINCO 2023) - Volume 1, pages 600-607

ISBN: 978-989-758-670-5; ISSN: 2184-2809

input from the ego-vehicle, as it provides the most

accurate representation of the visual input that the

driver also has of the scene. We then extend our

model to feature sensorial input (velocity data

expressed in km/h) as well. We compare our approach

with a method based on recurrent neural networks

(RNNs), more specifically Long Short-Term

Memories (LSTMs) that features multiple inputs in

order to produce a prediction. We have also tested a

model to estimate the brake pedal pressure needed in

road traffic scenarios and the results are promising.

2 RELATED WORK

We analyse the related work that features visual input

(images captured from the scene) and sensorial input

or measurements for self-driving vehicles related

tasks.

The authors of (Xue, 2018), propose an approach

that combines a Convolutional Neural Network

(CNN) and Long Short Term Memory (Hochreiter,

1997) to predict the future positions of pedestrians in

video sequences captured by static cameras. The

method utilizes an occupancy grid, along with image

inputs and observed trajectory (x, y pedestrian

coordinates), to predict the forthcoming pedestrian

position data in the image frame or scene.

The work (Xu, 2017), proposes an approach that

leverages a Fully Convolutional Network (FCN) to

extract road scene image data and combines it with a

LSTM to encode the current sensorial data of the ego-

vehicle, including speed and angular velocity. The

objective is to predict the next action that the ego-

vehicle should undertake. The network's output

represents the recommended current action, which

can be one of the following: go straight, stop, turn left,

or turn right.

The work of (Codevilla, 2018) also presents an

approach to predict the appropriate action for the

driver using both scene image and sensorial data. The

network takes as input the image from the road traffic

scene, along with measurement data such as ego-

vehicle speed, and command data including ego-

vehicle steering angle and acceleration. The neural

network's objective is to predict the recommended

action for the ego-vehicle, which can be one of the

following: continue (follow the road), left (turn left at

intersection), straight (go straight at intersection), or

right (turn right at intersection). This paper makes use

of CARLA (Dosovitskiy, 2017), an open-source

urban driving simulator that can provide a realistic

environment for training and evaluating autonomous

driving related algorithms.

In (Gu, 2020), the authors present an LSTM-based

solution for predicting future driver behaviour. The

approach incorporates a CNN to represent the image

frames and a recurrent neural network (LSTM) to

encode twelve features extracted from the frames,

including ego-vehicle velocity on three axes, distance

from the ego-vehicle to the front car (dx, dy), front

car velocity (three axes), front car acceleration (three

axes), and the number of existing front vehicles. The

input video sequence consists of ten frames (ten

images), and the model predicts the future values of

ego-vehicle acceleration on three axes. The approach

employs a pretrained ResNet (He, 2016) for feature

extraction from the input frames.

In (Ding, 2022), the authors focus on speed-

control forecasting for autonomous or self-driving

vehicles. The approach utilizes positional data of

objects in the scene, which are processed through a

CNN and LSTM to extract the ego-vehicle speed. The

network incorporates Mask R-CNN (He, 2017) with

a pretrained ResNet-50 backbone to extract object

information, including bounding box coordinates.

The proposed method constructs object-related

graphs based on object proximity, where graph edges

connect pairs of objects in the scene. Subsequently,

graph convolution is applied to extract local spatial

relations. The results are then fed into an LSTM and

a multi-layer perceptron to obtain the ego-vehicle

speed control data.

Most related work features imagery data from the

scene, together with sensorial data that is then

combined to extract current or future predictions,

usually with recurrent neural networks (mostly using

LSTM’s). Usually, these methods feature complex

deep learning architectures that require a lot of

training data, pre-processing this data, and also the

additional sensorial data as input. These neural

networks are difficult to implement on hardware

systems with limited resources (with the aim of being

integrated in the vehicle).

In this paper, we propose making use of only

imagery data in order to predict the current state of

the vehicle (in this case the ego-vehicle velocity),

based on previous “N” frames (image sequence). We

choose N = 20 to represent the image input sequence.

We then extend our initial solution to include the

velocity as input along with the image data. Our work

is similar to (Bojarski, 2016), where the authors make

use of an end-to-end learning approach for a self-

driving car using artificial neural networks. The

previous published paper explores using CNNs to

learn driving behaviours directly from visual input

and predicts the ego-vehicle steering angle from an

image from the road traffic scene. We extend this to

On-Board Estimation of Vehicle Speed and the Need of Braking Using Convolutional Neural Networks

601

make use of sensorial data as input using CNN only

models.

Our approach uses a publicly available large-scale

video dataset that can be extended to also use

different input data sources as presented in the results

section.

3 SOLUTION OVERVIEW

We implement 2 CNN models to predict ego-vehicle

speed. The first model features a sequence of images

as input, whereas the second one features also the

associated vehicle speed for each image as input into

the network. We have also tested a Recurrent Neural

Network based solution, that we implemented in

order to compare our results. The third model features

images and associated vehicle speed as input.

3.1 Proposed CNN Model: Image Input

(V1)

The initial CNN model architecture proposed in this

paper features several convolutional layers, followed

by fully connected layers. This model is able to

predict the current velocity, using only the extracted

relevant features from the pixel data. An overview of

the solution is presented in figure 1.

Figure 1: The proposed CNN model (V1) to predict the

current vehicle speed based on an input sequence of 20 past

frames (front-view images of the road traffic scene).

The model takes as input a sequence of 20 sequential

images, each with a dimension of 300x300 pixels.

The convolutional layers use different filter sizes and

strides to extract spatial features from the images,

progressively increasing the number of feature maps.

Rectified Linear Unit (ReLU) activation functions are

used to introduce non-linearity and enhance the

model's ability to learn complex patterns. A Dropout

layer is incorporated to mitigate overfitting. The

output of the convolutional layers is flattened and fed

into a stack of fully connected layers, which further

capture higher-level representations of the input data.

The final layer is a linear activation function that

predicts the current vehicle speed. The model features

a total of ~5.9 million trainable parameters, meaning

that it is easily trainable and also portable on other

hardware (such as Nvidia Jetson platform).

3.2 Extended CNN Model: Image and

Velocity Input (V2)

The initially proposed model (V1) was trained with

the dataset presented in the next section. Then, we

freeze the convolutional layers and the flatten layer

wights, and the next step is to extend the CNN to also

feature a measurement input, in this case the past 20

velocity values along with the 20 images.

Therefore, the extended network (V2) has the

following structure (figure 2).

Figure 2: The extended CNN model (V2) that uses pairs of

20 images and 20 velocity measurements as input to predict

the current velocity.

The flatten layer is then concatenated with the

sensorial data input (reshaped to feature the same data

ordering/dimensionality), followed by the same 4

Dense layers as the initially proposed model. Finally,

we retrained the extended CNN (V2) and evaluate it.

3.3 Recurrent Neural Network Based

Model (V3)

Based on previous published work, we have

implemented our own version of a CNN+LSTM

model based on the work of (Xue, 2018). Our

implementation features a sequence of 20 images of

the scene, as well as 20 corresponding ego-vehicle

speed data as input. An overview of this approach is

depicted in figure 3.

The layer architecture of this model is similar to

the one published by the authors of (Xue, 2018). We

change the input to use 20 images as input and predict

one measurement.

ICINCO 2023 - 20th International Conference on Informatics in Control, Automation and Robotics

602

Figure 3: An implementation of a CNN+LSTM network to

predict the velocity data using pairs of images and

velocities as input.

3.4 Dataset

We trained using the Honda Research Institute

Driving Dataset - HDD (Ramanishka, 2018) that

consists of 104 hours of driving data in road traffic

scenes from the San Francisco area. The provided

dataset introduces supplementary annotations that

feature various driver behaviours observed in driving

scenes, surpassing the limited focus on turn, go

straight, and lane change present in existing datasets.

Additionally, the dataset captures Controller Area

Network (CAN) signals to depict driver behaviours in

diverse scenarios, particularly interactions with other

participants in traffic.

The dataset features the following sensor types:

camera, Lidar, GPS, IMU and CAN bus signals, in

various scenarios such as: suburban, urban and

highway. The imagery data is encoded as colour

images with the size of 1280x720 pixels, captured at

30 Hz. On the data we received, we actually found

that some frames are missing and therefore not all

sequences feature 30 fps. The dataset contains 137

trips, out of which we selected 7 random trips to be

excluded from the training process (to be used for

evaluation). The remaining 130 trips are split into

training and testing/validation sets (80/20% split).

The 137 individual trips from the dataset feature

various road traffic scenes, captured at different

moments of the day and in various weather

conditions. From these trips we have extracted

sequences of N = 20 images, along with the provided

ego-vehicle speed for each image, using a sliding

window with an overlap of 5. We were able to obtain

a total of over 14.000 continuous sequences of 20

images and 20 velocity data pairs ([T

-21

-1

]), as well

the current velocity data (T

3.5 Implementation

The experimental setup features a desktop computer,

equipped with an Intel i7 CPU and two Nvidia 1080

Ti GPUs that have a total of 22 GB VRAM that are

used during the training process of the neural

network. The software development is based on

TensorFlow and Keras (Chollet, 2015) for the neural

network, and OpenCV and Matplotlib for

visualisation and video generation of the results.

4 EXPERIMENTAL RESULTS

4.1 Training the CNNs

All of the proposed CNN models for predicting

vehicle velocity are trained using the Mean Squared

Error (MSE) loss function and optimized using the

Adam optimizer featuring an initial learning rate of

0.001 that is decreased if the loss function doesn’t

improve. The training process aims to minimize the

difference between the predicted and ground truth

vehicle speeds that are expressed in km/h.

Figure 4: The loss function values over 50 epochs for V1.

We make use of loss monitorization to early stop the

model if the loss doesn’t improve after 10 epochs,

therefore we got a fully trained model after 53 epochs.

Each epoch takes around ~18 minutes to train,

resulting in a total training time of ~16 hours using

our hardware setup. The extended model (V2)

features a similar number of trainable parameters

(5.909 million versus 5.907 for the initial), and the

training times are almost identical. The CNN+LSTM

based model (V3) features a total of ~26 million

parameters, and using the same dataset means that the

total training time is mostly similar.

4.2 Initial Evaluation and Optimization

The accuracy for all of the proposed models is

evaluated using standard regression performance

metrics, including R-squared, Mean Absolute Error

(MAE), Root Mean Squared Error (RMSE).

We tried to determine the impact of selecting

different regions of interest within an image on the

On-Board Estimation of Vehicle Speed and the Need of Braking Using Convolutional Neural Networks

603

performance of a CNN-based regression model. The

selected regions include the centre, left, and right

portions of the input image, with the normal-sized

input image serving as the baseline. Figure 5

illustrates the selected regions of interest (ROIs).

Figure 5: Selecting the regions of interest from the original

input image.

Table 1 represents the evaluation results from the

initial proposed model (V1), compared with the

results obtained from the regions of interest models.

We first trained on a small subset of 25 trips out of

the entire 137 trips available in the HDD dataset, to

determine the best performing input image (Normal

or ROI). Then, we retrained the network (V1) with

the entire train set (130 trips) using the most optimal

input image type.

Table 1: Evaluation of the initial proposed CNN model

(V1) for predicting the velocity (km/h).

Model image

input

R-squared

(km/h)

MAE

(km/h)

RMSE

(km/h)

Normal

(fully

trained)

0.825 4.334 6.085

Normal 0.448 8.335 10.831

ROI Centre 0.333 8.672 11.91

ROI Left 0.335 9.011 11.888

ROI Right 0.243 9.302 12.689

The analysis revealed that the normal-sized input

image (no ROI applied) provided the best overall

performance in terms of R-squared, MAE, and

RMSE. Moreover, the centre region demonstrated

relatively better performance compared to the left and

right regions, indicating that the model relied heavily

on the central information for accurate predictions.

Figure 6 contains the prediction versus ground

truth of the ego-vehicle speed on a subset of frames

from trip with the id “201703081617” from the HDD

dataset.

Figure 6: Ground truth and prediction visual representation

on a trip from the validation set for model V1.

Figure 7 represents some of our results on the

evaluation test set, where the ground truth is

displayed with blue, whereas the prediction is

coloured red.

Figure 7: Ground truth and prediction visual representation

on a trip from the validation set for model V1.

A video of the velocity prediction results, from the

initial model (V1) using only images as input, is

available here: https://vimeo.com/832355875.

The results from the extended model (V2) can be

seen here: https://vimeo.com/832353512, whereas

the V3 model (CNN+LSTM) predictions video is

here: https://vimeo.com/832438597.

4.3 Evaluation of the Proposed Models

We have compared our proposed models (V1 - initial

and V2 - extended) with the CNN+LSTM based

method (V3). The results are presented in table 2.

Having the velocity as input and using it into a

Recurrent Neural Network, combined with the

extracted road scene data, means that the model is able

to predict more accurately the current ego-vehicle

velocity (speed). The downside is that existing

published models (V3) feature more trainable

parameters, therefore directly affecting negatively the

prediction time (as can be seen in table 2).

ICINCO 2023 - 20th International Conference on Informatics in Control, Automation and Robotics

604

Table 2: Comparison between CNN+LSTM and CNN

(initial and extended model) for velocity prediction (km/h).

CNN

Model

R-squared

(km/h)

MAE

(km/h)

RMSE

(km/h)

Prediction

time (ms)

Model

parameters

SS-LSTM

based

implement

ation (Xue,

2018) -

(V3)

0.99 0.52 0.76 18.64

~26

million

Proposed

model

(V1)

0.82 4.33 6.08 3.1

~5.9

million

Proposed

extended

model

(V2)

0.96 1.93 2.66 3.1

~5.9

million

From our experiments, we have found that the

LSTM-based implementation (V3), although

produces more accurate evaluation results, is actually

heavily dependent on the past velocity input data.

This can be proven simply by using random vehicular

speed inputs to the network, as can be seen in the

below figure:

Figure 8: Wrong ego-vehicle speed predictions (from the

CNN+LSTM model) when the ego-vehicle data is

corrupt/erroneous.

If we provide a constant past ego-vehicle speed as

input along with the images, the predicted output of

the CNN+LSTM model (V3) is also constant:

Figure 9: Wrong ego-vehicle speed predictions (from the

CNN+LSTM model) when the ego-vehicle data is constant

throughout the entire sequence/trip.

Our initial model (V1), even though not so

accurate on the standard evaluation metrics, is proven

to be more robust because it is capable of correctly

extracting the relevant features from the visual image

sequence to predict the current speed:

Figure 10: Prediction of ego-vehicle current speed from our

proposed model (V1 - using only image sequence input).

Furthermore, in order to compare our extended

model (V2) with the same test scenario, we have

simulated a sensor failure, meaning that we feed the

CNN with “-1” values for the velocity as input into

the model. Figure 11 illustrates the result of this test

scenario.

Figure 11: Prediction of the ego-vehicle speed from the

extended model (V2 - using image and past velocity input)

with sensor failure simulated (0km/h past velocity fed as

input into the model).

In order to validate the robustness of our solution,

we have performed multiple tests with simulated

sensor failure. The results are presented in table 3.

Table 3: Velocity (km/h) prediction evaluation of V2 model

using different simulated sensor failures.

Sensor failure

rate

R-squared

(

km/h

)

MAE

(

km/h

)

RMSE

(

km/h

)

0% 0.966 1.93 2.66

10% 0.843 2.87 5.77

20% 0.770 3.51 6.98

30% 0.764 3.79 7.07

50% 0.765 4.24 7.06

80% 0.694 4.90 8.06

100% 0.796 4.67 6.57

The extended model (V2) is able to rely on pixel

data in order to extract the relevant features to predict

accurately the ego-vehicle velocity even in simulated

velocity input failure (80% and 100% of the input

On-Board Estimation of Vehicle Speed and the Need of Braking Using Convolutional Neural Networks

605

data is -1km/h). A video with the prediction results on

the 80% fail rate can be seen here:

https://vimeo.com/832352987.

We have also tested our models on a different

dataset (Nedevschi, 2004), with images and data

acquired using a different hardware setup. The data is

acquired using a stereo-camera setup from which we

use only the images from the left camera. The

sensorial data is acquired using the CAN Bus from

the vehicle, from which we use the velocity.

We have found that the image-based model (V1)

is able to generalize and predict the ego-vehicle

velocity accurately in most situations, especially

when braking in intersections or decelerating during

driving. Some of the results are presented in figure 12.

Figure 12: Predicting velocity on a different dataset using

only images as input (V1).

We tested the extended model (V2) on the same

dataset, and figure 13 presents examples of the

prediction results.

Figure 13: Prediction of the ego-vehicle speed from the

extended model (V2 - using image and past velocity input)

with sensor failure simulated (0km/h past velocity fed as

input into the model).

Given the fact that the new dataset uses a different

hardware setup for the image acquisition process, we

can reason that the image-based input only model

(V1) is able to detect properly the velocity

acceleration and deceleration when needed. This is

effective in the situations where accidents can occur

(intersections, pedestrian crossings, forward collision

warning, etc.). When moving at a constant speed, the

V1 model predicts an inaccurate speed due to

difference in the frame rates from the training set and

the new evaluation set (Nedevschi, 2004). The V2

model, that also features velocity as input, is proven

to be more robust, and the evaluation of both versions

on the new dataset is presented in table 4.

Table 4: Velocity (km/h) prediction evaluation of the

proposed CNN models on the new dataset (Nedevschi,

2004).

CNN Model R-squared

(km/h)

MAE

(km/h)

RMSE

(km/h)

V1 -0.507 15.25 19.17

V2 0.871 4.04 5.59

4.4 Brake Pedal Sensor Evaluation

We have also trained the image-only based CNN (V1)

to predict the brake pedal pressure that needs to be

applied, as it was available in the additional sensorial

data from the dataset. Figure 14 represents some

results we have obtained on a small evaluation set.

Figure 14: Evaluation of the brake predictions versus

ground truth from the dataset.

The evaluation results are based on the brake

pedal pressure sensor from the dataset, which is

expressed in kPa, meaning that no brake pedal

pressed will give low values (close to 0), whereas full

brake pedal pressed will represent a larger value

(usually 2-3000 kPa. Evaluating the brake pedal

pressure gives the following results: a MAE of

245.95, R-squared 0.52 and the RMSE is 422.15 on

the V1 model that uses only images as input into the

neural network. An example is presented in figure 15.

We found that the brake pedal pressure prediction

is robust even on different datasets than the ones that

were used during training, as can be seen in figure 15.

The figure shows a case when the model is able to

predict correctly that the brake pedal needs to be

pressed when the vehicle is stationary.

ICINCO 2023 - 20th International Conference on Informatics in Control, Automation and Robotics

606

Figure 15: Visualisation of the brake predictions and the

ground truth velocity from a different dataset.

5 CONCLUSIONS

The implementation of a CNN model for vehicle

speed prediction using sequential image input

demonstrates the potential of leveraging temporal

information captured in sequential frames. The model

architecture, effectively extracts relevant spatial

features and captures high-level representations of the

input data. The model is evaluated using several

evaluation metrics, providing insights into its

accuracy and reliability. We extend our model to

provide sensorial data as input and we compare both

models with existing published work that uses

additional sensorial input. We found that our

approaches are more robust compared to other

methods that leverage additional data while using

recurrent neural networks, if the additional input

sensorial data is corrupt or erroneous. We have also

tested and evaluated our image based model to predict

brake pedal pressure given the same sequence of

input images, and the results are promising even on

un-seen data from different datasets. Further

experiments can enhance the understanding of the

model's capabilities and potentially lead to

improvements in vehicle speed prediction for real-

world applications, such as video forensics.

ACKNOWLEDGEMENTS

The research was supported by grants from the

Ministry of Research and Innovation, CNCS—

UEFISCDI, project number PN-III-P1-1.1-PD-2021-

0247 and PN-III-P4-ID-PCE2020-1700.

REFERENCES

Bojarski, M., Del Testa, D., Dworakowski, D., Firner, B.,

Flepp, B., Goyal, P., ... & Zhang, X. (2016). End to end

learning for self-driving cars. In arXiv preprint.

arXiv:1604.07316.

Chollet, F., & others. (2015). Keras. GitHub. Retrieved

from https://github.com/fchollet/keras.

Codevilla, F., Miiller, M., López, A., Koltun, V., and

Dosovitskiy, A. (2018). End-to-End Driving Via

Conditional Imitation Learning. In 2018 IEEE

International Conference on Robotics and Automation

(ICRA). IEEE Press, pp. 1–9.

Codevilla, F., López, A., and Gaidon, A. (2019). Exploring

the limitations of behavior cloning for autonomous

driving. In Proceedings of the IEEE International

Conference on Computer Vision (ICCV), pp. 2485–2494.

Ding, Y., Zhang, Z., Li, Y., and Zhou, X. (2022).

EgoSpeed-net: forecasting speed-control in driver

behavior from egocentric video data. In Proceedings of

the 30th International Conference on Advances in

Geographic Information Systems (SIGSPATIAL '22).

Association for Computing Machinery (12), pp. 1–10.

Dosovitskiy, A., Ros, G., Codevilla, F., López, A., Koltun,

V. (2017). CARLA: An open urban driving simulator.

In Proceedings of the IEEE Conference on Computer

Vision and Pattern Recognition Workshops (CVPRW),

pp. 953–961.

Gu, Z., Li, Z., Di, X., Shi, R. (2020). An LSTM-Based

Autonomous Driving Model Using a Waymo Open

Dataset. In Appl. Sci. 2020, 10, 2046.

He, K., Zhang, X., Ren, S., Sun, J. (2016). Deep Residual

Learning for Image Recognition. In Computer Vision

and Pattern Recognition, pp. 770–778.

He, K., Gkioxari, G., Dollár, P., Girshick, R. (2017). Mask

R-CNN. In 2017 IEEE International Conference on

Computer Vision (ICCV), pp. 2980-2988.

Hochreiter, S., Schmidhuber, J. (1997). Long short-term

memory. In Neural Computation, 9(8), 1735–1780.

National Highway Traffic Safety Administration, (2023).

United States Department of Transportation. Available

online: https://www.nhtsa.gov/risky-driving/speeding.

Nedevschi, S., Danescu, R., Frentiu, D., Marita, T., Oniga,

F., Pocol, C., Schmidt, R., Graf, T. (2004). High

accuracy stereo vision system for far distance obstacle

detection. In Proceedings of the IEEE Intelligent

Vehicles Symposium, Parma, Italy, 14–17 June 2004,

pp. 292–297.

Ramanishka, V., Chen, Y., Misu, T., Saenko, K. (2018).

Toward Driving Scene Understanding: A Dataset for

Learning Driver Behavior and Causal Reasoning. In

2018 IEEE/CVF Conference on Computer Vision and

Pattern Recognition (CVPR), Salt Lake City, UT, USA,

2018, pp. 7699–7707.

Xu, H., Gao, Y., Yu, F., and Darrell, T. (2017). End-to-End

Learning of Driving Models from Large-Scale Video

Datasets. In 2017 IEEE Conference on Computer

Vision and Pattern Recognition (CVPR), Honolulu, HI,

USA, 2017, pp. 3530–3538.

Xue, H., Huynh, D. Q., Reynolds, M. (2018). SS-LSTM: A

Hierarchical LSTM Model for Pedestrian Trajectory

Prediction. In 2018 IEEE Winter Conference on

Applications of Computer Vision (WACV), Lake Tahoe,

NV, USA, 2018, pp. 1186–1194.

On-Board Estimation of Vehicle Speed and the Need of Braking Using Convolutional Neural Networks

607