CAN-bus, an inertial measurement unit and GPS. One
upside to the method used was that it did not require
expensive equipment. A downside, on the other hand,
was an adaption phase at the start of each processed
sequence, where the performance was lowered due to
a lack of information. The case of long-time predic-
tions of driver behavior was considered in (Wijnands
et al., 2018). Among the behaviors studied was accel-
eration and whether the speed of the car was within
the legal limits. The study used GPS coordinates of
a test car, sampled at an interval of 30 seconds over a
large period of time, in combination with other infor-
mation.
The use of CNNs to classify the posture of a driver
has been tested in (Yan et al., 2015). In the study im-
ages of the full body posture of the driver was used
to classify four different classes: drive normal, re-
sponding to a cell phone, eating & smoking and oper-
ating the shift gear. With this method, the authors got
an accuracy of 99.78% correct classifications on their
dataset. The same dataset was used in (Yan et al.,
2016) however instead divided into six classes: Call,
Eat, Break, Wheel, Phone play, and Smoke. The clas-
sifications were made using a RCNN and achieved a
mean average precision rate of 97.76%.
Optical flow fields were used in our study as in
order to improve the accuracy of the CNN. This type
of method was also used in (Simonyan and Zisser-
man, 2014). In that study, both the original images
and the optical flow fields were processed in parallel
and the intermediate results were concatenated at the
end of the network. The study’s results were better or
close to the compared methods containing among oth-
ers improved dense trajectories and Spatio-temporal
HMAX. The results suggested that the use of opti-
cal flow gave a better performance compared to raw
frames.
To the best of our knowledge there are no previous
studies investigating predictions of a driver’s actions
inside a car with the use of neural networks. Under-
standing the driver’s intent and thus the future state of
the driver provides a possibility to further anticipate
the driver’s readiness to react to new conditions in a
traffic situation. Hence, valuable information about
how well a driver could handle a driving task or react
to changes in the environment can be gained. There-
fore warning systems could potentially make use of
such information to provide earlier and more accurate
warnings.
In (Jain et al., 2016) the method used to process
facial images was based on a landmark representation
of the face. It might be possible to generalize this
method, from landmarks of facial images to images
of the whole-body posture. Hence, more information
about the driver could potentially be used as the full
body posture can provide more information.
The architecture proposed in the presented study
revolved around two types of neural networks, one
used for classification of images and the other to pre-
dict future events based on sequences of outputs from
the first network. A CNN was used for image clas-
sifications, while an RNN with LSTM generated the
action intention recognition. The training and test-
ing images for the networks depicted the whole-body
posture of the driver.
2 DATA
There are currently few public datasets containing
image sequences of drivers. One of the datasets
was created in (Abouelnaga et al., 2017). The fo-
cus of this dataset was distracted drivers and con-
tains the classes: Drive Safe, Talk Passenger, Text
Right, Drink, Talk left, Text Left, talk Right, Adjust
Radio, Hair & Makeup and Reach Behind. Unfortu-
nately, this dataset is not on a sequential form where
one class leads to another creating a chain of actions
and could therefore not be used in this study. An-
other dataset created by (Jain et al., 2015) consists
of 1180 miles of driver data which is annotated with
turns, lane changes and drive straight. This dataset,
even though it contains images of a driver, does not
focus on driver activity. Therefore the dataset was not
suitable for this study.
The dataset used in this study consists of se-
quences of images, collected using two cameras
mounted inside of a car. During data collection, one
camera was placed in the left A-pillar facing the driver
and the other camera on the side window of the front
passenger seat. Due to safety reasons, the car was
parked during the collection of the data. There were
eight participants performing 13 tasks in a Volvo V40
Cross Country. In total eight classes of driver behav-
ior were used: drive safe, glance, lean, remove a hand
from the steering wheel, reach, grab, retract and hold
object. Examples of these classes can be found in Fig-
ure 1. The drivers were instructed to perform tasks in
a specific order, for example, start by driving safely
then glance, remove a hand from the steering wheel,
lean towards the center then proceed by picking up
an object. Each task was performed five times by
each participant and lasted between 1.5 and 13 sec-
onds. Around 90,000 images were collected. Due to
technical difficulties, the images were captured at ap-
proximately 30 frames per second. The dataset was
then divided into two separate sets where the training
and testing set consisted of 55 and 10 sequences re-
Using Recurrent Neural Networks for Action and Intention Recognition of Car Drivers
233