CNN’s. For example, (Mohan, 2014), presents a
deep deconvolutional network architecture that incor-
porates spatial information around each pixel in the
labeling process. (Brust et al., 2015) and (Mendes
et al., 2016) use small image patches from each frame
in order to label the center pixel of each patch as ei-
ther road or non-road. These patches are fed into a
trained CNN to classify the label of the center pixel
of that patch. These algorithms use spatial infor-
mation associated with each pixel in order to label
that pixel with high confidence. (Oliveira et al., 2016)
present a method for road segmentation with the aim
of reaching a better trade-off between accuracy and
speed. They introduce a CNN with deep deconvolu-
tional layers that improves the performance of the net-
work. Another advantage of this method is that it uses
the entire image as an input; not different patches of
each frame and this helps to make the algorithm run
faster and be more efficient.
Steering the Robot Directly. Work here can be
traced back to the late 1980’s and works such as
ALVINN (Pomerleau, 1989). This approach uses a
set of driver’s actions captured on different roads such
as one lane, multi-lane paved roads and also unpaved
off-roads, to train the neural network architecture of
the robot to return a suitable steering command to the
robot. Work following this basic strategy continues
today. To take but one recent example, NVIDIA (Bo-
jarski et al., 2016) designed a system to train a CNN
on camera frames with respect to the given steering
angle of a human driver for each frame. An instru-
mented car is outfitted with a camera that simulates
the human driver’s view of the road, and a human
drives the vehicle while the camera input and human
steering commands are collected. A CNN is then con-
structed from this dataset using the human steering
angle as ground truth.
CNN’s have also been used to drive a robot on
off-road trails and one such approach is presented
in (Giusti et al., 2016). This work uses a machine
learning approach for following a forest trail. Rather
than directly mapping image to steering angle this
approach categorizes the input image into one of
’straight’, ’left’ or ’right’ and then uses the distribu-
tion of likelihoods over these three categories to com-
pute both a steering angle as well as an appropriate
vehicle speed. In order to collect training data three
cameras are mounted on the hip of a hiker, one point-
ing ’forward’ and one yawed to the right and the other
one yawed to the left. Data is collected while the ’for-
ward’ camera is aligned with the trail. These three
cameras are set up with a 30 degrees yaw from each
other on the head of a hiker to record the dataset.
Table 1: TrailNet dataset training hyperparameters.
Training hyperparameters Value
epochs 5
learning rate 0.02
train batch size 100
validation batch size 100
random brightness ±15%
# images per label for training 5000
# images per label for validation 625
# images per label for testing 625
Their dataset consists of 8 hours of video in forest-
like trails and images are captured in a way that the
hiker always looks towards the direction of the trail.
Therefore, in a classification task, the central cam-
era is labeled as “go straight”, the left camera is la-
beled as “turn right” and the right camera is labeled
as “turn left”. These labels are then used to train a
CNN in order to perform the classification task and
outputs a probability of each class as a softmax func-
tion. They used a 9 layer neural network in order
to do the classification task. These layers consist of
4 back to back convolutional and max-pooling lay-
ers followed by a 2,000 neuron fully connected layer
and finally a classification layer (output layer) with
three neurons which returns the probability of occur-
rence of each label. This DNN is based on the archi-
tecture used in (Ciregan et al., 2012). For evaluation
purposes, the accuracy of this classifier is calculated
based on the maximum probability of softmax func-
tion in the output of DNN. The reported accuracy is
85.2% for the classification task between three labels
of “go straight”, “turn right” and “turn left”.
3 TrailNet DATASET
Key to a CNN-based approach is an appropriate
dataset to train the network. For this work we col-
lected the TrailNet dataset of different trails under
various trail and imaging conditions (its capture is
inspired by the approach presented in (Giusti et al.,
2016)). TrailNet
1
consists of images captured from
wide field of view cameras of different trail types,
where each class of images has a certain deviation an-
gle from the heading direction of the trail. In order to
study the effect of surface type of the trails, each of
the TrailNet dataset are further divided by trail types:
(1) asphalt, (2) concrete (3) dirt and (4) gravel. Trail-
Net was captured with three omnidirectional cameras.
1
TrailNet is available for public use at http://vgr.lab.
yorku.ca/tools/trailnet
ICINCO 2018 - 15th International Conference on Informatics in Control, Automation and Robotics
104