Autonomous Trail Following using a Pre-trained Deep Neural Network

Masoud Hoveidar-Seﬁd and Michael Jenkin

Electrical Engineering and Computer Science Department and York Centre for Field Robotics,

Lassonde School of Engineering, York University, Toronto, Canada

Keywords:

Autonomous Navigation, Trail Following, Path Finding, Deep Neural Networks.

Abstract:

Trails are unstructured and typically lack standard markers that characterize roadways; nevertheless, trails

can provide an effective set of pathways for off-road navigation. Here we approach the problem of trail

following by identifying the deviation of the robot from the heading angle of the trail through the reﬁnement

of a pretrained Inception-V3 (Szegedy et al., 2016a) Convolutional Neural Network (CNN) trained on the

ImageNet dataset (Deng et al., 2009). A differential system is developed that uses a pair of cameras each

providing input to its own CNN directed to the left and the right that estimate the deviation of the robot with

respect to the trail direction. The resulting networks have been successfully tested on over 1 km of different

trail types (asphalt, concrete, dirt and gravel).

1 INTRODUCTION

A key requirement for off-road autonomous naviga-

tion is to be able to follow a certain path or trail in

a natural and unstructured environment. The robot

continuously follows a trail and at each instant takes a

sensor snapshot of the trail “in front” of the robot. The

goal is to take this snapshot and to estimate the devi-

ation of the robot with respect to the direction of the

trail and to then generate an appropriate motion com-

mand that moves the robot along the trail. The road

fallowing version of the problem is aided by many

detectable and well known features associated with

highways and streets. This is to be compared to the

somewhat unstructured nature of trails. The diversity

of trail types presents many challenges for the task of

fully autonomous navigation.

There are many possible approaches to trail fol-

lowing, however, this work utilizes an Inception-

V3 (Szegedy et al., 2016b) CNN pretrained on the

ImageNet dataset (Deng et al., 2009) to estimate trail

deviation. Although the Inception-V3 CNN trained

on ImageNet can be used to recognize trail heading, it

has difﬁculty in distinguishing between deviations to

the left and the right. In order to overcome this limita-

tion here we propose a differential approach that uses

two cameras, each using the same tuned CNN based

on the Inception-V3 pre-trained network to address

the trail following task.

2 PREVIOUS WORK

There is a large road following literature associ-

ated both with ‘off-road’ roads as well as hard sur-

face roadway following. Space does not permit a

full review of these approaches here, and the inter-

ested reader is directed to (DeSouza and Kak, 2002)

and (Hillel et al., 2014) for surveys of the ﬁeld.

There are many potential approaches to trail fol-

lowing. Here we concentrate on a CNN-based ap-

proach. The last ten years or so have seen a resur-

gence in NN-based approaches to this problem aided

in part by substantive increases in computational

power that enables ‘deep’ networks to be constructed

and the vast amounts of labeled data that now exists

to train these networks. NN-based approaches to road

or trail following generally fall into one of two basic

categories, systems that classify the state of the robot

relative to the road and then use this information to

steer the vehicle, and approaches that map sensor data

to steering angle directly.

Road Surface Detection. This category of ap-

proaches ﬁrst segments image pixels into road or non-

road classes. This information is then used to es-

timate the best steering angle to keep the robot on

the road while moving the robot forward. These ap-

proaches treat the problem as a classiﬁcation prob-

lem (which pixels are road pixels) exploiting the wide

range of classiﬁcation approaches that are based on

Hoveidar-Seﬁd, M. and Jenkin, M.

Autonomous Trail Following using a Pre-trained Deep Neural Network.

DOI: 10.5220/0006832301030110

In Proceedings of the 15th International Conference on Informatics in Control, Automation and Robotics (ICINCO 2018) - Volume 1, pages 103-110

ISBN: 978-989-758-321-6

103

CNN’s. For example, (Mohan, 2014), presents a

deep deconvolutional network architecture that incor-

porates spatial information around each pixel in the

labeling process. (Brust et al., 2015) and (Mendes

et al., 2016) use small image patches from each frame

in order to label the center pixel of each patch as ei-

ther road or non-road. These patches are fed into a

trained CNN to classify the label of the center pixel

of that patch. These algorithms use spatial infor-

mation associated with each pixel in order to label

that pixel with high conﬁdence. (Oliveira et al., 2016)

present a method for road segmentation with the aim

of reaching a better trade-off between accuracy and

speed. They introduce a CNN with deep deconvolu-

tional layers that improves the performance of the net-

work. Another advantage of this method is that it uses

the entire image as an input; not different patches of

each frame and this helps to make the algorithm run

faster and be more efﬁcient.

Steering the Robot Directly. Work here can be

traced back to the late 1980’s and works such as

ALVINN (Pomerleau, 1989). This approach uses a

set of driver’s actions captured on different roads such

as one lane, multi-lane paved roads and also unpaved

off-roads, to train the neural network architecture of

the robot to return a suitable steering command to the

robot. Work following this basic strategy continues

today. To take but one recent example, NVIDIA (Bo-

jarski et al., 2016) designed a system to train a CNN

on camera frames with respect to the given steering

angle of a human driver for each frame. An instru-

mented car is outﬁtted with a camera that simulates

the human driver’s view of the road, and a human

drives the vehicle while the camera input and human

steering commands are collected. A CNN is then con-

structed from this dataset using the human steering

angle as ground truth.

CNN’s have also been used to drive a robot on

off-road trails and one such approach is presented

in (Giusti et al., 2016). This work uses a machine

learning approach for following a forest trail. Rather

than directly mapping image to steering angle this

approach categorizes the input image into one of

’straight’, ’left’ or ’right’ and then uses the distribu-

tion of likelihoods over these three categories to com-

pute both a steering angle as well as an appropriate

vehicle speed. In order to collect training data three

cameras are mounted on the hip of a hiker, one point-

ing ’forward’ and one yawed to the right and the other

one yawed to the left. Data is collected while the ’for-

ward’ camera is aligned with the trail. These three

cameras are set up with a 30 degrees yaw from each

other on the head of a hiker to record the dataset.

Table 1: TrailNet dataset training hyperparameters.

Training hyperparameters Value

epochs 5

learning rate 0.02

train batch size 100

validation batch size 100

random brightness ±15%

# images per label for training 5000

# images per label for validation 625

# images per label for testing 625

Their dataset consists of 8 hours of video in forest-

like trails and images are captured in a way that the

hiker always looks towards the direction of the trail.

Therefore, in a classiﬁcation task, the central cam-

era is labeled as “go straight”, the left camera is la-

beled as “turn right” and the right camera is labeled

as “turn left”. These labels are then used to train a

CNN in order to perform the classiﬁcation task and

outputs a probability of each class as a softmax func-

tion. They used a 9 layer neural network in order

to do the classiﬁcation task. These layers consist of

4 back to back convolutional and max-pooling lay-

ers followed by a 2,000 neuron fully connected layer

and ﬁnally a classiﬁcation layer (output layer) with

three neurons which returns the probability of occur-

rence of each label. This DNN is based on the archi-

tecture used in (Ciregan et al., 2012). For evaluation

purposes, the accuracy of this classiﬁer is calculated

based on the maximum probability of softmax func-

tion in the output of DNN. The reported accuracy is

85.2% for the classiﬁcation task between three labels

of “go straight”, “turn right” and “turn left”.

3 TrailNet DATASET

Key to a CNN-based approach is an appropriate

dataset to train the network. For this work we col-

lected the TrailNet dataset of different trails under

various trail and imaging conditions (its capture is

inspired by the approach presented in (Giusti et al.,

2016)). TrailNet

consists of images captured from

wide ﬁeld of view cameras of different trail types,

where each class of images has a certain deviation an-

gle from the heading direction of the trail. In order to

study the effect of surface type of the trails, each of

the TrailNet dataset are further divided by trail types:

(1) asphalt, (2) concrete (3) dirt and (4) gravel. Trail-

Net was captured with three omnidirectional cameras.

TrailNet is available for public use at http://vgr.lab.

yorku.ca/tools/trailnet

ICINCO 2018 - 15th International Conference on Informatics in Control, Automation and Robotics

104

Table 2: Detailed test accuracy results of the straight/not-straight method with the three narrow camera geometry setup on all

trail types. The 0

◦

and overall columns show recognition rate for the two classes. Performance is also broken down for each

of the −60

◦

and +60

◦

categories separately.

Test accuracy Subcategories accuracy

Training dataset Overall Not-straight Straight +60

◦

−60

◦

Asphalt 99.90% 99.95% 99.90% 100.0% 99.90% 99.90%

Concrete 99.75% 100% 99.20% 100.0% 100.0% 99.20%

Dirt 97.77% 100% 99.20% 100.0% 100.0% 99.30%

Gravel 99.13% 99.88% 98.14% 100.0% 99.24% 98.14%

Each of the cameras is a Kodak Pixpro SP360 (kodak,

2016), which provides a ﬁsh eye camera with a 214

◦

degree of vertical ﬁeld of view at 30 frames per sec-

ond with a resolution of 1072 × 1072 pixels. A Pi-

oneer 3-AT robot (p3at, 2017) was used as a mobile

base in this work. During training the three cameras

were mounted on a detachable circular mount on top

of the robot, while during trail following only the two

cameras directed to the left and right were used.

4 FINE TUNING A PRETRAINED

CNN MODEL

Transferring knowledge from a trained network on a

given task to another network operating on a target

task is called transfer learning and a survey of dif-

ferent approaches in this domain is presented in (Pan

and Yang, 2010). The basic approach involves taking

a pre-learned CNN and then surrounding this CNN

with a small number of layers and then training just

these layers on the speciﬁc problem. In this approach

the last layer of the network (also known as classiﬁ-

cation layer) which is a fully connected layer, is re-

placed with two new fully connected layers and only

the parameters of these two layers are trained on the

new dataset. This ﬁne tuning approach transfers the

mid-range features learned by the source network and

exploits them in a new classiﬁcation task. Obtain-

ing a rich and vast annotated dataset on a new task is

a very tedious and sometimes very expensive proce-

dure. As a consequence, the ﬁne tuning approach can

be very helpful when the target task does not have a

huge dataset from which to train a CNN from scratch.

Here we explore using the pre-trained Inception

network to differentiate between straight ahead trails

and trails that are bending away from the straight

ahead direction. In this method, labels are either

straight (i.e., 0

◦

) or not straight (i.e., −60

◦

or +60

◦

Hyperparameters used in the following training steps

were common over the different training regimes fol-

lowed and are provided in Table 1. Table 2 shows the

performance on each trail type by their corresponding

trained network.

In order to study the effect of the trail type in the

task of deviation angle recognition on trails, the per-

formance of each road-type (asphalt, concrete, dirt,

gravel) network was tested on a test set from each of

the trail types. Table 3 shows the performance con-

fusion matrix. The “correct” recognizer works well

on the “correct” trail type (the diagonal in Table 3)

while the “wrong” recognizer shows reasonably good

accuracy on the “wrong” trail type with some excep-

tions. The dirt network shows a reasonably good per-

formance on all the other trail types, the asphalt net-

work is reliable for asphalt and concrete, the concrete

network shows a good performance only on concrete,

and the gravel network also has a good ability to rec-

ognize the correct labels for the concrete dataset and

the gravel dataset itself. But for best performance, a

properly tuned trail-speciﬁc CNN performs best.

The ability of the ﬁne tuned Inception-V3 CNN’s

to differentiate between straight ahead and not-

straight ahead classes can be used to develop a dif-

ferential strategy for trail following. Figure 1 shows

the 0

◦

classiﬁer output results for each of the classi-

ﬁers as the roadway curves from −90

◦

to +90

◦

. For

each response curve a normalized Gaussian is ﬁt. The

nature of the ﬁts suggest that the networks have a

symmetrical sensitivity for the positive and negative

deviation angles from the trail. The Gaussian-like re-

sponse ﬁt also suggests that a classical difference of

offset Gaussian (DOOG)(Young et al., 2001)-like ap-

proach could be exploited to combine multiple CNN

response curves to estimate the direction of the trail

relative to the robot.

Table 3: Confusion matrix for the straight versus not-

straight trained networks on the test sets from their cor-

responding trails and other trail types. Poorly performing

network-trail type combinations are highlighted in bold.

Trail type (%)

NN asphalt concrete dirt gravel

asphalt 99.90 96.69 63.78 87.26

concrete 85.34 99.75 66.60 70.10

dirt 92.40 93.95 97.77 95.15

gravel 84.72 98.83 84.33 99.13

Autonomous Trail Following using a Pre-trained Deep Neural Network

105

(a) Asphalt

(b) Concrete

(d) Gravel

Figure 1: Blue points in the charts depict the 0

◦

label output results of the trained networks on images between −90

◦

and

+90

◦

on their corresponding trail type. Orange lines show a normalized Gaussian function ﬁtted to the plotted results. The

Gaussian ﬁt is normalized to the maximum of the response curve. Note the different vertical axes for the various plots.

(a) Asphalt CNN (b) Concrete CNN

(d) Gravel CNN

Figure 2: Experimental values of G

(blue) and G

−

(orange) with standard errors for generic camera geometry but different

CNN terrain types. The horizontal axis of all charts is the angle of the heading of the robot γ with respect to the trail direction.

4.1 Differential Method

Rather than utilizing a single “forward facing” sensor

here we utilize two such sensors each providing input

to their own identical CNN and then utilizes the dif-

ference between two “straight ahead” classiﬁer out-

puts obtained from cameras offset in orientation to the

left and right to encode the trail direction. In order to

ﬁnd the optimal offset angle between the cameras the

sensitivity functions of the CNN’s with respect to the

deviation angle of the robot γ relative to the trail is

estimated (see Table 4).

Assume that the sensitivity function of the right

and left CNN’s are well described by the Gaussian

distributions are G

and G

, where G

and G

have

the same standard deviation σ and means of µ

= θ/2

and µ

= −θ/2, respectively. The combined sensi-

tivity function of the two CNN’s is computed as the

addition of G

and G

as G

= G

+ G

. The differ-

ence between G

and G

is denoted as G

−

= G

−G

In order to have a stable overall sensitivity function

and preventing G

from having a ripple in the

center, θ should be chosen in a way that at γ = 0,

= G

≥ 0.5.

Table 4 shows the characteristics of the Gaussian

distributions of all of the networks on their corre-

ICINCO 2018 - 15th International Conference on Informatics in Control, Automation and Robotics

106

(a) Asphalt map

(b) Concrete map

(d) Gravel map

Figure 3: Overhead views of the test environments.

Table 4: Gaussian model parameters for each network.

NN Asphalt Concrete Dirt Gravel Average

µ 3.96 1.47 -8.86 -5.68 -2.27

σ 24.85 16.99 12.85 19.69 18.59

sponding trail type and their averages across the dif-

ferent trail types. The average standard deviation

avr

= 18.59 was used to establish a generic camera

separation. Figure 2 shows the performance of the

differential CNN’s on the different trail types. G

−

encodes the deviation of the robot from the straight

ahead direction while G

encodes the conﬁdence of

this output.

From Trail Orientation to Steering Angle. There

are a number of practical issues involved in comput-

ing a steering command from the estimated trail ori-

entation. First, if the detector cannot detect a road,

then some default steering angle should be chosen or

in a full application this information should also be

sent to higher level control. Second, there is a de-

sire to not oversteer (an overshoot) the vehicle mo-

tion, and ﬁnally, if the steering angle is only changing

by a very small amount the robot should not introduce

small zero-mean ﬂuctuations into the steering angle

(the controller should have a deadband region). Prop-

erly addressing all of these issues is beyond the scope

Autonomous Trail Following using a Pre-trained Deep Neural Network

107

(a) Asphalt

(b) Concrete

(d) Gravel

Figure 4: Snapshots taken along the trails.

of this paper, so here we adopt a very simple strategy

for obtaining a steering angle change from the (G+,

G-) pair. If |G

−

| < th

−

, then we assume that the robot

is in a deadband region, and set the change in steering

angle to be zero. If |G + | < th

, then the detector

does not detect a trail, and again we set the change in

steering angle to be zero. For other angles we clamp

the output change in steering angle to ±0.1 rad/s. In

all of the operating states of the robot, for the exper-

iments described in the following section, the robot

has a constant forward velocity of 20cm/s.

5 EXPERIMENTAL RESULTS

In order to test the performance of the algorithm for

the task of autonomous driving on trails, ﬁeld tri-

als were conducted on a range of trail types. Due

to the battery capacities of the robot and the hard-

ware used in this experiments, tests were broken down

into 10 distances of 20 meters long. The following

evaluations were used based on criteria typically used

for autonomous driving systems (see (Bojarski et al.,

2016));

• The maximum distance traveled over the the 20

ICINCO 2018 - 15th International Conference on Informatics in Control, Automation and Robotics

108

Table 5: Autonomous ﬁeld trial results for different trail types on 200 meters of paths shown in Figure 3 (a)-(d). The average

values are computed based on the performance of the algorithm over ten different 20 meters pieces of path. Performance of

the algorithm on asphalt and concrete paths was quit good, but deteriorated on dirt and gravel trails.

Parameters Asphalt Concrete Dirt Gravel

Maximum distance travelled (m) 20 20 19 20

Minimum distance travelled (m) 20 12 8 5

Average autonomous distance travelled (m) 20 18.6 15.6 13.1

Average number of human operator interruptions 0 0.2 1.2 1.4

meter test ranges before human intervention was

required.

• The minimum autonomous distance traveled over

each of the ten 20 meter test ranges before human

intervention was required.

• The average autonomous distance traveled over

each of the ten 20 meter test ranges before human

intervention was required.

• The average number of human operator interrup-

tions required in order to drive the vehicle 20 me-

ters.

Figure 3 shows the paths chosen for the test on a map

and Figure 4 provides a ﬁlm strip of the robot in oper-

ation on all trail types. A chart of the results of the test

on four different trails is shown in Table 5. Figure 5

plots a summary of these results.

The tests showed a number of interesting aspects

of the algorithm. First, the algorithm performed very

well on asphalt and concrete paths with the robot op-

erating perfectly on asphalt and performing on aver-

age over 18m (in a 20m trial) without failure on con-

crete trails. Performance on dirt and gravel trails was

less successful, but still showed quite good perfor-

mance. It is interesting to note that the dirt trail was

much narrower than the other trails tested (see Fig-

ure 5) but that even here performance was quite good.

Worst performance was found on the gravel trail. One

possibility here is that the control algorithm devel-

oped for gravel trails might need to be more tightly

tuned for the gravel as the robot’s tires exhibited poor

performance on this surface.

6 CONCLUSION

This paper presented a reﬁnement learning-based ap-

proach with the use of Inception-V3 network for off-

road autonomous trail following. Experiments eval-

uated the nature, geometry and number of cameras

that might be applied to the problem. Using a DOOG

model a trail following algorithm was successful in

navigating different trail types using a common cam-

era geometry and using tuned CNN’s for different trail

Figure 5: Autonomous test results plot on different trail

types (see Table 5 for details). Vertical axis is in meters for

distance traveled and count for average number of human

operator interruptions.

types. End-to-end testing of the algorithm showed

very good performance on asphalt and concrete paths,

with poorer performance on dirt and gravel. Perfor-

mance on dirt was still quite impressive given the nar-

row nature of such paths.

ACKNOWLEDGMENTS

The support of NSERC Canada and the NSERC

Canadian Field Robotics Network (NCFRN) are

gratefully acknowledged.

REFERENCES

Bojarski, M., Del Testa, D., Dworakowski, D., Firner, B.,

Flepp, B., Goyal, P., Jackel, L. D., Monfort, M.,

Muller, U., Zhang, J., Zhang, X., Zhao, J., and Zieba,

K. (2016). End to end learning for self-driving cars.

arXiv:1604.07316.

Brust, C.-A., Sickert, S., Simon, M., Rodner, E., and Den-

zler, J. (2015). Convolutional patch networks with

Autonomous Trail Following using a Pre-trained Deep Neural Network

109

spatial prior for road detection and urban scene un-

derstanding. arXiv preprint arXiv:1502.06344.

Ciregan, D., Meier, U., and Schmidhuber, J. (2012). Multi-

column deep neural networks for image classiﬁcation.

In Proc. IEEE CVPR, pages 3642–3649, Providence,

RI.

Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-

Fei, L. (2009). Imagenet: A large-scale hierarchical

image database. In Proceedings of the IEEE Inter-

national Conference on Computer Vision and Pattern

Recognition (CVPR), pages 248–255, Miami, FL.

DeSouza, G. N. and Kak, A. C. (2002). Vision for mobile

robot navigation: A survey. IEEE PAMI, 24(2):237–

267.

Giusti, A., Guzzi, J., Cires¸an, D. C., He, F.-L., Rodr

ıguez,

J. P., Fontana, F., Faessler, M., Forster, C., Schmidhu-

ber, J., Di Caro, G., et al. (2016). A machine learning

approach to visual perception of forest trails for mo-

bile robots. IEEE Robotics and Automation Letters,

1(2):661–667.

Hillel, A. B., Lerner, R., Levi, D., and Raz, G. (2014). Re-

cent progress in road and lane detection: a survey. Ma-

chine Vision and Applications, 25(3):727–745.

kodak (2016). The Kodak camera website.

Mendes, C. C. T., Fr

emont, V., and Wolf, D. F. (2016). Ex-

ploiting fully convolutional neural networks for fast

road detection. In Proc.IEEE ICRA, pages 3174–

3179, Stockholm, Sweden.

Mohan, R. (2014). Deep deconvolutional networks for

scene parsing. arXiv preprint arXiv:1411.4101.

Oliveira, G. L., Burgard, W., and Brox, T. (2016). Efﬁcient

deep models for monocular road segmentation. In

Proc. IEEE IROS, pages 4885–4891, Daejeon, South

Korea.

p3at (2017). The Pioneer 3-at robot website.

Pan, S. J. and Yang, Q. (2010). A survey on transfer learn-

ing. IEEE Transactions on Knowledge and Data En-

gineering, 22(10):1345–1359.

Pomerleau, D. A. (1989). ALVINN, an autonomous

land vehicle in a neural network. Technical report,

Carnegie Mellon University, Computer Science De-

partment, Pittsburgh, PA.

Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., and Wo-

jna, Z. (2016a). Rethinking the inception architecture

for computer vision. In Proceedings of the IEEE Con-

ference on Computer Vision and Pattern Recognition

(CVPR), pages 2818–2826, Las Vegas, NV.

Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., and Wojna,

Z. (2016b). Rethinking the inception architecture for

computer vision. In Proc. IEEE CVPR, pages 2818–

2826.

Young, R. A., Lesperance, R. M., and Meyer, W. W. (2001).

The gaussian derivative model for spatial-temporal vi-

sion: I. cortical model. Spatial Vision, 14(3):261–319.

ICINCO 2018 - 15th International Conference on Informatics in Control, Automation and Robotics

110