accuracy using this dataset as his best model
classified objects correctly with a success rate of
78.9% (Krizhevsky, 2010). Since then, Mishkin and
Matas (2016) have obtained 94.16% accuracy on the
CIFAR10 dataset. Whereas, Springenberg et al.
(2015) have obtained 95.59% accuracy and the
current best performance is by Graham (2014) with
an accuracy of 96.53% using max pooling.
There has been strong interest in using the
TurtleBot platform for obstacle detection and
avoidance. Boucher (2012) used the Point Cloud
Library and depth information along with plane
detection algorithms to build methods of obstacle
avoidance. High curvature edge detection was used to
locate boundaries between the ground and objects that
rest on the ground. Other researchers have considered
the use of Deep Learning for the purpose of obstacle
avoidance using the TurtleBot platform.
Tai, Li, and Liu (2016) used depth images as the
only input into the deep network for training
purposes. They discretized control commands with
outputs such as: “go-straightforward”, “turning-half-
right”, “turning-full-right”, etc. The depth image was
from a Kinect camera with dimensions of 640 x 480.
This image was downsampled to 160 x 120. Three
stages of processing were done where the layering
was ordered as such: convolution, activation, pooling.
The first convolution layer used 32 convolution
kernels, each of size 5 x 5. The final layer included a
fully-connected layer with outputs for each
discretized movement decision. In all trials, the robot
never collided with obstacles, and the accuracy
obtained after training in relation to the testing set was
80.2%. Their network was trained only on 1104 depth
images. The environment used in this dataset seems
fairly straightforward – meaning that the only
“obstacles” seems to be walls or pillars. The
environment was not dynamic. Tai and Liu (2016)
produced another paper related to the previous paper.
Instead of a real-world environment, this was tested
in a simulated environment provided by the TurtleBot
platform, called Gazebo. Different types of corridor
environments were tested and learned. A
reinforcement learning technique called Q-learning
was paired with the power of Deep Learning. The
robot, once again, used depth images and the training
was done using Caffe. Other deep reinforcement
learning research included real-world evaluation on a
TurtleBot (Tai et al., 2017), using dueling deep
double Q networks trained to learn obstacle
avoidance (Xie et al., 2017), and using a fully
connected NN to map to Q-values for obstacle
avoidance (Wu et al., 2019).
Tai, Li, and Liu (2017) applied Deep Learning
using several convolutional neural network layers to
process depth images in order to learn obstacle
avoidance for a TurtleBot in the real world. This is
very similar to our work, except they used depth
images, the obstacles were just a corridor, and they
train from scratch instead of using transfer learning as
we did.
Our research provides a distinctive approach in
comparison to these works. Research like Boucher’s
does not consider higher level learning, but instead
builds upon advanced expert systems, which can
detect differentials in the ground plane. By focusing
on Deep Learning, our research allows a pattern based
learning approach that is more general and one which
does not need to be explicitly programmed. While Tai
et al. used Deep Learning, their dataset was limited
with just over 1100 images. We built our own dataset
to have over 30,000 images, increasing the size of the
effective dataset by about 28 times. The environment
for our research is more complex than just the flat
surfaces of walls and columns. As in Xie’s work, in
our research the learning was done on a dataset that
was based on raw monocular RGB images. This
opens the door to further research with cameras that
do not have depth. Moreover, the sizes of the images
used in our research were dramatically smaller, which
also opens up the door for faster training and a speed
up in forward propagation. Lastly, similar to a few of
these works, the results of our work were tested in the
real world as opposed to a simulated environment.
2 DEEP LEARNING
Consider a standard feed-forward artificial neural
network that is fully connected between each layer
being used to process a 100 x 100 pixel image. With
3 color channels, we would have 100 x 100 x 3 or
30,000 inputs to our neural network. This is a large
number of inputs for a standard neural network to
process. Deep Learning directly addresses this
limitation.
The convolution layer passes convolution
windows over the image to produce new images that
are smaller. The number of images produced can be
specified by the programmer. Each new image will be
accompanied by a convolution kernel signifying the
weights. Instead of sending all input values from layer
to layer, deep networks are designed to take regions
or subsamples of inputs. For images this means that
instead of sending all pixels in the entire image as
inputs, different neurons will only take regions of the
image as inputs – full connectivity is reduced to local
NCTA 2019 - 11th International Conference on Neural Computation Theory and Applications
404