
accuracy  using  this  dataset  as  his  best  model 
classified  objects  correctly  with  a  success  rate  of 
78.9% (Krizhevsky, 2010). Since then, Mishkin and 
Matas (2016) have obtained 94.16% accuracy on the 
CIFAR10  dataset.  Whereas,  Springenberg  et  al. 
(2015)  have  obtained  95.59%  accuracy  and  the 
current best performance is by Graham (2014) with 
an accuracy of 96.53% using max pooling.  
There  has  been  strong  interest  in  using  the 
TurtleBot  platform  for  obstacle  detection  and 
avoidance.  Boucher  (2012)  used  the  Point  Cloud 
Library  and  depth  information  along  with  plane 
detection  algorithms  to  build  methods  of  obstacle 
avoidance. High curvature edge detection was used to 
locate boundaries between the ground and objects that 
rest on the ground. Other researchers have considered 
the use of Deep Learning for the purpose of obstacle 
avoidance using the TurtleBot platform.  
Tai, Li, and Liu (2016) used depth images as the 
only  input  into  the  deep  network  for  training 
purposes.  They  discretized  control  commands  with 
outputs such as: “go-straightforward”, “turning-half-
right”, “turning-full-right”, etc. The depth image was 
from a Kinect camera with dimensions of 640 x 480. 
This image was downsampled to 160 x 120. Three 
stages of  processing  were  done where  the layering 
was ordered as such: convolution, activation, pooling. 
The  first  convolution  layer  used  32  convolution 
kernels, each of size 5 x 5. The final layer included a 
fully-connected  layer  with  outputs  for  each 
discretized movement decision. In all trials, the robot 
never  collided  with  obstacles,  and  the  accuracy 
obtained after training in relation to the testing set was 
80.2%. Their network was trained only on 1104 depth 
images. The environment used in this dataset seems 
fairly  straightforward  –  meaning  that  the  only 
“obstacles”  seems  to  be  walls  or  pillars.  The 
environment  was not  dynamic.  Tai and  Liu  (2016) 
produced another paper related to the previous paper. 
Instead of a real-world environment, this was tested 
in a simulated environment provided by the TurtleBot 
platform, called Gazebo. Different types of corridor 
environments  were  tested  and  learned.  A 
reinforcement  learning  technique  called  Q-learning 
was  paired  with the power  of Deep  Learning. The 
robot, once again, used depth images and the training 
was  done  using  Caffe.  Other  deep  reinforcement 
learning research included real-world evaluation on a 
TurtleBot  (Tai  et  al.,  2017),  using  dueling  deep 
double  Q  networks  trained  to  learn  obstacle 
avoidance  (Xie  et  al.,  2017),  and  using  a  fully 
connected  NN  to  map  to  Q-values  for  obstacle 
avoidance (Wu et al., 2019). 
Tai,  Li,  and  Liu  (2017)  applied  Deep  Learning 
using several convolutional neural network layers to 
process  depth  images  in  order  to  learn  obstacle 
avoidance for a TurtleBot in the real world. This is 
very  similar  to  our  work,  except  they  used  depth 
images, the obstacles were just a corridor, and they 
train from scratch instead of using transfer learning as 
we did. 
Our  research provides a  distinctive  approach  in 
comparison to these works. Research like Boucher’s 
does not consider higher level learning, but instead 
builds  upon  advanced  expert  systems,  which  can 
detect differentials in the ground plane. By focusing 
on Deep Learning, our research allows a pattern based 
learning approach that is more general and one which 
does not need to be explicitly programmed. While Tai 
et al. used Deep Learning, their dataset was limited 
with just over 1100 images. We built our own dataset 
to have over 30,000 images, increasing the size of the 
effective dataset by about 28 times. The environment 
for our research is  more complex than just the flat 
surfaces of walls and columns. As in Xie’s work, in 
our research the learning was done on a dataset that 
was  based  on  raw  monocular  RGB  images.  This 
opens the door to further research with cameras that 
do not have depth. Moreover, the sizes of the images 
used in our research were dramatically smaller, which 
also opens up the door for faster training and a speed 
up in forward propagation. Lastly, similar to a few of 
these works, the results of our work were tested in the 
real world as opposed to a simulated environment. 
2  DEEP LEARNING 
Consider  a  standard  feed-forward  artificial  neural 
network that is fully connected between each layer 
being used to process a 100 x 100 pixel image. With 
3 color channels, we would have 100 x 100 x 3 or 
30,000 inputs to our neural network. This is a large 
number  of inputs for  a standard  neural network  to 
process.  Deep  Learning  directly  addresses  this 
limitation.  
The  convolution  layer  passes  convolution 
windows over the image to produce new images that 
are smaller. The number of images produced can be 
specified by the programmer. Each new image will be 
accompanied by a convolution kernel signifying the 
weights. Instead of sending all input values from layer 
to layer, deep networks are designed to take regions 
or subsamples of inputs. For images this means that 
instead of sending all pixels in the entire image as 
inputs, different neurons will only take regions of the 
image as inputs – full connectivity is reduced to local 
NCTA 2019 - 11th International Conference on Neural Computation Theory and Applications
404