Vision based Indoor Obstacle Avoidance using a

Deep Convolutional Neural Network

Mohammad O. Khan and Gary B. Parker

Department of Computer Science, Connecticut College, New London, CT, U.S.A.

Keywords: Deep Learning, Artificial Neural Networks, Obstacle Avoidance, Indoor, TurtleBot, Mobile Robotics.

Abstract: A robust obstacle avoidance control program was developed for a mobile robot in the context of tight, dynamic

indoor environments. Deep Learning was applied in order to produce a refined classifier for decision making.

The network was trained on low quality raw RGB images. A fine-tuning approach was taken in order to

leverage pre-learned parameters from another network and to speed up learning time. The robot successfully

learned to avoid obstacles as it drove autonomously in a tight classroom/laboratory setting.

1 INTRODUCTION

The field of Deep Learning consists of algorithms that

learn using massive artificial neural network

architectures. Most Deep Learning models are built

with the intent of processing images. Some of these

architectures are capable of outperforming humans in

tasks like classifying objects, which simply means

differentiating one object from other objects (dog vs.

wolf, e.g.). In this paper, we present an application of

Deep Learning to the concept of autonomous driving

for a TurtleBot type robot within a tight

classroom/laboratory setting based strictly on images.

The robot was able to successfully and autonomously

drive without hitting obstacles within the

environment.

Krizhevsky, Sutskever, and Hinton (2012) put

forth a foundational paper in regards to Deep

Learning. They developed a neural network with 60

million parameters and 650,000 neurons. This

network had 5 convolutional layers along with a few

pooling layers and 3 fully connected layers including

a final output layer of 1000 outputs. At the time, they

achieved a top-5 classification (of the 1000 classes)

error rate of only 15.3% compared to a much higher

second-place error rate of 26.2%. This paper

contributed to the discussion of the importance of

depth in neural networks by noting that removal of a

single hidden layer dropped the top-1 classification

error rate by 2%.

http://cs.conncoll.edu/parker

Szegedy et al. (2014) entered the ILSVRC

challenge with a 22 layer deep network nicknamed

GoogLeNet – in part because most of the engineers

and research scientists on the team worked for Google

at the time. The team won the competition with 12

times fewer parameters than Krizhevsky’s deep

network and obtained an impressive 6.66% error rate

for top-5 classification. Following the pattern of

improvements, He, Zhang, Ren, and Sun (2015) of

Microsoft Research used a 19 layer deep neural

network for the task and obtained an accuracy of

4.94% for top-5 classification. This was a landmark

accomplishment as it is purported to be the first to

beat human level performance (5.1%) for the

ImageNet dataset.

The most relevant dataset to our research is that of

CIFAR10 from the Canadian Institute for Advanced

Research (Krizhevsky, 2009b). Alex Krizhevsky

outlined the use of this dataset when he developed it

in 2009 for his Master’s Thesis during his time at the

University of Toronto (2009a). Prior to this, tiny

images on the scale of 32 x 32 were not easily labeled

for classification tasks in regards to algorithms like

Deep Learning. The CIFAR10 dataset includes 10

different classes: airplane, automobile, bird, cat, deer,

dog, frog, horse, ship, and truck. The classes are set

up in a way to be mutually exclusive. For example,

automobile and truck are completely different

categories. Krizhevsky developed different deep

neural network models in 2010 to run training with

the dataset. At the time he obtained the highest

Khan, M. and Parker, G.

Vision based Indoor Obstacle Avoidance using a Deep Convolutional Neural Network.

DOI: 10.5220/0008165104030411

In Proceedings of the 11th International Joint Conference on Computational Intelligence (IJCCI 2019), pages 403-411

ISBN: 978-989-758-384-1

403

accuracy using this dataset as his best model

classified objects correctly with a success rate of

78.9% (Krizhevsky, 2010). Since then, Mishkin and

Matas (2016) have obtained 94.16% accuracy on the

CIFAR10 dataset. Whereas, Springenberg et al.

(2015) have obtained 95.59% accuracy and the

current best performance is by Graham (2014) with

an accuracy of 96.53% using max pooling.

There has been strong interest in using the

TurtleBot platform for obstacle detection and

avoidance. Boucher (2012) used the Point Cloud

Library and depth information along with plane

detection algorithms to build methods of obstacle

avoidance. High curvature edge detection was used to

locate boundaries between the ground and objects that

rest on the ground. Other researchers have considered

the use of Deep Learning for the purpose of obstacle

avoidance using the TurtleBot platform.

Tai, Li, and Liu (2016) used depth images as the

only input into the deep network for training

purposes. They discretized control commands with

outputs such as: “go-straightforward”, “turning-half-

right”, “turning-full-right”, etc. The depth image was

from a Kinect camera with dimensions of 640 x 480.

This image was downsampled to 160 x 120. Three

stages of processing were done where the layering

was ordered as such: convolution, activation, pooling.

The first convolution layer used 32 convolution

kernels, each of size 5 x 5. The final layer included a

fully-connected layer with outputs for each

discretized movement decision. In all trials, the robot

never collided with obstacles, and the accuracy

obtained after training in relation to the testing set was

80.2%. Their network was trained only on 1104 depth

images. The environment used in this dataset seems

fairly straightforward – meaning that the only

“obstacles” seems to be walls or pillars. The

environment was not dynamic. Tai and Liu (2016)

produced another paper related to the previous paper.

Instead of a real-world environment, this was tested

in a simulated environment provided by the TurtleBot

platform, called Gazebo. Different types of corridor

environments were tested and learned. A

reinforcement learning technique called Q-learning

was paired with the power of Deep Learning. The

robot, once again, used depth images and the training

was done using Caffe. Other deep reinforcement

learning research included real-world evaluation on a

TurtleBot (Tai et al., 2017), using dueling deep

double Q networks trained to learn obstacle

avoidance (Xie et al., 2017), and using a fully

connected NN to map to Q-values for obstacle

avoidance (Wu et al., 2019).

Tai, Li, and Liu (2017) applied Deep Learning

using several convolutional neural network layers to

process depth images in order to learn obstacle

avoidance for a TurtleBot in the real world. This is

very similar to our work, except they used depth

images, the obstacles were just a corridor, and they

train from scratch instead of using transfer learning as

we did.

Our research provides a distinctive approach in

comparison to these works. Research like Boucher’s

does not consider higher level learning, but instead

builds upon advanced expert systems, which can

detect differentials in the ground plane. By focusing

on Deep Learning, our research allows a pattern based

learning approach that is more general and one which

does not need to be explicitly programmed. While Tai

et al. used Deep Learning, their dataset was limited

with just over 1100 images. We built our own dataset

to have over 30,000 images, increasing the size of the

effective dataset by about 28 times. The environment

for our research is more complex than just the flat

surfaces of walls and columns. As in Xie’s work, in

our research the learning was done on a dataset that

was based on raw monocular RGB images. This

opens the door to further research with cameras that

do not have depth. Moreover, the sizes of the images

used in our research were dramatically smaller, which

also opens up the door for faster training and a speed

up in forward propagation. Lastly, similar to a few of

these works, the results of our work were tested in the

real world as opposed to a simulated environment.

2 DEEP LEARNING

Consider a standard feed-forward artificial neural

network that is fully connected between each layer

being used to process a 100 x 100 pixel image. With

3 color channels, we would have 100 x 100 x 3 or

30,000 inputs to our neural network. This is a large

number of inputs for a standard neural network to

process. Deep Learning directly addresses this

limitation.

The convolution layer passes convolution

windows over the image to produce new images that

are smaller. The number of images produced can be

specified by the programmer. Each new image will be

accompanied by a convolution kernel signifying the

weights. Instead of sending all input values from layer

to layer, deep networks are designed to take regions

or subsamples of inputs. For images this means that

instead of sending all pixels in the entire image as

inputs, different neurons will only take regions of the

image as inputs – full connectivity is reduced to local

NCTA 2019 - 11th International Conference on Neural Computation Theory and Applications

404

connectivity. We take an image and extract local

regions of depth 3 for the color channels along with

their respective pixel values and input them into a

neuron. Supposing that our local receptive fields are

of size 5 x 5, this neuron takes in an input of

dimensions 5 x 5 x 3 for that particular portion of the

3 color channel image. The local receptive fields can

be seen as small windows that slide over our image,

where the number of panes on the window is

predefined. These panes help determine what features

under the window we want to extract, and over time

these features are better refined. The weighted

windows are commonly called kernels. Depending on

the type of kernel, different features of the image may

be highlighted, such as blurring and sharpening. In

this way, networks can develop identification of

complex patterns in datasets just by applying kernel

filters. Deep networks develop these kernels through

training without being explicitly programmed to do

so. The only supervision is from a loss function in the

output layer denoting how close the network’s

prediction was to the actual value of the image.

Through training, these kernels become more fine-

grained to reduce the loss function’s output.

Pooling is applied to each one of the convolution

images. Deep networks are stacked in such a way as

to include many different types of layers. A general

strategy is to follow a convolution layer with a type

of layer called a pooling layer. The convolution layer

is responsible for learning the lower level features of

an image, such as edges. The pooling layer seeks to

detect a higher level understanding of the lower level

features from the convolution layers. Pooling is also

good for building invariance to local translations.

This means that even if the input region is slightly

translated, most of the pooled output values will not

change. By employing max pooling (defined below),

dominant features, or regions with the largest values,

can be extracted and fed into later layers of the

network. Along with this benefit, the image is also

reduced dramatically because it is downsampled in

one of three ways:

1) Max pooling – The maximum pixel value is

chosen out of a rectangular region of pixels.

2) Min pooling – The minimum pixel value is chosen

out of a rectangular region of pixels.

3) Average pooling – The average pixel value is

chosen out of a rectangular region of pixels.

Reducing the size of the image dramatically cuts

down on the amount of processing needed to train the

higher level features of the network. In terms of

processing, the idea is similar to convolution as we

still pass a window over our image.

Convolution and pooling dominate the discussion

about types of network layers. However, there are a

few other types of layers that were used in this

research.

The Rectified Linear Unit (RLU) layer

(Krizhevsky et al., 2012) has recently grown in

popularity. Many researchers consider this over using

the sigmoid activation function. In fact, they were

able to accelerate convergence in their training by a

factor of 6 times in relation to the sigmoid activation

function using this function. This is a fairly

straightforward operation: the function takes a

numerical input X and returns it if it is positive,

otherwise it returns -1 * X. This effectively eliminates

negative inputs and boosts computation time since

complex computations such as exponentiation are not

needed.

The Local Response Normalization layer

(Krizhevsky et al., 2012) imitates biological lateral

inhibition – excited neurons have the capability of

subduing neighbor neurons. A neural “message” is

amplified and focused by this differential in neuron

excitement. These layers allow neuron’s with large

activation values to be much more influential than

other neurons. Following the pattern of feature

recognition in every layer, these layers allow

significant features to “survive” deeper into the

network.

The fully connected layer, which is like any

regular multi-layered perceptron, is generally the

final layer if it’s used in a network. The outputs of the

neurons in this layer are the actual outputs of the

network. Connected to this layer is the loss layer

where the network compares desired outputs to actual

outputs, and the learning is initiated here in terms of

gradient descent updates.

3 THE ROBOT

The robot used for this research (Figure 1) was the

“Deep Learning Robot” from Autonomous. Its basic

functionality is essentially equivalent to that of the

TurtleBot platform. The robot includes an Asus Xtion

Pro 3D Depth Camera, a microphone embedded in the

camera, and a speaker. A Kobuki mobile base allows

it to rotate and move in any direction on the ground

plane. Most importantly, it is equipped with an Nvidia

Tegra TK1, which allows us to carry out Deep

Learning computations on a GPU instead of having to

resort to extremely long wait times for training with a

CPU. This is its main difference from a regular

TurtleBot. While the Tegra TK1 is a powerful mobile

processor, it only has 2GB of memory. This is

Vision based Indoor Obstacle Avoidance using a Deep Convolutional Neural Network

405

problematic for training very deep networks because

holding too many parameters in memory causes the

robot to crash. While training, the robot is unstable

because of this limited memory so running multiple

programs at the same time is to be avoided.

The robot comes equipped with the Deep

Learning frameworks of Google TensorFlow, Torch,

Theano, and Caffe (we used Caffe), and CUDA and

cuDNN are provided for implementing Deep

Learning on GPUs and for speeding up that

computation. This robot is virtually a computer in

itself, and it allows us to treat it as such as it is very

compatible with Ubuntu 14.04. The TurtleBot

framework works hand in hand with the Robot

Operating System (ROS), which is used to control the

robot and to have access to all information coming

from any of the robot’s sensors. ROS is an “open-

source, meta-operating system” which allows

hardware abstraction, low-level control and message

passing between different modules/processes.

Figure 1: Photograph of the Deep Learning Robot.

4 OBSTACLE AVOIDANCE

The problem scenario is that of training a deep neural

network to learn autonomous driving of a vehicle in a

tight, chaotic room/office environment. To test the

functionality and success of the program, the

performance of the robot was compared to the end

goals. The end goals are primarily that the TurtleBot

should autonomously follow an approximately

rectangular path in a tight environment without

colliding into obstacles. A description of this

environment is provided below.

4.1 Environment

The Robotics Lab with obstacles in the room provides

a reasonably complex environment for our tests.

Figure 2 demonstrates this approximate environment

set up. The approximate rectangular path that was

configured was the perimeter of a long lab table. This

table only has 3 planes of support on the underside;

otherwise there are gaps underneath the table. White

rectangles with dark borders are lab tables. The north

and south sides of the tables are solid (2 of the planes

of support), whereas the east and west sides have

gaps. The gap size is large enough for the robot to be

able to drive through, but chairs (white circles with

dark borders) were placed in those locations. The total

radii of the chairs are larger than the circles shown

because the feet of the chairs extend out further.

There is no gap for the robot to move in between

neighbouring chairs (in most cases). The dark brown

rectangle (southwest corner of the lab) is a colony

space – boxed off area of the lab that may be used for

other experiments, but there are borders (one foot

high solid walls) that the robot would need to avoid

hitting. The golden rectangles (north and south walls

of the lab) denote cabinets which the robot must also

avoid. The red rectangle in the middle of the figure

shows the path around the center table that the robot

must follow or the general path it needs to go in on its

way as it avoids chairs, tables, boxes, etc. In separate

runs this path must be completed in both clockwise

and counter clockwise directions.

Figure 2: A visual of the environment with lab tables,

chairs, and cabinets. Images are provided below to help

understand this environment even more. The top of the

drawing is approximately north.

One can see from Figure 3 that the gaps were

closed with moveable round chairs. Each chair has 5

rounded legs and a circular stump. The chair heights

can be adjusted and the orientation can change 360

degrees for both the base and the actual seating.

Sample images are provided in the Figures 1, 3, and

NCTA 2019 - 11th International Conference on Neural Computation Theory and Applications

406

4 to visualize different possible orientations for the

chairs. These were chosen as the main objects of

interest because they are not solid – there is clearly a

good amount of gap area in between the legs. This

allows for complexity in defining what an obstacle is

and what and obstacle is not. The robot must not

simply learn to follow the color of the carpet because

even the gaps reveal the carpet.

Figure 3: Photograph showing chairs and spacing.

The camera for the robot faces down at about 40

degrees from the vertical position, so it is important

to design an environment that is complex enough, in

terms of objects close to the ground, to be a problem

of interest. To highlight the point of this experiment,

if the environment was built only using cardboard or

other flat material as the main obstacle in the

environment, then there would be a fairly

straightforward solution. There would not be much

variety, apart from lighting conditions, as to what

material needed to be avoided. By using the chairs,

the environment was more natural and complex. Not

only were the chairs not solid surfaces, they were

typically moved by students overnight. While they

might be in the same relative location, the orientations

were completely different each time. This adds

complexity to the problem because it is not easy for a

pattern to be developed since the orientation keeps

changing. This means that for obstacle avoidance to

be successful the deep neural network necessarily

needs to develop an “understanding” that chairs are to

be avoided. With enough gaps in between chairs and

the legs of the chairs having significant gaps, the

robot will still see the carpeted area. Thus, it cannot

just develop a control program to follow a carpeted

area, but instead needs a more complex pattern to be

recognized from the dataset.

Figure 4: The images above demonstrate various obstacle

avoidance scenarios.

It is important to establish guidelines as far as

environmental set up because there may be scenarios

that are impossible for the robot to solve. In our

research, we dealt with two. In the first, if there is

enough of a gap between two chairs the robot may

make the decision to go straight instead of turning

away from the chairs. In the second, if the robot is

facing a cabinet directly head on. Even for a human

with limited peripheral vision, it would be impossible

to know which direction to turn. There is no way to

have metaknowledge about which direction contains

an obstacle and which does not. This is not a fair

scenario to include in the dataset. To solve the former

of the two issues, the environment included chairs

that were placed close enough to have a small enough

gap that the robot would not be able to fit through. To

solve the latter of the two scenarios, cabineted areas

included an open cabinet that swivelled to a direction

the robot was supposed to avoid. Not only does this

Vision based Indoor Obstacle Avoidance using a Deep Convolutional Neural Network

407

add more chaos to the environment (there are various

different items in the cabinets which adds to the

complexity of developing a pattern), but it also

establishes rough guidelines as to the correct path.

Chairs, cabinets, and tables were not the only

obstacles to avoid. A few images in the dataset

included small cardboard boxes. A good amount of

the dataset included the borders of a colony space

environment. It was important to include obstacles

like this in order to confirm that the concept of

obstacle avoidance was being abstracted instead of

the robot only avoiding black colored objects (the

black chairs). It is also significant to note that students

used the lab throughout the day and night, so

conditions of the carpet changed while the dataset

was being developed. For example, coins were found

laid out on the ground near a turn in the path on one

day. On another day, shreds of paper were at different

locations on the path. We decided not to remove some

of these items while building the dataset because it

only adds to the diversity in what we might consider

edge cases.

4.2 Dataset Collection

During data collection the robot was controlled

remotely by a user on a keyboard (connected through

a computer on Bluetooth) as it was driven around the

lab following the path in both directions. The robot

maintained continuous forward movement as the

operator designated left, right, or straight. To increase

the diversity of the dataset, different starting points

were chosen and hard scenarios such as being close

to walls were considered. Overall, 30,754 images

were collected and labelled.

The script processed about 10 images per second,

but not every image was saved. While no time record

was kept, an estimated 1.5 – 2 hours were spent on

trial runs and collections. In the initial testing

conditions, we found that there were edge cases that

were missing, so more data was added over time. By

default, the images from the Asus Xtion Pro are of

dimension 640 x 480. While this would provide a

great amount of detail to train on, it would take an

incredible amount of processing power and time to

train to a significant accuracy. For our deep network

we downsample this image to 64 x 64 (Figure 5).

5 DEVELOPMENT OF DEEP

NEURAL NET ARCHITECTURE

We initially started by using an imitation of Alex

Krizhevsky’s deep network architecture to solve the

CIFAR10 dataset. The plan was to augment this

network with our own dataset. We obtained about

74% accuracy for that dataset. We took the weights

of the network from it having learned the CIFAR10

data, and then fine-tuned it for our own purpose –

obstacle avoidance while driving autonomously.

The thought for fine-tuning was inspired from the

notion that the lower level features detected by the

network are general enough to be applied to the

problem of obstacle detection. Intuitively, there is a

large difference between detecting an airplane and

detecting a dog or a cat. However, Krizhevsky’s

network is capable of differentiating between the two

based on the same kernel weights. That seems to be a

large area of coverage for the type of data provided.

The other thought here was that Krizhevsky’s

network was trained on 32 x 32 dimension images.

Since our images will be 64 x 64 pixels, we may

expect that there will be a boost in accuracy.

The complete network used for this research is

shown in Figure 6. It is split into three lines to

ease the visualization. We can see that there are 3

Figure 5: Reducing the image resolution from 640 x 480 to 64 x 64.

NCTA 2019 - 11th International Conference on Neural Computation Theory and Applications

408

Figure 6: The final architecture for the deep network. This is inspired by the architecture for solving the CIFAR10 dataset.

The rectangles represent layers. The octagons represent data.

iterations of the layer combinations of convolution,

pooling, and normalization. Note that the fine-tuning

of the network is evident from the visual. The layer

“ip1Tweak” is labeled as such because the final layer

of Krizhevsky’s network was removed and replaced

with an inner product (“ip”), or also considered fully

connected, layer that only had 3 outputs. This is

signified by the value 3 above the ip1Tweak layer in

the visual. The 3 outputs correspond to the decision

making of the TurtleBot in terms of autonomous

driving directions. The original network included 32

convolution kernels for the first convolution, 32

convolution kernels for the second convolution, and

64 convolution kernels for the last one. We can also

see how each convolution layer is immediately

followed by a pooling layer. Every convolution layer

also includes a rectified linear unit attached to it.

Local Response Normalization also appears to be an

effective addition to this network, as it augments the

outputs of 2 of the 3 pooling layers. The dataset was

split as such for the final network: 23,065 images for

training and 7,689 images for testing – a 75% training

split of the entire dataset.

The hyperparameters were:

• testing iterations: 100; basically how many

forward passes the test will carry out.

• batch size: 77; this is for batch gradient descent

– notice that batch size * testing iterations will

cover the entire testing dataset.

• base learning rate: 0.001

• momentum: 0.9

• weight decay: 0.004

• learning rate policy: fixed

• maximum training iterations: 15,000

• testing interval: 150; testing will be carried out

every 150 training iterations.

These hyperparameters were determined through

several experiments in order to find the desired level

of accuracy and performance. Some of these

parameters are surely subjective. For example, we

considered testing interval to be much less than it

usually is for large networks (on the order of 1000).

The reason for making this a small value is so that we

can analyze shifts in learning in a decent amount of

time instead of having to wait for over half an hour.

The number of maximum iterations was chosen as an

estimation of the number of epochs the network may

have needed to stabilize. The batch size of the training

data is 77 images, thus we would need about 300

iterations to cover the whole training dataset. Hence,

the number for maximum iterations was established

as 15,000 in order for the network to go through about

50 epochs.

Vision based Indoor Obstacle Avoidance using a Deep Convolutional Neural Network

409

6 RESULTS FOR AUTONOMOUS

DRIVING

Starting with a Krizhevsky network trained on the

CIFAR10 dataset and replacing the final layer with a

tweaked fully connected layer, we ran the Deep

Learning neural network on 30,000 images generated

for the obstacle avoidance problem. The network was

able to obtain an accuracy of about 92% after 15,000

iterations (Figure 7). It took the network about 200

iterations to get to the 84% accuracy mark and around

2000 iterations to achieve an accuracy of 90%. Ten

different test runs in the actual environment were

completed where the robot was reversed after a

completion of a lap in order to complete the lap in the

both directions. The robot did, although rarely,

slightly graze against the leg of a chair or a cardboard

box. However, this did not change the trajectory of

the robot and it was still able to complete its course.

For this reason, these rare occurrences were not

considered as major events for hitting an obstacle.

One could argue that the turning angle for the

robot is the only issue here since this is such a tight

environment. Though the network made the right

decision, the movement of the physical robot may

have been slightly too much. This can be corrected

with very small tweaks in the values of turning radii

for the different decisions, however this does not

reflect on or add to the discussion about the

performance of the deep network in itself.

Figure 7: The performance of the network in relation to

iterations for the fine-tuned Krizhevsky network trained

with over 30,000 images. The first 15,000 iterations are

shown. It took about 200 iterations to get to the 84% mark

and by 15,000 it was at 92% accuracy.

6.1 Visual Analysis of Results

While observing the robot during particular situations

of interest we noted that it routinely performed the

correct action. The scenario of the open cabinet was

not a challenge for the robot (Figure 1 and Figure 4

top left). As previously mentioned, this helped

augment the robots path learning. We observed that

the robot was successfully able to navigate the tight

corridor and move away from chair obstacles (Figure

4 top images) and the border of the colony space,

which showed that the robot learned to avoid more

than just the chairs (Figure 4 right images). Although,

the cardboard box was seldom included in the original

training dataset, the robot clearly had pattern

recognition broad enough to be able to avoid it

(Figure 4 bottom left). Figure 8 shows three examples

of the output of the neural network.

left 0, straight 0, right 1

left 0.73, straight 0.27, right 0

left 0.02, straight 0.96, right 0.02

Figure 8: A sampling of scenarios where the neural network

made live decisions and the outputs of the NN are shown

for each (they will total 1.0). The NN will have the robot

turn right in the top scenario, left in the middle, and straight

in the bottom.

NCTA 2019 - 11th International Conference on Neural Computation Theory and Applications

410

7 CONCLUSIONS

The approach of fine-tuning Krizhevsky’s network

that solved the CIFAR10 dataset was highly

successful. The robot effectively avoided obstacles in

the original room where the dataset was collected.

The robot also avoided colliding into other obstacles

that were not part of the dataset – the deep network

did not solely focus on chairs and cabinets as the only

obstacles to avoid. In regard to accuracy, this

approach seems more successful than the previous

approaches that utilized depth. In the future, different

dimensions (other than 64x64) may be considered. It

would be valuable to potentially find a definable

relationship between the image dimension and

network accuracy.

REFERENCES

Boucher, S., 2012. Obstacle detection and avoidance using

TurtleBot platform and Xbox Kinect. Research

Assistantship Report. Department of Computer

Science, Rochester Institute of Technology.

Graham, B., 2014. Fractional max-pooling. CoRR,

arXiv:1412.6071.

He, K. , Zhang, X., Ren, S. & Sun, J., 2015. Delving deep

into rectifiers: Surpassing human level performance on

imagenet classification. Proceedings of the

International Conference on Computer Vision.

Krizhevsky, A., 2009a. Learning multiple layers of features

from tiny images. Master’s thesis, Department of

Computer Science, University of Toronto.

Krizhevsky, A., 2009b. CIFAR10 dataset project page:

https://www.cs.toronto.edu/~kriz/cifar.html

Krizhevsky, A., 2010. Convolutional deep belief networks

on CIFAR-10. Unpublished manuscript.

Krizhevsky, A, Sutskever, I. & Hinton, G., 2012. ImageNet

classification with deep convolutional neural networks.

Neural Information Processing Systems (NIPS).

Mishkin, D. & Matas, J., 2016. All you need is a good init.

Proceedings of the International Conference on

Learning Representations (ICLR).

Springenberg, J.T., Dosovitskiy, A., Brox, T., Riedmiller,

M., 2015. Striving for simplicity: the all convolutional

net. Proceedings of the International Conference on

Learning Representations (ICLR).

Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S.,

Anguelov, D., Erhan, D., Vanhoucke,V., and

Rabinovich, A., 2014. Going deeper with convolutions.

Proceedings of the IEEE Conference on Computer

Vision and Pattern Recognition (CVPR).

Tai, L., Li, S. & Liu, M., 2016. A deep-network solution

towards modeless obstacle avoidance. Proceedings of

the IEEE/RSJ International Conference on Intelligent

Robots and Systems (IROS).

Tai, L. & Liu, M., 2016. A robot exploration strategy based

on q-learning network. Proceedings of the IEEE

International Conference on Real-time Computing and

Robotics (RCAR).

Tai, L., Li, S., Liu, M. 2017. Autonomous exploration of

mobile robots through deep neural networks.

International Journal of Advanced Robotic Systems.

Tai, L., Paolo, G., & Liu, M. 2017. Virtual-to-real deep

reinforcement learning: continuous control of mobile

robots for mapless navigation. Proceedings of the

IEEE/RSJ International Conference on Intelligent

Robots and Systems (IROS 2017).

Wu, K., Esfahani, M., Yuan, S., & Wang, H. 2019. Depth-

based obstacle avoidance through deep reinforcement

learning. Proceedings of the 5th International

Conference on Mechatronics and Robotics.

Xie, L., Wang, S., Markham, A., & Trigoni, N. 2017.

Towards monocular vision based obstacle avoidance

through deep reinforcement learning. Proceedings of

the RSS 2017 Workshop on New Frontiers for Deep

Learning in Robotics.

Vision based Indoor Obstacle Avoidance using a Deep Convolutional Neural Network

411