Experience Filtering for Robot Navigation using Deep Reinforcement

Learning

Phong Nguyen

, Takayuki Akiyama

and Hiroki Ohashi

R&D Group, Hitachi Ltd., 1-280, Higashi-Koigakubo, Kokubunji-shi, Tokyo, Japan

European R&D Group, Hitachi Europe GmbH, Kaiserslautern, Germany

Keywords:

Reinforcement Learning, Experience Replay, Similarity, Distance, Experience Filtering.

Abstract:

We propose a stochastic method of storing a new experience into replay memory to increase the performance

of the Deep Q-learning (DQL) algorithm, especially under the condition of a small memory. The conventional

standard DQL method with the Prioritized Experience Replay method attempts to use experiences in the replay

memory for improving learning efﬁciency; however, it does not guarantee the diversity of experience in the

replay memory. Our method calculates the similarity of a new experience with other existing experiences in

the memory based on a distance function and determines whether to store this new experience stochastically.

This method leads to the improvement in experience diversity in the replay memory and better utilization of

rare experiences during the training process. In an experiment to train a moving robot, our proposed method

improved the performance of the standard DQL algorithm with a memory buffer of less than 10,000 stored

experiences.

1 INTRODUCTION

A reinforcement learning algorithm is considered a

prominent solution for robot navigation. A state-of-

the-art reinforcement learning algorithm is the stan-

dard Deep Q-learning (DQL) algorithm recently pro-

posed by Mnih (Mnih et al., 2015). It uses an artiﬁ-

cial neural network to map state spaces to Q-values

of actions, which will be used by an agent to select

the best action in a given state. This algorithm sto-

res all the experiences of the agent in a replay me-

mory and the artiﬁcial neural network learns by sam-

pling the stored experiences. For the agent to per-

form well after training, positive experiences (exam-

ples of success) and negative experiences (examples

of failures) are theoretically essential to acquire and

store in the replay memory. The balance of the re-

play memory determines the efﬁciency of the training

process. When positive or negative experiences are

difﬁcult to acquire, the replay memory becomes im-

balanced. An artiﬁcial neural network may not learn

well under such conditions because the experiences

are sampled randomly for learning. Shaul presented

the Prioritized Experience Replay (PER) method to

prioritize the sampling process to obtain experiences

with high temporal difference error (TD-error) and

improve the training process (Schaul et al., 2015).

However, this method does not improve the balance

of the replay memory. In a scenario in which some ex-

periences are extremely difﬁcult to acquire and the re-

play memory is relatively small, such experiences are

quickly removed due to the ﬁrst-in-ﬁrst-out (FIFO)

mechanism. Such rare experiences should be retained

in the replay memory long enough for the artiﬁcial

neural network to reduce the TD-error and learn well.

Therefore, the diversity of the replay memory plays

an important role in the efﬁciency of the training pro-

cess.

In the Internet-of-Things (IoT) era, the DQL algo-

rithm is expected to be applied to embedded devices,

which usually have small memories. In our research,

we aimed at applying the DQL algorithm under such

a limited condition. Therefore, in this paper, we pro-

pose a stochastic method of storing a new experience

into the replay memory to improve the performance

of the DQL algorithm. Our method calculates the si-

milarity of a new experience with other existing ex-

periences in the memory based on a distance function

and determines whether to store this new experience

stochastically.

We achieved higher learning efﬁciency by obser-

ving the average reward of the agent after the training

process is completed. In an experiment to teach a ro-

bot to move (by avoiding obstacles and approach hu-

Nguyen, P., Akiyama, T. and Ohashi, H.

Experience Filtering for Robot Navigation using Deep Reinforcement Learning.

DOI: 10.5220/0006671802430249

In Proceedings of the 10th International Conference on Agents and Artiﬁcial Intelligence (ICAART 2018) - Volume 2, pages 243-249

ISBN: 978-989-758-275-2

243

mans for interaction) using the DQL algorithm, our

proposed method improved the performance of the

DQL algorithm with a memory buffer of less than

10,000 stored experiences by almost 20% in the ro-

bot’s average reward.

2 RELATED WORK

To make a robot move without it hitting obstacles and

approach a certain goal, a path-planning algorithm

and rule based algorithm have been investigated (Fa-

himi, 2008) (LaValle, 2006). While the path-planning

algorithm is negatively affected by a dynamic and

constantly changing environment because it always

requires re-planning when the environment changes,

a rule-based algorithm usually cannot cover all situa-

tions. Moreover, these algorithms are not adaptable to

different types of environments. With rein-forcement

learning, however, a robot can learn how to execute

better actions through trial-and-error processes wit-

hout any prior history or knowledge (Kaelbling et al.,

1996) (Sutton and Barto, 1998). Thus, rule deﬁniti-

ons and map information are not needed in advance

to train a moving model.

Reinforcement learning has recently received a lot

of attention in many research ﬁelds (Mirowski et al.,

2016) (Sadeghi and Levine, 2016) (Jaderberg et al.,

2016) (Wang et al., 2016). Mnih et al. (Mnih et al.,

2015) (Mnih et al., 2013) reported a method of com-

bining an artiﬁcial neural network with a Q-learning

algorithm to estimate the action value. The method

was tested on teaching a machine to play Atari games

from the input of raw image data. In games like Mon-

tezumas Revenge, the machine failed to learn from

the replay memory because the small number of po-

sitive experiences was subservient. In another paper,

Shaul et. al attempted to utilize useful experiences

by using a prioritized sampling method (Schaul et al.,

2015). Using TD-error as the metric to prioritize the

experience sampling process, this method can learn

more efﬁciently from the rare experiences in the re-

play memory. However, this method cannot lead to a

diversity of experiences in the memory, and rare and

useful experiences can be pushed out of the memory

quickly when the memory is small.

To ensure the replay memory is well diversiﬁed

and rare and useful experiences in the memory can be

kept longer, our proposed method uses a ﬁltering me-

chanism in the experience storing process. The ﬁlte-

ring mechanism works to ﬁlter out the new experien-

ces that are similar to many existing experiences in the

memory and only store the new experiences that are

different from other existing experiences. Even with

the limited size of replay memory, the neural network

can learn more effectively from a diversiﬁed and ba-

lanced replay memory. Our proposed method is com-

patible with the conventional PER method, and we

can combine them to achieve a superior performance

in term of achieving higher reward after the training

process is completed.

Other research on robotic systems that traverse

indoor environments using deep reinforcement lear-

ning for collision avoidance focuses on transfering

knowledge from simulation world to real world like in

(Sadeghi and Levine, 2016) , or using recurrent neu-

ral network to encode the history of experience like

in (Jaderberg et al., 2016). None of the above mentio-

ned research has tackled the problem of limited replay

memory size.

3 STANDARD DQL METHOD

AND PER METHOD

Deep reinforcement learning represents the Q-

function with a neural network, which takes a state

as input and outputs the corresponding Q-values of

actions in that state. Q-values can be any real values,

which makes it a regression task that can be optimized

with a simple squared error loss (Mnih et al., 2015):

L =

[r + γ · max

Q(s

, a

)

| {z }

target

− Q(s, a)

| {z }

prediction

]

(1)

An experience of a DQL agent is a tuple of ¡state,

action, reward, next state¿, hereby known as <

s, a, r, s

>. The DQL agent at state s, takes the action

a, and move on to the next state s’, in which it receives

a reward r. In each iteration, a transition experience

< s, a, r, s

> is stored in a replay memory. To train

the artiﬁcial neural network, uniformed random sam-

ples from the replay memory are used instead of the

most recent transition. This breaks the similarity of

subsequent training samples, which otherwise might

drive the network into a local minimum. PER met-

hod modiﬁes the standard DQL method from random

sampling to prioritized sampling (Schaul et al., 2015).

Experiences with high TD-error (the artiﬁcial neural

network does not estimate the Q-values accurately),

are most likely get sampled. There are two variations

with PER method: proportional PER and rank-based

PER. In proportional PER method, an experience has

the probability of getting sampled proportional with

its TD-error. On the other hand, in rank-based PER

method, an experience has the probability of getting

sampled negatively proportional with its TD-errors

rank in the replay memory.

ICAART 2018 - 10th International Conference on Agents and Artiﬁcial Intelligence

244

4 PROPOSED METHOD

In our method, instead of storing all the experiences in

the replay memory, each new experience is evaluated

on whether it should be stored or dropped out. We

present three ﬁltering mechanisms for our method to

ﬁlter new experiences.

4.1 Greedy Experience Filtering (EF)

When a new experience is observed, we compute

the distances of the new experience with other ex-

periences in the memory. The distance between

two experiences E1 < s1, a1, r1, s1

> and E2 <

s2, a2, r2, s2

> is deﬁned as follows:

• Distance between s

and s

(Bernstein et al.,

2001):

d(s

, s

) = Euclidean Distance (s

, s

)

normalized in range [0, 1] (2)

• Distance between a

and a

d(a

, a

) =







0 if a

= a

1 if a

6= a

(3)

• Distance between r

and r

d(r

, r

) =







0 if r

= r

1 if r

6= r

(4)

• Distance between s

and s

(Bernstein et al.,

2001):

d(s

, s

) = Euclidean Distance (s

, s

)

normalized in range [0, 1] (5)

• Distance d of E1 < s1, a1, r1, s1

> and E2 <

s2, a2, r2, s2

d(E

, E

) = min



d(s

, s

)

d(s

, s

)

+ d(a

, a

)

+ d(r

, r

)



(6)

After computing all the distances of the new expe-

rience with other experiences in the replay memory,

we make a stochastic decision to store the new expe-

rience in the replay memory with the following pro-

bability:

P(E

new

) = D(E

new

, E

memory

) =

∑

i=1

d(E

new

, E

)

(7)

Figure 1: Probability of storing new experience.

where N is the number of experiences in the replay

memory at the point before storing a new experience.

In formula (6), the distance between two experiences

are capped maximum at 1 because later in formula (7)

we will use the average distances to assign the proba-

bilities of storing the new experience. Capping the the

distance at 1 leads a more stable performance. The

way we design the distance d between 2 experiences

also ensures that when the agent is at a same state,

but takes different action or gains different rewards,

the algorithm would consider these 2 experiences very

different from each other.

Distance D shows how different a new experience

is to the existing memory. If D is high, the replay

memory is more likely to store the new experience.

However, if D is low, it is not likely to store it. Fig. 1

illustrates the process of computing the probability to

store a new experience. In this ﬁgure, similar expe-

riences (experiences with small D) are represented in

a similar color (e.g. blue), while rare experiences (ex-

periences with large D) are represented in different

colors (e.g. red, green, orange). When the replay me-

mory buffer is ﬁlled and it is decided that the new ex-

perience is stored, we need to drop an old experience

in the replay memory to make space for the new ex-

perience. To do so, we compute the sum of the dis-

tances between an experience and other experiences

in the replay memory and choose the experience with

the lowest distance sum:

ExperienceIDto drop = argmax

∑

j=1

d(E

, E

)

(8)

By dropping the experience with the lowest dis-

tance sum, we maximize D among experiences. The-

refore, the experiences in the memory are well diver-

siﬁed, and the rare and useful experiences will be kept

inside the memory.

Experience Filtering for Robot Navigation using Deep Reinforcement Learning

245

4.2 FIFO EF

The difference of FIFO EF and Greedy EF is when

we have to drop an old experience to add a new expe-

rience. Instead of choosing the lowest distance sum

of an experience with other experiences, we simply

drop the oldest experience in the replay memory. The

reason to do this is to avoid over-ﬁtting in training the

neural network to a certain set of experiences without

having more updated experience. When only drop-

ping the experience with the lowest distance sum in

the memory, some experiences in the memory are ne-

ver dropped. The neural network can only learn well

with these old experiences; hence, it cannot genera-

lize to other situations. By implementing FIFO EF,

we still achieve our purpose of trying to achieve a bet-

ter balance of experiences in the replay memory by a

ﬁltering process, while providing the neural network

with new data along the training process and having

the neural network generalize better.

4.3 Combination of FIFO EF and PER

Our proposed method can work compatibly with the

conventional PER method. In each training iteration,

we use FIFO EF method to decide whether to store a

new experience into the memory or not, and then pro-

ceed with sampling method presented in Shaul et. al.’s

paper. Our method ensures the diversity of replay me-

mory, which would be impossible to achieve if storing

whatever experiences the agent observes. Prioritized

experience sampling method in a diversed replay me-

mory makes more sense than in a monotonous replay

memory.

In addition, our method can retain the rare expe-

rience inside the replay memory, in comparsion to the

conventional way of storing experience because each

new experience is stored based on a probability, which

makes the rate of storing expriences is slower than

Figure 2: Simulated environment.

storing all experiences. For that reason, prioritizing

experience with high TD-error can resample rare ex-

periences more frequently and improve the learning

efﬁciency.

5 EXPERIMENTAL SETUP

The simulation environment we used contains a la-

yout of walls (called a map). Moving objects inclu-

ding a robot and human (in the role of a customer)

who can only move within the boundary of the map.

The robot has distance sensors that sense the distance

to objects in front of it. There are ﬁve distance sen-

sors, which sense ﬁve angles equally divided in front

of the robot. The maximum distance a distance sensor

can sense is 4 m. With a human detection sensor (e.g.

like a camera), the robot can detect the targeted cus-

tomer, the distance between it and the customer, and

the angle of the customer in relation to it.

We designed a reward system so that the robot can

receive a positive reward +1 when it is successfully

approaches and faces the customer and the distance

between it and the customer is less than 1 m. Howe-

ver, it receives a negative reward of -1 when it hits the

walls. It will then be reset to a random position on the

map.

The deep neural networks we used for action se-

lection and training are fully connected networks,

which contain an input layer, two hidden layers, and

an output layer. The input layer contains:

• ﬁve distance values from the distance sensor of

the robot,

• current information about the robot: current

speed, and current angular speed,

• the estimated distance to the customer,

• the angle α to the customer, which is represented

by two values: sinα and cos α, and

• the relative speed of the robot toward the custo-

mer.

And all these input values are normalized to a suit-

able range: the distance values, the current speed of

robot, the distance to the customer have a range from

0 to 1, and the current angular speed of the robot, the

relative speed of the robot toward the customer have

the range from -1 to 1. Each hidden layer contains

256 nodes. The output layer contains 9 nodes corre-

sponding to Q-values of 9 actions for the robot. Each

action is a combination of the robot making two deci-

sions about whether to change its speed and angular

velocity (decrease the speed, maintain speed, incre-

ase the speed, increase the angular velocity to the left,

ICAART 2018 - 10th International Conference on Agents and Artiﬁcial Intelligence

246

Figure 3: Average reward per episode after training.

keep the same angular velocity, and increase the angu-

lar velocity to the right). In other words, nine actions

of the robot are a product of three actions in changing

speed and three actions in changing angular velocity.

We trained the robot for 10

iterations with a re-

play memory capacity of 10

, 10

, and 10

. We used

a tunnel-like map with a customer moving inside (the

customer moved randomly). Fig. 2 illustrates the si-

mulated training environment. The red circle repre-

sents the robot. Five short green lines and small cir-

cles on them represent the distance sensors and their

sensing values. The picture of a face represents a hu-

man customer. The long green line represents the dis-

tance between the robot and the customer. In the eva-

luation process (after the training process is comple-

ted), we let the robot start at a random position on

the map, at a ﬁxed given time, and observed whether

it could avoid obstacles and approach the customer

successfully. An episode of evaluation starts when the

robot starts and ends when either the robot meets the

customers, hits the walls, runs out of time (we deﬁned

as in 200 iterations). We compared the performance

of the robot by using a conventional standard DQL

method (Mnih et al., 2015), PER method (both pro-

portional and rank-based) alone (Schaul et al., 2015),

our proposed method with FIFO EF, that with greedy

EF, that with FIFO EF and proportional PER, and that

with FIFO EF and rank-based PER.

6 RESULTS

The performance of the robot was evaluated based on

the rate it could approach the customer successfully

without hitting obstacles. The rate the robot hit the

obstacles is also a reference of its performance. We

observed the robot in 1000 episodes and calculated

the average reward for each episode. Fig. 3 shows the

results of a testing phase with the different methods in

different memory size settings. Our proposed method

with FIFO EF and that with FIFO EF and rank-based

and proportional PER outperformed the DQL method

and the PER methods alone when the memory size

was 10

, 10

. When the replay memory capacity re-

ached 10

, our proposed methods with FIFO EF and

that with FIFO EF and rank-based and proportional

PER still performed better than the DQL method and

about the same with the PER methods alone. Ho-

wever, our proposed method with greedy EF did not

achieve the expected performance and had the lowest

performance.

When the capacity size was 10

, the average re-

ward per episode the robot could receive with our pro-

posed method, except with greedy EF, were higher

than those with the DQL method. Fig. 4 shows the

improvement (or deterioration) of our proposed met-

hod compared with that of the DQL method. With

our method with FIFO EF and rankbased PER , we

could increase performance by 18.66% after training.

Experience Filtering for Robot Navigation using Deep Reinforcement Learning

247

Figure 4: Robot’s performance compared with standard DQL method (capacity = 10

In other words, the robot could receive 18.66% more

rewards per episode. This implies it successfully ap-

proached the customer more often and hit less obsta-

cles after training by using our proposed method.

We break down the robot’s performance of recei-

ving rewards into successfully approaching the custo-

mer and hitting obstacles. The rate of ﬁnding and ap-

proaching a customer after training was 97.57% while

that of only hitting obstacles was 1.1%, as shown in

Fig. 5. The top left of the chart is the desired region

for ideal performance after training: approached cus-

tomer 100% and hit walls 0%.

Our results have demonstrated that with a smaller

memory size (10

experience instances compared to

experience instances), our proposed method can

give the DQL algorithm a good performance on two

learning tasks: avoiding obstacles and approaching

a goal (in this case, a customer), while conventional

methods cannot perform well.

7 DISCUSSION

In this chapter, we discuss the strength and weakness

of our method based on the results. The strength of

our method is providing more diversity in experien-

ces; hence, breaking the dominance of similar expe-

riences in the memory, which causes neural networks

to be unable to learn efﬁciently. With our method

with FIFO EF alone, the performance of the robot

after training was better than with the DQL method

and with the PER methods. In addition, our met-

hod did not affect the uniformed random sampling

process of the experience replay mechanism, and the

implementation was simpler compared that with the

PER methods, with which we have to create a spe-

cial heap structure to store experiences in the replay

memory. Our method performed better than the DQL

method especially when the replay memory was re-

latively small. With a smaller memory, it is difﬁcult

for the replay memory to maintain important expe-

riences, and they also are forgotten quicker than with

a larger replay memory. With our experience ﬁltering

mechanism, not only important experiences are kept

longer, but also diversity of replay memory is also im-

proved. These factors make the deep neural network

learn more efﬁciently.The weakness of our method is

with greedy EF. Instead of letting the agent forget the

oldest memory, we let the agent forget the memory

with highest similarity in the existing replay memory.

This creates a bias when a neural network learns and

causes overﬁtting on a set of ﬁxed experiences. The

FIFO EF does not have this problem; thus, it outper-

forms greedy EF.

8 CONCLUSION

We proposed a method of ﬁltering and deciding whet-

her to store a new experience in the replay memory to

make the DQL algorithm work more effectively. Our

method calculates the similarity (or distance) between

ICAART 2018 - 10th International Conference on Agents and Artiﬁcial Intelligence

248

Figure 5: Robot’s performance (capacity = 10

a new experience and the rest of the existing experien-

ces in the memory and converts the similarity (or dis-

tance) to a probability of storing this new experience.

The higher the similarity (or the lower the distance),

the lower the probability to store this new experience.

This method demonstrated superior performance

to conventional methods when the replay memory has

a limited size. Our method is compatible with the

DQL and PER methods and further improves the per-

formance of the DQL algorithm. In our experiment

of training a robot to move in a virtual environment,

the average reward it could receive was 18.66% hig-

her than with the DQL method. We intend to apply

our method to embedded devices and robots that have

low-memory resources.

REFERENCES

Bernstein, S., Poznanski, R. R., Solomon, P., Wilson, S.,

Swift, R. J., and Salimi, B. (2001). Issn 0312-3685

subscription rates subscription rates (post free) for

the 2001 volume of the mathematical scientist are:£

11.00, us 18.15or a28. 60 all enquiries about the mat-

hematical scientist, as well as other subscriptions,

should be sent to the. The Mathematical Scientist,

26(2):136.

Fahimi, F. (2008). Autonomous robots: modeling, path

planning, and control, volume 107. Springer Science

& Business Media.

Jaderberg, M., Mnih, V., Czarnecki, W. M., Schaul, T.,

Leibo, J. Z., Silver, D., and Kavukcuoglu, K. (2016).

Reinforcement learning with unsupervised auxiliary

tasks. arXiv preprint arXiv:1611.05397.

Kaelbling, L. P., Littman, M. L., and Moore, A. W. (1996).

Reinforcement learning: A survey. Journal of artiﬁ-

cial intelligence research, 4:237–285.

LaValle, S. M. (2006). Planning algorithms. Cambridge

university press.

Mirowski, P., Pascanu, R., Viola, F., Soyer, H., Ballard, A.,

Banino, A., Denil, M., Goroshin, R., Sifre, L., Kavuk-

cuoglu, K., et al. (2016). Learning to navigate in com-

plex environments. arXiv preprint arXiv:1611.03673.

Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Anto-

noglou, I., Wierstra, D., and Riedmiller, M. (2013).

Playing atari with deep reinforcement learning. arXiv

preprint arXiv:1312.5602.

Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Ve-

ness, J., Bellemare, M. G., Graves, A., Riedmiller,

M., Fidjeland, A. K., Ostrovski, G., et al. (2015).

Human-level control through deep reinforcement le-

arning. Nature, 518(7540):529–533.

Sadeghi, F. and Levine, S. (2016). (cad2rl: Real single-

image ﬂight without a single real image. arXiv pre-

print arXiv:1611.04201.

Schaul, T., Quan, J., Antonoglou, I., and Silver, D.

(2015). Prioritized experience replay. arXiv preprint

arXiv:1511.05952.

Sutton, R. S. and Barto, A. G. (1998). Reinforcement le-

arning: An introduction, volume 1. MIT press Cam-

bridge.

Wang, Z., Bapst, V., Heess, N., Mnih, V., Munos, R., Ka-

vukcuoglu, K., and de Freitas, N. (2016). Sample ef-

ﬁcient actor-critic with experience replay. arXiv pre-

print arXiv:1611.01224.

Experience Filtering for Robot Navigation using Deep Reinforcement Learning

249