mans for interaction) using the DQL algorithm, our
proposed method improved the performance of the
DQL algorithm with a memory buffer of less than
10,000 stored experiences by almost 20% in the ro-
bot’s average reward.
2 RELATED WORK
To make a robot move without it hitting obstacles and
approach a certain goal, a path-planning algorithm
and rule based algorithm have been investigated (Fa-
himi, 2008) (LaValle, 2006). While the path-planning
algorithm is negatively affected by a dynamic and
constantly changing environment because it always
requires re-planning when the environment changes,
a rule-based algorithm usually cannot cover all situa-
tions. Moreover, these algorithms are not adaptable to
different types of environments. With rein-forcement
learning, however, a robot can learn how to execute
better actions through trial-and-error processes wit-
hout any prior history or knowledge (Kaelbling et al.,
1996) (Sutton and Barto, 1998). Thus, rule definiti-
ons and map information are not needed in advance
to train a moving model.
Reinforcement learning has recently received a lot
of attention in many research fields (Mirowski et al.,
2016) (Sadeghi and Levine, 2016) (Jaderberg et al.,
2016) (Wang et al., 2016). Mnih et al. (Mnih et al.,
2015) (Mnih et al., 2013) reported a method of com-
bining an artificial neural network with a Q-learning
algorithm to estimate the action value. The method
was tested on teaching a machine to play Atari games
from the input of raw image data. In games like Mon-
tezumas Revenge, the machine failed to learn from
the replay memory because the small number of po-
sitive experiences was subservient. In another paper,
Shaul et. al attempted to utilize useful experiences
by using a prioritized sampling method (Schaul et al.,
2015). Using TD-error as the metric to prioritize the
experience sampling process, this method can learn
more efficiently from the rare experiences in the re-
play memory. However, this method cannot lead to a
diversity of experiences in the memory, and rare and
useful experiences can be pushed out of the memory
quickly when the memory is small.
To ensure the replay memory is well diversified
and rare and useful experiences in the memory can be
kept longer, our proposed method uses a filtering me-
chanism in the experience storing process. The filte-
ring mechanism works to filter out the new experien-
ces that are similar to many existing experiences in the
memory and only store the new experiences that are
different from other existing experiences. Even with
the limited size of replay memory, the neural network
can learn more effectively from a diversified and ba-
lanced replay memory. Our proposed method is com-
patible with the conventional PER method, and we
can combine them to achieve a superior performance
in term of achieving higher reward after the training
process is completed.
Other research on robotic systems that traverse
indoor environments using deep reinforcement lear-
ning for collision avoidance focuses on transfering
knowledge from simulation world to real world like in
(Sadeghi and Levine, 2016) , or using recurrent neu-
ral network to encode the history of experience like
in (Jaderberg et al., 2016). None of the above mentio-
ned research has tackled the problem of limited replay
memory size.
3 STANDARD DQL METHOD
AND PER METHOD
Deep reinforcement learning represents the Q-
function with a neural network, which takes a state
as input and outputs the corresponding Q-values of
actions in that state. Q-values can be any real values,
which makes it a regression task that can be optimized
with a simple squared error loss (Mnih et al., 2015):
L =
1
2
[r + γ · max
a
0
Q(s
0
, a
0
)
| {z }
target
− Q(s, a)
| {z }
prediction
]
2
(1)
An experience of a DQL agent is a tuple of ¡state,
action, reward, next state¿, hereby known as <
s, a, r, s
0
>. The DQL agent at state s, takes the action
a, and move on to the next state s’, in which it receives
a reward r. In each iteration, a transition experience
< s, a, r, s
0
> is stored in a replay memory. To train
the artificial neural network, uniformed random sam-
ples from the replay memory are used instead of the
most recent transition. This breaks the similarity of
subsequent training samples, which otherwise might
drive the network into a local minimum. PER met-
hod modifies the standard DQL method from random
sampling to prioritized sampling (Schaul et al., 2015).
Experiences with high TD-error (the artificial neural
network does not estimate the Q-values accurately),
are most likely get sampled. There are two variations
with PER method: proportional PER and rank-based
PER. In proportional PER method, an experience has
the probability of getting sampled proportional with
its TD-error. On the other hand, in rank-based PER
method, an experience has the probability of getting
sampled negatively proportional with its TD-errors
rank in the replay memory.
ICAART 2018 - 10th International Conference on Agents and Artificial Intelligence
244