is a reinforcement function which tells the robot how
good or bad it has performed, but nothing about the
set of actions it should have carried out. Through
a stochastically exploration of the environment, the
robot must find a control policy – the action to be ex-
ecuted on each state – which maximises the expected
total reinforcement it will receive:
E[
∞
∑
t=0
γ
t
r
t
] (1)
where r
t
is the reinforcement received at time t, and
γ ∈ [0,1] is a discount factor which adjusts the relative
significance of long-term versus short-term rewards.
Q-learning (Watkins, 1989) is one of the most
popular reinforcement learning algorithms, although
it might be slow when rewards occur infrequently.
What is termed Eligibility Traces (Watkins, 1989) ex-
pedite the learning by adding more memory into the
system. One problem of this algorithms is their de-
pendence of the parameters used, that usually need to
be set after a trial an error process.
In this work we present a new learning algorithm
based on reinforcement. Our algorithm will provide a
prediction of how long the robot will be able to move
before it makes a mistake. This raises clear and read-
able systems where it is easy to detect, for example,
when the learning is not evolving properly: basically a
high discrepancy between the time before failure pre-
dicted and what is actually observed on the real robot.
Another advantage of our learning proposal is that it is
almost parameterless, so it minimises the adjustments
needed when the robot operates in a different environ-
ment or performs a different task. The only parameter
needed is a learning rate which is not only easy to set,
but it is often the same value, regardless of the task to
be learnt.
Since we wish to use the experience of each state
transition to improve the robot control policy in real
time, we shall apply Q-learning, but redefining the
utility function of states and actions. Q(s,a) will be
the expected time interval before a robot failure when
the robot starts moving in s, performs action a, and
follows the best possible control policy thereafter:
Q(s,a) = E[−e
(−T b f (s
0
=s,a
0
=a)/50T )
], (2)
where T b f (s
0
,a
0
) represents the expected time
interval (in seconds) before the robot does some-
thing wrong, when it executes a in s, and then fol-
lows the best possible control policy. T is the con-
trol period of the robot (expressed in seconds). The
term −e
−T b f /50T
in Eq. 2 is a continuous func-
tion that takes values in the interval [−1,0], and
varies smoothly as the expected time before failure
increases.
Since Q(s, a) and T b f (s,a) are not known, we
can only refer to their current estimations Q
t
(s,a) and
T b f
t
(s,a):
T b f
t
(s,a) = −50 ∗ T ∗ Ln(−Q
t
(s,a)), (3)
The definition of Q(s,a), T b f , and the best possi-
ble control policy, determine the relationship between
the Q-values corresponding to consecutive states:
T b f
t
(s
t
,a
t
) =
T if r
t
< 0
T + max
a
{T b f
t
(s
t+1
,a)} otherwise
(4)
r
t
is the reinforcement the robot receives when it
executes action a
t
in state s
t
. If we combine Eq. 3 and
Eq. 4, it is true to say:
Q
t+1
(s,a) =
−e
−1/50
if r
t
< 0
Q
t
(s
t
,a
t
) + δ otherwise
(5)
where,
δ = β(e
−1
50
∗ max
a
Q
t
(s
t+1
,a) − Q
t
(s
t
,a
t
)). (6)
β ∈ [0,1] is a learning rate, and it is the only pa-
rameter whose value has to be set by the user.
3 PERCEPTION LEARNING
In reinforcement learning the state space definition is
a key factor to achieve good learning times. The state
space must be fine enough to distinguish the different
situations the robot might find, but at the same time it
must have a reduced size to avoid the curse of dimen-
sionality.
The design of the state space is a delicate task,
and it is dependent on the problem the robot has to
solve. We propose a dynamic creation of the state
space as the robot explores the environment (Fig. 1).
For this task we have chosen to use a Fuzzy ART ar-
tificial neural network (Carpenter et al., 1991). This
kind of networks are able to perform an unsupervised
online classification of the input patterns without any
previous knowledge.
Of the three parametres that are involved in the
Fuzzy ART algorithm α,β and ρ – usually called vig-
ilance parameter – the most important is ρ. α and β
are almost independent of the task to solve, but the
value of ρ will influence the number of states created.
If it is too high the Fuzzy ART will create too many
classes. If ρ is too low the state representation will be
too coarse and the system will suffer from perceptual
aliasing, resulting in an increase of the learning time
or impossibility to achieve convergence.
Due to space restrictions we can’t provide more
details of the Fuzzy ART here. Nevertheless further
information can be found in (Carpenter et al., 1991).
ICINCO 2010 - 7th International Conference on Informatics in Control, Automation and Robotics
396