to obtain the training samples, when the knowledge
of the system dynamics is available, consists in sam-
pling uniformly all the state-action space to build a
training set that covers all possible situations suffi-
ciently well. Clearly, this procedure is not possible
when dealing with a real system with unknown dy-
namics, in what case samples can only be observed
while interacting with the real system. In the sim-
plest cases, it is possible to roughly cover the whole
state space by chaining a number of random actions,
as in (Ernst et al., 2005). However, when the prob-
lem grows in complexity, the probability of executing
a random sequence that drives the system to the inter-
esting regions of the workspace may be too low to be
achieved in practical time. In such cases it is neces-
sary to exploit the knowledge already obtained with
previous interactions (Riedmiller, 2005a; Ernst et al.,
2005).
It has to be noted that the need of exploiting what
has been learned so far introduces a tendency to ex-
perience the most promising states much more of-
tenly than others, and this systematically produces a
very biased sampling that aggravates the perturbing
effect caused by non-local updating pointed out be-
fore. In (Riedmiller, 2005a), this problem is avoided
by assuring that all datapoints are used for update the
same number of times. This is made possible by re-
membering a dense enough set of transitions and per-
forming full updates in batch mode. In fact this is a
common trait of all fitted value iteration algorithms.
From a computational point of view, this approach
is very computationally intensive, since all datapoints
are used a large number of times until convergence is
reached. A more efficient approach would result if,
instead of retraining with old data in batch, an incre-
mental updating could be achieved in which the per-
turbing effect of new samples on old estimations was
attenuated.
In the present work, we address the problem of the
biased sampling with incremental updating. In our
approach, we take into account how often each region
of the domain has been visited, updating more locally
those regions that are more densely sampled. To do
this, we need an estimation of the sampling density,
for what we use a Gaussian Mixture Model (GMM)
representing a probability density of samples in the
joint space of states, actions, and Q-values. At the
same time, this density estimation can be used as a
means of function approximation for the Q-function.
Density estimations are receiving increasing interest
in the field of machine learning (Bishop, 2006), since
they keep all the information contained in the data,
that is, they provide estimations not only for the ex-
pected function value, but also for its uncertainty.
Despite density estimations are more demanding
than simple function approximation (due to the fact
that they embody more information), their use for
function approximation has been advocated by differ-
ent authors (Figueiredo, 2000; Ghahramani and Jor-
dan, 1994), noting that simple and well understood
tools like the Expectation-Maximization (EM) algo-
rithm (Dempster et al., 1977) can be used to obtain
accurate estimations of the density function.
The rest of the paper is organized as follows: Sec-
tion 2 briefly resumes the basics of RL. Section 3 in-
troduces the concepts of GMM for multivariate den-
sity estimation, and the EM algorithm in its batch ver-
sion. In Section 4 we define the on-line EM algorithm
for the GMM. In Section 5, we present our approach
to deal with biased sampling. In Section 6 we develop
our RL algorithm using density estimation of the Q-
value function, involving action evaluation and action
selection. Section 7 describes the test control applica-
tion to show the feasibility of the approach. We con-
clude in Section 8 with a discussion of the proposed
approach.
2 THE REINFORCEMENT
LEARNING PARADIGM
In the RL paradigm, an agent must improve its per-
formance by selecting actions that maximize the ac-
cumulation of rewards provided by the environment
(Sutton and Barto, 1998). At each time step, the agent
observes the current state s
t
and chooses an action a
t
according to its policy a = π(s). The environment
changes to state s
t+1
in response to this action, and
produces an instantaneous reward r(s
t
,a
t
). The agent
must experiment by interacting with the environment
in order to find the optimal action policy from the out-
come of its past experiences. One of the most pop-
ular algorithms used in RL is Q-Learning (Watkins
and Dayan, 1992), which uses an action-value func-
tion Q(s,a) to estimate the maximum expected future
cumulative reward that can be obtained by executing
action a in situation s and acting optimally thereafter.
Q-learning uses a sampled version of the Bellman op-
timality equations (Bellman and Dreyfus, 1962) to es-
timate instantaneous q values,
q(s
t
,a
t
) = r(s
t
,a
t
) + γmax
a
Q(s
t+1
,a) (1)
where max
a
Q(s
t+1
,a) is the estimated maximum cu-
mulative reward corresponding to the next observed
situation s
t+1
, and γ is a discount factor, with values in
[0,1] that regulates the importance of future rewards
with respect to immediate ones. At a given stage of
REINFORCEMENT LEARNING FOR ROBOT CONTROL USING PROBABILITY DENSITY ESTIMATIONS
161