REINFORCEMENT LEARNING FOR ROBOT CONTROL

USING PROBABILITY DENSITY ESTIMATIONS

Alejandro Agostini and Enric Celaya

Institut de rob`otica i Inform`atica Industrial, (CSIC-UPC), Barcelona, Spain

Keywords:

Machine Learning in control applications, Reinforcement learning.

Abstract:

The successful application of Reinforcement Learning (RL) techniques to robot control is limited by the fact

that, in most robotic tasks, the state and action spaces are continuous, multidimensional, and in essence, too

large for conventional RL algorithms to work. The well known curse of dimensionality makes infeasible using

a tabular representation of the value function, which is the classical approach that provides convergence guar-

antees. When a function approximation technique is used to generalize among similar states, the convergence

of the algorithm is compromised, since updates unavoidably affect an extended region of the domain, that is,

some situations are modiﬁed in a way that has not been really experienced, and the update may degrade the ap-

proximation. We propose a RL algorithm that uses a probability density estimation in the joint space of states,

actions and Q-values as a means of function approximation. This allows us to devise an updating approach

that, taking into account the local sampling density, avoids an excessive modiﬁcation of the approximation far

from the observed sample.

1 INTRODUCTION

Any robotic application requires a precise control of

the robot effectors. In many situations, the design of

the control system and the necessary tuning of its pa-

rameters becomes a hard task, and there is a great in-

terest in providing the robot with the ability to learn

by itself how to control its effectors through experi-

ence. The most usual approach to achieve this is Re-

inforcement Learning (RL) (Sutton and Barto, 1998),

in which the robot must ﬁnd an optimal action policy

by trial and error using a signal, called reinforcement,

that tells how good was the result obtained after each

executed action.

Reinforcement Learning algorithms, like value it-

eration or policy iteration, are known to converge in

the case of ﬁnite state-action spaces, where a tabu-

lar representation of the value function or the utility

function Q can be used. However, in a typical robot

control problem, the state and action spaces are con-

tinuous or too large, so that it is necessary to use

some form of function approximation. In this case

convergence is no longer granted in general, and this

is attributed to the fact that the update on the func-

tion approximation that takes place after each expe-

rience is not local to the precise observed point in

the state-action space, but inﬂuences the values in

other regions, possibly undoing the learning done so

far (Riedmiller, 2005a). This problem is magniﬁed

when the sampling of the domain is too biased, caus-

ing some regions to be updated much more oftenly

than others.

Recently, much work has been done in RL with

function approximation. For example, (Ernst et al.,

2005) proposed the ﬁtted Q Iteration algorithm, based

on previous work of (Ormoneit and Sen, 2002)

on kernel-based RL, and using randomized trees

for function approximation. Similarly, (Riedmiller,

2005a) proposed the Neural Fitted Q Iteration (NFQ)

algorithm, which uses a multi-layer perceptron for

function representation. In a different approach, (Ras-

mussen and Kuss, 2004) used Gaussian Processes

(GP) to model the system dynamics, and a further GP

to represent the Value function. The same approach

is followed by (Rottmann and Burgard, 2009), while

(Engel et al., 2005) used a GP to directly represent

the Q-function in a model-free setting. All these algo-

rithms fall into the class of the so-called ﬁtted value

iteration algorithms (Gordon, 1995), which, in order

to approximate the desired function, take a ﬁnite num-

ber of training samples and try to ﬁt the function to

them in a batch, iterative process. The usual approach

160

Agostini A. and Celaya E. (2010).

REINFORCEMENT LEARNING FOR ROBOT CONTROL USING PROBABILITY DENSITY ESTIMATIONS.

In Proceedings of the 7th International Conference on Informatics in Control, Automation and Robotics, pages 160-168

DOI: 10.5220/0002949601600168

 SciTePress

to obtain the training samples, when the knowledge

of the system dynamics is available, consists in sam-

pling uniformly all the state-action space to build a

training set that covers all possible situations sufﬁ-

ciently well. Clearly, this procedure is not possible

when dealing with a real system with unknown dy-

namics, in what case samples can only be observed

while interacting with the real system. In the sim-

plest cases, it is possible to roughly cover the whole

state space by chaining a number of random actions,

as in (Ernst et al., 2005). However, when the prob-

lem grows in complexity, the probability of executing

a random sequence that drives the system to the inter-

esting regions of the workspace may be too low to be

achieved in practical time. In such cases it is neces-

sary to exploit the knowledge already obtained with

previous interactions (Riedmiller, 2005a; Ernst et al.,

2005).

It has to be noted that the need of exploiting what

has been learned so far introduces a tendency to ex-

perience the most promising states much more of-

tenly than others, and this systematically produces a

very biased sampling that aggravates the perturbing

effect caused by non-local updating pointed out be-

fore. In (Riedmiller, 2005a), this problem is avoided

by assuring that all datapoints are used for update the

same number of times. This is made possible by re-

membering a dense enough set of transitions and per-

forming full updates in batch mode. In fact this is a

common trait of all ﬁtted value iteration algorithms.

From a computational point of view, this approach

is very computationally intensive, since all datapoints

are used a large number of times until convergence is

reached. A more efﬁcient approach would result if,

instead of retraining with old data in batch, an incre-

mental updating could be achieved in which the per-

turbing effect of new samples on old estimations was

attenuated.

In the present work, we address the problem of the

biased sampling with incremental updating. In our

approach, we take into account how often each region

of the domain has been visited, updating more locally

those regions that are more densely sampled. To do

this, we need an estimation of the sampling density,

for what we use a Gaussian Mixture Model (GMM)

representing a probability density of samples in the

joint space of states, actions, and Q-values. At the

same time, this density estimation can be used as a

means of function approximation for the Q-function.

Density estimations are receiving increasing interest

in the ﬁeld of machine learning (Bishop, 2006), since

they keep all the information contained in the data,

that is, they provide estimations not only for the ex-

pected function value, but also for its uncertainty.

Despite density estimations are more demanding

than simple function approximation (due to the fact

that they embody more information), their use for

function approximation has been advocated by differ-

ent authors (Figueiredo, 2000; Ghahramani and Jor-

dan, 1994), noting that simple and well understood

tools like the Expectation-Maximization (EM) algo-

rithm (Dempster et al., 1977) can be used to obtain

accurate estimations of the density function.

The rest of the paper is organized as follows: Sec-

tion 2 brieﬂy resumes the basics of RL. Section 3 in-

troduces the concepts of GMM for multivariate den-

sity estimation, and the EM algorithm in its batch ver-

sion. In Section 4 we deﬁne the on-line EM algorithm

for the GMM. In Section 5, we present our approach

to deal with biased sampling. In Section 6 we develop

our RL algorithm using density estimation of the Q-

value function, involving action evaluation and action

selection. Section 7 describes the test control applica-

tion to show the feasibility of the approach. We con-

clude in Section 8 with a discussion of the proposed

approach.

2 THE REINFORCEMENT

LEARNING PARADIGM

In the RL paradigm, an agent must improve its per-

formance by selecting actions that maximize the ac-

cumulation of rewards provided by the environment

(Sutton and Barto, 1998). At each time step, the agent

observes the current state s

and chooses an action a

according to its policy a = π(s). The environment

changes to state s

t+1

in response to this action, and

produces an instantaneous reward r(s

). The agent

must experiment by interacting with the environment

in order to ﬁnd the optimal action policy from the out-

come of its past experiences. One of the most pop-

ular algorithms used in RL is Q-Learning (Watkins

and Dayan, 1992), which uses an action-value func-

tion Q(s,a) to estimate the maximum expected future

cumulative reward that can be obtained by executing

action a in situation s and acting optimally thereafter.

Q-learning uses a sampled version of the Bellman op-

timality equations (Bellman and Dreyfus, 1962) to es-

timate instantaneous q values,

q(s

) = r(s

) + γmax

Q(s

t+1

,a) (1)

where max

Q(s

t+1

,a) is the estimated maximum cu-

mulative reward corresponding to the next observed

situation s

t+1

, and γ is a discount factor, with values in

[0,1] that regulates the importance of future rewards

with respect to immediate ones. At a given stage of

REINFORCEMENT LEARNING FOR ROBOT CONTROL USING PROBABILITY DENSITY ESTIMATIONS

161

the learning, the temporary policy can be derived from

the estimated Q-function as,

π(s) = argmax

Q(s,a) (2)

In actor/critic architectures, a policy function

(called the actor) is learned and explicitly stored, so

that actions are directly decided by the actor and do

not need to be computed from the maximization in

(2). Despite this computational advantage, the learn-

ing of an actor may slow down convergence, since

then the learning of the Q-function must be done on-

policy instead of off-policy, and both functions, actor

and critic, must adapt to each other to reach conver-

gence. In our implementation we avoid the use of an

actor, and thus we must face the problem of maximiz-

ing the Q(s,a) function in (2).

The basic formulation of Q-learning assumes dis-

crete state-action spaces and the Q-function is stored

in a tabular representation. For continuous domains

a function approximation is required to represent the

Q-function and generalize among similar situations.

In next sections we present our proposal for function

approximation using density estimations.

3 DENSITY ESTIMATION WITH

A GAUSSIAN MIXTURE

MODEL

A Gaussian Mixture Model (Bishop, 2006) is a

weighted sum of multivariate Gaussian probability

density functions, and is used to represent general

probability density distributions in multidimensional

spaces. It is assumed that the samples of the distri-

bution to be represented have been generated through

the following process: ﬁrst, one Gaussian is randomly

selected with a priori given probabilities, and then,

a sample is randomly generated with the probabil-

ity distribution of the selected Gaussian. According

to this, the probability density function of generating

sample x is:

p(x;Θ) =

∑

i=1

G(x;µ

,Σ

) (3)

where K is the number of Gaussians of the mix-

ture; α

, usually denoted as the mixing parameter, is

the prior probability, P(i), of Gaussian i to generate

a sample; G(x;µ

,Σ

) is a multidimensional normal

Gaussian function with mean vector µ

and covariance

matrix Σ

; and Θ = {{α

,µ

,Σ

},...,{α

,µ

,Σ

}} is

the whole set of parameters of the mixture. By al-

lowing the adaption of the number K of Gaussians

in the mixture, any smooth density distribution can

be approximated arbitrarily close (Figueiredo, 2000).

The parameters of the model can be estimated using a

maximum-likelihood estimator (MLE). Given a set of

samples X = {x

;t = 1,... ,N}, the likelihood func-

tion is given by

L[X;Θ] =

∏

t=1

p(x

;Θ). (4)

The maximum-likelihood estimation of the model pa-

rameters is the Θ that maximizes the likelihood (4)

for the data set X. Direct computation of the MLE

requires complete information about which mixture

component generated which instance. Since this in-

formation is missing, the EM algorithm, described in

the next section, is often used.

3.1 The Expectation-maximization

Algorithm

The Expectation-Maximization (EM) algorithm

(Dempster et al., 1977) is a general tool that permits

to estimate the parameters that maximize the likeli-

hood function (4) for a board class of problems when

there are some missing data. The EM method ﬁrst

produces an estimation of the expected values of the

missing data using initial values of the parameters to

be estimated (E step), and then computes the MLE

of the parameters given the expected values of the

missing data (M step). This process is repeated

iteratively until a convergence criterion is fulﬁlled.

In this section we brieﬂy describe how EM is ap-

plied to the speciﬁc case of a GMM. The process

starts with an initialization of the mean vectors and

covariancematrices of the Gaussians. The E step con-

sists in obtaining the probability P(i|x

) for each com-

ponent i of generating instance x

, that we denote by

t,i

= P(i|x

) =

P(i)p(x

|i)

∑

j=1

P( j)p(x

| j)

G(x

;µ

,Σ

)

∑

j=1

G(x

;µ

,Σ

)

(5)

where t = 1,..,N and i = 1, ..,K. The maximiza-

tion step consists in computing the MLE using the

estimated w

t,i

. It can be shown (Duda et al., 2001)

that, for the case of a GMM, the mixing parameters,

means, and covariances are given by

∑

t=1

t,i

(6)

∑

t=1

t,i

∑

t=1

t,i

(7)

ICINCO 2010 - 7th International Conference on Informatics in Control, Automation and Robotics

162

∑

t=1

t,i

− µ

)(x

− µ

)

∑

t=1

t,i

(8)

4 ON-LINE EM

Estimating a probability density distribution by

means of the EM algorithm involves the iteration of

E and M steps on the complete set of available data,

that is, the mode of operation of EM is in batch. How-

ever, in Q-learning, sample data are not all available

at once: they arrive sequentially and must be used

online to improve the policy that will allow an efﬁ-

cient exploration-exploitation strategy. This prevents

the use of the batch EM algorithm, and requires an

on-line, incremental version of it. Several incremen-

tal EM algorithms have been proposed for the Gaus-

sian Mixture Model applied to clustering or classiﬁca-

tion of stationary data (Song and Wang, 2005; Arand-

jelovic and Cipolla, 2005).

The approach proposed in (Song and Wang, 2005)

in not strictly an on-line EM algorithm. It applies

the conventional batch EM algorithm onto separate

data streams corresponding to successive episodes.

For each new stream, a new GMM model is trained

in batch mode and then merged with the previous

model. The number of components for each new

GMM is deﬁned using the Bayesian Information Cri-

terion, and the merging process involves similarity

comparisons between Gaussians. This method in-

volves many computationally expensive processes at

each episode and tends to generate more components

than actually needed. The applicability of this method

to RL seems limited, not only for its computational

cost, but also because, due to the non-stationarity

of the Q-estimation, old data should not be taken as

equally valid during all the process.

The work of (Arandjelovic and Cipolla, 2005)

performs incremental updating of the density model

using no historical data and assuming that consecu-

tive data vary smoothly. The method maintains two

GMMs: the current GMM estimation, and a previous

GMM of the same complexity after which no model

updating (i.e. no change in the number of Gaussians)

has been done. By comparing the current GMM with

the historical one, it is determined if new Gaussians

are generated or if some Gaussians are merged to-

gether. Two observed shortcomings of the algorithm

are that the system fails when new data is well ex-

plained by the historical GMM, and when consecutive

data violate the condition of smooth variation.

In (Sato and Ishii, 2000), an on-line EM algorithm

is presented for the Normalized Gaussian Network

(NGnet), a model closely related to the GMM. This

algorithm is based on the works of (Nowlan, 1991;

Neal and Hinton, 1998). In (Nowlan, 1991) a method

for the incremental adaptation of the model param-

eters using a forgetting factor and cumulative statis-

tics is proposed, while in (Neal and Hinton, 1998)

the method in (Nowlan, 1991) is evaluated and con-

trasted with an incremental version which performs

steps of EM over a ﬁxed set of samples in an incre-

mental way. The method proposed in (Sato and Ishii,

2000) uses foundations of both works to elaborate an

on-line learning algorithm to train a NGnet for regres-

sion, where weighted averages of the model parame-

ters are calculated using a learning rate that implic-

itly incorporates a forgetting factor to deal with non-

stationarities. Inspired by this work, we developed

an on-line EM algorithm for the GMM. Our approach

uses cumulative statistics whose updating involves a

forgetting factor explicitly.

4.1 On-line EM for the GMM

In the on-line EM approach, an E step and an M step

are performed after the observation of each individual

sample. The E step does not differ from the batch ver-

sion (equation (5)), except that it is only computed for

the new sample. For the M step, the parameters of all

mixture components are updated with the new sam-

ple. For this, we deﬁne the following time-discounted

weighted sums

t,i

= [[1]]

t,i

(9)

t,i

= [[x]]

t,i

(10)

(XX)

t,i



(x− µ

)(x− µ

)



t,i

(11)

where we use the notation:

[[ f]]

t,i

∑

τ=1

∏

s=τ+1

τ,i

(12)

where λ

, which ranges in [0,1], is a time dependent

discount factor introduced for forgetting the effect of

old, possibly outdated values. Observe that for low

values of λ

, the inﬂuence of old data decreases pro-

gressively, so that they are forgotten along time. This

forgetting effect of old data is attenuated when λ

ap-

proaches 1: in this case, old and new data have the

same inﬂuence in the sum. As learning proceeds and

data values become more stable, forgetting them is no

more required and λ

can be made to progressively

approach 1 to reach convergence.

The discounted sum W

t,i

can be interpreted as the

accumulated number of samples (composed of frac-

tions w

t,i

) attributed to unit i along time, with forget-

ting. Similarly, X

t,i

corresponds to the accumulated

REINFORCEMENT LEARNING FOR ROBOT CONTROL USING PROBABILITY DENSITY ESTIMATIONS

163

sum with forgetting of sample vectors x attributed to

unit i, which is used to derive the mean vector µ

. In

the same way, (XX)

t,i

accumulates the matrices of co-

variances of the vectors x attributed to unit i, which

are used to calculate the covariance matrix Σ

From (12), we obtain the recursive formula:

[[ f]]

t,i

= λ

[[ f]]

t−1,i

+ f

t,i

(13)

When a new sample x

arrives, all the accumulators

are updated with the incremental formula (13), and

new estimators for the GMM parameters are obtained

as:

t,i

∑

j=1

t,i

(14)

t,i

(15)

t,i

(XX)

t,i

. (16)

The approximation capabilities of a GMM depend

on the number K of Gaussians of the mixture. Since

we can not determine the most appropriate number

beforehand, the number of Gaussians can be incre-

mented on-line. We delay the explanation of the pro-

cess for unit generation until Section 6.2.

5 DEALING WITH BIASED

SAMPLING

In the incremental formula (13), the factor λ

is used

to progressively replace (forget) old data by new ar-

rived ones in a smooth way. This is the desired ef-

fect when data are presented in a statistically unbi-

ased way, so that all past entries are equally forgotten

at the arrival of each new sample. However, in RL,

data are not uniformly distributed on the state-action

space, but are obtained along the trajectories followed

by the agent, and are particularly biased toward the

good-valued regions due to the need for exploitation.

That is, convergenceregions are more frequently sam-

pled, causing their densities to increase, but this is at

the expense of lowering densities and forgetting data

in other regions. This is undesirable since, statistics

in regions with low Q values, and hence sparsely sam-

pled, will get their data lost. This can be seen by set-

ting w

t,i

= 0 in equation (13), what yields:

[[ f]]

t,i

= λ

[[ f]]

t−1,i

, (17)

showingthat the accumulators of units that are seldom

activated will decay to 0. To avoid this we modify the

updating formula (13) in this way,

[[ f]]

t,i

= λ

t,i

[[ f]]

t−1,i

+ f

t,i

. (18)

In this updating formula, the power w

t,i

prevents

undesired changes in the parameters of the Gaussians

which are not responsible of generating the observed

values. Thus, if we make w

t,i

= 0 in (18), what we get

is:

[[ f]]

t,i

= [[ f]]

t−1,i

, (19)

so that the values of the statistics of the inactive units

remain unchanged. On the other hand, in the limit

case of w

t,i

= 1, corresponding to a full activation of

unit i, the effect of the new updating formula is the

same as before:

[[ f]]

t,i

= λ

[[ f]]

t−1,i

+ f

t,i

(20)

Therefore, we will adopt the updating formula (18)

that keeps better track of less explored regions.

6 Q-LEARNING WITH A GMM

In the case of Q-Learning, samples are of the form

= (s

,q(s

)), consisting of the visited state

, the executed action a

, and the estimated value of

q(s

) as given by eq. (1). To obtain this estimation

we need to evaluate max

Q(s

t+1

,a), where Q(s,a) is

deﬁned as the expected value of q given s and a for the

joint probability distribution provided by the GMM:

p(s,a,q) =

∑

i=1

G(s,a,q;µ

,Σ

), (21)

Q(s,a) = E [q|s,a] = µ(q|s,a) (22)

To compute this, we must ﬁrst obtain the distribution

p(q|s,a). Decomposing the covariances Σ

and means

in the following way:



(s,a)



(23)

(s,a)(s,a)

(s,a)q

q(s,a)

, (24)

the probability distribution of q, for the given state s

and a tentative action a, can then be expressed as:

p(q|s,a) =

∑

i=1

(s,a)G (q;µ

(q|s,a),σ

(q)) (25)

where,

(q|s,a) = µ

+Σ

q(s,a)



(s,a)(s,a)



−1



(s,a) − µ

(s,a)



(26)

(q) = Σ

− Σ

q(s,a)



(s,a)(s,a)



−1

(s,a)q

(27)

ICINCO 2010 - 7th International Conference on Informatics in Control, Automation and Robotics

164

(s,a) =

G(s,a;µ

(s,a)

,Σ

(s,a)(s,a)

)

∑

j=1

G(s,a;µ

(s,a)

,Σ

(s,a)(s,a)

)

. (28)

From (25) we can obtain the conditional mean and

variance, µ(q|s,a) and σ

(q|s,a), of the mixture at a

point (s,a) as:

µ(q|s,a) =

∑

i=1

(s,a)µ

(q|s,a) (29)

(q|s,a) =

∑

i=1

(s,a)(σ

(q) + (µ

(q|s,a)− µ(q|s,a))

)

(30)

Equation (29) is the estimated Q value for a given

state and action, while (30) is its estimated variance.

Our purpose was to ﬁnd the action that maximizes

Q(s,a). Unfortunately, this cannot be done analyti-

cally, but an approximated value can be obtained by

numerical techniques. In our implementation, we take

the approach of computing Q(s,a) for a ﬁnite num-

ber of actions, and taking the action that provides the

largest Q value as the approximated maximum.

6.1 Action Selection

If our purpose is to greedily exploit the policy learned

so far, as deﬁned in (2), we must execute the action a

corresponding to the maximum Q(s,a) value for the

current state s, obtained as just explained. However,

during learning, exploratory actions are also neces-

sary. This is the classical exploration/exploitation

tradeoff. Instead of following an undirected explo-

ration method (using the nomenclature of (Dearden

et al., 1998)), we make use of the uncertainty in the

estimation of the Q values to balance the informa-

tion gain obtained from exploration, with the reward

gain of executing the optimal action. For this, to

each evaluated action, we assign a Q value obtained

stochastically from a normal distribution with mean

(29) and variance (30), and select the action that pro-

vided the highest value. This exploration strategy in-

creases the probability of executing exploratory ac-

tions when predictions are less certain, but this prob-

ability decreases when the mean value is lower.

6.2 Unit Generation

Since the main purpose of our GMM is to represent

the Q function, the generation of new Gaussians is

principally driven by the need to better approximate

the set of observed q values. Thus, a new Gaussian

is generated when the two following conditions are

satisﬁed: 1) The estimation error of the observed q

value is larger than a predeﬁned value δ, and 2) The

sample density in the experienced instance is below a

threshold ρ. These criteria are expressed as:

(q(s,a) − µ(q|s,a))

≥ δ (31)

and

p(s,a,q) =

∑

i=1

G(s,a,q;µ

,Σ

) ≤ ρ (32)

Whenever both criteria are fulﬁlled, a Gaussian is

generated with parameters given by

K+1

= 1 (33)

K+1

(s,a,q) = (s

,q(s

)) (34)

K+1

= C diag{d

,...,d

}, (35)

where d

is the total range size of variable i; D is the

dimension of the state space; and C is a positive value

deﬁning the dispersion of the new Gaussian. The im-

position of the second condition is to avoid a prolifer-

ation of Gaussians in densely sampled regions.

7 EXPERIMENTS

To demonstrate the performance of the method we ap-

ply it to the classical benchmark problem of swinging

up and stabilizing an inverted pendulum with limited

torque (Doya, 2000). The task consists in swinging

the pendulum until reaching the upright position and

then stay there indeﬁnitely. The optimal policy for

this problem is not trivial since, due to the limited

torques available, the controller has to swing the pen-

dulum several times until its kinetic energy is large

enough to overcome the load torque and reach the up-

right position.

The state space is two-dimensional and is conﬁg-

ured by the angular position θ and angular velocity

θ. We take advantage of the symmetry of the prob-

lem by identifying states with inverted angular posi-

tion and velocity: (θ,

θ) ∼ (−θ,−

θ). The GMM for

density estimation is deﬁned in the four-dimensional

joint space x=(θ,

θ,a,q). As the reward signal (or, in

this case, the cost that we have to minimize) we sim-

ply take the absolute value of the angle of the pendu-

lum from its top position: cost(θ,

θ) = −r(θ,

θ) = |θ|

which ranges in the interval [0,π]. The discount coef-

ﬁcient γ in equation 1 is set to 0.85.

The setting of our system consists in the follow-

ing. We provide the system with 10 initial Gaussians.

The elements of the mean µ

of the mixture compo-

nent i are selected randomly, except for the q variable

that is initialized to the maximum possible value to

REINFORCEMENT LEARNING FOR ROBOT CONTROL USING PROBABILITY DENSITY ESTIMATIONS

165

favor exploration of unvisited regions. The initial co-

variance matrices Σ

are diagonal and the variance for

each variable is set to one tenth of the total span of its

range. Each Gaussian is initialized with an accumu-

lated number of samples W

= 0.1. This small value

makes the component i to have little inﬂuence in the

estimation while there is no, or little, updating.

The discount factor λ

for the computation of the

discounted sums (Section 4) takes values from the

equation

= 1− 1/(at+ b), (36)

where b ﬁxes the initial value of λ

and a determines

its growth rate toward 1. In our experiments we set

a = 0.001, and the value of b depends on whether the

updating formula (13) or (18) is used. In the ﬁrst case,

b = 10, and when (18) is used, b = 1000 to compen-

sate for the effect of the exponent w

t,i

< 1. For the

only purpose of the following discussion, we will re-

fer to updates performed according to (13) as simply

local updating, while updates performed according to

(18) as exponentially local updating.

We performed the experiments using episodes of

50 seconds with an actuation interval of 0.1 seconds.

At the beginning of each episode, the pendulum is

placed in the hang-down position. At the end of each

episode, a test of 30 seconds is performed exploiting

the policy learned so far. As the result of the test we

take the sum of the costs obtained at each time step of

0.1 sec.

Figure 1 shows the performance of the learn-

ing system when using simply local updating. The

graphic corresponds to the average of 50 indepen-

dent experiments of 100 episodes each. The system

shows a stable performance after about 60 episodes,

and reaches a total accumulated cost of around 100.

The best performance obtained by exhaustive manual

tuning was near 44, and corresponds to the line at the

lower part of the ﬁgure. The best result obtained by a

single experiment test was 44.42. In average, the to-

tal number of Gaussians at the end of the experiments

was about 84.

In order to compare these results with those of

(Riedmiller, 2005b), we will take into account the

total number of times the system needs to be up-

dated with a sample to achieve the control. Riedmiller

reports that the swing-up and balance task required

100 iterations of the NFQ algorithm, each one re-

quiring 1000 epochs of batch learning with the Rprop

learning algorithm to train the neural net with an un-

speciﬁed number D of samples. This gives a total

of 100,000 × D sample updates. In our case, good

control is obtained after approximately 60 × 500 =

30,000 updates, which is signiﬁcantly better.

These results were obtained with the simply lo-

10 20 30 40 50 60 70 80 90 100

100

200

300

400

500

600

700

800

900

episodes (x 500 iterations)

Accumulated cost

Figure 1: Average over 50 experiments of the accumulated

cost, on tests of 30 seconds, with simply local updating.

10 20 30 40 50 60 70 80 90 100

100

200

300

400

500

600

700

800

900

episodes (x 500 iterations)

Accumulated cost

Figure 2: Average over 50 experiments of the accumulated

cost, on tests of 30 seconds, with exponentially local updat-

ing.

cal updating of formula (13), which is sensible to the

effect of the biased sampling. Despite being good

results, we observe that the learning curve presents

some ﬂuctuations that prevent the system to converge

to a value nearer to the theoretical optimum. Such

ﬂuctuations are caused by transient learning phases

during which the system is not able to swing the pen-

dulum up, until a good policy is recovered again. This

is precisely the problem we anticipated: as far as the

system stabilizes near a good policy, it experiences

samples mostly near the optimal policy, so that the Q

estimation of less experienced actions degrades, and

eventually, suboptimal actions gain temporary control

until the system relearns their correct value. This is

the reason by which we introduced the exponent w

t,i

in the update formula (18) for exponentially local up-

dating. Its effectiveness is shown in Figure 2.

Results show that exponentially local updating

achieves convergence slightly faster and with a much

ICINCO 2010 - 7th International Conference on Informatics in Control, Automation and Robotics

166

0 0.5 1 1.5 2 2.5 3 3.5

−10

−8

−6

−4

−2

•

Figure 3: Distribution of Gaussians in a projection of the

joint space to the state space.

0 0.5 1 1.5 2 2.5 3 3.5

−20

−15

−10

−5

Figure 4: Distribution of Gaussians in a projection of the

joint space to the (θ,q) space.

more stable behavior. In this case the average cost is

reduced to near 64, that is just 20 above the theoreti-

cal optimum, which compared with the 100− 44= 56

of the simply local updating corresponds to an im-

provement by a factor between 2 and 3. The number

of Gaussians used in this case is also lowered to less

than 60, in average. To check the effectiveness of ex-

ponentially local updating to prevent forgetting, we

observe that in the course of the 50 experiments, the

system failed to swing-up the pendulum after the 50th

episode only in a single occasion, compared with far

more that 30 with simply local updating.

Figures 3 and 4 show two projections of the Gaus-

sians of a typical GMM obtained for this problem af-

ter training. It can be seen that they are not equally

distributed along the whole conﬁguration space, but

concentrate in the most common trajectories of the

system, what constitutes an efﬁcient use of resources.

8 CONCLUSIONS

We proposed a new approach for Q-Learning in con-

tinuous state-action spaces, in which a Gaussian Mix-

ture Model that estimates the probability density in

the joint state-action-Qvalue space is used for func-

tion approximation. From this joint distribution we

can obtain, not just the expected value of Q for a given

state and action, but a full probability distribution that

is used to deﬁne a directed exploration-exploitation

strategy. As a further beneﬁt, from the density estima-

tion in the joint space we can also obtain the sampling

density in the state-action space. This information is

used to remedy the problem of biased sampling inher-

ent to on-line Reinforcement Learning. For this, we

modiﬁed the incremental updating rule of an on-line

EM algorithm in order to avoid forgetting data of less

frequently sampled regions, even when exploration is

reiteratively done near the goal conﬁgurations.

Tests performed on a classical RL problem, the

swing-up and balance of an inverted pendulum, show

that our approach improves the results of previous

works when considering the number of sample up-

dates required to achieve the goal. The comparison

between our basic approach, using simply local up-

dating, and the proposed improvement using expo-

nentially local updating, shows that the approach is

effective in reducing the perturbing effect of biased

sampling.

Finally, we want to point out that the information

provided by the density estimation has not been fully

exploited yet. We expect to use the density informa-

tion in future works to better guide exploration during

learning.

ACKNOWLEDGEMENTS

This research was partially supported by Consolider

Ingenio 2010, project CSD2007-00018.

REFERENCES

Arandjelovic, O. and Cipolla, R. (2005). Incremental learn-

ing of temporally-coherent gaussian mixture models.

In Technical Papers - Society of Manufacturing Engi-

neers (SME).

Bellman, R. and Dreyfus, S. (1962). Applied Dynamic

Programming. Princeton University Press, Princeton,

New Jersey.

Bishop, C. M. (2006). Pattern Recognition and Ma-

chine Learning (Information Science and Statistics).

Springer-Verlag New York, Inc., Secaucus, NJ, USA.

REINFORCEMENT LEARNING FOR ROBOT CONTROL USING PROBABILITY DENSITY ESTIMATIONS

167

Dearden, R., Friedman, N., and Russell, S. (1998).

Bayesian q-learning. In In AAAI/IAAI, pages 761–768.

AAAI Press.

Dempster, A., Laird, N., Rubin, D., et al. (1977). Maxi-

mum likelihood from incomplete data via the EM al-

gorithm. Journal of the Royal Statistical Society. Se-

ries B (Methodological), 39(1):1–38.

Doya, K. (2000). Reinforcement learning in continuous

time and space. Neural Comput., 12(1):219–245.

Duda, R. O., Hart, P. E., and Stork, D. G. (2001). Pattern

classiﬁcation. John Wiley and Sons, Inc, New-York,

USA.

Engel, Y., Mannor, S., and Meir, R. (2005). Reinforcement

learning with gaussian processes. In ICML ’05: Pro-

ceedings of the 22nd international conference on Ma-

chine learning, pages 201–208, New York, NY, USA.

ACM.

Ernst, D., Geurts, P., and Wehenkel, L. (2005). Tree-based

batch mode reinforcement learning. J. Mach. Learn.

Res., 6:503–556.

Figueiredo, M. (2000). On gaussian radial basis func-

tion approximations: Interpretation, extensions, and

learning strategies. Pattern Recognition, International

Conference on, 2:618–621.

Ghahramani, Z. and Jordan, M. (1994). Supervised learn-

ing from incomplete data via an em approach. In Pro-

ceeding of Advances in Neural Information Process-

ing Systems (NIPS’94), pages 120–127. San Mateo,

CA: Morgan Kaufmann.

Gordon, G. J. (1995). Stable function approximation in dy-

namic programming. In ICML, pages 261–268.

Neal, R. and Hinton, G. (1998). A view of the em algorithm

that justiﬁes incremental, sparse, and other variants. In

Proceedings of the NATO Advanced Study Institute on

Learning in graphical models, pages 355–368, Nor-

well, MA, USA. Kluwer Academic Publishers.

Nowlan, S. J. (1991). Soft competitive adaptation: neural

network learning algorithms based on ﬁtting statisti-

cal mixtures. PhD thesis, Pittsburgh, PA, USA.

Ormoneit, D. and Sen, S. (2002). Kernel-based reinforce-

ment learning. Machine Learning, 49(2-3):161–178.

Rasmussen, C. and Kuss, M. (2004). Gaussian processes in

reinforcement learning. Advances in Neural Informa-

tion Processing Systems, 16:751–759.

Riedmiller, M. (2005a). Neural ﬁtted Q iteration-ﬁrst ex-

periences with a data efﬁcient neural reinforcement

learning method. Lecture notes in computer science,

3720:317–328.

Riedmiller, M. (2005b). Neural Reinforcement Learning to

Swing-up and Balance a Real Pole. In Proceedings of

the 2005 IEEE International Conference on Systems,

Man and Cybernetics, volume 4, pages 3191–3196.

Rottmann, A. and Burgard, W. (2009). Adaptive au-

tonomous control using online value iteration with

gaussian processes. In Proceedings of the 2009 IEEE

International Conference on Robotics and Automation

(ICRA’09), pages 2106–2111.

Sato, M.-A. and Ishii, S. (2000). On-line em algorithm for

the normalized gaussian network. Neural Comput.,

12(2):407–432.

Song, M. and Wang, H. (2005). Highly efﬁcient incremental

estimation of gaussian mixture models for online data

stream clustering. In Proceedings of SPIE: Intelligent

Computing: Theory and Applications III, pages 174–

183, Orlando, FL, USA.

Sutton, R. and Barto, A. (1998). Reinforcement Learning:

An Introduction. MIT Press, Cambridge, MA.

Watkins, C. and Dayan, P. (1992). Q-learning. Machine

Learning, 8(3-4):279–292.

ICINCO 2010 - 7th International Conference on Informatics in Control, Automation and Robotics

168