PIECEWISE CONSTANT REINFORCEMENT LEARNING

FOR ROBOTIC APPLICATIONS

Andrea Bonarini, Alessandro Lazaric and Marcello Restelli

Department of Electronics and Information, Politecnico di Milano

piazza Leonardo Da Vinci, 32

20133, Milan, Italy

Keywords:

Robot Learning, Reinforcement Learning.

Abstract:

Writing good behaviors for mobile robots is a hard task that requires a lot of hand tuning and often fails to

consider all the possible conﬁgurations that a robot may face. By using reinforcement learning techniques a

robot can improve its performance through a direct interaction with the surrounding environment and adapt its

behavior in response to some non-stationary events, thus achieving a higher degree of autonomy with respect

to pre-programmed robots. In this paper, we propose a novel reinforcement learning approach that addresses

the main issues of learning in real-world robotic applications: experience is expensive, explorative actions

are risky, control policy must be robust, state space is continuous. Preliminary results performed on a real

robot suggest that on-line reinforcement learning, matching some speciﬁc solutions, can be effective also in

real-world physical environments.

1 INTRODUCTION

A lot of research efforts have been spent in robotics

to identify control architectures with the aim of mak-

ing the writing of control programs easier. Although

many advances have been made, it is often difﬁcult

for a programmer to specify how to achieve the de-

sired solution. Furthermore, it is hard to take into

consideration all the possible conﬁgurations the robot

may face or the changes that may occur in the envi-

ronment.

Reinforcement Learning (RL) (Sutton and Barto,

1998) is a well-studied set of techniques that allow

an agent to achieve, by trial-and-error, optimal poli-

cies (i.e., policies that maximize the expected sum of

observed rewards) without any a priori information

about the problem to be solved. In the RL paradigm,

the programmer, instead of programming how the

robot should behave, has just to specify a reward func-

tion that models how good is an action when taken in

a given state. This level of abstraction allows to write

speciﬁcations for the robot behavior in a short time

and to obtain better and more robust policies with re-

spect to hand-written control code.

Despite the huge research efforts in the RL ﬁeld,

the application of RL algorithms to real-world robotic

problems is quite limited. The difﬁculty to gather ex-

perience, the necessity to avoid dangerous conﬁgura-

tions, the presence of continuous state variables are

some of the features that make the application of RL

techniques to robotic tasks complex.

In this paper, we analyze the main issues that must

be faced by learning robots and propose a set of tech-

niques aimed at making the RL approach more ef-

fective in real robotic tasks. In particular, our main

contributions are the introduction of the lower bound

update strategy, which allows to learn robust poli-

cies without the need of a complete exploration of the

whole state-action space, and the use of piecewise-

constant policies with reward accumulation, which al-

lows to efﬁciently learn even in presence of coarse

discretizations of the state space. Experimental re-

sults carried out with a real robot show that the pro-

posed learning techniques are effective in making the

learning process more stable than traditional RL algo-

rithms.

In the next section, we will brieﬂy review the

main approaches proposed in literature to overcome

the problems described above. In Section 3, we intro-

duce the RL framework and present the details of our

214

Bonarini A., Lazaric A. and Restelli M. (2007).

PIECEWISE CONSTANT REINFORCEMENT LEARNING FOR ROBOTIC APPLICATIONS.

In Proceedings of the Four th International Conference on Informatics in Control, Automation and Robotics, pages 214-221

DOI: 10.5220/0001649102140221

 SciTePress

algorithm. The results of the experimental validation,

carried out on a real robot with a soccer task, are re-

ported in section 4. We draw conclusions and propose

future research directions in section 5.

2 REINFORCEMENT LEARNING

ON ROBOTS

Given the complexity of the development of robotic

applications, the possibility to exploit learning tech-

niques is really appealing. In particular, the reinforce-

ment learning research ﬁeld provided a number of al-

gorithms that allow an agent to learn to behave opti-

mally by direct interaction with the environment with-

out any a priori information.

Up to now, excluding a few notable exceptions,

the application of the RL approach has been success-

ful only in small gridworlds and simple simulated

control problems. The success obtained in more com-

plex domains (Tesauro, 1995; Sutton, 1996) is mainly

due to ad-hoc solutions and the exploitation of domain

dependent information.

Several difﬁculties prevent the use of pure RL

methods in real world robotic applications. Since di-

rect experience is the main source of information for

an RL algorithm, the learning agent needs to repeat-

edly interact with the world executing each available

action in every state. While in software domains it

is possible to perform a large number of trials and

to place the agent in arbitrary states, in real robotic

domains there are several factors (limited battery ca-

pacity, blocking states, mechanical or electrical faults,

etc.) that make learning from scratch not feasible.

Several works in the robotic area have studied dif-

ferent solutions to reduce the amount of direct expe-

rience required by RL algorithms. A typical solution

consists of performing extensive training sessions us-

ing a physical simulator (Morimoto and Doya, 2000).

When the simulated robot achieves a good perfor-

mance the learned policy is applied on the physical

robot and the learning process goes on with the aim

of adjusting it to the real conditions. Although this

approach can be really effective, for many robotic do-

mains it is too hard to realize good enough simulation

environments. Furthermore, it may happen that the

approximation introduced in the simulation is such

that the knowledge gathered in the simulated learning

phase is almost useless for the real robot. Another ap-

proach that was originally proposed for robotic tasks,

but that has found common application in other do-

mains, is experience replay (Lin, 1992). The idea

is that the robot stores data about states, actions,

and rewards experienced, and ﬁctitiously repeats the

same moves thus performing more updates that speed

up the propagation of the rewards and the conver-

gence to the optimal value function. Although this

approach succeeds in speeding up the learning pro-

cess, it still requires an expensive exploration phase

to gather enough information. To overcome this prob-

lem, Lin adopts a human teacher to show the robot

several instances of reactive sequences that achieve

the task in order to bias the exploration to promising

regions of the action space. The same goal is pursued

in (Mill

an, 1996), but instead of a teacher it requires

a set of pre-programmed behaviors to focus the ex-

ploration on promising parts of the action space when

the robot faces new situations. In this paper, we fol-

low the approach proposed by (Smart and Kaelbling,

2002), which effectively provides prior knowledge by

splitting the learning process in two phases. In the

ﬁrst phase, example trajectories are supplied to the

robot (by automatic control or by human guidance)

through a control policy and the RL system passively

watches the experienced states, actions and rewards

with the aim of bootstrapping information into the

value-function. Once enough data has been collected,

the second phase starts and the robot is completely

controlled by the RL algorithm. In problems with

sparse reward functions, without any hint, the robot

would take a huge number of steps before collect-

ing some signiﬁcant reward, thus making the learning

process prohibitive. At the opposite, the “supervised”

phase is an initialization of the learning process so

that it can initially avoid a fully random exploration of

the environment. Furthermore, differently from imi-

tation learning methods, this approach allows to sup-

ply prior knowledge without knowing anything about

inverse kinematics.

Learning in real-world environments requires to

deal with dangerous actions that may harm the robot

or humans, and with stalling states, i.e., conﬁgura-

tions that prevent the robot from autonomously go-

ing on with the learning process. Again, example tra-

jectories can be effective to provide safe policies that

avoid harmful situations. Another way to reduce the

risk of performing dangerous actions is to use mini-

max learning (like the

Q-learning algorithm (Heger,

1994)), where the robot, instead of maximizing the

expected sum of discounted rewards, tries to maxi-

mize the value of the worst case. This kind of pes-

simistic learning has lead to good results in stochas-

tic (Heger, 1994), partially observable (Buffet and

Aberdeen, 2006), and multi-agent (Littman, 1994)

problems, showing also to be robust with respect to

changes in the problem parametrization, thus allow-

ing the reuse of the learned policy in different oper-

ating conditions. Unfortunately, algorithms like

Q-

PIECEWISE CONSTANT REINFORCEMENT LEARNING FOR ROBOTIC APPLICATIONS

215

learning need to perform an exhaustive search through

the action space, but this is not feasible in a real-world

robotic context. In the following sections, we propose

a variant to the

Q-learning that is able to ﬁnd safe poli-

cies without requiring the complete exploration of the

action space.

Another relevant issue in real-world robotic appli-

cations is that both the state and action spaces are

continuous. Usually, this problem is faced by us-

ing function approximators such as state aggregation,

CMAC (Sutton, 1996) or neural networks, in order to

approximate the value function over the state space.

Although these techniques obtained relevant results

in supervised learning, they require long hand tun-

ing and may result in highly unstable learning pro-

cesses and even divergence. Although state aggrega-

tion is one of the most stable function approximator,

its performance of state aggregation is strictly related

to the width of the aggregated states and algorithms

like Q-learning may have very poor performance even

in simple continuous problems, unless a very ﬁne dis-

cretization of the state space is used. In this paper, we

propose a learning technique that allows to achieve

good policies even in presence of large state aggrega-

tions, thus exploiting their generalization properties

to reduce the learning times.

3 THE ALGORITHM

As discussed in previous sections, learning in noisy

continuous state spaces is a difﬁcult task for many

different reasons. In this section, after the introduc-

tion of the formal description of the RL framework,

we detail a novel algorithm based on the idea of the

computation of a lower bound for a piecewise con-

stant policy.

3.1 The Reinforcement Learning

Framework

RL algorithms deal with the problem of learning how

to behave in order to maximize a reinforcement signal

by a direct interaction with a stochastic environment.

Usually, the environment is formalized as a ﬁnite state

discrete Markov Decision Process (MDP):

1. A set of states

S = {s

,···s

}

2. A set of actions

A = {a

,···a

}

3. A transition model

P (s, a, s

′

) that gives the proba-

bility to get to state s

′

from state s by taking action

4. A reward function

R (s,a) that gives the value of

taking action a in state s

Furthermore, each MDP satisﬁes the Markov

property:

P (s

t+1

) = P (s

t+1

t−1

,··· ,s

(1)

that is, the probability of getting in state s at time t +1

depends only on the state and action at the previous

time step and not on the history of the system.

A deterministic policy π :

S → A is a function that

maps each state in the environment to the action to

be executed by the agent. The action value function

(s) measures the utility of taking action a in state s

and following a policy π thereafter:

(s,a) = R(s,a) + E

∞

∑

t=1

R(s(t),π(s(t)))

, (2)

where γ ∈ [0,1) is a discount factor that weights recent

rewards more than those in the future.

The goal of the agent is to learn the optimal policy

∗

that maximizes the expected discounted reward in

each state. The action value function corresponding

to the optimal policy can be computed by solving the

following Bellman equation (Bellman, 1957):

∗

(s,a) = R(s,a) + γ

∑

′

P (s, a, s

′

)max

′

∗

′

(3)

Thus, the optimal policy can be deﬁned as the greedy

action in each state:

∗

(s) = argmax

∗

(s,a). (4)

Q-learning (Watkins and Dayan, 1992) is a model-

free algorithm that incrementally approximates the

solution through a direct interaction with the environ-

ment. At each time step, the action value function

Q(s,a) is updated according to the reward received by

the agent and to the estimation of the future expected

reward:

k+1

(s,a) = (1− α)Q

(s,a) +α



r+ γmax

′

)



where r = R(s, a) and α is the learning rate. In the

following, we will refer to the term in square brackets

as the target value:

U(s,a,s

′

) = R(s,a) + γmax

′

) (5)

When α decreases to 0 according to the Robbins-

Monro (Sutton and Barto, 1998) conditions and each

state-action pair is visited inﬁnitely often, the algo-

rithm is proved to converge to the optimal action value

function. Usually, at each time step the action to be

executed is chosen according to an explorative pol-

icy that balances a wide exploration of the environ-

ment and the exploitation of the learned policy. One

of the most used exploration policies is ε-greedy (Sut-

ton, 1996) that chooses the greedy action with proba-

bility 1−ε and a random action with probability ε.

ICINCO 2007 - International Conference on Informatics in Control, Automation and Robotics

216

3.2 Local Exploration Strategy

In order to avoid an exhaustive exploration of the ac-

tion space, a more sophisticated exploration policy

can be adopted. As suggested in (Smart and Kael-

bling, 2002), a “supervised” phase performed using a

sub-optimal controller is an effective way to initial-

ize the action value function. This way, the learning

process is bootstrapped by a hand-coded controller

whose policy is optimized when the control of the

robot is passed to the learning algorithm. Even if this

technique is effectiveto avoid an initial random explo-

ration, the ε-greedy exploration policy used thereafter

does not guarantee that many useless, and potentially

dangerous, actions are explored. To reduce this prob-

lem, we propose to adopt a local ε-greedy exploration

policy. Since the learning process is initialized using

a sub-optimal controller, it is preferable to perform an

exploration in the range of the greedy action, instead

of completely random actions. Therefore, with prob-

ability ε a locally explorative action is drawn from a

uniform probability distribution over a δ-interval of

the greedy action a

∗ 1

EXP

∼ U(a

∗

− δ;a

∗

+ δ) (6)

Even if the local ε-greedy exploration policy is not

guaranteed to avoid dangerous explorative actions,

in many robotic applications it is likely to be safer

and to converge to the optimal policy in less learn-

ing episodes than usual ε-greedy policy. In fact, it is

based on the assumption (often veriﬁed in robotic ap-

plications) that the optimal policy can be obtained by

small changes to the sub-optimal controller.

3.3 Q

-learning

As showed in many works (e.g., (Gaskett, 2003;

Morimoto and Doya, 2001)) traditional Q-learning

is often ill-suited for robotic applications character-

ized by noisy continuous environments with several

uncertain parameters in which non-stationary tran-

sitions may occur because of external unpredictable

factors. Many techniques (Gaskett, 2003; Morimoto

and Doya, 2001; Heger, 1994) improve the stability

and the robustness of the learning process, on the ba-

sis of the concept of maximization of performance in

the worst case, i.e., the min-max principle. This prin-

ciple deals with the problems introduced by highly

uncertain and stochastic environments. In fact, the

controller obtained at the end of the learning process

is optimized for the worst condition and not for the

In case of problems with multiple actuators, the explo-

rative action is obtained by the composition of explorative

actions for each actuator.

average situation, such as in Q-learning. Although ef-

fective in principle, this approach cannot always be

applied in real world applications. The Robust Rein-

forcement Learning (RRL) paradigm (Morimoto and

Doya, 2001) relies on an estimation of the dynam-

ics and of the noise of the environment in order to

compute the min-max solution of the value function.

Unfortunately, the model of the environment is not

always available and the estimation of its dynamics

often requires many learning episodes. A model-

free solution, the

Q-learning algorithm, proposed in

(Heger, 1994) can learn a robust controller through a

direct interaction with the environment. The update

formula for the action value function is:

k+1

(s,a) = min

(s,a),U(s, a, s

′

)

, (7)

whereU(s, a, s

′

) is the target value. If the action value

function is initialized to the highest possible value

(i.e., Q(s, a) =

max

1−γ

), this algorithm is proved to con-

verge to the min-max value function and policy, that is

the policy that receives the highest expected reward in

the worst case. Although this algorithm is guaranteed

to ﬁnd a robust controller, it can be applied only to

simulated environments, since the optimistic initial-

ization of the action value function makes the agent

to explore randomly all the available actions until at

least the best action converged to the min-max value

function. Thus, it is not suitable for robotic applica-

tions where long exploration is too expensive.

Another drawback of

Q-learning is that it ﬁnds an

optimal policy for the worst case even if caused by

non-stationary transitions. In fact, in real-world appli-

cations very negative conditions may occur during the

learning process because of very limited and uncon-

trolled situations possibly caused by non-stationarity

in the environment and, with

Q-learning, these condi-

tions are immediately stored in the action value func-

tion and cannot be removed anymore.

In order to keep the robustness of a min-max con-

troller, to reduce the exploration and to avoid effects

of non-stationarity as much as possible, we propose

-learning, a novel algorithm for the computation

of a lower bound for the action value function. Instead

of a minimization between the current estimation and

the target value (Eq. 5), we adopt the following update

rule:

k+1

(s,a) =



U(s,a, s

′

) if U(s,a,s

′

) < Q

(s,a)

(1− α)Q

(s,a) +αU(s,a, s

′

) otherwise

As it can be noticed, when the worst case is visited

the action value estimation is set to the target value as

Q-learning. On the other hand, if the target received

PIECEWISE CONSTANT REINFORCEMENT LEARNING FOR ROBOTIC APPLICATIONS

217

by the agent is greater than the current estimation, the

usual Q-learning update rule is used. As a result, the

action value function may not take into consideration

the worst case ever visited in the learning process,

when this is the result of rare events not following the

real dynamics of the system (e.g., collisions against

moving obstacles). As learning progresses the learn-

ing rate α decreases (according to Robbins-Monro

conditions) thus granting the convergence of the Q

learning algorithm since it becomes more and more

similar to

Q-learning

. As it can be noticed, this al-

gorithm does not require any particular initialization

of the action value function as in

Q-learning and this

can reduce the exploration needed to learn a nearly

optimal solution. Furthermore, this algorithm is ef-

fective in case of continuous state spaces in which the

transitions between states may be affected also by the

policy the robot is performing (Moore and Atkeson,

1995). In this situation, the worst case depends on the

policy and not only on the dynamics of the environ-

ment, thus it is necessary to evaluate the action value

function according to the current policy and not with

respect to the worst possible case. Therefore, while

Q-learning would converge to the worst case indepen-

dently from the policy, Q

-learning learns the action

value function for the worst case of the current policy.

3.4 PWC-Q-learning

Although the previous algorithm is effective in noisy

or non-stationary environments, it may experience

bad results (see Section 4) when applied to problems

with continuous state spaces. Usually, when applying

RL algorithms, a continuous state space is discretized

into intervals that are considered as aggregated states.

Unfortunately, if a coarse discretization is adopted,

the environment looses the Markov property and the

learning process is likely to fail. In fact, a very ﬁne

resolution on the state space is required to make algo-

rithms such as Q-learning stable and effective. Since

the number of states has a strong impact on the speed

of the learning process, this is often a problem of its

application to robotic problems.

The reason for Q-learning to fail in learning on

coarsely discretized states is strictly related to its

learning process, that continuously performs updates

within the same state. With Q-learning, when the

robot takes the greedy action a in state s, receives a

reward r and remains in the very same state, the ac-

tion value function is updated as:

k+1

(s,a) = (1−α)Q

(s,a) + α

r+ γQ

(s,a)

(8)

Let us notice that when α = 0 Q

-learning is the same

as the

Q-learning

Thus, the value of Q(s, a) is updated using its own es-

timation. If the state is sufﬁciently large, the agent

is likely to remain in the same state for many steps

and the Q-value tends to converge to its limit

1−γ

until either the state is left or another action is cho-

sen. As a result, the policy continuously changes and

the learning process may experience instability. This

phenomenon is much more relevant when a min-max-

based update rule is adopted; in that case, the conver-

gence to the limit is even faster.

In order to avoid the negative effect of the self-

update rule, we introduce a novel learning algo-

rithm: the Piecewise Constant Q-learning (PWC-Q-

learning). The main difference with respect to the tra-

ditional Q-learning is about the way the action value

function is updated. When the robot enters a state s

and selects an action a, PWC-Q-learning makes the

robot repeatedly execute the same action a and it ac-

cumulates the reward until a state transition occurs.

Only at that time, the update is performed according

to the SMDP Q-learning rule (Sutton et al., 1999):

k+1

(s,a) = (1− α)Q

(s,a) +

∑

i=1

i−1

+ γ

max

′

)

where N is the number of times in which the agent has

performed action selection in the state s. As it can be

noticed, in this way s

′

is always different from s and

no self-update is performed. As a result, the PWC-Q-

learning is more stable and guarantees a more reliable

learning process than the original Q-learning.

Furthermore, the PWC-Q-learning algorithm

matches also the limitations caused by the discretiza-

tion on the resolution of the controller that can be ac-

tually learned. In fact, when the learning is over, the

learned policy π maps each state into one single action

that must be kept constant until a different state is per-

ceived. Therefore, in this case a learning algorithm as

PWC-Q-learning that evaluates the real utility of an

action throughout a state is more suitable, and does

not allow any change in the action as in Q-learning.

While Q-learning needs a very ﬁne discretization

to reduce the instability caused by the loss of the

Markov property, PWC-Q-learning (as shown in the

experimental section) proved to be more stable even

in coarsely discretized continuous state spaces. By

using coarse discretizations it is also possible to re-

duce the duration of the learning process.

Finally, PWC-Q-learning can be merged with the

computation of the lower bound action value function

introduced in Section 3.3 in order to obtain a learn-

ing algorithm that is robust in highly stochastic and

noisy environments and that, at the same time, can be

successfully applied to continuous robotic problems.

ICINCO 2007 - International Conference on Informatics in Control, Automation and Robotics

218

Figure 1: The RoboCup robot used for the experiments.

4 EXPERIMENTAL RESULTS

To verify the effectiveness of the proposed approach

we made real robotic experiments with the aim of

measuring speed and stability of the learning pro-

cess, and optimality and robustness of the learned pol-

icy. The experimental activity has been performed on

a real robot belonging to the Milan RoboCup Team

(MRT) (Bonarini et al., 2006), a team of soccer robots

that participates to the Middle Size League of the

RoboCup competition (Kitano et al., 1997). The robot

used for the experiments is a holonomic robot with

three omnidirectional wheels that can reach the max-

imum speed of 1.8m/s (see Figure 1). The robot is

equipped with a omnidirectional catadioptric vision

sensor able to detect objects (e.g., ball, robots) up to

a distance of about 6m all around the robot.

Preliminary experiments have been carried out on

the task “go to ball”, in which the robot must learn

how to reach the ball as fast as possible. Although we

have chosen a quite simple task, the large discretiza-

tion adopted, the non-stationarity of the environment

(due to battery consumption during the learning pro-

cess), the noise affecting robot’s sensors and actua-

tors, and the limited amount of experience make the

results obtained signiﬁcant to evaluate the beneﬁts of

the PWC-Q

-learning algorithm. The state space of

this task is characterized by two continuous state vari-

ables: the distance and the angle at which the robot

perceives the ball. The ball distance has been dis-

cretized into ﬁve intervals: [0 : 50),[50 : 100), [100 :

200),[200 : 350), [350 : 600], while the angle has been

evenly split into 24 sectors 15

◦

wide, thus obtaining

a state space with 120 states. As far as the action

space is concerned, thanks to its three omnidirectional

wheels, the robot can move on the plane with three

degrees of freedom mapped to three action variables:

• module of the tangential velocity, associated to

the speed at which the robot translates. Its value

is expressed in percentage of the maximum tan-

gential velocity and it is discretized into 6 values

0,20,40, 60, 80, 100;

• direction of the tangential velocity, the direction

along with the robot translates. Its value is ex-

pressed in degrees and it is discretized into 24

evenly spaced values: 0

◦

,15

◦

,30

◦

,...,345

◦

;

• rotational velocity, associated to the speed at

which the robot changes its heading. Its value

is expressed in percentage of the maximum rota-

tional velocity and it is discretized into 9 values:

−20,−15, −10, −5, 0, 5, 10, 15, 20.

The total number of available actions is 1, 296.

The reward function is such that the robot receives

−1 as reward at each step except when its distance

from the ball is below 50cm and the angle falls in the

range [−15

◦

: 15

◦

], in which case the reward is +10

and the trial ends. At the beginning of each learn-

ing trial, the robot starts form the center kickoff po-

sition and performs learning steps until it succeeds in

reaching the ball which is positioned at 350cm. Once

a trial is ﬁnished, the learning process is suspended,

and the robot autonomously performs a resetting pro-

cedure moving towards the starting position.

The sparsity of this reward function, although

making easier its deﬁnition and preventing the in-

troduction of biases, requires a long exploration pe-

riod before catching some positive rewards and prop-

agate the associated information to the rest of the state

space. In this task, the robot would clueless wan-

der around the ﬁeld with little hope of success and

high risk of bumping against objects around the ﬁeld.

For this reason, as mentioned in Section 2, we split

the learning process into two phases. The ﬁrst phase

consists of “supervised” trials, i.e., trials in which the

robot is controlled by hand-written behaviors and the

RL algorithm only observes and records the actions

taken in the visited states and the associated rewards.

On the basis of the observed data, the RL algorithm

builds a ﬁrst approximation of the value function that

will be exploited in the second phase. The acquired

policy allows to make a safe exploration of the en-

vironment, thus considerably speeding up the learn-

ing process towards the optimal policy. In the second

phase, the hand-written controller is bypassed by the

RL system which chooses which action must be exe-

cuted by the robot in each state, and, on the basis of

the collected reward and the reached state, performs

on-line updates of its knowledge.

In the following, we present comparative exper-

iments among Q-learning, Q

-learning (see Sec-

tion 3.3), and PWC-Q

-learning (see Section 3.4).

The “supervised” phase has been performed only

once and then the collected data have been reused

in all the experiments. It consists of 60 trials with

the ball placed in different positions around the robot

within 400cm. Since the actions produced by the

PIECEWISE CONSTANT REINFORCEMENT LEARNING FOR ROBOTIC APPLICATIONS

219

100

120

140

160

180

200

0 20 40 60 80 100 120 140 160 180 200

Number of steps per trial

Number of learning trials

Q-Learning

best

hand-coded

suboptimal

hand-coded

100

120

140

160

180

200

0 20 40 60 80 100 120 140 160 180 200

Number of steps per trial

Number of learning trials

Q-Learning with lower bounds

best

hand-coded

suboptimal

hand-coded

100

120

140

160

180

200

0 20 40 60 80 100 120 140 160 180 200

Number of steps per trial

Number of learning trials

Piecewise constant Q-Learning with lower bounds

best

hand-coded

suboptimal

hand-coded

Figure 2: Performance of Q-learning, Q

-learning and PWC-Q

-learning on the “go to ball” task.

hand-written controller (based on fuzzy rules) are

continuous, and given that the RL algorithms work

with discrete actions, we replace each action pro-

duced by the controller with the one with the clos-

est value among those in the discrete set of the RL

system. Given the low complexity of the task, the

hand-written behavior is, on purpose, highly subopti-

mal (with only low speed commands) in order to bet-

ter highlight the improvements obtained by the learn-

ing processes in such a noisy environment.

In Figure 2 are reported the learning plots of the

three algorithms during the second learning phase.

The performance of the algorithms is measured by

the number of steps (each step lasts 70ms) required

to reach the ball. Every ﬁve trials the learning process

is suspended for one trial; in this trial the robot exe-

cutes the policy learned so far. Only the exploitation

trials are shown in the graphs. When the robot is not

able to reach the ball within 200 steps, the trial ends.

For each graph are also reported the average perfor-

mance of the suboptimal hand-coded policy followed

by the controller (∼ 152 steps) and the average per-

formance of our best hand-coded policy (∼ 48 steps).

It is worth noting that, due to noise in the sensing and

actuating systems and to the low accuracy of the reset-

ting procedure, also the performance of ﬁxed policies

is affected by a high variance (standard deviation of

about ±10 steps). The parametrization is the same

for each learning algorithm: the learning rate is 0.5

and decreases quite quickly, the ε-greedy exploration

starts at 0.5 and decreases slowly, the discount factor

is 0.99. The action values stored in the Q-table are

initialized with a low value (−100), so that the values

of the actions performed during the supervised trials

become larger, thus biasing the exploration to regions

near to the example trajectories. Given the stochastic-

ity (due to sensor noise) and the non-stationarity (due

to battery discharging), each experiment has been re-

peated three times for each algorithm, for a total of

1,350 trials and more than 130,000 steps.

In the ﬁrst two graphs we have reported typical

runs of Q-learning and Q

-learning. Both of them

show a quick learning in the ﬁrst trials, but then they

alternate good trials to bad trials without appearing to

converge to a stable solution. Trials that reach 200

steps typically mean that the robot, in some regions,

has learned a policy that stops the movement of the

robot. As explained above, this irrational behavior is

due to self-updates (see Equation 8). Unsurprisingly,

this problem is more frequent in the Q

-learning

algorithm. The third plot displays the performance

of PWC-Q

-learning algorithm averaged over three

different runs. Here, the number of steps needed to

reach the ball decreases quite slowly, but, unlike the

other algorithms, the policy improves more and more

reaching performance close to the best hand-coded

policy, without any trial reaching the 200 steps limit.

It is worth brieﬂy describing the policy learned by

the robot with the PWC-Q

-learning algorithm: in

the ﬁrst trials the robot gradually learns to increase

its speed in different situations until it learns to reach

the ball at the maximum speed. Although this pol-

icy allows to complete some trials in very few steps

(even less than 40 steps), it is likely to fail, since a

coarse discretization gives control problems at high

speed, especially in regions where a good accuracy is

required. The problem is that, when the robot trav-

els fast, it may not be able to go straight to the ball,

and in general it hits the ball, thus needing to run af-

ter it for many steps. Given the risk aversion typical

of minimax approaches, the PWC-Q

-learning algo-

rithm quickly learns to give up with fast movements

when the robot is near to the ball. The ﬁnal result

is that the robot starts at high speed and slows down

when the ball gets closer, thus managing to reach the

termination condition without touching the ball.

During one of the experiments another interest-

ing characteristic of the PWC-Q

-learning algorithm

emerged. After about 100 trials, one of the wheels be-

gan to partially lose its grip on the engine axis, thus

causing the robot to turn slightly right instead of mov-

ing straightforward. Our learning algorithm managed

ICINCO 2007 - International Conference on Informatics in Control, Automation and Robotics

220

to face this non-stationarity by reducing the tangen-

tial speed in all the states, turning slightly right, and

then it started to gradually raise the tangential speed.

This behavior has put in evidence how this learning

approach potentially can increase the robustness to-

wards mechanical faults, thus increasing the auton-

omy of the robot. Further studies will be focused on

a detailed analysis of this important aspect.

5 DISCUSSION AND FUTURE

WORKS

The application of RL techniques to robotic applica-

tions could lead to the autonomous learning of opti-

mal solutions for many different tasks. Unfortunately,

most of the traditional RL algorithms fail when ap-

plied to real-world problems because of the time re-

quired to ﬁnd the optimal solution, of the noise that

affects both sensors and actuators, and of the difﬁ-

culty to manage continuous state spaces.

In this paper, we have described and experimen-

tally tested a novel algorithm (PWC-Q

-learning)

designed to overcome the main issues that arise when

learning is applied to real-world robotic tasks. PWC-

-learning computes the lower bound for the ac-

tion value function while following a piecewise con-

stant policy. Unlike other min-max-based algorithms,

PWC-Q

-learning does not require the model of the

dynamics of the environment and avoids long and

blind exploration phases. Furthermore, it does not

learn the optimal policy for the theoretically worst

case, but it estimates the lower bound on the condi-

tions actually experienced by the robot according to

its current policy and to the current dynamics of the

environment. Finally, the piecewise constant action

selection and update guarantee a stable learning pro-

cess in continuous state spaces, even when the dis-

cretization is such that the Markov property is lost.

Although preliminary, the experiments showed

that PWC-Q

-learning succeeds in learning a nearly-

optimal policy by optimizing the behavior of a sub-

optimal controller in noisy continuous environments.

Furthermore, it proved to be more stable with re-

spect to Q-learning even when a coarse discretiza-

tion of the state space is used. At the moment, we

are currently investigating the theoretical properties

of the proposed algorithm and we are testing its per-

formance on more complex robotic tasks, such as the

“align to goal” and the “kick” tasks.

REFERENCES

Bellman, R. (1957). Dynamic Programming. Princeton

University Press, Princeton.

Bonarini, A., Matteucci, M., Restelli, M., and Sorrenti,

D. G. (2006). Milan robocup team 2006. In RoboCup-

2006: Robot Soccer World Cup X.

Buffet, O. and Aberdeen, D. (2006). Policy-gradient for

robust planning. In Proceedings of the Workshop on

Planning, Learning and Monitoring with Uncertainty

and Dynamic Worlds (ECAI 2006).

Gaskett, C. (2003). Reinforcement learning under circum-

stances beyond its control. In Proceedings of Interna-

tional Conference on Computational Intelligence for

Modelling Control and Automation.

Heger, M. (1994). Consideration of risk in reinforcement

learning. In Proceedings of the 11th ICML, pages

105–111.

Kitano, H., Asada, M., Osawa, E., Noda, I., Kuniyoshi, Y.,

and Matsubara, H. (1997). Robocup: The robot world

cup initiative. In Proceedings of the First Interna-

tional Conference on Autonomous Agent (Agent-97).

Lin, L.-J. (1992). Self-improving reactive agents based on

reinforcement learning, planning and teaching. Ma-

chine Learning, 8(3-4):293–321.

Littman, M. L. (1994). Markov games as a framework for

multi-agent reinforcement learning. In Proceedings of

the 11th ICML, pages 157–163.

Mill

an, J. D. R. (1996). Rapid, safe, and incremental learn-

ing of navigation strategies. IEEE Transactions on

Systems, Man, and Cybernetics (B), 26(3):408–420.

Moore, A. and Atkeson, C. (1995). The parti-game algo-

rithm for variable resolution reinforcement learning

in multidimensional state-spaces. Machine Learning,

21:711–718.

Morimoto, J. and Doya, K. (2000). Acquisition of stand-up

behavior by a real robot using hierarchical reinforce-

ment learning. In Proceedings of the 17th ICML.

Morimoto, J. and Doya, K. (2001). Robust reinforcement

learning. In Advances in Neural Information Process-

ing Systems 13, pages 1061–1067.

Smart, W. D. and Kaelbling, L. P. (2002). Effective rein-

forcement learning for mobile robots. In Proceedings

of ICRA, pages 3404–3410.

Sutton, R. S. (1996). Generalization in reinfrocement learn-

ing: Successful examples using sparse coarse coding.

In Advances in Neural Information Processing Sys-

tems 8, pages 1038–1044.

Sutton, R. S. and Barto, A. G. (1998). Reinforcement Learn-

ing: An Introduction. MIT Press, Cambridge, MA.

Sutton, R. S., Precup, D., and Singh, S. (1999). Between

mdps and semi-mdps: A framework for temporal ab-

straction in reinforcement learning. Artiﬁcial intelli-

gence, 112(1-2):181–211.

Tesauro, G. (1995). Temporal difference learning and td-

gammon. Communications of the ACM, 38.

Watkins, C. and Dayan, P. (1992). Q-learning. Machine

Learning, 8:279–292.

PIECEWISE CONSTANT REINFORCEMENT LEARNING FOR ROBOTIC APPLICATIONS

221