A CAUTIOUS APPROACH TO GENERALIZATION IN

REINFORCEMENT LEARNING

Raphael Fonteneau

, Susan A. Murphy

, Louis Wehenkel

and Damien Ernst

Department of Electrical Engineering and Computer Science, University of Li

ege, Belgium

Department of Statistics, University of Michigan, USA

Keywords:

Reinforcement learning, Prior knowledge, Cautious generalization.

Abstract:

In the context of a deterministic Lipschitz continuous environment over continuous state spaces, ﬁnite action

spaces, and a ﬁnite optimization horizon, we propose an algorithm of polynomial complexity which exploits

weak prior knowledge about its environment for computing from a given sample of trajectories and for a given

initial state a sequence of actions. The proposed Viterbi-like algorithm maximizes a recently proposed lower

bound on the return depending on the initial state, and uses to this end prior knowledge about the environment

provided in the form of upper bounds on its Lipschitz constants. It thereby avoids, in way depending on

the initial state and on the the prior knowledge, those regions of the state space where the sample is too

sparse to make safe generalizations. Our experiments show that it can lead to more cautious policies than

algorithms combining dynamic programming with function approximators. We give also a condition on the

sample sparsity ensuring that, for a given initial state, the proposed algorithm produces an optimal sequence

of actions in open-loop.

1 INTRODUCTION

Since the late sixties, the ﬁeld of Reinforcement

Learning (RL) (Sutton and Barto, 1998) has studied

the problem of inferring from the sole knowledge of

observed system trajectories, near-optimal solutions

to optimal control problems. The original motivation

was to design computational agents able to learn by

themselves how to interact in a rational way with their

environment. The techniques developed in this ﬁeld

have appealed researchers trying to solve sequential

decision making problems in many ﬁelds such as Fi-

nance (Ingersoll, 1987), Medicine (Murphy, 2003;

Murphy, 2005) or Engineering (Riedmiller, 2005).

RL algorithms are challenged when dealing with

large or continuous state spaces. Indeed, in such cases

they have to generalize the information contained in a

generally sparse sample of trajectories. The dominat-

ing approach for generalizing this information is to

combine RL algorithms with function approximators

(Bertsekas and Tsitsiklis, 1996; Lagoudakis and Parr,

2003; Ernst et al., 2005). Usually, these approxima-

tors generalize the information contained in the sam-

ple to areas poorly covered by the sample by implic-

itly assuming that the properties of the system in those

areas are similar to the properties of the system in the

nearby areas well covered by the sample. This in turn

often leads to low performance guarantees on the in-

ferred policy when large state space areas are poorly

covered by the sample. This can be explained by the

fact that when computing the performance guarantees

of these policies, one needs to take into account that

they may actually drive the system into the poorly

visited areas to which the generalization strategy as-

sociates a favorable environment behavior, while the

environment may actually be particularly adversarial

in those areas. This is corroborated by theoretical re-

sults which show that the performance guarantees of

the policies inferred by these algorithms degrade with

the sample sparsity where, loosely speaking, the spar-

sity can be seen as the radius of the largest non-visited

state space area.

We propose an algorithm for reinforcement learn-

Usually, these theoretical results do not give lower

bounds per se but a distance between the return of the in-

ferred policy and the optimal return. However, by adapting

in a straightforward way the proofs behind these results, it

is often possible to get a bound on the distance between the

estimate of the return of the inferred policy computed by the

RL algorithm and its actual return and, from there, a lower

bound on the return of the inferred policy.

Fonteneau R., Murphy S., Wehenkel L. and Ernst D. (2010).

A CAUTIOUS APPROACH TO GENERALIZATION IN REINFORCEMENT LEARNING.

In Proceedings of the 2nd International Conference on Agents and Artiﬁcial Intelligence - Artiﬁcial Intelligence, pages 64-73

DOI: 10.5220/0002726900640073

 SciTePress

ing that derives action sequences which tend to avoid

regions where the performance is uncertain given the

available set and weak prior knowledge. This weak

prior knowledge is given in the form of upper bounds

on the Lipschitz constants of the environment. To

compute these sequences of actions, the algorithm ex-

ploits a lower bound on the performance of the agent

when it uses a certain open-loop sequence of actions

while starting from a given initial condition, as pro-

posed in (Fonteneau et al., 2009). More speciﬁcally,

it computes an open-loop sequence of actions to be

used from a given initial state to maximize that lower

bound. To this end we derive a Viterbi-like algorithm,

of polynomial computational complexity in the size

of the dataset and the optimization horizon. Our algo-

rithm does not rely on function approximators and it

computes, as a byproduct, a lower bound on the return

of its open-loop sequence of decisions. It essentially

adopts a cautious approach to generalization in the

sense that it computes decisions that avoid driving the

system into areas of the state space that are not well

enough covered by the available dataset, according to

the prior information about the dynamics and reward

function. Our algorithm − named CGRL for Cautious

Generalization (oriented) Reinforcement Learning −

assumes a ﬁnite action space and a deterministic dy-

namics and reward function, and it is formulated for

ﬁnite time-horizon problems. We also provide a con-

dition on the sample sparsity ensuring that, for a given

initial state, the proposed algorithm produces an opti-

mal sequence of actions in open-loop, and we suggest

directions for leveraging our approach to a larger class

of problems in RL.

The rest of the paper is organized as follows. Sec-

tion 2 brieﬂy discusses related work. In Section 3,

we formalize the inference problem we consider. In

Section 4, we adapt the results of (Fonteneau et al.,

2009) to compute a lower bound on the return of an

open-loop sequence of actions. Section 5 proposes

a polynomial algorithm for inferring a sequence of

actions maximizing this lower bound and Section 5

states a condition on the sample sparsity for its opti-

mality. Section 6 illustrates the features of the pro-

posed algorithm and Section 7 discusses its interest,

while Section 8 concludes.

2 RELATED WORK

The CGRL algorithm outputs sequences of decisions

that, given the prior knowledge it has about its en-

vironment in terms of upper bounds on its Lipschitz

constants, are likely to drive the agent only towards

areas well enough covered by the sample. Heuristic

strategies have already been proposed in the RL lit-

erature to infer policies that exhibit such a conserva-

tive behavior. As a way of example, some of these

strategies associate high negative rewards to trajec-

tories falling outside of the well covered areas. The

CGRL algorithm can be seen as a min-max approach

to solve the generalization task which exploits in a

rational way prior knowledge in the form of upper

bounds on its Lipschitz constants. Other works in RL

have already developped min-max strategies when the

environment behavior is partially unknown. However,

these strategies usually consider problems with ﬁnite

state spaces where the uncertainities come from the

lack of knowledge of the transition probabilities (De-

lage and Mannor, 2006; Cs

aji and Monostori, 2008).

In model predictive control (MPC) where the envi-

ronment is supposed to be fully known (Ernst et al.,

2009), min-max approaches have been used to deter-

mine the optimal sequence of actions with respect to

the “worst case” disturbance sequence occuring (Be-

mporad and Morari, 1999). The CGRL algorithm re-

lies on a methodology for computing a lower bound

on the return in a deterministic setting with a mostly

unknown environment. In this, it is related to works in

the ﬁeld of RL which try to get from a sample of tra-

jectories lower bounds on the returns of inferred poli-

cies (Mannor et al., 2004; Qian and Murphy, 2009).

3 PROBLEM STATEMENT

We consider a discrete-time system whose dynamics

over T stages is described by a time-invariant equa-

tion

t+1

= f (x

) t = 0, 1, . ..,T − 1,

where for all t, the state x

is an element of the normed

vector state space X and u

is an element of the ﬁ-

nite (discrete) action space U. T ∈ N

is referred to

as the optimization horizon. An instantaneous reward

= ρ(x

) ∈ R is associated with the action u

taken

while being in state x

. For every initial state x and for

every sequence of actions (u

,...,u

T −1

) ∈ U

, the

cumulated reward over T stages (also named return

over T stages) is deﬁned as

,...,u

T −1

(x) =

T −1

∑

t=0

ρ(x

) .

We assume that the system dynamics f and the reward

function ρ are Lipschitz continuous, i.e. that there

exist ﬁnite constants L

∈ R such that: ∀x,x

∈

X ,∀u ∈ U,

k f (x,u) − f (x

,u)k ≤ L

kx − x

k ,

|ρ(x,u) − ρ(x

,u)| ≤ L

kx − x

k .

A CAUTIOUS APPROACH TO GENERALIZATION IN REINFORCEMENT LEARNING

We further suppose that: (i) the system dynamics

f and the reward function ρ are unknown, (ii) a

set of one-step transitions F = {(x

)}

|F |

l=1

known where each one-step transition is such that y

f (x

) and r

= ρ(x

), (iii) ∀a ∈ U, ∃(x, u, r,y) ∈

F : u = a (each action a ∈ U appears at least once

in F ) and (iv) two constants L

and L

satisfying the

above-written inequalities are known.

Let J

∗

(x) = max

,...,u

T −1

)∈U

,...,u

T −1

(x) . An op-

timal sequence of actions u

∗

(x),...,u

∗

T −1

(x) is such

that J

∗

(x),...,u

∗

T −1

(x)

(x) = J

∗

(x). The goal is to com-

pute, for any initial state x ∈ X , a sequence of actions

( ˆu

∗

(x),..., ˆu

∗

T −1

(x)) ∈ U

such that J

ˆu

∗

(x),..., ˆu

∗

T −1

(x)

as close as possible to J

∗

(x).

4 LOWER BOUND ON THE

RETURN OF A GIVEN

SEQUENCE OF ACTIONS

In this section, we present a method for comput-

ing, from a given initial state, a dataset of transi-

tions, and weak prior knowledge about the environ-

ment, a maximal lower bound on the T -stage re-

turn of a given sequence of actions u

,...,u

T −1

The method is adapted from (Fonteneau et al.,

2009). In the following, we denote by F

,...,u

T −1

the set of all sequences of one-step system tran-

sitions [(x

),...,(x

T −1

)]

that may be built from elements of F and that are

compatible with u

,...,u

T −1

, i.e. for which u

, ∀t ∈ J0, T − 1K. First, we compute a lower bound

on the return of the sequence u

,...,u

T −1

from any

given element τ from F

,...,u

T −1

. This lower bound

B(τ,x) is made of the sum of the T rewards corre-

sponding to τ (

∑

T −1

t=0

) and T negative terms. Every

negative term is associated with a one-step transition.

More speciﬁcally, the negative term corresponding to

the transition (x

) of τ represents an upper

bound on the variation of the cumulated rewards over

the remaining time steps that can occur by simulat-

ing the system from a state x

rather than y

t−1

(with

−1

= x). By maximizing B(τ,x) over F

,...,u

T −1

we obtain a maximal lower bound on the return of

,...,u

T −1

whose tightness can be characterized in

terms of the sample sparsity.

These constants do not necessarily have to be the small-

est ones satisfying these inequalities (i.e., the Lispchitz con-

stants), however, the smaller they are, the higher the lower

bound on the return of the policy outputted by the CGRL

algorithm will be.

4.1 Computing a Bound from a Given

Sequence of One-Step Transitions

We have the following lemma.

Lemma 4.1. Let u

,...,u

T −1

be a sequence of ac-

tions. Let τ = [(x

)]

T −1

t=0

∈ F

,...,u

T −1

. Then,

,...,u

T −1

(x) ≥ B(τ,x) ,

with

B(τ,x)

T −1

∑

t=0



− L

T −t

t−1

− x



−1

= x ,

T −t

= L

T −t−1

∑

i=0

)

The proof is given in Appendix 8.1. The lower

bound on J

,...,u

T −1

(x) derived in this lemma can be

interpreted as follows. The sum of the rewards of the

“broken” trajectory formed by the sequence of one-

step system transitions τ can never be greater than

,...,u

T −1

(x), provided that every reward r

is penal-

ized by a factor L

T −t

t−1

− x

k. This factor is in

fact an upper bound on the variation of the (T − t)-

state-action value function (see Appendix 8.1) that

can occur when “jumping” from (y

t−1

) to (x

An illustration of this is given in Figure 1.

4.2 Tightness of Highest Lower Bound

Over all Compatible Sequences of

One-Step Transitions

We deﬁne

,...,u

T −1

(x) = max

τ∈F

,...,u

T −1

B(τ,x)

and we analyze in this subsection the tightness of the

lower bound B

,...,u

T −1

(x) as a function of the sample

sparsity. The sample sparsity is deﬁned as follows: let

= {(x

,y) ∈ F |u

= a} (∀a, F

0 accord-

ing to assumption (iii) given in Section 3) and let us

suppose that ∃ α ∈ R

∀a ∈ U , sup

∈X



min

)∈F

− x



≤ α . (1)

The smallest α which satisﬁes equation (1) is named

the sample sparsity and is denoted by α

∗

(the exis-

tence of α and α

∗

implies that X is bounded). We

have the following theorem.

Theorem 4.2 (Tightness of highest lower bound).

∃ C > 0 : ∀(u

,...,u

T −1

) ∈ U

,...,u

T −1

(x) − B

,...,u

T −1

(x) ≤ Cα

∗

ICAART 2010 - 2nd International Conference on Agents and Artificial Intelligence

∀ t ∈

{

0,... ,T −1

}

, u

x

, u

, r

, y



 x

, u

, r

, y



x

T −2

, u

T −2

, r

T −2

, y

T −2



x

T −1

, u

T −1

, r

T −1

, y

T −1



 x

, u



= x

, u



=f  x



T −2

T −1

∥x−x

∥

,... ,u

T −1

 x≥

∑

t =0

T −1

−L

T −t

∥y

t−1

−x

∥] with y

−1

= x

∥y

−x

∥

∥y

T −2

− x

T −1

∥

Figure 1: A graphical interpretation of the different terms composing the bound on J

,...,u

T −1

(x) computed from a sequence

of one-step transitions.

The proof of Theorem 4.2 is given in Appendix

8.2. The lower bound B

,...,u

T −1

(x) thus converges

to the T -stage return of the sequence of actions

,...,u

T −1

when the sample sparsity α

∗

decreases

to zero.

5 COMPUTING A SEQUENCE OF

ACTIONS MAXIMIZING THE

HIGHEST LOWER BOUND

Let B

∗

(x) = {(u

,...,u

T −1

) ∈ U

,...,u

T −1

(x) =

max

,...,u

T −1

)∈U

,...,u

T −1

(x)}. The CGRL algorithm

computes for each initial state x a sequence of ac-

tions ˆu

∗

(x),..., ˆu

∗

T −1

(x) that belongs to B

∗

(x). From

what precedes, it follows that the actual return

ˆu

∗

(x),..., ˆu

∗

T −1

(x)

(x) of this sequence is lower-bounded

by max

,...,u

T −1

)∈U

,...,u

T −1

(x). Due to the tightness

of the lower bound B

,...,u

T −1

(x), the value of the re-

turn which is guaranteed will converge to the true re-

turn of the sequence of actions when α

∗

decreases

to zero. Additionaly, we prove in Section 5.1 that

when the sample sparsity α

∗

decreases below a par-

ticular threshold, the sequence ˆu

∗

(x),..., ˆu

∗

T −1

(x) is

optimal. To identify a sequence of actions that be-

longs to B

∗

(x) without computing for all sequences

,...,u

T −1

the value B

,...,u

T −1

(x), the CGRL algo-

rithm exploits the fact that the problem of ﬁnding an

element of B

∗

(x) can be reformulated as a shortest

path problem.

5.1 Convergence of

∗

(x),. ..,

∗

T−1

(x)

Towards an Optimal Sequence of

Actions

We prove hereafter that when α

∗

gets lower than a

particular threshold, the CGRL algorithm can only

output optimal policies.

Theorem 5.1 (Convergence of CGRL algorithm).

Let

∗

(x) = {(u

,...,u

T −1

) ∈ U

,...,u

T −1

(x) = J

∗

(x)} ,

and let us suppose that J

∗

(x) 6= U

(if J

∗

(x) = U

the search for an optimal sequence of actions is in-

deed trivial). We deﬁne

ε(x) = min

,...,u

T −1

∈U

∗

(x)

∗

(x) − J

,...,u

T −1

(x)}.

Then

Cα

∗

< ε(x) =⇒ ( ˆu

∗

(x),..., ˆu

∗

T −1

(x)) ∈ J

∗

(x) .

The proof of Theorem 5.1 is given in Appendix

8.3.

5.2 Cautious Generalization

Reinforcement Learning Algorithm

The CGRL algorithm computes an element of the set

∗

(x) deﬁned previously. Let D : F

→ U

the operator that maps a sequence of one-step sys-

tem transitions τ = [(x

)]

T −1

t=0

into the se-

quence of actions u

,...,u

T −1

. Using this opera-

tor, we can write B

∗

(x) =



,...,u

T −1

) ∈ U

|∃τ ∈

argmax

τ∈F

B(τ,x) for which D(τ) = (u

,...,u

T −1

)



. Or,

A CAUTIOUS APPROACH TO GENERALIZATION IN REINFORCEMENT LEARNING

✶

,... , l

T−1

✶

∈argmax

,..., l

T −1

0,l

c

l

...c

T −1

l

T −2

, l

T −1



with c

i , j=−L

T − t

∥

−x

∥

r

, y

= x

x

, u

, r

, y



x

, u

, r

, y



x

, u

, r

, y



 x

, u

, r

, y



x

, u

, r

, y



x

, u

, r

, y



0,1

0,2

 0,n

1,1

1,2

1, n

2, n

2,2

2,1

n , n

n , 2

n , 1

 x

, u

, r

, y



 x

, u

, r

, y



x

, u

, r

, y



T −1

i, j

u

✶

 x , ... , u

T −1

✶

 x=u

✶

, ... ,u

T − 1

✶

Figure 2: A graphical interpretation of the CGRL algorithm (notice that n = |F |).

equivalently

∗

(x) =

,. .., u

T −1

) ∈ U

∃τ ∈ argmax

τ∈F

T −1

∑

t=0



− L

T −t

t−1

− x



for which

D(τ) = (u

,. .., u

T −1

)

From this expression, we can notice that a se-

quence of one-step transitions τ such that D(τ) be-

longs to B

∗

(x) can be obtained by solving a short-

est path problem on the graph given in Figure 2. The

CGRL algorithm works by solving this problem using

the Viterbi algorithm and by applying the operator D

to the sequence of one-step transitions τ correspond-

ing to its solution. Its complexity is quadratic with

respect to the cardinality of the input sample F and

linear with respect to the optimization horizon T .

6 ILLUSTRATION

In this section, we illustrate the CGRL algorithm on a

variant of the puddle world benchmark introduced in

(Sutton, 1996). In this benchmark, a robot whose goal

is to collect high cumulated rewards navigates on a

plane. A puddle stands in between the initial position

of the robot and the high reward area. If the robot is

in the puddle, it gets highly negative rewards. An op-

timal navigation strategy drives the robot around the

puddle to reach the high reward area. Two datasets

of one-step transitions have been used in our exam-

ple. The ﬁrst set F

contains elements that uniformly

cover the area of the state space that can be reached

within T steps. The set F

has been obtained by re-

moving from F

the elements corresponding to the

highly negative rewards.

The full speciﬁcation of the

benchmark and the exact procedure for generating F

and F

are given in Appendix 8.4. On Figure 3, we

have drawn the trajectory of the robot when follow-

ing the sequence of actions computed by the CGRL

algorithm. Every state encountered is represented by

a white square. The plane upon which the robot nav-

igates has been colored such that the darker the area,

the smaller the corresponding rewards are. In particu-

lar, the puddle area is colored in dark grey/black. We

see that the CGRL policy drives the robot around the

puddle to reach the high-reward area − which is rep-

resented by the light-grey circles. The CGRL algo-

rithm also computes a lower-bound on the cumulated

rewards obtained by this action sequence. Here, we

found out that this lower bound was rather conserva-

tive.

Although this problem might be treated by on-line

learning methods, in some settings - for whatever reason

- on-line learning may be impractical and all one will have

is a batch of trajectories

ICAART 2010 - 2nd International Conference on Agents and Artificial Intelligence

Figure 3: CGRL with F

Figure 4: FQI with F

Figure 4 represents the policy inferred from F

using the (ﬁnite-time version of the) Fitted Q Itera-

tion algorithm (FQI) combined with extremely ran-

domized trees as function approximators (Ernst et al.,

2005). The trajectories computed by the CGRL and

FQI algorithms are very similar and so are the sums

of rewards obtained by following these two trajecto-

ries. However, by using F

rather that F

, the CGRL

and FQI algorithms do not lead to similar trajectories,

as it is shown on Figures 5 and 6. Indeed, while the

CGRL policy still drives the robot around the puddle

to reach the high reward area, the FQI policy makes

the robot cross the puddle. In terms of optimality,

this latter navigation strategy is much worse. The dif-

ference between both navigation strategies can be ex-

plained as follows. The FQI algorithm behaves as if

it were associating to areas of the state space that are

not covered by the input sample, the properties of the

elements of this sample that are located in the neigh-

borhood of these areas. This in turn explains why it

computes a policy that makes the robot cross the pud-

dle. The same behavior could probably be observed

by using other algorithms that combine dynamic pro-

gramming strategies with kernel-based approximators

or averagers (Boyan and Moore, 1995; Gordon, 1999;

Figure 5: CGRL with F

Figure 6: FQI with F

Ormoneit and Sen, 2002). The CGRL algorithm gen-

eralizes the information contained in the dataset, by

assuming, given the intial state, the most adverse be-

havior for the environment according to its weak prior

knowledge about the environment. This results in the

fact that the CGRL algorithm penalizes sequences of

decisions that could drive the robot in areas not well

covered by the sample, and this explains why the

CGRL algorithm drives the robot around the puddle

when run with F

7 DISCUSSION

The CGRL algorithm outputs a sequence of actions

as well as a lower bound on its return. When L

> 1

(e.g. when the system is unstable), this lower bound

will decrease exponentially with T . This may lead

to very low performance guarantees when the opti-

mization horizon T is large. However, one can also

observe that the terms L

T −t

− which are responsible

for the exponential decrease of the lower bound with

the optimization horizon − are multiplied by the dis-

tance between the end state of a one-step transition

A CAUTIOUS APPROACH TO GENERALIZATION IN REINFORCEMENT LEARNING

and the beginning state of the next one-step transition

of the sequence τ (ky

∗

t−1

− x

∗

k) solution of the short-

est path problem of Figure 2. Therefore, if these states

∗

t−1

and x

∗

are close to each other, the CGRL algo-

rithm can lead to good performance guarantees even

for large values of T . It is also important to notice that

this lower bound does not depend explicitly on the

sample sparsity α

∗

, but depends rather on the initial

state for which the sequence of actions is computed.

Therefore, this may lead to cases where the CGRL

algorithm provides good performance guarantees for

some speciﬁc initial states, even if the sample does

not cover every area of the state space well enough.

Other RL algorithms working in a similar setting

as the CGRL algorithm, while not exploiting the weak

prior knowledge about the environment, do not output

a lower bound on the return of the policy h they infer

from the sample of trajectories F . However, some

lower bounds on the return of h can still be com-

puted. For instance, this can be done by exploiting

the results of (Fonteneau et al., 2009) upon which the

CGRL algorithm is based. However, one can show

that following the strategy described in (Fonteneau

et al., 2009) would necessarily lead to a bound lower

than the lower bound associated to the sequence of

actions computed by the CGRL algorithm. Another

strategy would be to design global lower bounds on

their policy by adapting proofs used to establish the

consistency of these algorithms. As a way of exam-

ple, by proceeding like this, we can design a lower-

bound on the return of the policy given by the FQI

algorithm when combined with some speciﬁc approx-

imators which have, among others, Lipschitz continu-

ity properties. These algorithms compute a sequence

of state-action value functions

, ... ,

and

compute the policy h : {0, 1, . . . , T − 1} × X deﬁned

as follows : h(t,x

) ∈ argmax

u∈U

T −t

,u). For in-

stance when using kernel-based approximators (Or-

moneit and Sen, 2002), we have as result that the re-

turn of h when starting from a state x is larger than

(x,h(0,x)) − (C

T + C

) · α

∗

where C

and C

depends on L

, L

, the Lipschtiz constants of the class

of approximation and an upper bound on ρ. The ex-

plicit dependence of this lower bound on α

∗

as well

as the large values of C

and C

tend to lead to a

very conservative lower bound, especially when F is

sparse.

8 CONCLUSIONS

We have proposed a new strategy for RL using a batch

of system transitions and some weak prior knowledge

about the environment behavior. It consists in max-

imizing lower bounds on the return that may be in-

ferred by combining information from a dataset of

observed system transitions and upper bounds on the

Lipschitz constants of the environment. The proposed

algorithm is of polynomial complexity and avoids re-

gions of the state space where the sample density is

too low according to the prior information. A sim-

ple example has illustrated that this strategy can lead

to cautious policies where other batch-mode RL algo-

rithms fail because they unsafely generalize the infor-

mation contained in the dataset.

From the results given in (Fonteneau et al., 2009),

it is also possible to derive in a similar way upper

bounds on the return of a policy. In this respect, it

would also be possible to adopt an optimistic gener-

alization strategy by inferring policies that maximize

these upper bounds. We believe that exploiting to-

gether the policy based on a cautious generalization

strategy and the one based on an optimistic general-

ization strategy could offer interesting possibilities for

addressing the exploitation-exploration tradeoff faced

when designing intelligent agents. For example, if the

policies coincide, it could be an indication that further

exploration is not needed.

When using batch mode reinforcement learning

algorithms to design autonomous intelligent agents,

a problem arises. After a long enough time of inter-

action with their environment, the sample the agents

collect may become so large that batch mode RL-

techniques may become computationally impractical,

even with small degree polynomial algorithms. As

suggested by (Ernst, 2005), a solution for addressing

this problem would be to retain only the most “in-

formative samples”. In the context of the proposed

algorithm, the complexity for computing the optimal

sequence of decisions is quadratic in the size of the

dataset. We believe that it would be interesting to de-

sign lower complexity algorithms based on subsam-

pling the dataset based on the initial state information.

Finally, while in this paper we have considered de-

terministic environments and open-loop strategies for

interacting with them, we believe that extending our

framework to closed-loop strategies (revising at each

time step the ﬁrst stage decision in a receding horizon

approach) and study their performances, in particular

in the context of stochastic environments, is a very

promising direction of further research.

ICAART 2010 - 2nd International Conference on Agents and Artificial Intelligence

ACKNOWLEDGEMENTS

Raphael Fonteneau acknowledges the ﬁnancial sup-

port of the FRIA (Fund for Research in Industry and

Agriculture). Damien Ernst is a research associate

of the FRS-FNRS. This paper presents research re-

sults of the Belgian Network BIOMAGNET (Bioin-

formatics and Modeling: from Genomes to Net-

works), funded by the Interuniversity Attraction Poles

Programme, initiated by the Belgian State, Science

Policy Ofﬁce.We also acknowledge ﬁnancial support

from NIH grants P50 DA10075 and R01 MH080015.

The scientiﬁc responsibility rests with its authors.

REFERENCES

Bemporad, A. and Morari, M. (1999). Robust model pre-

dictive control: A survey. Robustness in Identiﬁcation

and Control, 245:207–226.

Bertsekas, D. and Tsitsiklis, J. (1996). Neuro-Dynamic Pro-

gramming. Athena Scientiﬁc.

Boyan, J. and Moore, A. (1995). Generalization in rein-

forcement learning: Safely approximating the value

function. In Advances in Neural Information Process-

ing Systems 7, pages 369–376. MIT Press.

aji, B. C. and Monostori, L. (2008). Value function based

reinforcement learning in changing markovian envi-

ronments. J. Mach. Learn. Res., 9:1679–1709.

Delage, E. and Mannor, S. (2006). Percentile optimization

for Markov decision processes with parameter uncer-

tainty. Operations Research.

Ernst, D. (2005). Selecting concise sets of samples for a

reinforcement learning agent. In Proceedings of the

Third International Conference on Computational In-

telligence, Robotics and Autonomous Systems (CIRAS

2005), page 6.

Ernst, D., Geurts, P., and Wehenkel, L. (2005). Tree-based

batch mode reinforcement learning. Journal of Ma-

chine Learning Research, 6:503–556.

Ernst, D., Glavic, M., Capitanescu, F., and Wehenkel, L.

(April 2009). Reinforcement learning versus model

predictive control: a comparison on a power system

problem. IEEE Transactions on Systems, Man, and

Cybernetics - Part B: Cybernetics, 39:517–529.

Fonteneau, R., Murphy, S., Wehenkel, L., and Ernst, D.

(2009). Inferring bounds on the performance of a

control policy from a sample of trajectories. In Pro-

ceedings of the 2009 IEEE Symposium on Adaptive

Dynamic Programming and Reinforcement Learning

(IEEE ADPRL 09), Nashville, TN, USA.

Gordon, G. (1999). Approximate Solutions to Markov De-

cision Processes. PhD thesis, Carnegie Mellon Uni-

versity.

Ingersoll, J. (1987). Theory of Financial Decision Making.

Rowman and Littleﬁeld Publishers, Inc.

Lagoudakis, M. and Parr, R. (2003). Least-squares pol-

icy iteration. Jounal of Machine Learning Research,

4:1107–1149.

Mannor, S., Simester, D., Sun, P., and Tsitsiklis, J. (2004).

Bias and variance in value function estimation. In Pro-

ceedings of the 21

International Conference on Ma-

chine Learning.

Murphy, S. (2003). Optimal dynamic treatment regimes.

Journal of the Royal Statistical Society, Series B,

65(2):331–366.

Murphy, S. (2005). An experimental design for the devel-

opment of adaptive treatment strategies. Statistics in

Medicine, 24:1455–1481.

Ormoneit, D. and Sen, S. (2002). Kernel-based reinforce-

ment learning. Machine Learning, 49(2-3):161–178.

Qian, M. and Murphy, S. (2009). Performance guarantee

for individualized treatment rules. Submitted.

Riedmiller, M. (2005). Neural ﬁtted Q iteration - ﬁrst ex-

periences with a data efﬁcient neural reinforcement

learning method. In Proceedings of the Sixteenth

European Conference on Machine Learning (ECML

2005), pages 317–328.

Sutton, R. (1996). Generalization in reinforcement learn-

ing: Successful examples using sparse coding. In

Advance in Neural Information Processing Systems 8,

pages 1038–1044. MIT Press.

Sutton, R. and Barto, A. (1998). Reinforcement Learning.

MIT Press.

APPENDIX

8.1 Proof of Lemma 4.1

Before proving Lemma 4.1 in Section 8.1.2, we prove

in Section 8.1.1 a preliminary result related to the Lip-

schitz continuity of state-action value functions.

8.1.1 Lipschitz Continuity of the N-stage

State-action Value Functions

For N = 1,...,T , let us deﬁne the family of state-

action value functions Q

,...,u

T −1

: X ×U → R as fol-

lows:

,...,u

T −1

(x,u) = ρ(x,u) +

T −1

∑

t=T −N+1

ρ(x

where x

T −N+1

= f (x,u). Q

,...,u

T −1

(x,u) gives the

sum of rewards from instant t = T −N to instant T −1

when (i) the system is in state x at instant T − N, (ii)

the action chosen at instant T −N is u and (iii) the ac-

tions chosen at instants t > T −N are u

. The function

,...,u

T −1

can be deduced from Q

,...,u

T −1

as follows:

∀x ∈ X, J

,...,u

T −1

(x) = Q

,...,u

T −1

(x,u

). (2)

A CAUTIOUS APPROACH TO GENERALIZATION IN REINFORCEMENT LEARNING

We also have ∀x ∈ X, ∀u ∈ U,

,...,u

T −1

N+1

(x,u) =

ρ(x,u) + Q

,...,u

T −1

( f (x,u),u

T −N

)

(3)

Lemma [Lipschitz continuity of Q

,...,u

T −1

]

∀N ∈ {1,...,T }, ∀x,x

∈ X, ∀u ∈ U,

,...,u

T −1

(x,u) − Q

,...,u

T −1

,u)| ≤ L

kx − x

k ,

with L

= L

∑

N−1

t=0

Proof We consider the statement H (N): ∀x, x

∈

X, ∀u, u

∈ U ,

,...,u

T −1

(x,u) − Q

,...,u

T −1

,u)| ≤ L

kx − x

We prove by induction that H (N) is true

∀N ∈ {1, . . . , T }. For the sake of clarity, we de-

note |Q

,...,u

T −1

(x,u) − Q

,...,u

T −1

,u)| by ∆

Basis (N = 1) : We have ∆

= |ρ(x,u) − ρ(x

,u)|,

and the Lipschitz continuity of ρ allows to write

∆

≤ L

kx − x

k. This proves H (1).

Induction step: We suppose that H (N) is true,

1 ≤ N ≤ T − 1. Using equation (3), we can write

∆

N+1



,...,u

T −1

N+1

(x,u) − Q

,...,u

T −1

N+1

,u)



ρ(x,u) − ρ(x

,u) + Q

,...,u

T −1

( f (x,u),u

T −N

) −

,...,u

T −1

( f (x

,u),u

T −N

)



and, from there, ∆

N+1

≤



ρ(x,u) −

ρ(x

,u)



,...,u

T −1

( f (x,u),u

T −N

) −

,...,u

T −1

( f (x

,u),u

T −N

)



H (N) and the Lipschitz continuity of ρ give

∆

N+1

≤ L

kx − x

k + L

k f (x,u) − f (x

,u)k.

The Lipschitz continuity of f gives

∆

N+1

≤ L

kx − x

k + L

kx − x

then ∆

N+1

≤ L

N+1

kx − x

k since L

N+1

= L

. This proves H (N + 1) and ends the proof.

8.1.2 Proof of Lemma 4.1

By assumption we have u

= u

, then we use equation

(2) and the Lipschitz continuity of Q

,...,u

T −1

to write

,...,u

T −1

(x) − Q

,...,u

T −1

)| ≤ L

kx − x

It follows that

,...,u

T −1

) − L

kx − x

k ≤ J

,...,u

T −1

(x).

According to equation (3), we

have Q

,...,u

T −1

) = ρ(x

) +

,...,u

T −1

( f (x

),u

) and from there

,...,u

T −1

) = r

+ Q

T −1

Thus,

,...,u

T −1

)+r

−L

kx−x

k ≤ J

,...,u

T −1

(x).

The Lipschitz continuity of Q

,...,u

T −1

with u

= u

gives

,...,u

T −1

) − Q

,...,u

T −1

≤ L

T −1

− x

k .

This implies that

,...,u

T −1

) − L

T −1

− x

≤ Q

,...,u

T −1

We have therefore

,...,u

T −1

) + r

− L

kx − x

−L

T −1

− x

k ≤ J

,...,u

T −1

(x).

The proof is completed by developing this iteration.

8.2 Proof of Theorem 4.2

Let (x

,...,x

T −1

) be the

trajectory of an agent starting from x

= x when

following the open-loop policy u

,...,u

T −1

. Using

equation (1), we deﬁne τ = [(x

)]

T −1

t=0

∈

,...,u

T −1

that satisﬁes ∀t ∈ {0, 1, . . . , T − 1}

− x

k = min

l∈{1,...,|F |}

− x

k ≤ α

∗

. (4)

We have B(τ,x) =

∑

T −1

t=0

− L

T −t

t−1

− x

k] with

−1

= x. Let us focus on ky

t−1

− x

k . We

have ky

t−1

− x

k = kx

− x

+ x

− y

t−1

k, and hence

t−1

−x

k ≤ kx

−x

k+kx

−y

t−1

k. Using inequal-

ity (4), we can write

t−1

− x

k ≤ α

∗

+ kx

− y

t−1

k. (5)

For t = 0, one has kx

− y

t−1

k = kx

− x

k = 0. For

t > 0 kx

− y

t−1

k = k f (x

t−1

) − f (x

t−1

and the Lipschitz continuity of f implies that kx

−

t−1

k ≤ L

t−1

− x

t−1

k. So, as kx

t−1

− x

t−1

k ≤ α

∗

we have

∀t > 0, kx

− y

t−1

k ≤ L

∗

. (6)

Equations (5) and (6) imply that for t > 0, ky

t−1

−

k ≤ α

∗

(1 +L

) and, for t = 0, ky

−1

− x

k ≤ α

∗

≤

∗

(1 + L

). This gives

B(τ,x) ≥

T −1

∑

t=0

− L

T −t

∗

(1 + L

)]

= B.

We also have, by deﬁnition of B

,...,u

T −1

(x) ,

,...,u

T −1

(x) ≥ B

,...,u

T −1

(x) ≥ B(τ,x) ≥ B .

ICAART 2010 - 2nd International Conference on Agents and Artificial Intelligence

Thus,

,...,u

T −1

(x) − B

,...,u

T −1

(x)| ≤ |J

,...,u

T −1

(x) − B|

= J

,...,u

T −1

(x) − B

= |

∑

T −1

t=0

[(r

− r

) + L

T −t

∗

(1 + L

)]|

≤

∑

T −1

t=0

[|r

− r

| + L

T −t

∗

(1 + L

)].

The Lipschitz continuity of ρ allows to write

− r

| = |ρ(x

) − ρ(x

)| ≤ L

− x

and using inequality (4), we have |r

− r

| ≤ L

∗

Finally, we obtain

,...,u

T −1

(x) − B ≤

∑

T −1

t=0

∗

+ L

T −t

∗

(1 + L

)]

≤ T L

∗

∑

T −1

t=0

T −t

∗

(1 + L

)

≤ α

∗



T L

∑

T −1

t=0

T −t

(1 + L

)



Thus

,...,u

T −1

(x) − B

,...,u

T −1

(x)

≤



T L

+ (1 + L

)

T −1

∑

t=0

T −t



∗

which completes the proof.

8.3 Proof of Theorem 5.1

Let us prove that by Reductio ad absurdum. Let us

suppose that the algorithm does not return an optimal

sequence of actions, which means that

ˆu

∗

(x),..., ˆu

∗

T −1

(x)

(x) ≤ J

∗

(x) − ε(x) .

Let us consider a sequence u

∗

(x),...,u

∗

T −1

(x) ∈

∗

(x). Then J

∗

(x),...,u

∗

T −1

(x)

(x) = J

∗

(x). The lower

bound B

∗

(x),...,u

∗

T −1

(x)

(x) satisﬁes the relationship

∗

(x) − B

∗

(x),...,u

∗

T −1

(x)

(x) ≤ Cα.

Knowing that Cα < ε(x), we have

∗

(x),...,u

∗

T −1

(x)

(x) > J

∗

(x) − ε.

By deﬁnition of ε,

∗

(x) − ε ≥ J

ˆu

∗

(x),..., ˆu

∗

T −1

(x)

(x),

and since

ˆu

∗

(x),..., ˆu

∗

T −1

(x)

(x) ≥ B

ˆu

∗

(x),..., ˆu

∗

T −1

(x)

(x),

we have

∗

(x),...,u

∗

T −1

(x)

(x) > B

ˆu

∗

(x),..., ˆu

∗

T −1

(x)

(x) ,

which contradicts the fact that the algorithm returns

the sequence that leads to the highest lower bound.

This ends the proof.

8.4 Experimental Speciﬁcations

The puddle world benchmark is deﬁned by

X = R

U = {



0.1 0





−0.1 0





0 0.1





0 −0.1



f (x,u) = x + u,

ρ(x,u) = k

,Σ

(x) − k

,Σ

(x) − k

,Σ

(x) ,

with

µ,Σ

(x) =

2π

|Σ|

−(x−µ)Σ

−1

(x−µ)



1 1





0.225 0.75





0.45 0.6





0.005 0

0 0.005





0.05 0

0 0.001





0.001 0

0 0.05



and k

= 1,k

= k

= 20. The Euclidian norm is

used. L

= 1, L

= 1.3742 ∗ 10

, T = 25, initial

state x = (0.35, 0.65). The sets of one-step system

transitions are

= {(x,u,ρ(x,u), f (x,u))|x = (−2.15 + i ∗

5/203,−1.85 + j ∗ 5/203), i, j = 1 : 203},

= F

\{(x,u,r,y) ∈ F

|x ∈ [0.4, 0.5] ×

[0.25,0.95] ∪[−0.1,0.6] × [0.7,0.8]}.

The FQI algorithm combined with extremely ran-

domized trees is run using its default parameters given

in (Ernst et al., 2005).

A CAUTIOUS APPROACH TO GENERALIZATION IN REINFORCEMENT LEARNING