A CAUTIOUS APPROACH TO GENERALIZATION IN
REINFORCEMENT LEARNING
Raphael Fonteneau
1
, Susan A. Murphy
2
, Louis Wehenkel
1
and Damien Ernst
1
1
Department of Electrical Engineering and Computer Science, University of Li
`
ege, Belgium
2
Department of Statistics, University of Michigan, USA
Keywords:
Reinforcement learning, Prior knowledge, Cautious generalization.
Abstract:
In the context of a deterministic Lipschitz continuous environment over continuous state spaces, finite action
spaces, and a finite optimization horizon, we propose an algorithm of polynomial complexity which exploits
weak prior knowledge about its environment for computing from a given sample of trajectories and for a given
initial state a sequence of actions. The proposed Viterbi-like algorithm maximizes a recently proposed lower
bound on the return depending on the initial state, and uses to this end prior knowledge about the environment
provided in the form of upper bounds on its Lipschitz constants. It thereby avoids, in way depending on
the initial state and on the the prior knowledge, those regions of the state space where the sample is too
sparse to make safe generalizations. Our experiments show that it can lead to more cautious policies than
algorithms combining dynamic programming with function approximators. We give also a condition on the
sample sparsity ensuring that, for a given initial state, the proposed algorithm produces an optimal sequence
of actions in open-loop.
1 INTRODUCTION
Since the late sixties, the field of Reinforcement
Learning (RL) (Sutton and Barto, 1998) has studied
the problem of inferring from the sole knowledge of
observed system trajectories, near-optimal solutions
to optimal control problems. The original motivation
was to design computational agents able to learn by
themselves how to interact in a rational way with their
environment. The techniques developed in this field
have appealed researchers trying to solve sequential
decision making problems in many fields such as Fi-
nance (Ingersoll, 1987), Medicine (Murphy, 2003;
Murphy, 2005) or Engineering (Riedmiller, 2005).
RL algorithms are challenged when dealing with
large or continuous state spaces. Indeed, in such cases
they have to generalize the information contained in a
generally sparse sample of trajectories. The dominat-
ing approach for generalizing this information is to
combine RL algorithms with function approximators
(Bertsekas and Tsitsiklis, 1996; Lagoudakis and Parr,
2003; Ernst et al., 2005). Usually, these approxima-
tors generalize the information contained in the sam-
ple to areas poorly covered by the sample by implic-
itly assuming that the properties of the system in those
areas are similar to the properties of the system in the
nearby areas well covered by the sample. This in turn
often leads to low performance guarantees on the in-
ferred policy when large state space areas are poorly
covered by the sample. This can be explained by the
fact that when computing the performance guarantees
of these policies, one needs to take into account that
they may actually drive the system into the poorly
visited areas to which the generalization strategy as-
sociates a favorable environment behavior, while the
environment may actually be particularly adversarial
in those areas. This is corroborated by theoretical re-
sults which show that the performance guarantees of
the policies inferred by these algorithms degrade with
the sample sparsity where, loosely speaking, the spar-
sity can be seen as the radius of the largest non-visited
state space area.
1
We propose an algorithm for reinforcement learn-
1
Usually, these theoretical results do not give lower
bounds per se but a distance between the return of the in-
ferred policy and the optimal return. However, by adapting
in a straightforward way the proofs behind these results, it
is often possible to get a bound on the distance between the
estimate of the return of the inferred policy computed by the
RL algorithm and its actual return and, from there, a lower
bound on the return of the inferred policy.
64
Fonteneau R., Murphy S., Wehenkel L. and Ernst D. (2010).
A CAUTIOUS APPROACH TO GENERALIZATION IN REINFORCEMENT LEARNING.
In Proceedings of the 2nd International Conference on Agents and Artificial Intelligence - Artificial Intelligence, pages 64-73
DOI: 10.5220/0002726900640073
Copyright
c
SciTePress
ing that derives action sequences which tend to avoid
regions where the performance is uncertain given the
available set and weak prior knowledge. This weak
prior knowledge is given in the form of upper bounds
on the Lipschitz constants of the environment. To
compute these sequences of actions, the algorithm ex-
ploits a lower bound on the performance of the agent
when it uses a certain open-loop sequence of actions
while starting from a given initial condition, as pro-
posed in (Fonteneau et al., 2009). More specifically,
it computes an open-loop sequence of actions to be
used from a given initial state to maximize that lower
bound. To this end we derive a Viterbi-like algorithm,
of polynomial computational complexity in the size
of the dataset and the optimization horizon. Our algo-
rithm does not rely on function approximators and it
computes, as a byproduct, a lower bound on the return
of its open-loop sequence of decisions. It essentially
adopts a cautious approach to generalization in the
sense that it computes decisions that avoid driving the
system into areas of the state space that are not well
enough covered by the available dataset, according to
the prior information about the dynamics and reward
function. Our algorithm named CGRL for Cautious
Generalization (oriented) Reinforcement Learning
assumes a finite action space and a deterministic dy-
namics and reward function, and it is formulated for
finite time-horizon problems. We also provide a con-
dition on the sample sparsity ensuring that, for a given
initial state, the proposed algorithm produces an opti-
mal sequence of actions in open-loop, and we suggest
directions for leveraging our approach to a larger class
of problems in RL.
The rest of the paper is organized as follows. Sec-
tion 2 briefly discusses related work. In Section 3,
we formalize the inference problem we consider. In
Section 4, we adapt the results of (Fonteneau et al.,
2009) to compute a lower bound on the return of an
open-loop sequence of actions. Section 5 proposes
a polynomial algorithm for inferring a sequence of
actions maximizing this lower bound and Section 5
states a condition on the sample sparsity for its opti-
mality. Section 6 illustrates the features of the pro-
posed algorithm and Section 7 discusses its interest,
while Section 8 concludes.
2 RELATED WORK
The CGRL algorithm outputs sequences of decisions
that, given the prior knowledge it has about its en-
vironment in terms of upper bounds on its Lipschitz
constants, are likely to drive the agent only towards
areas well enough covered by the sample. Heuristic
strategies have already been proposed in the RL lit-
erature to infer policies that exhibit such a conserva-
tive behavior. As a way of example, some of these
strategies associate high negative rewards to trajec-
tories falling outside of the well covered areas. The
CGRL algorithm can be seen as a min-max approach
to solve the generalization task which exploits in a
rational way prior knowledge in the form of upper
bounds on its Lipschitz constants. Other works in RL
have already developped min-max strategies when the
environment behavior is partially unknown. However,
these strategies usually consider problems with finite
state spaces where the uncertainities come from the
lack of knowledge of the transition probabilities (De-
lage and Mannor, 2006; Cs
´
aji and Monostori, 2008).
In model predictive control (MPC) where the envi-
ronment is supposed to be fully known (Ernst et al.,
2009), min-max approaches have been used to deter-
mine the optimal sequence of actions with respect to
the “worst case” disturbance sequence occuring (Be-
mporad and Morari, 1999). The CGRL algorithm re-
lies on a methodology for computing a lower bound
on the return in a deterministic setting with a mostly
unknown environment. In this, it is related to works in
the field of RL which try to get from a sample of tra-
jectories lower bounds on the returns of inferred poli-
cies (Mannor et al., 2004; Qian and Murphy, 2009).
3 PROBLEM STATEMENT
We consider a discrete-time system whose dynamics
over T stages is described by a time-invariant equa-
tion
x
t+1
= f (x
t
,u
t
) t = 0, 1, . ..,T 1,
where for all t, the state x
t
is an element of the normed
vector state space X and u
t
is an element of the fi-
nite (discrete) action space U. T N
0
is referred to
as the optimization horizon. An instantaneous reward
r
t
= ρ(x
t
,u
t
) R is associated with the action u
t
taken
while being in state x
t
. For every initial state x and for
every sequence of actions (u
0
,...,u
T 1
) U
T
, the
cumulated reward over T stages (also named return
over T stages) is defined as
J
u
0
,...,u
T 1
(x) =
T 1
t=0
ρ(x
t
,u
t
) .
We assume that the system dynamics f and the reward
function ρ are Lipschitz continuous, i.e. that there
exist finite constants L
f
,L
ρ
R such that: x,x
0
X ,u U,
k f (x,u) f (x
0
,u)k ≤ L
f
kx x
0
k ,
|ρ(x,u) ρ(x
0
,u)| ≤ L
ρ
kx x
0
k .
A CAUTIOUS APPROACH TO GENERALIZATION IN REINFORCEMENT LEARNING
65
We further suppose that: (i) the system dynamics
f and the reward function ρ are unknown, (ii) a
set of one-step transitions F = {(x
l
,u
l
,r
l
,y
l
)}
|F |
l=1
is
known where each one-step transition is such that y
l
=
f (x
l
,u
l
) and r
l
= ρ(x
l
,u
l
), (iii) a U, (x, u, r,y)
F : u = a (each action a U appears at least once
in F ) and (iv) two constants L
f
and L
ρ
satisfying the
above-written inequalities are known.
2
Let J
(x) = max
(u
0
,...,u
T 1
)U
T
J
u
0
,...,u
T 1
(x) . An op-
timal sequence of actions u
0
(x),...,u
T 1
(x) is such
that J
u
0
(x),...,u
T 1
(x)
(x) = J
(x). The goal is to com-
pute, for any initial state x X , a sequence of actions
( ˆu
0
(x),..., ˆu
T 1
(x)) U
T
such that J
ˆu
0
(x),..., ˆu
T 1
(x)
is
as close as possible to J
(x).
4 LOWER BOUND ON THE
RETURN OF A GIVEN
SEQUENCE OF ACTIONS
In this section, we present a method for comput-
ing, from a given initial state, a dataset of transi-
tions, and weak prior knowledge about the environ-
ment, a maximal lower bound on the T -stage re-
turn of a given sequence of actions u
0
,...,u
T 1
.
The method is adapted from (Fonteneau et al.,
2009). In the following, we denote by F
T
u
0
,...,u
T 1
the set of all sequences of one-step system tran-
sitions [(x
l
0
,u
l
0
,r
l
0
,y
l
0
),...,(x
l
T 1
,u
l
T 1
,r
l
T 1
,y
l
T 1
)]
that may be built from elements of F and that are
compatible with u
0
,...,u
T 1
, i.e. for which u
l
t
=
u
t
, t J0, T 1K. First, we compute a lower bound
on the return of the sequence u
0
,...,u
T 1
from any
given element τ from F
T
u
0
,...,u
T 1
. This lower bound
B(τ,x) is made of the sum of the T rewards corre-
sponding to τ (
T 1
t=0
r
l
t
) and T negative terms. Every
negative term is associated with a one-step transition.
More specifically, the negative term corresponding to
the transition (x
l
t
,u
l
t
,r
l
t
,y
l
t
) of τ represents an upper
bound on the variation of the cumulated rewards over
the remaining time steps that can occur by simulat-
ing the system from a state x
l
t
rather than y
l
t1
(with
y
l
1
= x). By maximizing B(τ,x) over F
T
u
0
,...,u
T 1
,
we obtain a maximal lower bound on the return of
u
0
,...,u
T 1
whose tightness can be characterized in
terms of the sample sparsity.
2
These constants do not necessarily have to be the small-
est ones satisfying these inequalities (i.e., the Lispchitz con-
stants), however, the smaller they are, the higher the lower
bound on the return of the policy outputted by the CGRL
algorithm will be.
4.1 Computing a Bound from a Given
Sequence of One-Step Transitions
We have the following lemma.
Lemma 4.1. Let u
0
,...,u
T 1
be a sequence of ac-
tions. Let τ = [(x
l
t
,u
l
t
,r
l
t
,y
l
t
)]
T 1
t=0
F
T
u
0
,...,u
T 1
. Then,
J
u
0
,...,u
T 1
(x) B(τ,x) ,
with
B(τ,x)
.
=
T 1
t=0
r
l
t
L
Q
T t
ky
l
t1
x
l
t
k
,
y
l
1
= x ,
L
Q
T t
= L
ρ
T t1
i=0
(L
f
)
i
.
The proof is given in Appendix 8.1. The lower
bound on J
u
0
,...,u
T 1
(x) derived in this lemma can be
interpreted as follows. The sum of the rewards of the
“broken” trajectory formed by the sequence of one-
step system transitions τ can never be greater than
J
u
0
,...,u
T 1
(x), provided that every reward r
l
t
is penal-
ized by a factor L
Q
T t
ky
l
t1
x
l
t
k. This factor is in
fact an upper bound on the variation of the (T t)-
state-action value function (see Appendix 8.1) that
can occur when “jumping” from (y
l
t1
,u
t
) to (x
l
t
,u
t
).
An illustration of this is given in Figure 1.
4.2 Tightness of Highest Lower Bound
Over all Compatible Sequences of
One-Step Transitions
We define
B
u
0
,...,u
T 1
(x) = max
τF
T
u
0
,...,u
T 1
B(τ,x)
and we analyze in this subsection the tightness of the
lower bound B
u
0
,...,u
T 1
(x) as a function of the sample
sparsity. The sample sparsity is defined as follows: let
F
a
= {(x
l
,u
l
,r
l
,y) F |u
l
= a} (a, F
a
6=
/
0 accord-
ing to assumption (iii) given in Section 3) and let us
suppose that α R
+
:
a U , sup
x
0
X
min
(x
l
,u
l
,r
l
,y
l
)F
a
kx
l
x
0
k
α . (1)
The smallest α which satisfies equation (1) is named
the sample sparsity and is denoted by α
(the exis-
tence of α and α
implies that X is bounded). We
have the following theorem.
Theorem 4.2 (Tightness of highest lower bound).
C > 0 : (u
0
,...,u
T 1
) U
T
,
J
u
0
,...,u
T 1
(x) B
u
0
,...,u
T 1
(x) Cα
.
ICAART 2010 - 2nd International Conference on Agents and Artificial Intelligence
66
t
{
0,... ,T 1
}
, u
l
t
=u
t
x
l
0
, u
l
0
, r
l
0
, y
l
0
x
l
1
, u
l
1
, r
l
1
, y
l
1
x
l
T 2
, u
l
T 2
, r
l
T 2
, y
l
T 2
x
l
T 1
, u
l
T 1
, r
l
T 1
, y
l
T 1
x
l
0
, u
l
0
x
1
=f x
0
,u
0
x
0
=x
x
2
x
T 2
x
T 1
x
T
xx
l
0
J
T
u
0
,... ,u
T 1
x
t =0
T 1
[r
l
t
L
Q
T t
y
l
t1
x
l
t
] with y
l
1
= x
y
l
0
x
l
1
y
l
T 2
x
l
T 1
Figure 1: A graphical interpretation of the different terms composing the bound on J
u
0
,...,u
T 1
(x) computed from a sequence
of one-step transitions.
The proof of Theorem 4.2 is given in Appendix
8.2. The lower bound B
u
0
,...,u
T 1
(x) thus converges
to the T -stage return of the sequence of actions
u
0
,...,u
T 1
when the sample sparsity α
decreases
to zero.
5 COMPUTING A SEQUENCE OF
ACTIONS MAXIMIZING THE
HIGHEST LOWER BOUND
Let B
(x) = {(u
0
,...,u
T 1
) U
T
|B
u
0
,...,u
T 1
(x) =
max
(u
0
0
,...,u
0
T 1
)U
T
B
u
0
0
,...,u
0
T 1
(x)}. The CGRL algorithm
computes for each initial state x a sequence of ac-
tions ˆu
0
(x),..., ˆu
T 1
(x) that belongs to B
(x). From
what precedes, it follows that the actual return
J
ˆu
0
(x),..., ˆu
T 1
(x)
(x) of this sequence is lower-bounded
by max
(u
0
,...,u
T 1
)U
T
B
u
0
,...,u
T 1
(x). Due to the tightness
of the lower bound B
u
0
,...,u
T 1
(x), the value of the re-
turn which is guaranteed will converge to the true re-
turn of the sequence of actions when α
decreases
to zero. Additionaly, we prove in Section 5.1 that
when the sample sparsity α
decreases below a par-
ticular threshold, the sequence ˆu
0
(x),..., ˆu
T 1
(x) is
optimal. To identify a sequence of actions that be-
longs to B
(x) without computing for all sequences
u
0
,...,u
T 1
the value B
u
0
,...,u
T 1
(x), the CGRL algo-
rithm exploits the fact that the problem of finding an
element of B
(x) can be reformulated as a shortest
path problem.
5.1 Convergence of
ˆ
u
0
(x),. ..,
ˆ
u
T1
(x)
Towards an Optimal Sequence of
Actions
We prove hereafter that when α
gets lower than a
particular threshold, the CGRL algorithm can only
output optimal policies.
Theorem 5.1 (Convergence of CGRL algorithm).
Let
J
(x) = {(u
0
,...,u
T 1
) U
T
|J
u
0
,...,u
T 1
(x) = J
(x)} ,
and let us suppose that J
(x) 6= U
T
(if J
(x) = U
T
,
the search for an optimal sequence of actions is in-
deed trivial). We define
ε(x) = min
u
0
,...,u
T 1
U
T
\J
(x)
{J
(x) J
u
0
,...,u
T 1
(x)}.
Then
Cα
< ε(x) = ( ˆu
0
(x),..., ˆu
T 1
(x)) J
(x) .
The proof of Theorem 5.1 is given in Appendix
8.3.
5.2 Cautious Generalization
Reinforcement Learning Algorithm
The CGRL algorithm computes an element of the set
B
(x) defined previously. Let D : F
T
U
T
be
the operator that maps a sequence of one-step sys-
tem transitions τ = [(x
l
t
,u
l
t
,r
l
t
,y
l
t
)]
T 1
t=0
into the se-
quence of actions u
l
0
,...,u
l
T 1
. Using this opera-
tor, we can write B
(x) =
(u
0
,...,u
T 1
) U
T
|∃τ
argmax
τF
T
B(τ,x) for which D(τ) = (u
0
,...,u
T 1
)
. Or,
A CAUTIOUS APPROACH TO GENERALIZATION IN REINFORCEMENT LEARNING
67
x
l
0
,... , l
T1
argmax
l
0
,..., l
T 1
c
0
0,l
0
c
1
l
0,
l
1
...c
T 1
l
T 2
, l
T 1
with c
t
i , j=L
Q
T t
y
i
x
j
r
j
, y
0
= x
x
1
, u
1
, r
1
, y
1
x
2
, u
2
, r
2
, y
2
x
n
, u
n
, r
n
, y
n
x
1
, u
1
, r
1
, y
1
x
2
, u
2
, r
2
, y
2
x
n
, u
n
, r
n
, y
n
c
0
0,1
c
0
0,2
c
0
0,n
c
1
1,1
c
1
1,2
c
1
1, n
c
1
2, n
c
1
2,2
c
1
2,1
c
1
n , n
c
1
n , 2
c
1
n , 1
x
1
, u
1
, r
1
, y
1
x
2
, u
2
, r
2
, y
2
x
n
, u
n
, r
n
, y
n
c
T 1
i, j
u
0
x , ... , u
T 1
x=u
l
0
, ... ,u
l
T 1
Figure 2: A graphical interpretation of the CGRL algorithm (notice that n = |F |).
equivalently
B
(x) =
n
(u
0
,. .., u
T 1
) U
T
|
τ argmax
τF
T
T 1
t=0
r
l
t
L
Q
T t
ky
l
t1
x
l
t
k
for which
D(τ) = (u
0
,. .., u
T 1
)
o
.
From this expression, we can notice that a se-
quence of one-step transitions τ such that D(τ) be-
longs to B
(x) can be obtained by solving a short-
est path problem on the graph given in Figure 2. The
CGRL algorithm works by solving this problem using
the Viterbi algorithm and by applying the operator D
to the sequence of one-step transitions τ correspond-
ing to its solution. Its complexity is quadratic with
respect to the cardinality of the input sample F and
linear with respect to the optimization horizon T .
6 ILLUSTRATION
In this section, we illustrate the CGRL algorithm on a
variant of the puddle world benchmark introduced in
(Sutton, 1996). In this benchmark, a robot whose goal
is to collect high cumulated rewards navigates on a
plane. A puddle stands in between the initial position
of the robot and the high reward area. If the robot is
in the puddle, it gets highly negative rewards. An op-
timal navigation strategy drives the robot around the
puddle to reach the high reward area. Two datasets
of one-step transitions have been used in our exam-
ple. The first set F
1
contains elements that uniformly
cover the area of the state space that can be reached
within T steps. The set F
2
has been obtained by re-
moving from F
1
the elements corresponding to the
highly negative rewards.
3
The full specification of the
benchmark and the exact procedure for generating F
1
and F
2
are given in Appendix 8.4. On Figure 3, we
have drawn the trajectory of the robot when follow-
ing the sequence of actions computed by the CGRL
algorithm. Every state encountered is represented by
a white square. The plane upon which the robot nav-
igates has been colored such that the darker the area,
the smaller the corresponding rewards are. In particu-
lar, the puddle area is colored in dark grey/black. We
see that the CGRL policy drives the robot around the
puddle to reach the high-reward area which is rep-
resented by the light-grey circles. The CGRL algo-
rithm also computes a lower-bound on the cumulated
rewards obtained by this action sequence. Here, we
found out that this lower bound was rather conserva-
tive.
3
Although this problem might be treated by on-line
learning methods, in some settings - for whatever reason
- on-line learning may be impractical and all one will have
is a batch of trajectories
ICAART 2010 - 2nd International Conference on Agents and Artificial Intelligence
68
Figure 3: CGRL with F
1
.
Figure 4: FQI with F
1
.
Figure 4 represents the policy inferred from F
1
by
using the (finite-time version of the) Fitted Q Itera-
tion algorithm (FQI) combined with extremely ran-
domized trees as function approximators (Ernst et al.,
2005). The trajectories computed by the CGRL and
FQI algorithms are very similar and so are the sums
of rewards obtained by following these two trajecto-
ries. However, by using F
2
rather that F
1
, the CGRL
and FQI algorithms do not lead to similar trajectories,
as it is shown on Figures 5 and 6. Indeed, while the
CGRL policy still drives the robot around the puddle
to reach the high reward area, the FQI policy makes
the robot cross the puddle. In terms of optimality,
this latter navigation strategy is much worse. The dif-
ference between both navigation strategies can be ex-
plained as follows. The FQI algorithm behaves as if
it were associating to areas of the state space that are
not covered by the input sample, the properties of the
elements of this sample that are located in the neigh-
borhood of these areas. This in turn explains why it
computes a policy that makes the robot cross the pud-
dle. The same behavior could probably be observed
by using other algorithms that combine dynamic pro-
gramming strategies with kernel-based approximators
or averagers (Boyan and Moore, 1995; Gordon, 1999;
Figure 5: CGRL with F
2
.
Figure 6: FQI with F
2
.
Ormoneit and Sen, 2002). The CGRL algorithm gen-
eralizes the information contained in the dataset, by
assuming, given the intial state, the most adverse be-
havior for the environment according to its weak prior
knowledge about the environment. This results in the
fact that the CGRL algorithm penalizes sequences of
decisions that could drive the robot in areas not well
covered by the sample, and this explains why the
CGRL algorithm drives the robot around the puddle
when run with F
2
.
7 DISCUSSION
The CGRL algorithm outputs a sequence of actions
as well as a lower bound on its return. When L
f
> 1
(e.g. when the system is unstable), this lower bound
will decrease exponentially with T . This may lead
to very low performance guarantees when the opti-
mization horizon T is large. However, one can also
observe that the terms L
Q
T t
which are responsible
for the exponential decrease of the lower bound with
the optimization horizon are multiplied by the dis-
tance between the end state of a one-step transition
A CAUTIOUS APPROACH TO GENERALIZATION IN REINFORCEMENT LEARNING
69
and the beginning state of the next one-step transition
of the sequence τ (ky
l
t1
x
l
t
k) solution of the short-
est path problem of Figure 2. Therefore, if these states
y
l
t1
and x
l
t
are close to each other, the CGRL algo-
rithm can lead to good performance guarantees even
for large values of T . It is also important to notice that
this lower bound does not depend explicitly on the
sample sparsity α
, but depends rather on the initial
state for which the sequence of actions is computed.
Therefore, this may lead to cases where the CGRL
algorithm provides good performance guarantees for
some specific initial states, even if the sample does
not cover every area of the state space well enough.
Other RL algorithms working in a similar setting
as the CGRL algorithm, while not exploiting the weak
prior knowledge about the environment, do not output
a lower bound on the return of the policy h they infer
from the sample of trajectories F . However, some
lower bounds on the return of h can still be com-
puted. For instance, this can be done by exploiting
the results of (Fonteneau et al., 2009) upon which the
CGRL algorithm is based. However, one can show
that following the strategy described in (Fonteneau
et al., 2009) would necessarily lead to a bound lower
than the lower bound associated to the sequence of
actions computed by the CGRL algorithm. Another
strategy would be to design global lower bounds on
their policy by adapting proofs used to establish the
consistency of these algorithms. As a way of exam-
ple, by proceeding like this, we can design a lower-
bound on the return of the policy given by the FQI
algorithm when combined with some specific approx-
imators which have, among others, Lipschitz continu-
ity properties. These algorithms compute a sequence
of state-action value functions
ˆ
Q
1
,
ˆ
Q
2
, ... ,
ˆ
Q
T
and
compute the policy h : {0, 1, . . . , T 1} × X defined
as follows : h(t,x
t
) argmax
uU
ˆ
Q
T t
(x
t
,u). For in-
stance when using kernel-based approximators (Or-
moneit and Sen, 2002), we have as result that the re-
turn of h when starting from a state x is larger than
ˆ
Q
T
(x,h(0,x)) (C
1
T + C
2
T
2
) · α
where C
1
and C
2
depends on L
f
, L
ρ
, the Lipschtiz constants of the class
of approximation and an upper bound on ρ. The ex-
plicit dependence of this lower bound on α
as well
as the large values of C
1
and C
2
tend to lead to a
very conservative lower bound, especially when F is
sparse.
8 CONCLUSIONS
We have proposed a new strategy for RL using a batch
of system transitions and some weak prior knowledge
about the environment behavior. It consists in max-
imizing lower bounds on the return that may be in-
ferred by combining information from a dataset of
observed system transitions and upper bounds on the
Lipschitz constants of the environment. The proposed
algorithm is of polynomial complexity and avoids re-
gions of the state space where the sample density is
too low according to the prior information. A sim-
ple example has illustrated that this strategy can lead
to cautious policies where other batch-mode RL algo-
rithms fail because they unsafely generalize the infor-
mation contained in the dataset.
From the results given in (Fonteneau et al., 2009),
it is also possible to derive in a similar way upper
bounds on the return of a policy. In this respect, it
would also be possible to adopt an optimistic gener-
alization strategy by inferring policies that maximize
these upper bounds. We believe that exploiting to-
gether the policy based on a cautious generalization
strategy and the one based on an optimistic general-
ization strategy could offer interesting possibilities for
addressing the exploitation-exploration tradeoff faced
when designing intelligent agents. For example, if the
policies coincide, it could be an indication that further
exploration is not needed.
When using batch mode reinforcement learning
algorithms to design autonomous intelligent agents,
a problem arises. After a long enough time of inter-
action with their environment, the sample the agents
collect may become so large that batch mode RL-
techniques may become computationally impractical,
even with small degree polynomial algorithms. As
suggested by (Ernst, 2005), a solution for addressing
this problem would be to retain only the most “in-
formative samples”. In the context of the proposed
algorithm, the complexity for computing the optimal
sequence of decisions is quadratic in the size of the
dataset. We believe that it would be interesting to de-
sign lower complexity algorithms based on subsam-
pling the dataset based on the initial state information.
Finally, while in this paper we have considered de-
terministic environments and open-loop strategies for
interacting with them, we believe that extending our
framework to closed-loop strategies (revising at each
time step the first stage decision in a receding horizon
approach) and study their performances, in particular
in the context of stochastic environments, is a very
promising direction of further research.
ICAART 2010 - 2nd International Conference on Agents and Artificial Intelligence
70
ACKNOWLEDGEMENTS
Raphael Fonteneau acknowledges the financial sup-
port of the FRIA (Fund for Research in Industry and
Agriculture). Damien Ernst is a research associate
of the FRS-FNRS. This paper presents research re-
sults of the Belgian Network BIOMAGNET (Bioin-
formatics and Modeling: from Genomes to Net-
works), funded by the Interuniversity Attraction Poles
Programme, initiated by the Belgian State, Science
Policy Office.We also acknowledge financial support
from NIH grants P50 DA10075 and R01 MH080015.
The scientific responsibility rests with its authors.
REFERENCES
Bemporad, A. and Morari, M. (1999). Robust model pre-
dictive control: A survey. Robustness in Identification
and Control, 245:207–226.
Bertsekas, D. and Tsitsiklis, J. (1996). Neuro-Dynamic Pro-
gramming. Athena Scientific.
Boyan, J. and Moore, A. (1995). Generalization in rein-
forcement learning: Safely approximating the value
function. In Advances in Neural Information Process-
ing Systems 7, pages 369–376. MIT Press.
Cs
´
aji, B. C. and Monostori, L. (2008). Value function based
reinforcement learning in changing markovian envi-
ronments. J. Mach. Learn. Res., 9:1679–1709.
Delage, E. and Mannor, S. (2006). Percentile optimization
for Markov decision processes with parameter uncer-
tainty. Operations Research.
Ernst, D. (2005). Selecting concise sets of samples for a
reinforcement learning agent. In Proceedings of the
Third International Conference on Computational In-
telligence, Robotics and Autonomous Systems (CIRAS
2005), page 6.
Ernst, D., Geurts, P., and Wehenkel, L. (2005). Tree-based
batch mode reinforcement learning. Journal of Ma-
chine Learning Research, 6:503–556.
Ernst, D., Glavic, M., Capitanescu, F., and Wehenkel, L.
(April 2009). Reinforcement learning versus model
predictive control: a comparison on a power system
problem. IEEE Transactions on Systems, Man, and
Cybernetics - Part B: Cybernetics, 39:517–529.
Fonteneau, R., Murphy, S., Wehenkel, L., and Ernst, D.
(2009). Inferring bounds on the performance of a
control policy from a sample of trajectories. In Pro-
ceedings of the 2009 IEEE Symposium on Adaptive
Dynamic Programming and Reinforcement Learning
(IEEE ADPRL 09), Nashville, TN, USA.
Gordon, G. (1999). Approximate Solutions to Markov De-
cision Processes. PhD thesis, Carnegie Mellon Uni-
versity.
Ingersoll, J. (1987). Theory of Financial Decision Making.
Rowman and Littlefield Publishers, Inc.
Lagoudakis, M. and Parr, R. (2003). Least-squares pol-
icy iteration. Jounal of Machine Learning Research,
4:1107–1149.
Mannor, S., Simester, D., Sun, P., and Tsitsiklis, J. (2004).
Bias and variance in value function estimation. In Pro-
ceedings of the 21
st
International Conference on Ma-
chine Learning.
Murphy, S. (2003). Optimal dynamic treatment regimes.
Journal of the Royal Statistical Society, Series B,
65(2):331–366.
Murphy, S. (2005). An experimental design for the devel-
opment of adaptive treatment strategies. Statistics in
Medicine, 24:1455–1481.
Ormoneit, D. and Sen, S. (2002). Kernel-based reinforce-
ment learning. Machine Learning, 49(2-3):161–178.
Qian, M. and Murphy, S. (2009). Performance guarantee
for individualized treatment rules. Submitted.
Riedmiller, M. (2005). Neural fitted Q iteration - first ex-
periences with a data efficient neural reinforcement
learning method. In Proceedings of the Sixteenth
European Conference on Machine Learning (ECML
2005), pages 317–328.
Sutton, R. (1996). Generalization in reinforcement learn-
ing: Successful examples using sparse coding. In
Advance in Neural Information Processing Systems 8,
pages 1038–1044. MIT Press.
Sutton, R. and Barto, A. (1998). Reinforcement Learning.
MIT Press.
APPENDIX
8.1 Proof of Lemma 4.1
Before proving Lemma 4.1 in Section 8.1.2, we prove
in Section 8.1.1 a preliminary result related to the Lip-
schitz continuity of state-action value functions.
8.1.1 Lipschitz Continuity of the N-stage
State-action Value Functions
For N = 1,...,T , let us define the family of state-
action value functions Q
u
0
,...,u
T 1
N
: X ×U R as fol-
lows:
Q
u
0
,...,u
T 1
N
(x,u) = ρ(x,u) +
T 1
t=T N+1
ρ(x
t
,u
t
),
where x
T N+1
= f (x,u). Q
u
0
,...,u
T 1
N
(x,u) gives the
sum of rewards from instant t = T N to instant T 1
when (i) the system is in state x at instant T N, (ii)
the action chosen at instant T N is u and (iii) the ac-
tions chosen at instants t > T N are u
t
. The function
J
u
0
,...,u
T 1
can be deduced from Q
u
0
,...,u
T 1
N
as follows:
x X, J
u
0
,...,u
T 1
(x) = Q
u
0
,...,u
T 1
T
(x,u
0
). (2)
A CAUTIOUS APPROACH TO GENERALIZATION IN REINFORCEMENT LEARNING
71
We also have x X, u U,
Q
u
0
,...,u
T 1
N+1
(x,u) =
ρ(x,u) + Q
u
0
,...,u
T 1
N
( f (x,u),u
T N
)
(3)
Lemma [Lipschitz continuity of Q
u
0
,...,u
T 1
N
]
N {1,...,T }, x,x
0
X, u U,
|Q
u
0
,...,u
T 1
N
(x,u) Q
u
0
,...,u
T 1
N
(x
0
,u)| L
Q
N
kx x
0
k ,
with L
Q
N
= L
ρ
N1
t=0
L
t
f
.
Proof We consider the statement H (N): x, x
0
X, u, u
0
U ,
|Q
u
0
,...,u
T 1
N
(x,u) Q
u
0
,...,u
T 1
N
(x
0
,u)| L
Q
N
kx x
0
k.
We prove by induction that H (N) is true
N {1, . . . , T }. For the sake of clarity, we de-
note |Q
u
0
,...,u
T 1
N
(x,u) Q
u
0
,...,u
T 1
N
(x
0
,u)| by
N
.
Basis (N = 1) : We have
N
= |ρ(x,u) ρ(x
0
,u)|,
and the Lipschitz continuity of ρ allows to write
N
L
ρ
kx x
0
k. This proves H (1).
Induction step: We suppose that H (N) is true,
1 N T 1. Using equation (3), we can write
N+1
=
Q
u
0
,...,u
T 1
N+1
(x,u) Q
u
0
,...,u
T 1
N+1
(x
0
,u)
=
ρ(x,u) ρ(x
0
,u) + Q
u
0
,...,u
T 1
N
( f (x,u),u
T N
)
Q
u
0
,...,u
T 1
N
( f (x
0
,u),u
T N
)
and, from there,
N+1
ρ(x,u)
ρ(x
0
,u)
+
Q
u
0
,...,u
T 1
N
( f (x,u),u
T N
)
Q
u
0
,...,u
T 1
N
( f (x
0
,u),u
T N
)
.
H (N) and the Lipschitz continuity of ρ give
N+1
L
ρ
kx x
0
k + L
Q
N
k f (x,u) f (x
0
,u)k.
The Lipschitz continuity of f gives
N+1
L
ρ
kx x
0
k + L
Q
N
L
f
kx x
0
k,
then
N+1
L
Q
N+1
kx x
0
k since L
Q
N+1
= L
ρ
+
L
Q
N
L
f
. This proves H (N + 1) and ends the proof.
8.1.2 Proof of Lemma 4.1
By assumption we have u
l
0
= u
0
, then we use equation
(2) and the Lipschitz continuity of Q
u
0
,...,u
T 1
T
to write
|J
u
0
,...,u
T 1
(x) Q
u
0
,...,u
T 1
T
(x
l
0
,u
0
)| L
Q
T
kx x
l
0
k.
It follows that
Q
u
0
,...,u
T 1
T
(x
l
0
,u
0
) L
Q
T
kx x
l
0
k J
u
0
,...,u
T 1
(x).
According to equation (3), we
have Q
u
0
,...,u
T 1
T
(x
l
0
,u
0
) = ρ(x
l
0
,u
0
) +
Q
u
0
,...,u
T 1
T 1
( f (x
l
0
,u
0
),u
1
) and from there
Q
u
0
,...,u
T 1
T
(x
l
0
,u
0
) = r
l
0
+ Q
h
T 1
(y
l
0
,u
1
).
Thus,
Q
u
0
,...,u
T 1
T 1
(y
l
0
,u
1
)+r
l
0
L
Q
T
kxx
l
0
k J
u
0
,...,u
T 1
T
(x).
The Lipschitz continuity of Q
u
0
,...,u
T 1
T 1
with u
1
= u
l
1
gives
|Q
u
0
,...,u
T 1
T 1
(y
l
0
,u
1
) Q
u
0
,...,u
T 1
T 1
(x
l
1
,u
l
1
)|
L
Q
T 1
ky
l
0
x
l
1
k .
This implies that
Q
u
0
,...,u
T 1
T 1
(x
l
1
,u
1
) L
Q
T 1
ky
l
0
x
l
1
k
Q
u
0
,...,u
T 1
T 1
(y
l
0
,u
1
).
We have therefore
Q
u
0
,...,u
T 1
T 1
(x
l
1
,u
1
) + r
l
0
L
Q
T
kx x
l
0
k
L
Q
T 1
ky
l
0
x
l
1
k ≤ J
u
0
,...,u
T 1
(x).
The proof is completed by developing this iteration.
8.2 Proof of Theorem 4.2
Let (x
0
,u
0
,r
0
,x
1
,u
1
,...,x
T 1
,u
T 1
,r
T 1
,x
T
) be the
trajectory of an agent starting from x
0
= x when
following the open-loop policy u
0
,...,u
T 1
. Using
equation (1), we define τ = [(x
l
t
,u
l
t
,r
l
t
,y
l
t
)]
T 1
t=0
F
T
u
0
,...,u
T 1
that satisfies t {0, 1, . . . , T 1}
kx
l
t
x
t
k = min
l∈{1,...,|F |}
kx
l
x
t
k α
. (4)
We have B(τ,x) =
T 1
t=0
[r
l
t
L
Q
T t
ky
l
t1
x
l
t
k] with
y
l
1
= x. Let us focus on ky
l
t1
x
l
t
k . We
have ky
l
t1
x
l
t
k = kx
l
t
x
t
+ x
t
y
l
t1
k, and hence
ky
l
t1
x
l
t
k kx
l
t
x
t
k+kx
t
y
l
t1
k. Using inequal-
ity (4), we can write
ky
l
t1
x
l
t
k α
+ kx
t
y
l
t1
k. (5)
For t = 0, one has kx
t
y
l
t1
k = kx
0
x
0
k = 0. For
t > 0 kx
t
y
l
t1
k = k f (x
t1
,u
t1
) f (x
l
t1
,u
t1
)k
and the Lipschitz continuity of f implies that kx
t
y
l
t1
k L
f
kx
t1
x
l
t1
k. So, as kx
t1
x
l
t1
k α
,
we have
t > 0, kx
t
y
l
t1
k L
f
α
. (6)
Equations (5) and (6) imply that for t > 0, ky
l
t1
x
l
t
k α
(1 +L
f
) and, for t = 0, ky
l
1
x
l
0
k α
α
(1 + L
f
). This gives
B(τ,x)
T 1
t=0
[r
l
t
L
Q
T t
α
(1 + L
f
)]
.
= B.
We also have, by definition of B
u
0
,...,u
T 1
(x) ,
J
u
0
,...,u
T 1
(x) B
u
0
,...,u
T 1
(x) B(τ,x) B .
ICAART 2010 - 2nd International Conference on Agents and Artificial Intelligence
72
Thus,
|J
u
0
,...,u
T 1
(x) B
u
0
,...,u
T 1
(x)| |J
u
0
,...,u
T 1
(x) B|
= J
u
0
,...,u
T 1
(x) B
= |
T 1
t=0
[(r
t
r
l
t
) + L
Q
T t
α
(1 + L
f
)]|
T 1
t=0
[|r
t
r
l
t
| + L
Q
T t
α
(1 + L
f
)].
The Lipschitz continuity of ρ allows to write
|r
t
r
l
t
| = |ρ(x
t
,u
t
) ρ(x
l
t
,u
t
)| L
ρ
kx
t
x
l
t
k,
and using inequality (4), we have |r
t
r
l
t
| L
ρ
α
.
Finally, we obtain
J
u
0
,...,u
T 1
(x) B
T 1
t=0
[L
ρ
α
+ L
Q
T t
α
(1 + L
f
)]
T L
ρ
α
+
T 1
t=0
L
Q
T t
α
(1 + L
f
)
α
T L
ρ
+
T 1
t=0
L
Q
T t
(1 + L
f
)
.
Thus
J
u
0
,...,u
T 1
(x) B
u
0
,...,u
T 1
(x)
T L
ρ
+ (1 + L
f
)
T 1
t=0
L
Q
T t
α
,
which completes the proof.
8.3 Proof of Theorem 5.1
Let us prove that by Reductio ad absurdum. Let us
suppose that the algorithm does not return an optimal
sequence of actions, which means that
J
ˆu
0
(x),..., ˆu
T 1
(x)
(x) J
(x) ε(x) .
Let us consider a sequence u
0
(x),...,u
T 1
(x)
J
(x). Then J
u
0
(x),...,u
T 1
(x)
(x) = J
(x). The lower
bound B
u
0
(x),...,u
T 1
(x)
(x) satisfies the relationship
J
(x) B
u
0
(x),...,u
T 1
(x)
(x) Cα.
Knowing that Cα < ε(x), we have
B
u
0
(x),...,u
T 1
(x)
(x) > J
(x) ε.
By definition of ε,
J
T
(x) ε J
ˆu
0
(x),..., ˆu
T 1
(x)
(x),
and since
J
ˆu
0
(x),..., ˆu
T 1
(x)
(x) B
ˆu
0
(x),..., ˆu
T 1
(x)
(x),
we have
B
u
0
(x),...,u
T 1
(x)
(x) > B
ˆu
0
(x),..., ˆu
T 1
(x)
(x) ,
which contradicts the fact that the algorithm returns
the sequence that leads to the highest lower bound.
This ends the proof.
8.4 Experimental Specifications
The puddle world benchmark is defined by
X = R
2
,
U = {
0.1 0
,
0.1 0
,
0 0.1
,
0 0.1
},
f (x,u) = x + u,
ρ(x,u) = k
1
N
µ
1
,Σ
1
(x) k
2
N
µ
2
,Σ
2
(x) k
3
N
µ
3
,Σ
3
(x) ,
with
N
µ,Σ
(x) =
1
2π
p
|Σ|
e
(xµ)Σ
1
(xµ)
0
2
,
µ
1
=
1 1
,
µ
2
=
0.225 0.75
,
µ
3
=
0.45 0.6
,
Σ
1
=
0.005 0
0 0.005
,
Σ
2
=
0.05 0
0 0.001
,
Σ
3
=
0.001 0
0 0.05
and k
1
= 1,k
2
= k
3
= 20. The Euclidian norm is
used. L
f
= 1, L
ρ
= 1.3742 10
6
, T = 25, initial
state x = (0.35, 0.65). The sets of one-step system
transitions are
F
1
= {(x,u,ρ(x,u), f (x,u))|x = (2.15 + i
5/203,1.85 + j 5/203), i, j = 1 : 203},
F
2
= F
1
\{(x,u,r,y) F
1
|x [0.4, 0.5] ×
[0.25,0.95] [0.1,0.6] × [0.7,0.8]}.
The FQI algorithm combined with extremely ran-
domized trees is run using its default parameters given
in (Ernst et al., 2005).
A CAUTIOUS APPROACH TO GENERALIZATION IN REINFORCEMENT LEARNING
73