ACKNOWLEDGEMENTS
Raphael Fonteneau acknowledges the financial sup-
port of the FRIA (Fund for Research in Industry and
Agriculture). Damien Ernst is a research associate
of the FRS-FNRS. This paper presents research re-
sults of the Belgian Network BIOMAGNET (Bioin-
formatics and Modeling: from Genomes to Net-
works), funded by the Interuniversity Attraction Poles
Programme, initiated by the Belgian State, Science
Policy Office.We also acknowledge financial support
from NIH grants P50 DA10075 and R01 MH080015.
The scientific responsibility rests with its authors.
REFERENCES
Bemporad, A. and Morari, M. (1999). Robust model pre-
dictive control: A survey. Robustness in Identification
and Control, 245:207–226.
Bertsekas, D. and Tsitsiklis, J. (1996). Neuro-Dynamic Pro-
gramming. Athena Scientific.
Boyan, J. and Moore, A. (1995). Generalization in rein-
forcement learning: Safely approximating the value
function. In Advances in Neural Information Process-
ing Systems 7, pages 369–376. MIT Press.
Cs
´
aji, B. C. and Monostori, L. (2008). Value function based
reinforcement learning in changing markovian envi-
ronments. J. Mach. Learn. Res., 9:1679–1709.
Delage, E. and Mannor, S. (2006). Percentile optimization
for Markov decision processes with parameter uncer-
tainty. Operations Research.
Ernst, D. (2005). Selecting concise sets of samples for a
reinforcement learning agent. In Proceedings of the
Third International Conference on Computational In-
telligence, Robotics and Autonomous Systems (CIRAS
2005), page 6.
Ernst, D., Geurts, P., and Wehenkel, L. (2005). Tree-based
batch mode reinforcement learning. Journal of Ma-
chine Learning Research, 6:503–556.
Ernst, D., Glavic, M., Capitanescu, F., and Wehenkel, L.
(April 2009). Reinforcement learning versus model
predictive control: a comparison on a power system
problem. IEEE Transactions on Systems, Man, and
Cybernetics - Part B: Cybernetics, 39:517–529.
Fonteneau, R., Murphy, S., Wehenkel, L., and Ernst, D.
(2009). Inferring bounds on the performance of a
control policy from a sample of trajectories. In Pro-
ceedings of the 2009 IEEE Symposium on Adaptive
Dynamic Programming and Reinforcement Learning
(IEEE ADPRL 09), Nashville, TN, USA.
Gordon, G. (1999). Approximate Solutions to Markov De-
cision Processes. PhD thesis, Carnegie Mellon Uni-
versity.
Ingersoll, J. (1987). Theory of Financial Decision Making.
Rowman and Littlefield Publishers, Inc.
Lagoudakis, M. and Parr, R. (2003). Least-squares pol-
icy iteration. Jounal of Machine Learning Research,
4:1107–1149.
Mannor, S., Simester, D., Sun, P., and Tsitsiklis, J. (2004).
Bias and variance in value function estimation. In Pro-
ceedings of the 21
st
International Conference on Ma-
chine Learning.
Murphy, S. (2003). Optimal dynamic treatment regimes.
Journal of the Royal Statistical Society, Series B,
65(2):331–366.
Murphy, S. (2005). An experimental design for the devel-
opment of adaptive treatment strategies. Statistics in
Medicine, 24:1455–1481.
Ormoneit, D. and Sen, S. (2002). Kernel-based reinforce-
ment learning. Machine Learning, 49(2-3):161–178.
Qian, M. and Murphy, S. (2009). Performance guarantee
for individualized treatment rules. Submitted.
Riedmiller, M. (2005). Neural fitted Q iteration - first ex-
periences with a data efficient neural reinforcement
learning method. In Proceedings of the Sixteenth
European Conference on Machine Learning (ECML
2005), pages 317–328.
Sutton, R. (1996). Generalization in reinforcement learn-
ing: Successful examples using sparse coding. In
Advance in Neural Information Processing Systems 8,
pages 1038–1044. MIT Press.
Sutton, R. and Barto, A. (1998). Reinforcement Learning.
MIT Press.
APPENDIX
8.1 Proof of Lemma 4.1
Before proving Lemma 4.1 in Section 8.1.2, we prove
in Section 8.1.1 a preliminary result related to the Lip-
schitz continuity of state-action value functions.
8.1.1 Lipschitz Continuity of the N-stage
State-action Value Functions
For N = 1,...,T , let us define the family of state-
action value functions Q
u
0
,...,u
T −1
N
: X ×U → R as fol-
lows:
Q
u
0
,...,u
T −1
N
(x,u) = ρ(x,u) +
T −1
∑
t=T −N+1
ρ(x
t
,u
t
),
where x
T −N+1
= f (x,u). Q
u
0
,...,u
T −1
N
(x,u) gives the
sum of rewards from instant t = T −N to instant T −1
when (i) the system is in state x at instant T − N, (ii)
the action chosen at instant T −N is u and (iii) the ac-
tions chosen at instants t > T −N are u
t
. The function
J
u
0
,...,u
T −1
can be deduced from Q
u
0
,...,u
T −1
N
as follows:
∀x ∈ X, J
u
0
,...,u
T −1
(x) = Q
u
0
,...,u
T −1
T
(x,u
0
). (2)
A CAUTIOUS APPROACH TO GENERALIZATION IN REINFORCEMENT LEARNING
71