Raphael Fonteneau, Susan A. Murphy, Louis Wehenkel, Damien Ernst


In the context of a deterministic Lipschitz continuous environment over continuous state spaces, finite action spaces, and a finite optimization horizon, we propose an algorithm of polynomial complexity which exploits weak prior knowledge about its environment for computing from a given sample of trajectories and for a given initial state a sequence of actions. The proposed Viterbi-like algorithm maximizes a recently proposed lower bound on the return depending on the initial state, and uses to this end prior knowledge about the environment provided in the form of upper bounds on its Lipschitz constants. It thereby avoids, in way depending on the initial state and on the the prior knowledge, those regions of the state space where the sample is too sparse to make safe generalizations. Our experiments show that it can lead to more cautious policies than algorithms combining dynamic programming with function approximators. We give also a condition on the sample sparsity ensuring that, for a given initial state, the proposed algorithm produces an optimal sequence of actions in open-loop.


  1. Bemporad, A. and Morari, M. (1999). Robust model predictive control: A survey. Robustness in Identification and Control, 245:207-226.
  2. Bertsekas, D. and Tsitsiklis, J. (1996). Neuro-Dynamic Programming. Athena Scientific.
  3. Boyan, J. and Moore, A. (1995). Generalization in reinforcement learning: Safely approximating the value function. In Advances in Neural Information Processing Systems 7, pages 369-376. MIT Press.
  4. Csáji, B. C. and Monostori, L. (2008). Value function based reinforcement learning in changing markovian environments. J. Mach. Learn. Res., 9:1679-1709.
  5. Delage, E. and Mannor, S. (2006). Percentile optimization for Markov decision processes with parameter uncertainty. Operations Research.
  6. Ernst, D. (2005). Selecting concise sets of samples for a reinforcement learning agent. In Proceedings of the Third International Conference on Computational Intelligence, Robotics and Autonomous Systems (CIRAS 2005), page 6.
  7. Ernst, D., Geurts, P., and Wehenkel, L. (2005). Tree-based batch mode reinforcement learning. Journal of Machine Learning Research, 6:503-556.
  8. Ernst, D., Glavic, M., Capitanescu, F., and Wehenkel, L. (April 2009). Reinforcement learning versus model predictive control: a comparison on a power system problem. IEEE Transactions on Systems, Man, and Cybernetics - Part B: Cybernetics, 39:517-529.
  9. Fonteneau, R., Murphy, S., Wehenkel, L., and Ernst, D. (2009). Inferring bounds on the performance of a control policy from a sample of trajectories. In Proceedings of the 2009 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (IEEE ADPRL 09), Nashville, TN, USA.
  10. Gordon, G. (1999). Approximate Solutions to Markov Decision Processes. PhD thesis, Carnegie Mellon University.
  11. Ingersoll, J. (1987). Theory of Financial Decision Making. Rowman and Littlefield Publishers, Inc.
  12. Lagoudakis, M. and Parr, R. (2003). Least-squares policy iteration. Jounal of Machine Learning Research, 4:1107-1149.
  13. Mannor, S., Simester, D., Sun, P., and Tsitsiklis, J. (2004). Bias and variance in value function estimation. In Proceedings of the 21st International Conference on Machine Learning.
  14. Murphy, S. (2003). Optimal dynamic treatment regimes. Journal of the Royal Statistical Society, Series B, 65(2):331-366.
  15. Murphy, S. (2005). An experimental design for the development of adaptive treatment strategies. Statistics in Medicine, 24:1455-1481.
  16. Ormoneit, D. and Sen, S. (2002). Kernel-based reinforcement learning. Machine Learning, 49(2-3):161-178.
  17. Qian, M. and Murphy, S. (2009). Performance guarantee for individualized treatment rules. Submitted.
  18. Riedmiller, M. (2005). Neural fitted Q iteration - first experiences with a data efficient neural reinforcement learning method. In Proceedings of the Sixteenth European Conference on Machine Learning (ECML 2005), pages 317-328.
  19. Sutton, R. (1996). Generalization in reinforcement learning: Successful examples using sparse coding. In Advance in Neural Information Processing Systems 8, pages 1038-1044. MIT Press.
  20. Sutton, R. and Barto, A. (1998). Reinforcement Learning. MIT Press.
  21. By definition of e, 8.4

Paper Citation

in Harvard Style

Fonteneau R., Murphy S., Wehenkel L. and Ernst D. (2010). A CAUTIOUS APPROACH TO GENERALIZATION IN REINFORCEMENT LEARNING . In Proceedings of the 2nd International Conference on Agents and Artificial Intelligence - Volume 1: ICAART, ISBN 978-989-674-021-4, pages 64-73. DOI: 10.5220/0002726900640073

in Bibtex Style

author={Raphael Fonteneau and Susan A. Murphy and Louis Wehenkel and Damien Ernst},
booktitle={Proceedings of the 2nd International Conference on Agents and Artificial Intelligence - Volume 1: ICAART,},

in EndNote Style

JO - Proceedings of the 2nd International Conference on Agents and Artificial Intelligence - Volume 1: ICAART,
SN - 978-989-674-021-4
AU - Fonteneau R.
AU - Murphy S.
AU - Wehenkel L.
AU - Ernst D.
PY - 2010
SP - 64
EP - 73
DO - 10.5220/0002726900640073