for single trajectory Reinforcement Learning. Journal
of Machine Learning Research (JMLR), pages 1–9.
Dearden, R., Friedman, N., and Russell, S. (1998).
Bayesian Q-learning. In Proceedings of Fifteenth Na-
tional Conference on Artificial Intelligence (AAAI),
pages 761–768. AAAI Press.
Duff, M. O. (2002). Optimal Learning: Computational
procedures for Bayes-adaptive Markov decision pro-
cesses. PhD thesis, University of Massachusetts
Amherst.
Fonteneau, R., Busoniu, L., and Munos, R. (2013). Opti-
mistic planning for belief-augmented markov decision
processes. In Adaptive Dynamic Programming And
Reinforcement Learning (ADPRL), 2013 IEEE Sym-
posium on, pages 77–84. IEEE.
Guez, A., Silver, D., and Dayan, P. (2012). Efficient Bayes-
adaptive Reinforcement Learning using sample-based
search. In Neural Information Processing Systems
(NIPS).
Guez, A., Silver, D., and Dayan, P. (2013). Scalable and
efficient bayes-adaptive reinforcement learning based
on monte-carlo tree search. Journal of Artificial Intel-
ligence Research, pages 841–883.
Kaelbling, L., Littman, M., and Cassandra, A. (1998). Plan-
ning and acting in partially observable stochastic do-
mains. Artificial Intelligence, 101(12):99 – 134.
Kearns, M., Mansour, Y., and Ng, A. Y. (2002). A sparse
sampling algorithm for near-optimal planning in large
Markov decision processes. Machine Learning, 49(2-
3):193–208.
Kocsis, L. and Szepesv
´
ari, C. (2006). Bandit based Monte-
Carlo planning. European Conference on Machine
Learning (ECML), pages 282–293.
Kolter, J. Z. and Ng, A. Y. (2009a). Near-Bayesian explo-
ration in polynomial time. In Proceedings of the 26th
Annual International Conference on Machine Learn-
ing.
Kolter, J. Z. and Ng, A. Y. (2009b). Near-bayesian explo-
ration in polynomial time. In Proceedings of the 26th
Annual International Conference on Machine Learn-
ing, pages 513–520. ACM.
Martin, J. J. (1967). Bayesian decision problems and
markov chains. ”Originally submitted as a Ph.D. the-
sis [Massachusetts Institute of Technology, 1965]”.
Schwenk, H. and Bengio, Y. (2000). Boosting Neural Net-
works. Neural Comp., 12(8):1869–1887.
Silver, E. A. (1963). Markovian decision processes with
uncertain transition probabilities or rewards. Techni-
cal report, DTIC Document.
Sutton, R. S. and Barto, A. G. (1998). Reinforcement learn-
ing: An introduction, volume 1. MIT press Cam-
bridge.
Walsh, T. J., Goschin, S., and Littman, M. L. (2010). Inte-
grating sample-based planning and model-based rein-
forcement learning. In AAAI.
Wang, Y., Won, K. S., Hsu, D., and Lee, W. S. (2012).
Monte carlo bayesian reinforcement learning. arXiv
preprint arXiv:1206.6449.
Zhang, T., Kahn, G., Levine, S., and Abbeel, P. (2015).
Learning deep control policies for autonomous aerial
vehicles with mpc-guided policy search. CoRR,
abs/1509.06791.
Zhu, J., Zou, H., Rosset, S., and Hastie, T. (2009). Multi-
class adaboost. Statistics and its Interface, 2(3):349–
360.
APPENDIX
6.1 BRL Algorithms
Each algorithm considered in our experiments is de-
tailed precisely. For each algorithm, a list of “reason-
able” values is provided to test each of their parame-
ters. When an algorithm has more than one parameter,
all possible parameter combinations are tested.
6.1.1 Random
At each time-step t, the action u
t
is drawn uniformly
from U .
6.1.2 ε-Greedy
The ε-Greedy agent maintains an approximation of
the current MDP and computes, at each time-step,
its associated Q-function. The selected action is
either selected randomly (with a probability of ε
(1 ≥ ε ≥ 0), or greedily (with a probability of 1 − ε)
with respect to the approximated model.
Tested Values:
ε ∈ {0.0, 0.1,0.2,0.3,0.4, 0.5, 0.6,0.7,0.8,0.9,1.0}.
6.1.3 Soft-max
The Soft-max agent maintains an approximation of
the current MDP and computes, at each time-step, its
associated Q-function. The selected action is selected
randomly, where the probability to draw an action u is
proportional to Q(x
t
,u). The temperature parameter
τ allows to control the impact of the Q-function
on these probabilities (τ → 0
+
: greedy selection;
τ → +∞: random selection).
Tested Values:
τ ∈ {0.05,0.10,0.20, 0.33, 0.50,1.0,2.0,3.0, 5.0, 25.0}.
6.1.4 OPPS
Given a prior distribution p
0
M
(.) and an E/E strategy
space S , the Offline, Prior-based Policy Search
algorithm (OPPS) identify a strategy π
∗
∈ S which
maximises the expected discounted sum of returns
Approximate Bayes Optimal Policy Search using Neural Networks
151