up to a certain depth. Exceptions include (Poupart
et al., 2006), which uses an analytical bound based
on sampling a small set of beliefs and (Wang et al.,
2005), which uses Kearn’s sparse sampling algo-
rithm (Kearns et al., 1999) to expand the tree. Both
methods have complexity exponential in the horizon,
something which we improve via the use of smooth-
ness properties induced by the Bayesian updating.
There are also connections with work on
POMDPs problems (Ross et al., 2008). However
this setting, though equivalent in an abstract sense,
is not sufficiently close to the one we consider. Re-
sults on bandit problems, employing the same value
function bounds used herein were reported in (Dim-
itrakakis, 2008), which experimentally compared al-
gorithms operating on leaf nodes only.
Related results on the online sample complexity
of Bayesian RL were developed by (Kolter and Ng,
2009), who employs a different upper bound to ours
and (Asmuth et al., 2009), who employs MDP sam-
ples to plan in an augmented MDP space, similarly to
(Auer et al., 2008) (who consider the set of plausible
MDPs) and uses Bayesian concentration of measure
results (Zhang, 2006) to prove mistake bounds on the
online performance of the algorithm.
Interestingly, Alg. 4 resembles HOO (Bubeck
et al., 2008) in the way that it traverses the tree, with
two major differences. (a) The search is adapted to
stochastic trees. (b) We use means of samples of
upper bounds, rather than upper bounds on sample
means. For these reasons, we are unable to simply
restate the arguments in (Bubeck et al., 2008).
We presented complexity results and counting ar-
guments for a number of tree search algorithms on
trees where stochastic upper and lower bounds sat-
isfying a smoothness property exist. These are the
first results of this type and partially extend the re-
sults of (Norkin et al., 1998), which provided an
asymptotic convergence proof, under similar smooth-
ness conditions, for a stochastic branch and bound al-
gorithm. In addition, we introduce a mechanism to
utilise samples obtained at inner nodes when calcu-
lating mean upper bounds at leaf nodes. Finally, we
relate our complexity results to those of (Kearns et al.,
1999), for whose lower bound we provide a small
improvement. We plan to address the online sam-
ple complexity of the proposed algorithms, as well as
their practical performance, in future work.
ACKNOWLEDGEMENTS
This work was part of the ICIS project, supported
by the Dutch Ministry of Economic Affairs, grant nr:
BSIK03024. I would like to thank the anonymous re-
viewers, as well as colleagues at the university of Am-
sterdam, Leoben and TU Crete for their comments on
earlier versions of this paper.
REFERENCES
Asmuth, J., Li, L., Littman, M. L., Nouri, A., and Wingate,
D. (2009). A Bayesian sampling approach to explo-
ration in reinforcement learning. In UAI 2009.
Auer, P., Jaksch, T., and Ortner, R. (2008). Near-optimal
regret bounds for reinforcement learning. In Proceed-
ings of NIPS 2008.
Bubeck, S., Munos, R., Stoltz, G., and Szepesv
´
ari, C.
(2008). Online optimization in X-armed bandits. In
NIPS, pages 201–208.
Dimitrakakis, C. (2008). Tree exploration for Bayesian RL
exploration. In CIMCA’08, pages 1029–1034, Los
Alamitos, CA, USA. IEEE Computer Society.
Dimitrakakis, C. (2009). Complexity of stochastic branch
and bound for belief tree search in Bayesian rein-
forcement learning. Technical Report IAS-UVA-09-
01, University of Amsterdam.
Duff, M. O. (2002). Optimal Learning Computational
Procedures for Bayes-adaptive Markov Decision Pro-
cesses. PhD thesis, University of Massachusetts at
Amherst.
Kearns, M. J., Mansour, Y., and Ng, A. Y. (1999). A sparse
sampling algorithm for near-optimal planning in large
Markov decision processes. In Dean, T., editor, IJCAI,
pages 1324–1231. Morgan Kaufmann.
Kolter, J. Z. and Ng, A. Y. (2009). Near-Bayesian explo-
ration in polynomial time. In ICML 2009.
Norkin, V. I., Pflug, G. C., and Ruszczyski, A. (1998). A
branch and bound method for stochastic global op-
timization. Mathematical Programming, 83(1):425–
450.
Poupart, P., Vlassis, N., Hoey, J., and Regan, K. (2006). An
analytic solution to discrete Bayesian reinforcement
learning. In ICML 2006, pages 697–704. ACM Press
New York, NY, USA.
Puterman, M. L. (1994,2005). Markov Decision Processes
: Discrete Stochastic Dynamic Programming. John
Wiley & Sons, New Jersey, US.
Ross, S., Pineau, J., Paquet, S., and Chaib-draa, B. (2008).
Online planning algorithms for POMDPs. Journal of
Artificial Intelligence Resesarch, 32:663–704.
Wang, T., Lizotte, D., Bowling, M., and Schuurmans, D.
(2005). Bayesian sparse sampling for on-line reward
optimization. In ICML ’05, pages 956–963, New
York, NY, USA. ACM.
Zhang, T. (2006). From ε-entropy to KL-entropy: Analysis
of minimum information complexity density estima-
tion. Annals of Statistics, 34(5):2180–2210.
ICAART 2010 - 2nd International Conference on Agents and Artificial Intelligence
264