COMPLEXITY OF STOCHASTIC BRANCH AND BOUND

METHODS FOR BELIEF TREE SEARCH IN BAYESIAN

REINFORCEMENT LEARNING

Christos Dimitrakakis

University of Amsterdam, The Netherlands

Keywords:

Exploration, Bayesian reinforcement learning, Belief tree search, Complexity, PAC bounds.

Abstract:

There has been a lot of recent work on Bayesian methods for reinforcement learning exhibiting near-optimal

online performance. The main obstacle facing such methods is that in most problems of interest, the optimal

solution involves planning in an inﬁnitely large tree. However, it is possible to obtain stochastic lower and

upper bounds on the value of each tree node. This enables us to use stochastic branch and bound algorithms to

search the tree efﬁciently. This paper proposes some algorithms and examines their complexity in this setting.

1 INTRODUCTION

Various Bayesian methods for exploration in Markov

decision processes (MDPs) and for solving known

partially-observable Markov decision processes

(POMDPs), were proposed previously (c.f. (Poupart

et al., 2006; Duff, 2002; Ross et al., 2008)). How-

ever, such methods often suffer from computational

tractability problems. Optimal Bayesian exploration

requires the creation of an augmented MDP model in

the form of a tree (Duff, 2002), where the root node

is the current belief-state pair and children are all

possible subsequent belief-state pairs. The size of the

belief tree increases exponentially with the horizon,

while the branching factor is inﬁnite in the case of

continuous observations or actions.

In this work, we examine the complexity of ef-

ﬁcient algorithms for expanding the tree. In particu-

lar, we propose and analyse stochastic search methods

similar to the ones proposed in (Bubeck et al., 2008;

Norkin et al., 1998). Related methods have been

previously examined experimentally in the context

of Bayesian reinforcement learning in (Dimitrakakis,

2008; Wang et al., 2005).

The remainder of this section summarises the

Bayesian planning framework. Our main results are

presented in Sect. 2. Section 3 concludes with a dis-

cussion of related work.

1.1 Markov Decision Processes

Reinforcement learning [c.f. (Puterman, 2005)] is

discrete-time sequential decision making problem,

where we wish to act so as to maximise the expected

sum of discounted future rewards E

∑

k=1

t+k

where r

∈ R is a stochastic reward at time t. We are

only interested in rewards from time t to T > 0, and

γ ∈ [0,1] plays the role of a discount factor. Typically,

we assume that γ and T are known (or have known

prior distribution) and that the sequence of rewards

arises from a Markov decision process µ:

Deﬁnition 1 (MDP). A Markov decision process is a

discrete-time stochastic process with: A state s

∈ S

at time t and a reward r

∈ R, generated by the pro-

cess µ, and an action a

∈ A, chosen by the deci-

sion maker. We denote the distribution over next

states s

t+1

, which only depends on s

and a

, by

µ(s

t+1

). Furthermore µ(r

t+1

) is a reward

distribution conditioned on states and actions. Finally,

µ(r

t+1

) = µ(r

t+1

)µ(s

t+1

In the above, and throughout the text, we usu-

ally take µ(·) to mean P

(·), the distribution un-

der the process µ, for compactness. Frequently

such a notation will imply a marginalisation. For

example, we shall write µ(s

t+k

) to mean

∑

t+1

,...,s

t+k−1

µ(s

t+k

,... , s

t+1

). The decision

maker takes actions according to a policy π, which

259

Dimitrakakis C. (2010).

COMPLEXITY OF STOCHASTIC BRANCH AND BOUND METHODS FOR BELIEF TREE SEARCH IN BAYESIAN REINFORCEMENT LEARNING.

In Proceedings of the 2nd International Conference on Agents and Artiﬁcial Intelligence - Artiﬁcial Intelligence, pages 259-264

DOI: 10.5220/0002721402590264

 SciTePress

deﬁnes a distribution π(a

) over A, conditioned on

the state s

. A policy π is stationary if π(a

= a|s

s) = π(a

= a|s

= s) for all t,t

. The expected util-

ity of a policy π selecting actions in the MDP µ, from

time t to T can be written as the value function:

π,µ

t,T

(s) = E

π,µ

∑

k=1

t+k



, (1)

where E

π,µ

denotes the expectation under the Markov

chain arising from acting policy π on the MDP µ.

Whenever it is clear from context, superscripts and

subscripts shall be omitted for brevity. The optimal

value function will be denoted by V

∗

, max

. If

the MDP is known, we can evaluate the optimal value

function policy in time polynomial to the sizes of the

state and action sets (Puterman, 2005) via backwards

induction (value iteration).

1.2 Bayesian Reinforcement Learning

If the MDP is unknown, we may use a Bayesian

framework to represent our uncertainty (Duff, 2002).

This requires maintaining a belief ξ

, about which

MDP µ ∈ M corresponds to reality. More precisely,

we deﬁne a measurable space (M ,M), where M is a

(usually uncountable) set of MDPs, and M is a suit-

able σ-algebra. With an appropriate initial density

(µ), we can obtain a sequence of densities ξ

(µ),

representing our subjective belief at time t, by condi-

tioning ξ

(µ) on the latest observations:

t+1

(µ) ,

µ(r

t+1

)ξ

(µ)

t+1

)ξ

(µ

)dµ

. (2)

In the following, we write E

to denote expectations

with respect to any belief ξ.

1.3 Belief-augmented MDPs

In order to optimally select actions in this framework,

it is necessary to explicitly take into account future

changes in the belief when planning (Duff, 2002).

The idea is to combine the original MDP’s state s

and our belief state ξ

into a hyper-state.

Deﬁnition 2 (BAMDP). A Belief-Augmented MDP ν

(BAMPD) is an MDP with a set of hyper-states Ω =

S × B, where B is an appropriate set of probability

measures on M and S,A are the state and action sets

of all µ ∈ M . At time t, the agent observes the hyper-

state ω

= (s

,ξ

) ∈ Ω and takes action a

∈ A. We

write the transition distribution as ν(ω

t+1

|ω

) and

the reward distribution as ν(r

|ω

The hyper-state ω

has the Markov property. This al-

lows us to treat the BAMDP as an inﬁnite-state MDP

with transitions ν(ω

t+1

|ω

), and rewards ν(r

|ω

When the horizon T is ﬁnite, we need only require ex-

pand the tree to depth T − t. Thus, backwards induc-

tion starting from the set of terminal hyper-states Ω

and proceeding backwards to T − 1,.. . ,t provides a

solution:

∗

(ω) = max

a∈A

(r|ω) + γ

∑

∈Ω

n+1

ν(ω

|ω,a)V

∗

n+1

(ω

(3)

where Ω

is the set of hyper-states at time n. We

can approximately solve inﬁnite-horizon problems if

we expand the tree to some ﬁnite depth, if we have

bounds on the value of leaf nodes.

1.4 Bounds on the Value Function

We shall relate the optimal value function of the

BAMDP, V

∗

(ω), for some ω(s,ξ), to the value func-

tions V

of MDPs µ ∈ M for some π. The opti-

mal policy for µ is denoted as π

∗

(µ). The mean

MDP resulting from belief ξ is denoted as ¯µ

and

has the properties: ¯µ

t+1

) = E

[µ(s

t+1

)],

¯µ

t+1

) = E

[µ(r

t+1

)].

Proposition 1 (Dimitrakakis, 2008). For any ω =

(s,ξ), the BAMDP value function V

∗

obeys:

∗

(µ)

(s)ξ(µ)dµ ≥ V

∗

(ω) ≥

∗

(¯µ

)

(s)ξ(µ)dµ

(4)

Proof. By deﬁnition, V

∗

(ω) ≥V

(ω) for all ω, for any

policy π. It is easy to see that the lower bound equals

∗

(¯µ

)

(ω), thus proving the right hand side. The up-

per bound follows from the fact that for any function

f , max

f (x, u) du ≤

max

f (x, u) du.

If M is not ﬁnite, then we cannot calculate the

upper bound of V (ω) in closed form. However,

we can use Monte Carlo sampling: Given a hyper-

state ω = (s,ξ), we draw m MDPs from its belief ξ:

,... , µ

∼ ξ,

estimate the value function for each

, ˜v

U,k

∗

(µ

)

(s), and average the samples: ˆv

U,m

Because of the way that the BAMDP ν is constructed

from beliefs over M , the next reward now depends on the

next state rather than the current state and action.

In the discrete case, we sample a multinomial distri-

bution from each of the Dirichlet densities independently

for the transitions. For the rewards we draw independent

Bernoulli distributions from the Beta of each state-action

pair.

ICAART 2010 - 2nd International Conference on Agents and Artificial Intelligence

260

∑

k=1

˜v

U,k

. Let v

(µ)V

∗

)dµ. Then,

lim

m→∞

[ ˆv

U,m

] = v

almost surely and E[ ˆv

U,m

] = v

Lower bounds can be calculated via a similar pro-

cedure. We begin by calculating the optimal policy

∗

(¯µ

) for the mean MDP ¯µ

arising from ξ. We then

compute ˜v

L,k

, V

∗

(¯µ

)

, the value of that policy for

each sample µ

and estimate ˆv

L,m

∑

k=1

˜v

L,k

2 COMPLEXITY OF BELIEF

TREE SEARCH

We now present our main results. Detailed proofs are

available in an accompanying technical report (Dim-

itrakakis, 2009). We search trees which arise in the

context of planning under uncertainty in MDPs us-

ing the BAMDP framework. We can use value func-

tion bounds on the leaf nodes of a partially expanded

BAMDP tree to obtain bounds for the inner nodes

through backwards induction. The bounds can be

used both for action selection and for further tree

expansion. However, the bounds are estimated via

Monte Carlo sampling, something that necessitates

the use of stochastic branch and bound technique to

expand the tree.

We analyse a set of such algorithms. The ﬁrst is

a search to a ﬁxed depth that employs exact lower

bounds. We then show that if only stochastic bounds

are available, the complexity of ﬁxed depth search

only increases logarithmically. We then present two

stochastic branch and bound algorithms, whose com-

plexity is dependent on the number of near-optimal

branches. The ﬁrst of these uses bound samples on

leaf nodes only, while the second uses samples ob-

tained in the last half of the parents of leaf nodes, thus

using the collected samples more efﬁciently.

2.1 Assumptions and Notation

We present the main assumptions concerning the tree

search, pointing out the relations to Bayesian RL. The

symbols V and v have been overloaded to make this

correspondence more apparent. The tree that has a

branching factor at most φ. The branching is due to

both action choices and random outcomes (see Fig.1).

Thus, the nodes at depth k correspond to the set of

hyper-states {ω

t+k

} in the BAMDP. By abusing no-

tation, we may also refer to the components of each

node ω = (s,ξ) as s(ω),ξ(ω).

We deﬁne a branch b as a set of policies (i.e. the

set of all policies starting with a particular action).

The value of a branch b is V

, max

π∈b

. The

root branch is the set of all policies, with value V

∗

. A

Figure 1: A belief tree, where the rewards are ignored

for simplicity, with actions A = {a

} and states S =

hyper-state ω is b-reachable if ∃π ∈ b s.t P

π,ν

(ω|ω

) >

0.Any branch b can be partitioned at any b-reachable

ω into a set of branches B(b,ω). A possible parti-

tion is any b

{

π ∈ b : i = argmax

π(a|ω)

}

for any

∈ B(b, ω). We simplify this by considering only

deterministic policies. We denote the k-horizon value

function by V

(k) , max

π∈b

t,k

(ω

). For each tree

node ω = (s,ξ), we deﬁne upper and lower bounds

(ω) , E

∗

(s)], v

(ω) , E

∗

(¯µ

)

(s)], from (4).

By fully expanding the tree to depth k and perform-

ing backwards induction (3), using either v

or v

the value of leaf nodes, we obtain respectively upper

and lower bounds V

(k),V

(k) on the value of any

branch. Finally, we use C (ω) for the set of immediate

children of a node ω and the short-hand Ω

for C

(ω),

the set of all children of ω at depth k. We assume the

following:

Assumption 1 (Uniform linear convergence). There

exists γ ∈ (0, 1) and β > 0 s.t. for any branch b, and

depth k, V

−V

(k) ≤ βγ

, V

(k) −V

≤ βγ

Remark 1. For BAMDPs with r

∈ [0,1] and γ < 1,

Ass. 1 holds, from boundedness and the geometric se-

ries, with β = 1/(1−γ), since V

(k) and V

(k) are the

k-horizon value functions with the value of leaf nodes

bounded in 1/(1 − γ).

We analyse algorithms which search the tree and

then select an (action) branch

∗

. For each algorithm,

we examine the number of leaf node evaluations re-

quired to bound the regret V

∗

−V

∗

2.2 Flat Search

With exact bounds, we can expand all branches to a

ﬁxed depth and then select the branch

∗

, with the

highest lower bound. This is Alg. 1, with complexity

given by the following lemma.

COMPLEXITY OF STOCHASTIC BRANCH AND BOUND METHODS FOR BELIEF TREE SEARCH IN BAYESIAN

REINFORCEMENT LEARNING

261

Algorithm 1. Flat oracle search.

1: Expand all branches until depth k = log

ε/β or

∆

> βγ

− ε.

2: Select the root branch

∗

= argmax

(k).

Algorithm 2. Flat stochastic search.

1: procedure FSSEARCH(ω

,k, m)

2: Let Ω



t+k

: i = 1,... , φ



be the set of

all k-step children of ω

3: for ω ∈ Ω

4: Draw m samples ˜v

L, j

= V

, µ ∼ ξ(ω)

5: ˆv

∑

j=1

˜v

L, j

6: end for

7: Calculate

, return

∗

= argmax

8: end procedure

Lemma 1. Alg. 1 on a tree with branching factor φ,

γ ∈ (0,1), samples O(φ

1+log

ε/β

) times to bound the

regret by ε.

Proof. Bound the k-horizon value function error with

Ass. 1 and note that there are φ

k+1

leaves.

In our case, we only have a stochastic lower bound on

the value of each node. Algorithm 2 expands the tree

to a ﬁxed depth and then takes multiple samples from

each leaf node.

Lemma 2. Calling Alg. 2 with k = dlog

ε/2βe, m =

2dlog

(ε/2β)e · log φ, we bound the regret by ε using



1+log

ε/2β

log

(ε/2β) · log φ



samples.

Proof. The regret now is due to both limited depth

and stochasticity. We bound each by ε/2, the ﬁrst via

Lem. 1 and the second via Hoeffding’s inequality.

Thus, stochasticity mainly adds a logarithmic factor

to the oracle search. We now consider two algo-

rithms which do not search to a ﬁxed depth, but select

branches to deepen adaptively.

2.3 Stochastic Branch and Bound 1

A stochastic branch and bound algorithm similar to

those examined here was originally developed by

(Norkin et al., 1998) for optimisation problems. At

each stage, it takes an additional sample at each leaf

node, to improve their upper bound estimates, then

expands the node with the highest mean upper bound.

Algorithm 3 uses the same basic idea, averaging the

value function samples at every leaf node.

In order to bound complexity, we need to bound

the time required until we discover a nearly optimal

branch. We calculate the number of times a subop-

timal branch is expanded before its suboptimality is

discovered. Similarly, we calculate the number of

times we shall sample the optimal node until its mean

upper bound becomes dominant. These two results

cover the time spent sampling upper bounds of nodes

in the optimal branch without expanding them and the

time spent expanding nodes in a sub-optimal branch.

Algorithm 3. Stochastic branch and bound 1.

Let L

be the root.

for n = 1,2,. .. do

for ω ∈ L

++, µ ∼ ξ(ω), ˜v

= V

∗

(s(ω)).

ˆv

∑

i=1

˜v

end for

∗

= argmax

ˆv

n+1

= C (

∗

) ∪ L

∗

end for

Lemma 3. If N is the (random) number of samples ˜v

from random variable V ∈ [0,β] we must take until its

empirical mean

∑

i=1

˜v

> EV − ∆, then:

E[N] ≤ 1 + β

∆

−2

(5)

P[N > n] ≤ exp



−2β

−2

∆



. (6)

Proof. The ﬁrst inequality follows from the Hoeffding

inequality and an integral bound on the resulting sum,

while the second inequality is proven directly via a

Hoeffding bound.

By setting ∆ to be the difference between the opti-

mal and second optimal branch, we can use the above

lemma to bound the number of times N the leaf nodes

in the optimal branch will be sampled without be-

ing expanded. The converse problem is bounding the

number of times that a suboptimal branch will be ex-

panded.

Lemma 4. If b is a branch with V

= V

∗

− ∆, then

it will be expanded at least to depth k

= log

∆/β.

Subsequently,

P(K > k) < O



exp



−2β

−2



(k − k

)∆



. (7)

Proof. In the worst case, the branch is degenerate and

only one leaf has non-zero probability. We then apply

a Hoeffding bound to obtain the desired result.

ICAART 2010 - 2nd International Conference on Agents and Artificial Intelligence

262

Algorithm 4. Stochastic branch and bound 2.

for ω ∈ L

∑

∈C (ω)

∑

∈C (ω)

∑

i=1

˜v

end for

Use (3) to obtain

for all nodes.

Set ω

to root.

for d = 1,... do

∗

= argmax

∑

ω∈Ω

d−1

( j|a)

(ω)

∼ ω

d−1

( j|a

∗

if ω

∈ L

then,

n+1

= C (ω

) ∪ L

\ω

Break

end if

end for

2.4 Stochastic Branch and Bound 2

The degeneracy is the main problem of Alg. 3. Alg. 4

not only propagates upper bounds from multiple leaf

nodes to the root, but also re-uses upper bound sam-

ples from inner nodes, in order to handle the degener-

ate case where only one path has non-zero probability.

(Nevertheless, Lemma 3 applies without modiﬁcation

to Alg. 4). Because we are no longer operating on

leaf nodes, we can take advantage of the upper bound

samples collected along a given trajectory. However,

if we use all of the upper bounds along a branch, then

the early samples may bias our estimates a lot. For

this reason, if a leaf is at depth k, we only average

the upper bounds along the branch to depth k/2. The

complexity of this approach is given by the following

lemma:

Lemma 5. If b is s.t. V

= V

∗

−∆, it will be expanded

to depth k

> log

∆/β and

P(K > k) / exp



−2(k − k

)

(1 − γ

)



, k > k

Proof. There is a degenerate case where only one sub-

branch has non-zero probability. However we now

re-use the samples that were obtained at previous ex-

pansions, thus allowing us to upper bound the bias

∆(1−γ

k+1

)

(k−k

)(1−γ)

. This allows to use a tighter Hoeffding

bound and so obtain the desired outcome.

This bound decreases faster with k. Furthermore,

there is no dependence on ∆ after the initial transi-

tory period, which may however be very long. The

gain is due to the fact that we are re-using the upper

bounds previously obtained in inner nodes. Thus, this

algorithm should be particularly suitable for stochas-

tic problems.

2.5 Lower Bounds for Bayesian RL

We can reduce the branching factor φ, (which

is |A × S × R | for a full search) by employing

sparse sampling methods (Kearns et al., 1999) to

O{|A| exp[1/(1 − γ)]}. This was essentially the ap-

proach employed by (Wang et al., 2005). However,

our main focus here is to reduce the depth to which

each branch is searched.

The main problem with the above algorithms is

the fact that we must reach k

= dlog

∆e to discard

∆-optimal branches. However, since the hyper-state

arises from a Bayesian belief, we can use an addi-

tional smoothness property:

Lemma 6. The Dirichlet parameter sequence ψ

with n

∑

i=1

, is a c-Lipschitz martingale with

= 1/2(n

+ 1).

Proof. Simple calculations show that, no matter what

is observed, E

(ψ

t+1

) = ψ

. Then, we

bound the difference |ψ

t+k

− ψ

| by two dif-

ferent bounds, which we equate to obtain c

Lemma 7. If µ, ˆµ are such that kT −

T k

∞

≤ ε and

kr − ˆrk

∞

≤ ε, for some ε > 0, then



−



∞

≤

(1−γ)

, for any policy π.

Proof. By subtracting the Bellman equations for

V and taking the norm, we can repeatedly apply

Cauchy-Schwarz and triangle inequalities to obtain

the desired result.

The above results help us obtain better lower bounds

in two ways. First we note that initially 1/k converges

faster than γ

, for large γ, thus we should be able to

expand less deeply. Later, n

is large so we can sample

even more sparely.

If we search to depth k, and the rewards are in

[0,1], then, naively, our error is bounded by

∑

∞

n=k

/(1 − γ). However, the mean MDPs for n > k are

close to the mean MDP at k due to Lem. 6. This means

that β can be signiﬁcantly smaller than 1/(1 − γ). In

fact, the total error is bounded by

∑

∞

n=k

(n − k)/n.

For undiscounted problems, our error is bounded by

T − k in the original case and by T − k[1 + log(T /k)]

when taking into account the smoothness.

3 CONCLUSIONS AND RELATED

WORK

Much recent work on Bayesian RL focused on my-

opic estimates or full expansion of the belief tree

COMPLEXITY OF STOCHASTIC BRANCH AND BOUND METHODS FOR BELIEF TREE SEARCH IN BAYESIAN

REINFORCEMENT LEARNING

263

up to a certain depth. Exceptions include (Poupart

et al., 2006), which uses an analytical bound based

on sampling a small set of beliefs and (Wang et al.,

2005), which uses Kearn’s sparse sampling algo-

rithm (Kearns et al., 1999) to expand the tree. Both

methods have complexity exponential in the horizon,

something which we improve via the use of smooth-

ness properties induced by the Bayesian updating.

There are also connections with work on

POMDPs problems (Ross et al., 2008). However

this setting, though equivalent in an abstract sense,

is not sufﬁciently close to the one we consider. Re-

sults on bandit problems, employing the same value

function bounds used herein were reported in (Dim-

itrakakis, 2008), which experimentally compared al-

gorithms operating on leaf nodes only.

Related results on the online sample complexity

of Bayesian RL were developed by (Kolter and Ng,

2009), who employs a different upper bound to ours

and (Asmuth et al., 2009), who employs MDP sam-

ples to plan in an augmented MDP space, similarly to

(Auer et al., 2008) (who consider the set of plausible

MDPs) and uses Bayesian concentration of measure

results (Zhang, 2006) to prove mistake bounds on the

online performance of the algorithm.

Interestingly, Alg. 4 resembles HOO (Bubeck

et al., 2008) in the way that it traverses the tree, with

two major differences. (a) The search is adapted to

stochastic trees. (b) We use means of samples of

upper bounds, rather than upper bounds on sample

means. For these reasons, we are unable to simply

restate the arguments in (Bubeck et al., 2008).

We presented complexity results and counting ar-

guments for a number of tree search algorithms on

trees where stochastic upper and lower bounds sat-

isfying a smoothness property exist. These are the

ﬁrst results of this type and partially extend the re-

sults of (Norkin et al., 1998), which provided an

asymptotic convergence proof, under similar smooth-

ness conditions, for a stochastic branch and bound al-

gorithm. In addition, we introduce a mechanism to

utilise samples obtained at inner nodes when calcu-

lating mean upper bounds at leaf nodes. Finally, we

relate our complexity results to those of (Kearns et al.,

1999), for whose lower bound we provide a small

improvement. We plan to address the online sam-

ple complexity of the proposed algorithms, as well as

their practical performance, in future work.

ACKNOWLEDGEMENTS

This work was part of the ICIS project, supported

by the Dutch Ministry of Economic Affairs, grant nr:

BSIK03024. I would like to thank the anonymous re-

viewers, as well as colleagues at the university of Am-

sterdam, Leoben and TU Crete for their comments on

earlier versions of this paper.

REFERENCES

Asmuth, J., Li, L., Littman, M. L., Nouri, A., and Wingate,

D. (2009). A Bayesian sampling approach to explo-

ration in reinforcement learning. In UAI 2009.

Auer, P., Jaksch, T., and Ortner, R. (2008). Near-optimal

regret bounds for reinforcement learning. In Proceed-

ings of NIPS 2008.

Bubeck, S., Munos, R., Stoltz, G., and Szepesv

ari, C.

(2008). Online optimization in X-armed bandits. In

NIPS, pages 201–208.

Dimitrakakis, C. (2008). Tree exploration for Bayesian RL

exploration. In CIMCA’08, pages 1029–1034, Los

Alamitos, CA, USA. IEEE Computer Society.

Dimitrakakis, C. (2009). Complexity of stochastic branch

and bound for belief tree search in Bayesian rein-

forcement learning. Technical Report IAS-UVA-09-

01, University of Amsterdam.

Duff, M. O. (2002). Optimal Learning Computational

Procedures for Bayes-adaptive Markov Decision Pro-

cesses. PhD thesis, University of Massachusetts at

Amherst.

Kearns, M. J., Mansour, Y., and Ng, A. Y. (1999). A sparse

sampling algorithm for near-optimal planning in large

Markov decision processes. In Dean, T., editor, IJCAI,

pages 1324–1231. Morgan Kaufmann.

Kolter, J. Z. and Ng, A. Y. (2009). Near-Bayesian explo-

ration in polynomial time. In ICML 2009.

Norkin, V. I., Pﬂug, G. C., and Ruszczyski, A. (1998). A

branch and bound method for stochastic global op-

timization. Mathematical Programming, 83(1):425–

450.

Poupart, P., Vlassis, N., Hoey, J., and Regan, K. (2006). An

analytic solution to discrete Bayesian reinforcement

learning. In ICML 2006, pages 697–704. ACM Press

New York, NY, USA.

Puterman, M. L. (1994,2005). Markov Decision Processes

: Discrete Stochastic Dynamic Programming. John

Wiley & Sons, New Jersey, US.

Ross, S., Pineau, J., Paquet, S., and Chaib-draa, B. (2008).

Online planning algorithms for POMDPs. Journal of

Artiﬁcial Intelligence Resesarch, 32:663–704.

Wang, T., Lizotte, D., Bowling, M., and Schuurmans, D.

(2005). Bayesian sparse sampling for on-line reward

optimization. In ICML ’05, pages 956–963, New

York, NY, USA. ACM.

Zhang, T. (2006). From ε-entropy to KL-entropy: Analysis

of minimum information complexity density estima-

tion. Annals of Statistics, 34(5):2180–2210.

ICAART 2010 - 2nd International Conference on Agents and Artificial Intelligence

264