Speeding up Online POMDP Planning

Uniﬁcation of Observation Branches by Belief-state Compression Via

Expected Feature Values

Gavin Rens

Centre for Artiﬁcial Intelligence Research, University of KwaZulu-Natal and CSIR Meraka, Pretoria, South Africa

Keywords:

Online, POMDP, Planning, Heuristic, Optimization, Belief-state Compression, Expected Feature Values.

Abstract:

A novel algorithm to speed up online planning in partially observable Markov decision processes (POMDPs) is

introduced. I propose a method for compressing nodes in belief-decision-trees while planning occurs. Whereas

belief-decision-trees branch on actions and observations, with my method, they branch only on actions. This is

achieved by unifying the branches required due to the nondeterminism of observations. The method is based

on the expected values of domain features. The new algorithm is experimentally compared to three other

online POMDP algorithms, outperforming them on the given test domain.

1 INTRODUCTION

Partially observable Markov decision processes

(POMDPs) (Monahan, 1982; Lovejoy, 1991) can ge-

nerate optimal policies that account for stochastic ac-

tions and partially observable environments. They are

also reward-based, accounting for an agent’s prefer-

ences (to guide its behavior). Unfortunately, optimal

POMDP policies are usually intractable to generate,

and they are even less practical when used in dynamic

environments.

A strategy for policy generation in dynamic en-

vironments that deals with this intractability is con-

tinuous planning or agent-centered search (Koenig,

2001). Agents employing this strategy compute fu-

ture actions with only local look-ahead or forward-

search. Recently, algorithms for online planning have

been developed to deal with intractability of solving

POMDPs (Ross et al., 2008).

Two sources for the intractability of solving

POMDPs’ optimally are usually cited in the litera-

ture (Pineau et al., 2003). First is the curse of di-

mensionality, which refers to (in the case of a model

with discrete states) a belief space having a dimension

equal to the number of states. For instance, a domain

modeled with 1000 states has a 1000-dimensional be-

lief space! Several researchers have suggested ways

to compress the state space in attempts to deal with

this problem (Poupart and Boutilier, 2003; Roy et al.,

2005; Li et al., 2007). Second is the curse of his-

tory, which refers to the number of possible belief-

states which must be considered during planning, in-

creases exponentially with the planning horizon. Kur-

niawati et al. (2011) reduce the effective horizon in

robot motion planning by using a particular (ofﬂine)

point-based POMDP solver. He et al. (2011) tackle

the horizon problem for online planning for large sys-

tems that need predictions for actions many steps into

the future by using macro-actions. According to Rens

and Ferrein (2013), exponential growth of belief-state

size in the number of steps can be considered as

a third source of potential intractability in POMDP

algorithms—the curse of outcomes. They investigated

several “fast and frugal” methods for reducing the size

of belief-nodesgenerated duringonline planning. The

most effective method was what they called the mean-

as-threshold method (explained in § 3).

The search space in online POMDP planning can

be thought of graphically as a ﬁnite tree (a belief-

decision-trees), where nodes in alternating tiers rep-

resent (i) belief-states branching on the agent’s choice

of actions to perform, respectively, (ii) perceptions

branching on the environment’s choice of which ob-

servation to supply. An agent’s belief-states can con-

tain anything from a few states to thousands of states

or more. Furthermore, given some action choice, ev-

ery possible observation results in a new belief-state.

I introduce a novel approximate algorithm to

speed up online POMDP planning by generating only

a single most expected state for every action con-

sidered, instead of generating a belief-state for every

possible observation for every action considered dur-

241

Rens G..

Speeding up Online POMDP Planning - Uniﬁcation of Observation Branches by Belief-state Compression Via Expected Feature Values.

DOI: 10.5220/0005165802410246

In Proceedings of the International Conference on Agents and Artiﬁcial Intelligence (ICAART-2015), pages 241-246

ISBN: 978-989-758-074-1

 2015 SCITEPRESS (Science and Technology Publications, Lda.)

ing planning. The method uniﬁes the branches which

are due to the nondeterminism of perceptions, thus

branching only on actions. The uniﬁcation is based

on the expected values of domain features. The new

algorithm is experimentally compared to three other

online POMDP algorithms, outperforming them on

the given test domain—in some cases, signiﬁcantly.

The required preliminaries are reviewed in Sec-

tion 2. Section 3 presents the four algorithms which

will be compared with each other, including the new

proposed algorithm. In Section 4, I test the algorithms

on a simple grid-world domain. Some concluding re-

marks are made in the last section.

2 PRELIMINARIES

In a partially observable Markov decision process

(POMDP), the agent can only predict with a likeli-

hood in which state it will end up after performing an

action. And due to imperfect sensors, an agent must

maintain a probability distribution over the set of pos-

sible states.

Formally (Kaelbling et al., 1998), a POMDP is

a tuple hS, A,T,R,Z,O,b

i with a ﬁnite set of states

S = {s

, s

, .. ., s

}, a ﬁnite set of actions A =

,... ,a

}, the state-transition function, where

T(s,a, s

′

) is the probability of being in s

′

after per-

forming action a in state s, the reward function, where

R(a,s) is the reward gained for executing a while in

state s, a ﬁnite set of observationsZ = {z

,... ,z

};

the observation function, where O(s

′

,a, z) is the prob-

ability of observing z in state s

′

resulting from per-

forming action a in some other state; and b

is the

initial probability distribution over all states in S.

A belief-state b is a set of pairs hs, pi where each

state s in b is associated with a probability p. All

probabilities must sum up to one, hence, b forms a

probability distribution over the set S of all states. To

update the agent’s beliefs about the world, a state es-

timation function SE(z,a,b) = b

is deﬁned as

′

) =

O(s

′

,a, z)

∑

s∈S

T(s,a,s

′

)b(s)

Pr(z|a, b)

, (1)

where a is an action performed in ‘current’ belief-

state b, z is the resultant observation and b

′

) de-

notes the probability of the agent being in state s

′

‘new’ belief-state b

. Note that Pr(z|a, b) is a nor-

malizing constant.

Let the planning horizon h (also called the look-

ahead depth) be the number of future steps the

agent plans ahead each time it selects its next action.

∗

(b,h) is the optimal value of future courses of ac-

tions the agent can take with respect to a ﬁnite hori-

zon h starting in belief-state b. This function assumes

that at each step, the action which will maximize the

state’s value will be selected. V

∗

(b,h) is deﬁned as

max

a∈A

ρ(a,b) + γ

∑

z∈Z

Pr(z|a, b)V

∗

(SE(z,a,b), h− 1)

where ρ(a,b) is deﬁned as

∑

s∈S

R(a,s)b(s), 0 ≤

γ < 1 is a factor to discount the value of future

rewards and Pr(z|a, b) denotes the probability of

reaching belief-state b

= SE(z,a,b). While V

∗

de-

notes the optimal state value, function Q

∗

denotes

the optimal action value: Q

∗

(a,b, h) = ρ(a,b) +

∑

z∈Z

Pr(z|a, b)V

∗

(SE(z,a, b),h − 1) is the value of

executing a in the current belief-state, plus the total

expected value of belief-states reached thereafter.

Algorithm 1 is the basic online POMDP algorithm

to select the next best action. (

at line 10 is a place-

holder for the best action so far returned by the algo-

rithm, but not used there.)

Algorithm 1: Basic.

Input: b: belief-state, h: horizon

Output: an action, the action’s Q value

1 b

′

: belief-state, initially empty;

2 if h = 0 then

3 return stop, 0

4 if h > 0 then

5 maxVal ← −∞;

6 foreach a ∈ A do

7 sum ← 0;

8 foreach z ∈ Z do

9 b

′

← SE(z, a,b);

,V ← Basic(b

′

,h− 1);

11 sum ← sum+ Pr(z | a,b)V;

12 value ← ρ(a,b)+ γ· sum;

13 if value > maxVal then

14 maxVal ← value;

15 bestAct ← a;

16 return bestAct,maxVal;

3 ONLINE POMDP

ALGORITHMS

3.1 Mean as Threshold

The basic algorithm is augmented by compressing the

belief-states in a very simple and intuitive way. The

idea (Rens and Ferrein, 2013) is to reduce the size of a

belief-state by retaining only a small number of repre-

sentative states. As the number of states in a belief re-

duces, performing belief-update on the ‘compressed’

belief will be signiﬁcantly faster.

For each belief-node generated in the belief-

decision-tree, a subset of states with probabilities

ICAART2015-InternationalConferenceonAgentsandArtificialIntelligence

242

abovea certain threshold are retained. Using the mean

of the probabilities of the states in b as a threshold

seems to be a reasonable heuristic for selecting states

with probabilities relatively high compared to all the

states in the node. They deﬁne the set mp(b) of most

probable states of a set b as {hs, pi ∈ b | p ≥ µ

The mean-as-threshold (MT) method is then to re-

place b

′

← SE(z,a,b) at line 9 of Algorithm 1 by

′

= mt(SE(z,a, b)), where

mt(b) = {hs, p/αi | (s, p) ∈ mp(b),α =

∑

hs,p

′

i∈mp(b)

′

3.2 Monte Carlo Sampling

Every action leads to a set of observations with a

probability distribution related to the likelihood of the

observation being perceived. Given an action, the

belief-state in which it was performed and a possible

observation, a new belief-state can be generated. In-

stead of expanding the belief-nodes for each observa-

tion associated with each action at each stage during

planning, a relatively small subset of nodes can be ge-

nerated from a relatively small subset of observations

sampled.

In the approach of McAllester and Singh (1999),

the probabilities Pr(z | a,b) are approximated us-

ing the observed frequencies in the sample: Let

(a,b) = {z

,... ,z

} be c samples obtained from

Z proportional to Pr(· | a,b). Let N

(a,b)) be

the number of times observation z occurs in Z

(a,b).

Then

(a,b))

is an estimate of Pr(z | a, b).

Whereas computing

Pr(z | a, b) =

∑

′

∈S

O(s

′

,a, z)

∑

s∈S

T(s,a, s

′

)b(s)

exactly is in O(|S|

), estimating it by looking at the

frequency of z’s occurrence in the set of samples is

only in O(log|S| + log|Z|). McAllester and Singh

(1999) relate sample size c to the optimality of this

approximation in a theorem. Determining a desirable

value for c theoretically is beyond the scope of this

paper. “[...] a few samples is often sufﬁcient to ob-

tain a good estimate as the observations that have the

most effect on Q

∗

(b,a) (i.e. those which occur with

high probability)are more likely to be sampled” (Ross

et al., 2008). In the experiments, I found that using

even only one sample produces good returns. Plan-

ning time becomes impractical when using more than

three or four samples.

A disadvantage of this approach is that actions

cannot be pruned (Monte Carlo estimation is not guar-

anteed to correctly propagate the lower and upper

bounds up the tree). Thus, each action executable in a

belief-state must be considered. That means that the

Monte Carlo approach may be difﬁcult to apply in do-

mains where the number of actions |A| is large (Ross

et al., 2008).

The Monte Carlo (MC) sampling approach is thus

to replace lines 7 to 10 of Algorithm 1 by

foreach z ∈ Z

| N

(a,b)) > 0 do

′

← SE(z,a,b)

,V ← MC(b

′

,h − 1,c)

sum ← sum+

(a,b))

where MC(···) is the modiﬁed basic algorithm.

3.3 Real-Time Belief Space Search

Branch-and-Bound (BB) pruning is a general search

technique used to prune nodes that are known to be

suboptimal in the search tree, thus preventing the ex-

pansion of unnecessary lower nodes in lower parts of

the tree. To achieve this, an estimate is maintained on

the value Q

∗

(a,b) of each action a, for every belief-

node b in the tree. These estimates are computed by

ﬁrst evaluating an upper bound heuristic for the fringe

nodes (at the ﬁxed depth D) of the current tree. The

bound is then propagated to parent nodes according to

the following equation:

δ(b,h) =



Hr(b), if h = 0

∆(b,h), otherwise

with ∆(b,h) is deﬁned as

max

a∈A

[ρ(a,b) + γ

∑

z∈Z

Pr(z | a, b)δ(SE(z, a,b),h− 1)],

where heuristic function Hr(b) is an over-optimistic

estimate on V

∗

(b). Similarly, heuristic function

Hr(a,b) is an over-optimistic estimate on Q

∗

(a,b).

Of course, Hr(b) and Hr(a, b) are domain-speciﬁc

functions.

The idea behind Branch-and-Bound pruning is as

follows. If a given action a in a belief b has an upper

bound Hr(a, b) which is lower than the current value

of b determined so far using δ(b,h), then we know

that a cannot be on the path yielding the best value,

meaning that a is suboptimal in belief b. Hence,

all belief-nodes reached from b via action a will be

pruned from the tree (Ross et al., 2008). The heuristic

function used in the experiments is

Hr(a,b) =

∑

s∈S

MDP

(s,a)b(s),

where Q

MDP

(s,a) is the QMDP approximation—an

upper bound on the Q-value of a in s. According

to V

∗

(b) = max

∗

(a,b), the function Hr(b) is then

Hr(b) = max

a∈A

Hr(a,b).

SpeedingupOnlinePOMDPPlanning-UnificationofObservationBranchesbyBelief-stateCompressionViaExpected

FeatureValues

243

I implemented a version of Real-Time Belief

Space Search (RTBSS) ﬁrst proposed by Paquet et al.

(2005). It follows the branch-and-bound strategy to

prune suboptimal states from the tree. The 0 at line 3

of Algorithm 1 is replaced by Hr(b). Replace foreach

a ∈ A at line 6 by foreach a ∈ Sort(b,A). Sort(b, A)

is a list of the actions in {a

,... ,a

|A|

}, sorted such

that if i < j, then Hr(a

,b) ≥ Hr(a

,b). Actions are

sorted in this manner in an attempt to prune branches

as soon as possible (Paquet et al., 2005). And ﬁnally,

the procedure corresponding to lines 6 to 14 are exe-

cuted only if Hr(a,b) > maxVal.

3.4 Observation Uniﬁcation

The OUCEF algorithm (2) is my contribution. It came

about due to the following insight. Suppose a per-

son’s beliefs about his/her current situation were com-

parable to a POMDP state. One could imagine that

in a real-time/online setting, when considering future

courses of action (including perceptions), one does

not consciously maintain different sets of belief-states

for the possible observations one could perceive af-

ter some action. Rather, one can imagine, that given

(mental) models of stochastic actions and observa-

tions, a single projected state is formed (for every ac-

tion considered). That is, only the most expected state

is considered, given a sequence of actions.

Our idea for unifying all belief-states which would

have been generated for each observation in Z =

,... ,z

} after action a performed in belief-state

b is to select, for each feature, the feature value clos-

est to the expected value of the feature (note that

the expected value may not exist). Let s( f) be the

value of feature f in state s. Formally, given a set

B(a,b) = {b

,... ,b

} of projected belief-states,

the expected value ˆv(a,b, f) of feature f is

ˆv(a,b, f) =

∑

∈B(a,b),hs

′

,pi∈b

′

( f)Pr(z | a,b)p

∑

z∈Z,s

′

∈S

′

( f)Pr(z | a,b)×

O(s

′

,a, z)

∑

s∈S

T(s,a,s

′

)b(s)

Pr(z | a, b)

(from (1))

∑

z∈Z,s

′

∈S

′

( f)O(s

′

,a, z)

∑

s∈S

T(s,a, s

′

)b(s),

As mentioned in the section about the Monte Carlo

Sampling approach, computing Pr(z | a,b) is in

O(|S|

), a relatively intensive computation. Notice

that it has been cancelled out of the calculation of

ˆv(a,b, f).

Recall that Pr(z |,a,b) can be viewed as the probability

of reaching belief-state b

from b.

Algorithm 2: OUCEF

Input: b: belief-state, h: horizon

Output: an action, the action’s Q value

1 if h = 0 then

2 return stop, 0

3 if h > 0 then

4 maxVal ← −∞;

5 foreach a ∈ A do

6 ,V = OUCEF({hcef(a, b), 1i}, h− 1);

7 value ← ρ(a,b) + γ ·V;

8 if value > maxVal then

9 maxVal ← value;

10 bestAct ← a;

11 return bestAct,maxVal;

Every state s ∈ S is deﬁned by the value the state

assigns to the set of features F considered for the do-

main. Let {v

,... ,v

} be the ﬁnite set of discrete

values that feature f can take. Hence, a state s can be

expressed as {( f,v

) | f ∈ F,v

∈ {v

,...,v

}}.

I propose that the most expected state s

∗

= cef(a,b),

given some action a is executed in some b, is identi-

ﬁed by determining the feature values closest to the

expected feature values:

cef(a, b)

def

= {( f,snapTo( ˆv(a,b, f)) | f ∈ F},

where snapTo(ˆv(a, b, f)) is the value in

,... ,v

} closest to ˆv(a,b, f).

If a feature f

qal

is qualitative, then its values

can be converted to numbers for the period of

calculation of ˆv(a,b, f

qal

). For instance, given a

POMDP with only a ‘direction’ feature dir with

possible values in {North,East,West, South}, one

can associate 1 with North, 2 with East, 3 with

West and 4 with South. If there are two observa-

tions, then there are two possible new belief-states,

each with four states, for any action executed in

any belief-state. Assume, for example, that the

two new belief-states are b

= {h{(dir,1)},0.1i,

h{(dir,2)}, 0.2i, h{(dir, 3)},0.3i, h{(dir, 4)},0.4i}

and b

= {h{(dir,1)},0.3i, h{(dir,2)}, 0.3i,

h{(dir,3)}, 0.3i, h{(dir,4)}, 0.1i}, due to per-

forming some a

′

in some b

′

and perceiving z

′

with

some probability p

′

, respectively, perceiving z

′

with some probability 1 − p

′

. Also assume that

the probability of reaching b

is 0.4 and of reach-

ing b

is 0.6.

The expected value of dir is thus

ˆv(a

′

,dir) = 0.4(1 × 0.1 + 2 × 0.2 + 3 × 0.3 + 4 ×

0.4) + 0.6(1 × 0.3 + 2 × 0.3 + 3 × 0.3 + 4 × 0.1) =

2.52. 2.52 is the closest to dir = 3 and we es-

timate that the agent should believe it is facing

In the following calculation, Pr(z | a,b) is used to sim-

plify the example. Else, deﬁning and using the observation

and transition functions would be necessary.

ICAART2015-InternationalConferenceonAgentsandArtificialIntelligence

244

West after a

′

performed in b

′

. Hence, cef(a

′

) =

{(dir,snapTo( ˆv(a

′

,dir))} = {(dir,West)}.

Algorithm 2 summarizes the approach.

4 EXPERIMENTS

To assess the proposed algorithm, four sets of ex-

periments were performed in a six-by-six grid-world.

In this world, the agent’s task is to collect twelve

randomly scattered items. States are quadruples

(x,y,d,t), with x,y ∈ {1,· ·· , 6} being the coordi-

nates of the agent’s position in the world, d ∈

{North,East,West, South} the direction it is facing,

and t ∈ {0,1}, t = 1 if an item is present in the cell

with the agent, else t = 0. The agent can perform

ﬁve actions {left,right,forward,see,collect}, mean-

ing, turn left, turn right, move one cell forward, see

whether an item is present and collect an item. The

only observation possible when executing one of the

physical actions is obsNil, the null observation, and

see has possible observations from the set {0,1} for

whether the agent sees the presence of an item (1) or

not (0).

The same reward function is used in all cases. The

value of the reward function is essentially propor-

tional to the inverse of the Manhattan distance from

the cell in which the agent is currently located to the

nearest item to be collected. The agent also receives

rewards for executing a collect action when there is

an item in the same cell.

Next, I deﬁne the possible outcomes for each ac-

tion: When the agent turns left or right, it can get

stuck in the same direction, turn 90

◦

or overshoots by

◦

. When the agent movesforward, it movesone cell

in the direction it is facing or it gets stuck and does

not move. The agent can see an item or see nothing

(no item in the cell), and collecting is deterministic

(if there is an item present, it will be collected with

certainty, if the agent executes collect). All actions

except collect are designed so that (i) the correct out-

come is achieved 95% of the time and incorrect out-

comes are achieved 5% of the time or (ii) the correct

outcome is achieved 80% of the time and incorrect

outcomes are achieved 20% of the time. So that the

agent does not get lost too quickly, I have included an

automatic localization action, that is, a sensing action

returns information about the agent’s approximate lo-

cation. The action is automatic because the agent can-

not choose to perform it or not to perform it; the agent

localizes itself after every regular/chosen action is ex-

ecuted. However, just as with regular actions, the lo-

calization sensor is noisy, and it correctly reports the

agent’s location with probability 0.95 or 0.80, else the

Table 1: Results for h = 3 and sf = 0.8.

Algorithm ic/s s/a ic

MT .62 0.29 6.52

MC (c=1) .12 1.21 5.18

MC (c=2) .10 1.68 5.84

MC (c=3) .07 2.39 6.16

RTBSS .12 1.27 5.50

OUCEF 1.48 0.09 4.80

Table 2: Results for h = 4 and sf = 0.8.

Algorithm ic/s s/a ic

MT .17 1.33 7.98

MC (c=1) .02 6.67 5.06

MC (c=2) .01 12.57 5.52

MC (c=3) .01 15.22 5.56

RTBSS 0.03 4.60 4.52

OUCEF .40 0.38 5.42

Table 3: Results for h = 3 and sf = 0.95.

Algorithm ic/s s/a ic

MT .97 0.26 9.04

MC (c=1) .15 1.17 6.44

MC (c=2) .11 1.67 6.90

MC (c=3) .08 2.36 6.78

RTBSS .17 0.95 5.96

OUCEF 1.69 0.09 5.48

Table 4: Results for h = 4 and sf = 0.95.

Algorithm ic/s s/a ic

MT .20 1.26 9.04

MC (c=1) .03 6.87 6.98

MC (c=2) .02 11.11 6.96

MC (c=3) .01 15.57 6.96

RTBSS .04 3.94 5.18

OUCEF .60 0.37 7.94

sensor reports a location adjacent to the agent with

probability uniformly distributed over 0.05, resp., 0.2

For each experiment, 50 trials were run with the

agent starting in random locations and performing 36

actions per trial. The look-ahead depth is set to either

h = 3 or h = 4, and for MC, the sample size is set

to c = 1, c = 2 or c = 3. Only results for c = 1 are

reported here, because MC was most effective in this

case.

Let sf denote the stochasticity factor; it corre-

sponds to the percentage correct outcomes of actions

and observations. I thus set sf = 0.8 or sf = 0.95.

For each experiment, I measure the average number

of items collected (ic), the average time (in seconds) it

takes to plan for one action (s/a) and the average num-

ber of items collected per second (ic/s). ic/s is taken

as the performance measure in these experiments. Ar-

guably, ic could have been used as the performance

measure, however,I feel that speed of planning should

be considered when measuring performance of online

algorithms. Tables 1 thru 4 report the results.

SpeedingupOnlinePOMDPPlanning-UnificationofObservationBranchesbyBelief-stateCompressionViaExpected

FeatureValues

245

5 CONCLUDING REMARKS

I presented a novel online POMDP algorithm which

performs at least twice as good as the other algo-

rithms on a particular grid-world problem. The basic

algorithm with mean-as-threshold belief-state com-

pression always collected the most items (ic). How-

ever, because it takes more than twice (for h = 3) or

three times (for h = 4) as long as OUCEF to select

the next action, its effectiveness is signiﬁcantly be-

low OUCEF’s (w.r.t. ic/s). OUCEF’s nominal perfor-

mance (ic) is comparable with that of the other algo-

rithms over the four experiment parameter combina-

tions.

The effectiveness of the OUCEF algorithm is due

to (i) unifying the branches due to nondeterministic

observations by collecting all belief-nodes at the ends

of these branchesinto one set B, and then (ii) selecting

the state most representative of B, by calculating the

expected values of the features of the states in B.

The aspect of this work most in need of attention is

to validate the approach on different benchmark prob-

lems. It might be the case that the OUCEF algorithm

is well suited to the kind of grid-world problems pre-

sented here, but to few other problems. Or it might

be suited to many kinds of problems. This paper is,

however, a ﬁrst step in introducing and testing the al-

gorithm. At the very least, the new ideas presented

here might lead other researchers to new insights in

their online POMDP algorithm. A theoretical analy-

sis of the optimality of OUCEF is also required and

could lead to interesting insights.

REFERENCES

He, R., Brunskill, E., and Roy, N. (2011). Efﬁcientplanning

under uncertainty with macro-actions. Journal of Ar-

tiﬁcial Intelligence Research (JAIR), 40:523–570.

Kaelbling, L., Littman, M., and Cassandra, A. (1998). Plan-

ning and acting in partially observable stochastic do-

mains. Artiﬁcial Intelligence, 101(1–2):99–134.

Koenig, S. (2001). Agent-centered search. Artiﬁcial Intelli-

gence Magazine, 22:109–131.

Kurniawati, H., Du, Y., Hsu, D., and Lee, W. (2011). Mo-

tion planning under uncertainty for robotic tasks with

long time horizons. International Journal of Robotics

Research, 30(3):308–323.

Li, X., Cheung, W., Liu, J., and Wu, Z. (2007). A

novel orthogonal NMF-based belief compression for

POMDPs. In Proceedings of the Twenty-fourth In-

ternational Conference on Machine Learning (ICML-

07), pages 537–544, New York, NY, USA. ACM

Press.

Lovejoy, W. (1991). A survey of algorithmic methods for

partially observed Markov decision processes. Annals

of Operations Research, 28:47–66.

McAllester, D. and Singh, S. (1999). Approximate plan-

ning for factored POMDPs using belief state simpliﬁ-

cation. In Proceedings of the Fifteenth Conference on

Uncertainty in Artiﬁcial Intelligence (UAI-99), pages

409–416, San Francisco, CA. Morgan Kaufmann.

Monahan, G. (1982). A survey of partially observable

Markov decision processes: Theory, models, and al-

gorithms. Management Science, 28(1):1–16.

Paquet, S., Tobin, L., and Chaib-draa, B. (2005). Real-time

decision making for large POMDPs. In Advances in

Artiﬁcial Intelligence: Proceedings of the Eighteenth

Conference of the Canadian Society for Computa-

tional Studies of Intelligence, volume 3501 of Lecture

Notes in Computer Science, pages 450–455. Springer

Verlag.

Pineau, J., Gordon, G., and Thrun, S. (2003). Point-based

value iteration: An anytime algorithm for POMDPs.

In Proceedings of the International Joint Conference

on Artiﬁcial Intelligence (IJCAI), pages 1025–1032.

Poupart, P. and Boutilier, C. (2003). Value-directed com-

pression of POMDPs. In Advances in Neural Infor-

mation Processing Systems (NIPS 2003), pages 1547–

1554. MIT Press, Massachusetts/England.

Rens, G. and Ferrein, A. (2013). Belief-node condensa-

tion for online pomdp algorithms. In Proceedings of

IEEE AFRICON 2013, pages 1270–1274, Red Hook,

NY 12571 USA. Institute of Electrical and Electronics

Engineers, Inc.

Ross, S., Pineau, J., Paquet, S., and Chaib-draa, B. (2008).

Online planning algorithms for POMDPs. Journal of

Artiﬁcial Intelligence Research (JAIR), 32:663–704.

Roy, N., Gordon, G., and Thrun, S. (2005). Finding

approximate POMDP solutions through belief com-

pressions. Journal of Artiﬁcial Intelligence Research

(JAIR), 23:1–40.

ICAART2015-InternationalConferenceonAgentsandArtificialIntelligence

246