Towards Domain-independent Biases for Action Selection in Robotic

Task-planning under Uncertainty

Juan Carlos Sabor

ıo

and Joachim Hertzberg

1,2

Institute of Computer Science, University of Osnabr

uck, Wachsbleiche 27, Osnabr

uck, Germany

DFKI Robotics Innovation Center (Osnabr

uck), Albert-Einstein-Straße 1, Osnabr

uck, Germany

Keywords:

Action Selection, Monte-Carlo Planning, Planning under Uncertainty.

Abstract:

Task-planning algorithms for robots must quickly select actions with high reward prospects despite the

huge variability of their domains, and accounting for the high cost of performing the wrong action in the

“real-world”. In response we propose an action selection method based on reward-shaping, for planning in

(PO)MDP’s, that adds an informed action-selection bias but depends almost exclusively on a clear speciﬁca-

tion of the goal. Combined with a derived rollout policy for MCTS planners, we show promising results in

relatively large domains of interest to robotics.

1 INTRODUCTION

Planning under uncertainty requires computing val-

ues for states and actions, that reﬂect a combination

of some form of utility or reward and their probabil-

ity. These planning problems are often modelled as

Markov Decision Processes (MDP’s) or Partially Ob-

servable MDP’s (POMDP’s), and solved using many

well known methods among which Monte-Carlo Tree

Search (MCTS) is a popular choice, especially in the

online planning community. The modern standard

for MCTS is UCT (Kocsis and Szepesv

ari, 2006),

which uses the UCB1 formula (Auer et al., 2002)

in search trees, thus guaranteeing asymptotic conver-

gence and solutions that minimize regret. Recently

POMCP (Silver and Veness, 2010), an extension of

UCT for partially observable domains, has become

somewhat of a reference point for POMDP planning

and contributed to the consolidation of Monte-Carlo

algorithms as a more efﬁcient alternative to traditional

POMDP solvers.

Despite these advances, transferring these meth-

ods to online planning in robots and other simi-

lar agents is not easy. The onboard hardware of

robots imposes severe limitations on computational

resources, and the high variability and uncertainty

of domains of interest still render most of these al-

gorithms inadequate for fast, online planning. Plan-

ning in the “real-world” introduces additional con-

siderations that are often overlooked in maze prob-

lems and other game-like domains: not only is their

state-complexity different, but there is also a certain

amount of consistency and regularity that may be ex-

ploited.

We argue that several simpliﬁcations and assump-

tions are not only possible, but necessary, in order

to achieve satisﬁcing behavior in such domains. One

such simpliﬁcation is quickly focusing on promising

states, and avoiding less promising ones. We have

developed these ideas while trying to provide a for-

mal interpretation of the intuitive concept of “rele-

vance”, which in planning terms may be seen as a reli-

able (albeit imperfect) attentional ﬁlter guiding action

selection, which may offer an intuitive way of han-

dling problems with high dimensionality. Planning

algorithms should be able to quickly identify promis-

ing (high expected value) states and focus on getting

there. State values represent a weighted average of

future rewards, so the problem reduces to quickly lo-

cating these sources of future rewards. One relatively

simple idea is preferring actions that lead to subgoals

(subsets of some terminal state) while avoiding those

that don’t, and encouraging these actions by provid-

ing additional, positive rewards. Achieving a subgoal

objectively brings the agent a step closer to achieving

a larger goal, and so we use this idea to formalize a

metric of state to goal proximity.

In this paper we propose partial goal satisfaction

as a way to compute the proximity of states to goals

and provide a reward bonus in action selection, which

easily becomes an action selection policy for Monte-

Carlo rollouts. This is by no means a complete so-

Saborío, J. and Hertzberg, J.

Towards Domain-independent Biases for Action Selection in Robotic Task-planning under Uncertainty.

DOI: 10.5220/0006578500850093

In Proceedings of the 10th International Conference on Agents and Artiﬁcial Intelligence (ICAART 2018) - Volume 2, pages 85-93

ISBN: 978-989-758-275-2

lution to online planning onboard robots, but rather a

contribution towards the improvement of action selec-

tion in planning algorithms, when information about

the goal is available. The intended effect is planning

more efﬁciently in domains with many states and ac-

tions.

Alternative, well-known approaches for address-

ing large planning spaces include value approxima-

tion and state aggregation, but these work under the

assumption that there are large groups of states that

can be clustered together (due to similarity or other

reasons) using ﬁxed criteria. At the moment we are

interested in how agents may use knowledge of their

goal(s) to improve their action selection criteria, in

particular by focusing on only a few good alternatives

when many options are available, as is the case of do-

mains with high variability and large branching factor.

In the following sections we discuss previous re-

lated work, and proceed to explain our proposal. We

provide two case studies and an analysis of exper-

imental results, as well as conclusions and future

work.

2 NOTATION

Unless otherwise speciﬁed, S and A are ﬁnite sets of

states and actions respectively and T (s,a,s

) is the

probability of reaching state s

when executing a in s,

which yields a real-valued reward R(s,a,s

). An MDP

is the tuple hS,A,T,Ri, with discount factor γ. Un-

der partial observability, an agent receives an obser-

vation ω ∈ Ω with probability O(s, a,ω) = p(ω|s,a)

and maintains an internal belief state b ∈ B, where

b(s) is the probability of s being the current state. A

POMDP is therefore hS, A,T,R,Ω,Oi. The sequence

= (a

,ω

,. .. ,a

t−1

,ω

) is the history at time t.

The common notation for planning in POMDP’s

uses belief states, or probability distributions over the

entire state space. That is, policies are given in terms

of beliefs. In this paper, however, we present our ac-

tion selection bias in terms of fully observable states

and states with mixed observability (states that con-

tain both fully and partially observable features). Our

experimental setup uses states sampled from a belief

state approximator. Future work includes extending

these deﬁnitions to more general cases, such as belief

states.

3 RELATED WORK

Our work centers on improving the performance of

action selection for (PO)MDP planning algorithms,

relying mostly on UCT (Kocsis and Szepesv

ari, 2006)

and its POMDP extension, POMCP (Silver and Ve-

ness, 2010). Unlike more traditional POMDP algo-

rithms that approximate the belief space using vector

sets, POMCP uses Monte-Carlo Tree Search (MCTS)

and approximates the current belief state using an un-

weighted particle ﬁlter. Because this is a key improve-

ment of the Monte-Carlo family of algorithms, this

paper focuses on states with partially observable ele-

ments instead of belief states. The concept of mixed

observability has produced positive results even out-

side of MCTS algorithms (Ong et al., 2010). Alter-

natively, MCTS algorithms have also been used for

belief selection in POMDP’s (Somani et al., 2013).

Existing approaches addressing the state-space

complexity involve clustering states and generalizing

state or belief values (Pineau et al., 2003), (Pineau

et al., 2006), function approximation (Sutton and

Barto, 2012) and random forest model learning (Hes-

ter and Stone, 2013). Our main critique of these meth-

ods for online task planning is that they have ﬁxed

aggregation criteria that do not respond to the con-

nection between states and goals. For instance, states

could be clustered together depending on context so

their shared values reﬂect the agent’s current goals.

Other approaches generate abstractions for plan-

ning and learning over hierarchies of actions (Sut-

ton et al., 1999), (Dietterich, 2000). The drawback

is that relatively detailed, prior knowledge of the do-

main is required to manually create these hierarchies,

although recent work suggests a possible way to build

them automatically (Konidaris, 2016).

Planning algorithms for (PO)MDP’s overlap with

reinforcement learning (RL) methods, with the differ-

ence that in RL the agent must ﬁnd an optimal pol-

icy while discovering and learning the transition dy-

namics. Reward shaping is commonly used in RL to

improve performance by simply giving additional re-

wards for certain preferred actions. This generates an

MDP with a different reward distribution and there-

fore different convergence properties, but potential-

based reward shaping (PBRS) has been shown to pre-

serve policy optimality (Ng et al., 1999). A study of

PBRS in the context of online POMDP planning can

be found in (Eck et al., 2016).

Building on these arguments, the work presented

in this paper reﬂects our efforts to provide a general-

purpose, PBRS bias for action selection in planning

tasks of interest to robotics, in order to address the

complexity of planning in domains with large vari-

ability, where states or beliefs may not always be eas-

ily aggregated.

ICAART 2018 - 10th International Conference on Agents and Artiﬁcial Intelligence

4 MEASURING GOAL

PROXIMITY

The main challenge for efﬁcient planning is quickly

separating good or promising states from bad or un-

wanted states. Even domains without clearly speci-

ﬁed goals (eg. pure RL tasks) have terminal states

or conditions that establish what must be accom-

plished and what subgoals the agent should pursue.

In robotics, it is reasonable to assume planning agents

are somewhat informed and aware of at least part of

their goal(s). If no goals are known, the robot should

ﬁnd one. Any sufﬁciently detailed state description

(eg. features needed for planning) provides informa-

tion to compute, for any given state, some numerical

score representing how many features in the terminal

state have already been accomplished. The larger this

number is, the closer this state is to being a terminal or

a goal state. We call this idea partial goal satisfaction

(PGS), formalized in equation 1.

PGS is simple to implement in fully observable

domains, but due to the uncertainty of observations,

estimating this number might be challenging in par-

tially observable domains. Fully observable features

can be easily counted in meaningful ways (eq. 2).

For partially observable features, information gather-

ing actions should increase the probability that their

current, estimated value is correct, thus also affecting

the probability of an agent being in some given state

(b(s)). In other words, collecting information about

a given set of partially-observable features yields a

better estimate of the world’s current, true state. The

simplest, most general approach is therefore measur-

ing some form of uncertainty or entropy and provid-

ing rewards as this uncertainty is reduced (eq. 3). Let

s ∈ S be a state, decomposed into countable discrete

features s

, G

be the set containing the observable

features present in the goal, G

−

the set of observ-

able restrictions, ∆(s) the set of states reachable from

s (similar to the transitive closure of T (s,·)) and G

the set of partially or non-observable elements, then:

pgs(s) =

∑

/∈G

v(s

) +

∑

∈G

w(s

) (1)

where:

v(s

) =











1 iff s

∈ G

x ∈ (0, 1) iff ∃s

∈ ∆(s) s.t.:

∈ s

∧ s

∈ G

0 iff s

/∈ {G

∩ G

−

}

−1 iff s

∈ G

−

(2)

and

w(s

) =

(

0 iff H(s

) 6 T

−1 otherwise

(3)

This means the different features in each state are

evaluated depending on whether they are partially ob-

servable (s

∈ G

) or not. Positive, observable fea-

tures add points and negative features deduct points.

State changes that lead to a positive feature (s

∈ G

)

in a future state (s

∈ ∆(s)) yield a fraction of a point

and help implicitly deﬁne subgoals (eg. interacting

with an object referenced in the goal, such as picking

up the coffee cup that goes on the table), and if no

relevant features are present no points are awarded or

taken. Partially observable features are scored based

on their entropy, punishing features or states with

high entropy. Whenever enough information is gath-

ered and the entropy is reduced below some threshold

, this punishment is removed. This encourages the

agent to quickly get rid of this penalty by executing

a number of information gathering actions, which in

turn may lead to discovering new reward sources (eg.

interacting with relevant but previously unrecognized

elements). In principle any combination of the indi-

vidual elements in the goal may be considered a sub-

goal for scoring purposes, and only completing all of

them simultaneously yields the total, problem-deﬁned

terminal reward.

PGS may be useful in different contexts, but it is

intended as an optimistic value initialization method

that allows an agent to exploit nearby opportunities if

available. Directly applying PGS on some planning

problems that appear relatively simple, such as some

particular blocks world conﬁgurations, may not yield

the desired results. As explained below, PGS is in-

tended to be used within the context of Monte-Carlo

or similar search algorithms, where the optimistic as-

sumptions of PGS will eventually be corrected (if

they’re wrong) and the problem solved properly.

4.1 PGS in Reward Shaping

Reward shaping, commonly used to improve the per-

formance of (PO)MDP algorithms and RL problems,

works by adding a small, additional reward to some

state transitions. These additional rewards often come

from an in-depth analysis of the structure of the prob-

lem and provide some form of heuristic bias in action

selection. The new reward structure deﬁnes a decision

process with additional reward sources, implicitly in-

troducing subgoals the agent may achieve in order to

get closer to the ﬁnal goal. In our case, instead of

providing explicit, domain dependent knowledge to

shape rewards, we use the PGS function to encourage

the agent to pursue courses of action leading to the

completion of subgoals.

Towards Domain-independent Biases for Action Selection in Robotic Task-planning under Uncertainty

Reward shaping substitutes the usual reward func-

tion in an MDP with:

R(s,a, s

) + F(s,a,s

) (4)

where R is the problem-deﬁned reward distribu-

tion and F is a reward bonus. If F has the form

F(s,a,s

) = γφ(s

) − φ(s) (5)

then it is a potential function and eq. 4 is potential-

based. We now deﬁne φ(s) for PGS as

φ(s) = αpgs(s) (6)

where α a scaling factor. Because most (PO)MDP

algorithms already use γ to refer to the discount fac-

tor, from now on we will refer to γ

PGS

when in the

context of PBRS. In practice, transitions to states that

are closer to a subgoal (positive reward source) will

produce a positive difference, transitions to states that

are farther from subgoals generate a negative differ-

ence, and other transitions cancel each other out. Nor-

mally reward shaping functions are highly domain-

dependent and speciﬁc for particular problems. PGS

manages to attain simplicity and generality, possibly

requiring only minor details for implementation, but

with values coming from knowledge about goals al-

ways available to the agent or robot.

4.2 PGS as a Rollout Policy

The MCTS family of MDP and POMDP algorithms

works by sampling sequences of states from a prob-

abilistic transition model. A tree of states (or in the

case of POMCP a tree of histories) is progressively

expanded and the average returns and visit counts are

maintained per tree node. When enough statistics are

available (eg. all known successors of a state have

been visited) the UCB1 rule is used to select an action.

When a new state is discovered, a rollout or random

simulation is performed and its outcome used as an

initial value estimation. Rollout policies are therefore

largely responsible for the performance of MCTS on-

line planning algorithms. Using PGS as a rollout pol-

icy, the agent quickly focuses on actions that directly

contribute to the completion of (sub)goals and, like-

wise, avoids undesirable actions. Selecting actions

that maximize state-to-goal proximity can implicitly

summarize a very rich array of knowledge and heuris-

tics, that must otherwise be given explicitly. To the

best of our knowledge, the effect of evaluating goal

proximity within the context of Monte-Carlo rollouts

hasn’t yet been systematically studied.

Using PGS as a rollout policy (eq. 7) is very sim-

ple: given a state s and available actions a ∈ A, select

the state s

← (s, a) that satisﬁes the largest amount of

subgoals. Ties are broken randomly.

A(s) = arg max

pgs(s

← (s,a)) (7)

Because PGS is computed as a difference between

the current and previous states (eq. 5), when γ

PGS

= 1

only newly completed subgoals produce positive val-

ues. For example imagine a robot tasked with col-

lecting and delivering a cup of coffee: during plan-

ning, standing next to the cup offers the possibility

of picking it up, satisfying a subgoal that yields a re-

ward bonus, therefore becoming the preferred action

of the rollout policy. Once holding it, dropping the

cup in any place other than the correct location re-

verts this condition and produces a negative reward,

meaning it will never be chosen in a rollout (albeit

eventually during simulation, if all actions are system-

atically sampled). MCTS recommends an action only

after arbitrarily many simulations have been carried

out, but starting out with the (seemingly) right action

greatly improves performance. Unlike with PGS, im-

proved rollout policies often consist of manually de-

signed heuristics and explicit preferred actions.

5 RESULTS

We tested PGS in two well-known and commonly

used benchmark problems: the taxi domain, which

deﬁnes a fully observable MDP and Rocksample, a

POMDP that scales to very large state spaces. For the

former we implemented our own version of UCT, and

for the latter we modiﬁed the POMCP source code.

All tests ran on a desktop workstation with an Intel

i7-4790 CPU, 20 GB RAM and Debian GNU/Linux.

The challenge for robot planning under uncer-

tainty is achieving good performance within a ﬁnite

horizon, fast enough, even in large problems. These

two scenarios show the performance of PGS using

limited resources (very few Monte-Carlo simulations)

and how it scales in considerably large versions of

Rocksample.

5.1 Taxi Domain

The taxi domain, ﬁrst proposed in (Dietterich, 2000),

is a simple, fully-observable MDP often used to test

planning and learning algorithms. The taxi agent

moves in any of four directions in a 5 × 5 grid and

must pick up a passenger located in one of four possi-

ble depots, and bring it to another depot. A slight vari-

ation is the “ﬁckle taxi” in which movement is non-

deterministic: with a small probability (eg. p = 0.1)

ICAART 2018 - 10th International Conference on Agents and Artiﬁcial Intelligence

the taxi will end up East or West of its intended direc-

tion. Possible actions are moving North, East, South

or West, collecting a passenger when standing on the

same grid cell or dropping the passenger (when car-

rying one). Rewards are −1 for each regular move,

−10 for dropping the passenger in the wrong location

and 20 for delivering a passenger correctly, which also

terminates the episode. We chose one instance of the

taxi domain and obtained the total discounted reward

of its optimal policy, 8.652, in order to compare it

with the experimental results. This particular conﬁg-

uration and in general the taxi domain are illustrated

in ﬁgure 1, where the dark cell at the top left corner is

the goal depot where the passenger must be dropped.

The walls shown in the picture are also included in the

experiment which means the agent’s movement is re-

stricted, in cells next to walls, to only open, adjacent

cells.

Figure 1: The taxi domain.

PGS in the taxi domain is easy to formalize as all

elements are fully observable. We simply award 0.25

points for picking up the right passenger and 1 point

for dropping it at the correct depot (G

). There are

no restrictions (G

−

0). The terminal state reward is

preserved but the rest rely on PGS, using γ

PGS

= 1 and

α = 10 to reﬂect the punishment for illegally dropping

a passenger. Finally, a discount factor of γ = 0.95 and

search depth of 90 steps were used within UCT. It is

common to allow the taxi to start only over a depot,

but in our experiments it could be anywhere on the

grid. We ran UCT with PGS rewards and PGS rollout

policy in both (regular and ﬁckle) versions of the taxi

domain, and obtained the average discounted rewards

and running times over 1000 runs. Table 1 shows the

result of repeated runs on the ﬁxed task (ﬁg. 1) and a

set of randomly generated episodes (randomized ori-

gin and destination depots, and taxi starting location)

using 1024 simulations.

Results are especially promising when we con-

sider the total discounted reward of the optimal policy

in the ﬁxed (non ﬁckle) task (8.652). Restricting the

amount of computation to only 256 simulations per

move (≈ 1.6 s. per episode), the PGS-based planner

Table 1: Performance in the taxi domain after 1000 runs

with 1024 simulations.

Transition Episodes Avg. Return Time

Normal

Fixed 6.161 3.049

Random 4.257 3.531

Fickle

Fixed 3.275 4.410

Random 2.138 4.176

achieved an average discounted reward of 5.089. On

random tasks it is important to mention that episodes

were terminated after 5 s., but their (negative) reward

still averaged.

Averaging performance, especially with stochas-

tic methods, hides some additional, interesting details

of particularly good runs. We ran a separate batch of

1000 episodes using 1024 simulations, of which 616

ﬁnished in 2 s. or less and 797 in 3 s. or less. In

these runs the mean discounted reward was particu-

larly high and the statistical mode was, also for the

entire set, the optimal discounted reward (8.652). Re-

sults are shown in table 2.

Table 2: Discounted rewards with 1024 simulations and

limited runtime.

Time (s.) Count Mean Min Mode

6 2 616 7.499 3.523 8.652

6 3 797 7.125 1.97 8.652

Interestingly, it seems that when given more time

(or simulations), a larger number of unsuccessful runs

appear (therefore reducing the minimum and average

returns). This could be due to the fact that, being

somewhat greedy, UCT can sometimes enter loops

when attempting to solve simple discrete mazes. At a

certain distance (sufﬁciently far from a reward source)

all possible next states (eg. cells in the maze) ap-

pear equally good (or bad) at ﬁrst and their action val-

ues might be very similar or even the same. Given

more time, this type of repetitive behavior results in

very long policies with extremely low returns. With

enough planning time (convergence in RL and MCTS

methods is, after all, guaranteed after asymptotically

many iterations), the agent might be able to break

out of this loop, locate the goal and improve its pol-

icy. This is precisely the reason for introducing ad-

ditional reward sources, but we are interested in ﬁnd-

ing reasonably good solutions as quick as possible.

The taxi domain is an interesting benchmark because,

despite being extremely simpliﬁed in terms of tasks

and actions, there are many “empty” cells that create

large groups of very similar states. In richer domains

with more features, where naive planning algorithms

may require systematic deliberation and long run-

times, PGS shows very promising results (eg. Rock-

Towards Domain-independent Biases for Action Selection in Robotic Task-planning under Uncertainty

sample in next subsection).

Our main motivation for speeding up MDP plan-

ning is transferring these methods to robotic task

planning, so we are particularly interested in main-

taining good levels of performance and not necessar-

ily optimizing indeﬁnitely. This means, in practice,

we are aiming at solving (parts of) complex domains

within a few seconds and not necessarily millisec-

onds. Despite being essentially a toy problem because

of its size and full observability, the taxi domain is

a abstraction of a common manipulation task (nav-

igating, collecting, delivering) and our results show

PGS succeeds in achieving good levels of perfor-

mance with very limited time, planning directly on

an unfactored MDP.

5.2 Rocksample

Rocksample, originally found in (Smith and Sim-

mons, 2004), is a commonly used problem that sim-

ulates a Mars rover tasked with collecting valuable

rocks. This problem corresponds to a POMDP in

which the location of the agent and the rocks are

known, but the value of these rocks is initially un-

known and must be determined by the use of a noisy

sensor that returns one of two observations, good or

bad, with a given reliability. Rocksample[n,k] deﬁnes

an n× n grid with k rocks, where the agent may move

in any of four directions, sample a rock if standing

directly on top of it, or use the sensor on any rock

(action check

for rock i) for a total of 5 + k actions

(see ﬁg. 2). Rewards are 10 for sampling good rocks,

−10 for sampling bad rocks, 10 for exiting (East)

and −100 for leaving the grid in any other direction

(Smith and Simmons, 2004). We used POMCP as a

POMDP solver (Silver and Veness, 2010), but modi-

ﬁed it to test our proposal.

POMCP is a particularly fast POMDP solver

mainly due to two reasons: it uses an unweighted

particle ﬁlter to approximate the belief state, avoiding

costly belief updates, and it expands a tree of histories

instead of states, circumventing the curse of dimen-

sionality. Within POMCP, instead of a belief state the

agent receives a state sampled from the particle ﬁlter,

which corresponds to a state with mixed observability.

This way we were able to test our methodology within

a POMDP solver, albeit relying on states that contain

partially observable features (thus, not directly intro-

ducing preference over beliefs). In addition, POMCP

uses slightly enhanced rocksample states, where the

probability that a rock is good is updated directly af-

ter every corresponding check action, using the sen-

sor efﬁciency and the previous likelihood. We deﬁned

C = G

∩ G

−

to be the set of collected rocks, and G

Figure 2: Special layout for Rocksample[11,11].

the remaining rocks. Scoring function v(s

) returns

1 for good rocks with good observations (G

) and

−1 for bad rocks (G

−

). Function w(s

) returns −1

if H

) > 0.5, that is, if the binary entropy of rock

r (s

) is higher than 0.5. POMCP comes with a pre-

ferred actions policy, which uses manually encoded

heuristics such as “head North if there are rocks with

more positive observations” or “check rocks that have

been measured less than ﬁve times and have less than

two positive observations”. Clearly, PGS succeeds in

avoiding this level of over speciﬁcation.

We used γ

PGS

= 1 and α = 10 (to reﬂect the dif-

ference in rewards received when sampling good and

bad rocks). This scoring function deducts points for

undesirable states (eg. collecting bad rocks, high en-

tropy for any rock) and only adds points when collect-

ing good rocks, but further negative points are with-

drawn once the knowledge about any particular rock

increases (i.e. entropy < 0.5). In practice this means

that during rollouts check will be preferred if it re-

duces entropy for some rock, that sample will be pre-

ferred when standing over a promising rock, and that

otherwise movement actions will be considered.

We compared three different policies: uniformly

random with legal moves (“legal” in POMCP), ex-

plicit preferred actions (“smart” in POMCP) and our

own, “PGS”. Figure 3 shows the discounted rewards

averaged over 1000 runs for all three policies in rock-

sample [11, 11], [15,15], and the large [25, 25] and

[25,12], with up to 2048 Monte-Carlo simulations per

move.

PGS clearly outperforms the legal policy and is

only slightly outmatched by the smart policy. This

difference however reduces as the problem size in-

creases, particularly in [25,25], a very large problem

for rocksample standards and in [25, 12], an equally

large grid but with fewer reward sources. Estimat-

ICAART 2018 - 10th International Conference on Agents and Artiﬁcial Intelligence

Simulations

Avg. Discounted Reward

Legal

Smart

PGS

(a) [11,11]

Simulations

Avg. Discounted Reward

Legal

Smart

PGS

(b) [15,15]

Simulations

Avg. Discounted Reward

Legal

Smart

PGS

Simulations

Avg. Discounted Reward

Legal

Smart

PGS

(d) [25,12]

Figure 3: Performance in Rocksample.

ing the PGS value of derived states for every action

in our rollout policy may be somewhat computation-

ally expensive, particularly when all or many actions

are available, since it trades off simulations for run-

time. Results show, however, this pays off compared

to the random policy: with ≈ 2.5 seconds of com-

putation in rocksample[25,12], the legal rollout pol-

icy achieves a discounted reward of 5.713 with 2048

simulations, whereas PGS collects 7.35 with only 128

simulations. A faster implementation for planning on-

board robots might be necessary, but also a shallower

planning depth may be used (the experiments used a

depth of 90 steps). Additionally, extensions for es-

timating the goal proximity of beliefs and not only

sampled states (which POMCP does) might be help-

ful.

6 CONCLUSIONS & DISCUSSION

Our experimental results show that PGS is useful

to improve the performance of planning in large

(PO)MDP’s, despite its simplicity and even consider-

ing the contributions of algorithms such as POMCP,

that handle large state spaces relatively well. Plan-

ning directly over beliefs might be useful for prob-

lems with extreme uncertainty, and where many

uncertainty-reducing actions may be necessary as is

the case of many robotic planning domains. In fully

observable problems such as the Taxi domain, PGS

achieved a level of performance that seems unattain-

able for a uniformly random policy.

In domains with partial and mixed observabil-

ity and particularly in problems with scarce reward

sources, PGS easily outperformed the uniformly ran-

Towards Domain-independent Biases for Action Selection in Robotic Task-planning under Uncertainty

dom policy. In these scenarios, such policies scale

poorly and domain knowledge becomes necessary to

achieve good performance. We showed that with

barely any domain-dependent knowledge, PGS can be

competitive with a manually designed, action selec-

tion policy with very detailed, heuristic knowledge.

This type of domain-independent bias is essential for

planning and acting in complex domains.

In general we can identify three main approaches

for speeding up planning in large stochastic domains:

1) Action hierarchies that produce smaller, abstract

MDP’s and then transfer these solutions to the base

MDP. 2) State abstractions that group states together

so their values are shared and the values of unknown

states, approximated. 3) PBRS, that forces an agent to

focus on good action prospects, avoiding potentially

costly choices. It seems that for planning domains

with high variability it would be difﬁcult to efﬁciently

generate state or action abstractions in advance, and

the planner might have to traverse many unique states

anyway. Our PGS methodology addresses this issue

by attempting to quickly identify reward sources and

back propagate scaled partial rewards, using as lit-

tle domain knowledge as possible but exploiting an

agent’s knowledge of its own goals. All of the afore-

mentioned techniques are either strongly domain de-

pendent or use ﬁxed criteria to generate groups or cat-

egories.

As previously stated, this work is only part of our

efforts to introduce the notion of relevance in task

planning. Future work includes expanding the for-

mal deﬁnition of goal proximity for beliefs, and de-

signing dynamic value approximation and/or state ag-

gregation methods derived from this methodology. In

order to transfer these methods to real-world robotic

tasks, some form of state aggregation or abstraction

will be required to map continuous to discrete state

representations and to correctly recognize or disre-

gard states with relevant or irrelevant information, re-

spectively. We argue that states should, similar to

PGS, be grouped based on criteria derived from the

goal.

When thinking about planning and decision prob-

lems it is normal to refer to state spaces and the num-

ber of states as a measure of their complexity. If

we were to compare most MDP benchmark problems

with real-world, robotic planning problems, we ﬁnd

that the former often have combinatorial complexity

just as games do. We could potentially generate all

possible states and therefore, ﬁnd arbitrarily complex

policies. This idea does not transfer well to planning

with robots in the “real world”, where the number of

states may be unknown and, potentially, much larger

than a game. Despite these differences, we under-

stand intuitively that some problems can be solved

quickly in some level of abstraction and their solu-

tions transferred back to the original domain. So if

state complexity does not necessarily represent how

complex problems are: how should the complexity of

planning problems be estimated? Certainly problems

with many states look more complex when only their

computational complexity is considered, often a func-

tion of the number of states. But more accurately, and

because planning problems include not only a domain

but also goals, some intrinsic relationship between the

values of states and their distance to the goal (or sub-

goals) must be considered. A good planning algo-

rithm should efﬁciently identify the gaps in between

state values in order to quickly tell apart good states

from bad states. These values come uniquely from

perceived (or simulated) rewards. Because the value

of a state is the average discounted return of its chil-

dren, it might be hard to differentiate between promis-

ing states early on, when the agent is simply too far

from any source of reward. Conversely, if the agent

always starts next to its goal, the problem should be

trivial to solve regardless of the number of possible

states. By designing planning algorithms that exploit

these observations, we expect to improve the perfor-

mance of online planning in robots and other agents,

despite the apparent complexity of “real-world” do-

mains.

ACKNOWLEDGEMENTS

This work is supported by a DAAD research grant.

REFERENCES

Auer, P., Cesa-Bianchi, N., and Fischer, P. (2002). Finite-

time analysis of the multiarmed bandit problem. Ma-

chine Learning, 47(2):235–256.

Dietterich, T. G. (2000). Hierarchical reinforcement learn-

ing with the maxq value function decomposition.

Journal of Artiﬁcial Intelligence Research, 13:227–

303.

Eck, A., Soh, L.-K., Devlin, S., and Kudenko, D. (2016).

Potential-based reward shaping for ﬁnite horizon on-

line pomdp planning. Autonomous Agents and Multi-

Agent Systems, 30(3):403–445.

Hester, T. and Stone, P. (2013). TEXPLORE: Real-time

sample-efﬁcient reinforcement learning for robots.

Machine Learning, 90(3).

Kocsis, L. and Szepesv

ari, C. (2006). Bandit based monte-

carlo planning. In In: ECML-06. Number 4212 in

LNCS, pages 282–293. Springer.

Konidaris, G. (2016). Constructing abstraction hierarchies

using a skill-symbol loop. In Proceedings of the

ICAART 2018 - 10th International Conference on Agents and Artiﬁcial Intelligence

Twenty-Fifth International Joint Conference on Arti-

ﬁcial Intelligence, IJCAI 2016, New York, NY, USA,

9-15 July 2016, pages 1648–1654.

Ng, A. Y., Harada, D., and Russell, S. (1999). Policy invari-

ance under reward transformations: Theory and appli-

cation to reward shaping. In In Proceedings of the Six-

teenth International Conference on Machine Learn-

ing, pages 278–287. Morgan Kaufmann.

Ong, S. C. W., Png, S. W., Hsu, D., and Lee, W. S. (2010).

Planning under uncertainty for robotic tasks with

mixed observability. Int. J. Rob. Res., 29(8):1053–

1068.

Pineau, J., Gordon, G., and Thrun, S. (2003). Policy-

contingent abstraction for robust robot control. In Pro-

ceedings of the Nineteenth Conference on Uncertainty

in Artiﬁcial Intelligence, UAI’03, pages 477–484, San

Francisco, CA, USA. Morgan Kaufmann Publishers

Inc.

Pineau, J., Gordon, G. J., and Thrun, S. (2006). Anytime

point-based approximations for large POMDPs. Jour-

nal of Artiﬁcial Intelligence Research, 27:335–380.

Silver, D. and Veness, J. (2010). Monte-Carlo Planning in

Large POMDPs. In In Advances in Neural Informa-

tion Processing Systems 23, pages 2164–2172.

Smith, T. and Simmons, R. (2004). Heuristic Search Value

Iteration for POMDPs. In Proceedings of the 20th

Conference on Uncertainty in Artiﬁcial Intelligence,

UAI ’04, pages 520–527, Arlington, Virginia, United

States. AUAI Press.

Somani, A., Ye, N., Hsu, D., and Lee, W. S. (2013).

Despot: Online pomdp planning with regularization.

In Burges, C. J. C., Bottou, L., Welling, M., Ghahra-

mani, Z., and Weinberger, K. Q., editors, Advances

in Neural Information Processing Systems 26, pages

1772–1780. Curran Associates, Inc.

Sutton, R., Precup, D., and Singh, S. (1999). Between mdps

and semi-mdps: A framework for temporal abstrac-

tion in reinforcement learning. Artiﬁcial Intelligence,

112:181–211.

Sutton, R. S. and Barto, A. G. (2012). Reinforcement Learn-

ing: An Introduction. MIT Press, Cambridge, MA,

USA, 2nd edition. (to be published).

Towards Domain-independent Biases for Action Selection in Robotic Task-planning under Uncertainty