lution to online planning onboard robots, but rather a
contribution towards the improvement of action selec-
tion in planning algorithms, when information about
the goal is available. The intended effect is planning
more efficiently in domains with many states and ac-
tions.
Alternative, well-known approaches for address-
ing large planning spaces include value approxima-
tion and state aggregation, but these work under the
assumption that there are large groups of states that
can be clustered together (due to similarity or other
reasons) using fixed criteria. At the moment we are
interested in how agents may use knowledge of their
goal(s) to improve their action selection criteria, in
particular by focusing on only a few good alternatives
when many options are available, as is the case of do-
mains with high variability and large branching factor.
In the following sections we discuss previous re-
lated work, and proceed to explain our proposal. We
provide two case studies and an analysis of exper-
imental results, as well as conclusions and future
work.
2 NOTATION
Unless otherwise specified, S and A are finite sets of
states and actions respectively and T (s,a,s
0
) is the
probability of reaching state s
0
when executing a in s,
which yields a real-valued reward R(s,a,s
0
). An MDP
is the tuple hS,A,T,Ri, with discount factor γ. Un-
der partial observability, an agent receives an obser-
vation ω ∈ Ω with probability O(s, a,ω) = p(ω|s,a)
and maintains an internal belief state b ∈ B, where
b(s) is the probability of s being the current state. A
POMDP is therefore hS, A,T,R,Ω,Oi. The sequence
h
t
= (a
0
,ω
1
,. .. ,a
t−1
,ω
t
) is the history at time t.
The common notation for planning in POMDP’s
uses belief states, or probability distributions over the
entire state space. That is, policies are given in terms
of beliefs. In this paper, however, we present our ac-
tion selection bias in terms of fully observable states
and states with mixed observability (states that con-
tain both fully and partially observable features). Our
experimental setup uses states sampled from a belief
state approximator. Future work includes extending
these definitions to more general cases, such as belief
states.
3 RELATED WORK
Our work centers on improving the performance of
action selection for (PO)MDP planning algorithms,
relying mostly on UCT (Kocsis and Szepesv
´
ari, 2006)
and its POMDP extension, POMCP (Silver and Ve-
ness, 2010). Unlike more traditional POMDP algo-
rithms that approximate the belief space using vector
sets, POMCP uses Monte-Carlo Tree Search (MCTS)
and approximates the current belief state using an un-
weighted particle filter. Because this is a key improve-
ment of the Monte-Carlo family of algorithms, this
paper focuses on states with partially observable ele-
ments instead of belief states. The concept of mixed
observability has produced positive results even out-
side of MCTS algorithms (Ong et al., 2010). Alter-
natively, MCTS algorithms have also been used for
belief selection in POMDP’s (Somani et al., 2013).
Existing approaches addressing the state-space
complexity involve clustering states and generalizing
state or belief values (Pineau et al., 2003), (Pineau
et al., 2006), function approximation (Sutton and
Barto, 2012) and random forest model learning (Hes-
ter and Stone, 2013). Our main critique of these meth-
ods for online task planning is that they have fixed
aggregation criteria that do not respond to the con-
nection between states and goals. For instance, states
could be clustered together depending on context so
their shared values reflect the agent’s current goals.
Other approaches generate abstractions for plan-
ning and learning over hierarchies of actions (Sut-
ton et al., 1999), (Dietterich, 2000). The drawback
is that relatively detailed, prior knowledge of the do-
main is required to manually create these hierarchies,
although recent work suggests a possible way to build
them automatically (Konidaris, 2016).
Planning algorithms for (PO)MDP’s overlap with
reinforcement learning (RL) methods, with the differ-
ence that in RL the agent must find an optimal pol-
icy while discovering and learning the transition dy-
namics. Reward shaping is commonly used in RL to
improve performance by simply giving additional re-
wards for certain preferred actions. This generates an
MDP with a different reward distribution and there-
fore different convergence properties, but potential-
based reward shaping (PBRS) has been shown to pre-
serve policy optimality (Ng et al., 1999). A study of
PBRS in the context of online POMDP planning can
be found in (Eck et al., 2016).
Building on these arguments, the work presented
in this paper reflects our efforts to provide a general-
purpose, PBRS bias for action selection in planning
tasks of interest to robotics, in order to address the
complexity of planning in domains with large vari-
ability, where states or beliefs may not always be eas-
ily aggregated.
ICAART 2018 - 10th International Conference on Agents and Artificial Intelligence
86