Dynamic Programming for One-sided Partially Observable
Pursuit-evasion Games
Karel Hor
´
ak and Branislav Bo
ˇ
sansk
´
y
Department of Computer Science, Faculty of Electrical Engineering,
Czech Technical University in Prague, Prague, Czech Republic
Keywords:
Pursuit-evasion Games, One-sided Partial Observability, Infinite Horizon, Value Iteration, Concurrent Moves.
Abstract:
Pursuit-evasion scenarios appear widely in robotics, security domains, and many other real-world situations.
We focus on two-player pursuit-evasion games with concurrent moves, infinite horizon, and discounted re-
wards. We assume that the players have partial observability, however, the evader has an advantage of know-
ing the current position of pursuer’s units. This setting is particularly interesting for security domains where
a robust strategy, maximizing the utility in the worst-case scenario, is often desirable. We provide, to the best
of our knowledge, the first algorithm that provably converges to the value of a partially observable pursuit-
evasion game with infinite horizon. Our algorithm extends well-known value iteration algorithm by exploiting
that (1) value functions of our game depend only on the position of the pursuer and the belief he has about the
position of the evader, and (2) that these functions are piecewise linear and convex in the belief space.
1 INTRODUCTION
Pursuit-evasion games (PEGs) appear in many sce-
narios in robotics and security domains (Vidal et al.,
2002; Chung et al., 2011). A team of centrally con-
trolled pursuing units (the pursuer) aims to locate
and capture the evader, while the evader aims for the
opposite. We study these games and assume their
discrete-time variant played on a finite graph. We as-
sume that units of both players move simultaneously,
the horizon of the game is infinite, rewards are dis-
counted over time with discount factor γ [0,1), and
the players have only a partial information about the
current state. Formally, such a game belongs to zero-
sum partially observable stochastic games (POSGs).
We aim for finding robust strategies of the pur-
suer against the worst-case evader. Specifically, we
assume that the evader knows the positions of pur-
suer’s units and her only uncertainty is the strategy of
the pursuer and the move he will perform next. Al-
though in reality such perfectly informed adversary is
rarely met, the pursuer usually does not know what
information is being revealed to the evader. Hence,
in order to derive robust strategies (i.e. maximizing
pursuer’s reward against any type of the evader), it is
natural to use such a perfectly informed adversary.
We design the first algorithm that provably con-
verges to the value of such one-sided partially observ-
able pursuit-evasion games. Moreover, as the value
converges, strategies of the players converge to the
optimal strategies as well. This contrasts with exist-
ing approaches in robotics and security, where heuris-
tic solutions without any optimality guarantees are
used (Vidal et al., 2002; Chung et al., 2011).
Our algorithm extends the well-known value it-
eration algorithms for concurrent-moves stochas-
tic games (Shapley, 1953) and partially observable
Markov decision processes (POMDPs) (Smallwood
and Sondik, 1973; Monahan, 1982; Pineau et al.,
2003; Smith and Simmons, 2012). We show that, sim-
ilarly to POMDPs, one-sided pursuit-evasion games
allow us to define compactly represented value func-
tions and propose a dynamic programming approach
to approximate them in an iterative manner. Specifi-
cally we show that the value functions (1) depend only
on the position of the pursuer’s units and his belief
about the possible position of the evader, but not on
the history of moves, (2) these functions are piecewise
linear and convex and thus we can represent them as
a set of α-vectors (Section 2.1), and (3) we can de-
sign a dynamic-programming operator with provable
convergence to optimal value functions (Section 3).
Our results for one-sided partially observable
pursuit-evasion games have similar implications as
those derived for POMDPs. Our paper is thus the first
step in a whole line of research. The importance of
Horà ˛ak K. and BoÅ ˛aanská B.
Dynamic Programming for One-sided Partially Observable Pursuit-evasion Games.
DOI: 10.5220/0006190605030510
In Proceedings of the 9th International Conference on Agents and Artificial Intelligence (ICAART 2017), pages 503-510
ISBN: 978-989-758-220-2
Copyright
c
2017 by SCITEPRESS – Science and Technology Publications, Lda. All rights reserved
503
the results is highlighted in the derivation of the full-
backup value iteration algorithm. Moreover, due to
the close similarity with POMDP models in the struc-
ture of the solution, efficient point-based versions of
the algorithm should be applicable as well.
Due to the space constraints, proofs of some re-
sults can be found in the full version of the paper.
1.1 Related Work
A similar model with one-sided partial observability
where one of the players has a perfect information
was presented in (McEneaney, 2004). This player is
assumed to know the action the opponent will play
at the current stage. Such a game is essentially turn-
based and only pure strategies are thus thought of.
Disregarding randomization has severe limita-
tions. In many cases, if the evader knows the action of
the pursuer before she has to decide herself, she can
use this additional information to avoid getting caught
(simply by avoiding the vertices the pursuer is about
to move to next forever). Randomized strategies thus
better correspond to real-world problems occurring in
real-time. Using them, however, presents additional
challenges that we solve in this paper.
Finite horizon POSGs can be also solved by con-
verting a game to the matrix form by enumerating all
pure strategies of the players. In (Hansen et al., 2004),
the pure strategies are constructed in an incremen-
tal way using dynamic programming while prunning
those that are dominated. Although this improves on
the na
¨
ıve enumeration approach, the number of strate-
gies is still exponential in the horizon in the worst case
and so is their size, which makes the algorithm im-
practical when focusing on long-term interactions.
2 FINITE-HORIZON GAME
We use the notion of finite-horizon POSGs, or
extensive-form games (EFGs), to reason about an
infinite-horizon pursuit-evasion game. An EFG is a
tuple G = (N , H ,Z,T ,u, I). N is the set of players,
in our case N =
{
p,e
}
where p stands for the pur-
suer and e for the evader. Set H denotes a finite set of
histories of actions taken by all players from the be-
ginning of the game. Every history corresponds to a
node in the game tree; hence we use the terms history
and node interchangeably. Each of the histories may
be either (1) terminal (h Z H ) where the game
ends and player i gets utility u
i
(h), (2) controlled by
the nature that selects the successor node according to
a fixed probability distribution known to all players,
or (3) one of the players from N may be to act. We
consider a zero-sum scenario where u
p
(h) = u
e
(h).
To simplify the notation we use u(h) to denote pur-
suer’s reward. An ordered list of transitions of player
i from root to node h is referred to as a player is se-
quence. Allowed transitions in the game are modeled
using a transition function T that provides a set of
successor nodes for each non-terminal history. The
imperfect observation of players is modelled via in-
formation sets I
i
that form a partition over histories h
where player i N takes action. We assume perfect
recall setting where the players never forget their past
actions, i.e. for every I
i
I
i
, all histories h I
i
have
the same player is sequence. Each information set
I
i
I
i
corresponds to one decision point of player i. A
randomized behavioral strategy σ
i
of player i assigns
a distribution over actions to each of the information
sets in I
i
. σ
i
can be represented in the form of a real-
ization plan r which assigns probability r(σ
i
) of play-
ing sequence σ
i
to each player is sequence σ
i
. The
behavioral strategy at information set I
i
I
i
reached
using a sequence σ
i
is then b(I
i
,a) = r(σ
i
a)/r(σ
i
). A
Nash equilibrium (NE) in an EFG is a pair of behav-
ioral strategies, in which each player best-responds
the strategy of his opponent. The expected utility of
playing NE strategies is termed value of the game.
We will now use this terminology to construct an
EFG for a finite-horizon version of a pursuit-evasion
game with N pursuing units played on a graph G =
hV ,Ei for t rounds (we term t as the horizon). Part
of the game tree is shown in Figure 1. At every round
τ t, pursuer’s units occupy vertices s
τ
p
, where s
τ
p
=
{s
τ
p,1
,...,s
τ
p,N
} is an N-element multiset of vertices
of G, and the evader is located in vertex s
τ
e
V . The
goal of the pursuer is to achieve a situation where the
evader is caught, i.e. s
τ
e
s
τ
p
. In every round, players
move their units to vertices adjacent to their current
positions (adj(v) denotes the set of vertices adjacent to
v). Position of the evader in round τ+ 1 is thus s
τ+1
e
adj(s
τ
e
). We overload the operator adj to apply it also
on multisets representing positions of pursuer’s units,
i.e. s
τ+1
p
adj(s
τ
p
), where adj(s
τ
p
) = ×
i=1...N
adj(s
τ
p,i
).
A horizon-t game G
t
s
0
p
,b
0
is parametrized by
the initial position of the pursuer s
0
p
V
N
and a dis-
tribution over evader’s initial positions b
0
(V )
known to both players (we term b
0
the initial belief ).
The game starts with a chance move selecting the ini-
tial position of the evader s
0
e
(based on b
0
).
A history h H in a game with horizon t cor-
responds to a list of positions s
0
e
s
1
p
s
1
e
···s
τ
p
s
τ
e
, where
τ t. The utility values are assigned to terminal his-
tories as follows: if the pursuer failed to capture the
evader in time, i.e. if τ = t and s
τ
e
6∈ s
τ
p
, he gets utility
u(h) = 0; if he successfully captured the evader in the
time limit t, i.e. if τ t and s
τ
e
s
τ
p
, he gets the reward
ICAART 2017 - 9th International Conference on Agents and Artificial Intelligence
504
P P
E E
...
s
1
p
s
1
p
E E
...
s
1
p
E E
...
s
1
p
s
1
p
P P P P
s
1
e
s
1
e
s
1
e
s
1
e
1
N
𝛾𝛾
... ...
s
1
p
_s
1
e
s
1
p
_s
1
e
s
1
p
_
s
1
e
s
1
p
_s
1
e
T
1
T
2
T
¨
2
T
¨
1
s
0
e
s
0
e
I
p
[ç]
I
e
[s
0
e
] I
e
[
s
0
e
]
I
p
[s
1
p
]
Figure 1: EFG representation of a finite-horizon PEG.
u(h) = γ
τ
for capturing the evader in τ rounds (where
γ [0,1) is the discount factor). The transition func-
tion T complies with the graph (i.e., the adjacency
function adj), hence s
τ
p
adj(s
τ1
p
) and s
τ
e
adj(s
τ1
e
)
for every τ 1. For notational simplicity we denote
the sequence of pursuer’s actions s
1
p
···s
τ
p
in h as h|
p
and the sequence of evader’s actions s
1
e
···s
τ
e
as h|
e
.
The position of the evader is unknown to the pur-
suer. Hence, in a perfect recall game, there is one pur-
suer’s information set I
p
[σ
p
] for each of his sequences
σ
p
where I
p
[σ
p
] = {h
0
| h
0
H \Z : h
0
|
p
=σ
p
}.
Evader, on the other hand, knows the situation al-
most perfectly. She knows where the pursuer’s units
were located before the pursuer acted in the current
round (recall that the pursuer acts first). The only in-
formation missing to the evader is the action being
taken by the pursuer in the current round. Hence,
for every history h = s
0
e
s
1
p
s
1
e
···s
τ
p
s
τ
e
where the pursuer
is to play, there is evader’s information set I
e
[h] =
hs
τ+1
p
|s
τ+1
p
adj(s
τ
p
)
containing all possible con-
tinuations of the pursuer.
2.1 Shape of the Value Function
The sizes of the extensive-form representation and as-
sociated behavioral strategies grow exponentially as
the horizon increases. This makes it quickly impos-
sible to use standard algorithms for game trees, espe-
cially since we aim to solve infinite horizon games.
We alleviate the problem of increasing complexity
of the strategy representation by representing strate-
gies only using their values. We show that the value
of a strategy is linear in the belief, and we can thus
represent it using just |V | real numbers. Moreover,
when the horizon is finite, we need to consider only
finitely many strategies regardless of the initial belief,
which makes value functions, formed by values of
best strategies for each belief, be piecewise linear and
convex and allows us to represent them compactly.
Definition 1. A value function v
t
s
0
p
: (V ) [0,1]
is a function assigning the value v
t
s
0
p
(b
0
) of the
game G
t
s
0
p
,b
0
to every initial belief b
0
. By v
t
we
mean a set of value functions v
t
s
0
p
, one for each ini-
tial position s
0
p
V
N
of the pursuer.
In the following text, we show that a value func-
tion v
t
s
0
p
is piecewise linear and convex (PWLC)
in the belief for every finite horizon t. For notational
simplicity, the term linear is used to refer to affine
functions as well. The proof is structured as follows:
(1) firstly we show that the expected utility of every
pursuer’s strategy is linear in the belief, next (2) it is
sufficient to consider a finite set of pursuer’s strate-
gies Σ
t
s
0
p
when looking for the Nash equilibrium
one; and finally (3) we show that the PWLC nature of
the value function follows from (1) and (2).
Lemma 1. Let σ
p
be a randomized behavioral strat-
egy of the pursuer in games G
t
s
0
p
,b
0
, where the pur-
suer starts in vertices s
0
p
, parametrized by the initial
belief b
0
. The expected utility of playing σ
p
against a
best responding opponent is linear in b
0
.
Theorem 1. Let G
t
s
0
p
,b
0
be a horizon-t game
parametrized by the initial belief b
0
where the pur-
suer starts in a set of vertices s
0
p
. There exists a finite
set of pursuer’s behavioral strategies Σ
t
s
0
p
such that
for every initial belief b
0
, there is at least one strategy
σ
p
Σ
t
s
0
p
that is in Nash equilibrium of G
t
s
0
p
,b
0
.
Proof. We use the sequence-form linear program for
solving EFGs (Koller et al., 1996) to reason about the
set of strategies Σ
t
s
0
p
. In this LP, values in every
information set of the evader, as well as the value
v(root) in the root node, are computed in a bottom-
up fashion. Every such value v(I
e
) of an information
set I
e
can be seen as a concave piecewise linear func-
tion in the space of pursuer’s realization plans (a com-
pact representation of his behavioral strategies). The
pursuer then seeks for a realization plan that maxi-
mizes v(root); the maximizer of which can be found
among extreme points of line segments of v(root), i.e.
vertices of a polytope bounded by this function (Van-
derbei, 2014). We show that the set of such extreme
points does not depend on the initial belief b
0
.
There is one information set I
e
[s
0
e
] of the evader for
each of her initial positions s
0
e
. The utility of every
terminal node in the subgame beneath I
e
[s
0
e
] is mul-
tiplied by chance probability b(s
0
e
), which allows us
to factor out this probability and obtain the following
constraint for the root node:
v(root)
s
0
e
s
0
p
b
0
(s
0
e
) +
s
0
e
V \s
0
p
b
0
(s
0
e
) · ˆv(I
e
[s
0
e
]) (1)
Value v(root) is a convex combination of concave
piecewise linear functions ˆv(I
e
[s
0
e
]). As the belief was
Dynamic Programming for One-sided Partially Observable Pursuit-evasion Games
505
factored out, these functions, as well as the finite set
of their extreme points P[s
0
e
], no longer depend on the
belief. This convex combination with arbitrary coef-
ficients b
0
cannot have an extreme where none of the
functions ˆv(I
e
[s
0
e
]) has one. The set of extreme points
is therefore a subset of
S
s
0
e
P[s
0
e
] a finite set that
does not depend on the belief. Each of the extreme
points in
S
s
0
e
P[s
0
e
] corresponds to one pursuer’s re-
alization plan, and thus one his behavioral strategy,
which allows us to construct the finite set Σ
t
s
0
p
.
Theorem 2. Value function v
t
s
0
p
is piecewise linear
and convex in the belief space.
Proof. This result directly follows from Lemma 1 and
Theorem 1. There is a finite set of randomized strate-
gies Σ
t
s
0
p
that has to be considered by the pursuer
and value of each such strategy is linear in the belief
space. Thus the value function v
t
s
0
p
is a pointwise
maximum taken over a finite set of linear functions,
which is a PWLC function in the belief space.
A PWLC function can be represented as a finite
set of α-vectors. Every α-vector α = (α
1
,...,α
|V |
)
represents one of the affine functions by assigning an
expected reward α
i
to each pure belief. We will of-
ten work with the α-vector representation of a value
function, hence we overload the notation and consider
value functions to be sets of such α-vectors as well.
Lemma 1 and Theorem 1 imply that each linear
segment of the value function matches one pursuer’s
strategy, we thus use terms α-vector and pursuer’s
strategy interchangeably. This is similar to POMDPs
where each α-vector matches one conditional plan.
3 VALUE ITERATION
In the previous section, we related the concept of
the value functions to the EFG representation of the
game and discussed that these functions have desir-
able properties. We leverage their representation to
design a dynamic programming approach inspired by
value iteration algorithms for either POMDPs (Small-
wood and Sondik, 1973; Monahan, 1982) or perfect
information stochastic games (Shapley, 1953). A se-
quence of value functions {v
t
}
t=0
is being constructed
by the algorithm, starting with values of a horizon-0
game, where the pursuer wins only when he starts in
the same vertex as the evader.
We avoid using the exponentially-sized represen-
tation of the underlying EFG by computing value
function of a horizon-t game using the solution of the
game with horizon t1. First, we state a well-defined
value update formula that expresses v
t
in terms of v
t1
(Theorem 3). We let the players choose their strate-
gies for the first round of the horizon-t game using the
maximin principle (we term these one-step strategies)
and we show that the pursuer can use these strate-
gies to update his belief. Pursuer’s one-step strategy
π
p
is a distribution over possible actions of his units,
π
p
(adj(s
0
p
)), from which he samples his action.
The evader acts similarly, however she conditions her
decision on her true position s
0
e
(not just on the overall
belief available to the pursuer); her one-step strategy
is thus a mapping π
e
: V (V ), such that π
e
(s
0
e
)
assigns zero probability to vertices not adjacent to s
0
e
.
The piecewise linearity and convexity of value
functions have implications on the computation of the
value functions. Firstly it allows finding optimal one-
step strategies by means of linear programming (Sec-
tion 3.1), furthermore, we need not evaluate the value
update formula in every point in the belief space to
construct new value functions. Instead, we can use an
incremental algorithm which inspects extreme points
of line segments of a temporary function to check
if it can terminate and value function has been con-
structed, or further linear segments have to be added.
Theorem 3. The value of G
t
s
0
p
,b
is computed from
value functions v
t1
of horizon-(t1) games. It holds
v
t
s
0
p
(b) =
s
e
s
0
p
b(s
e
) + (2)
+ γ
h
s
e
V \s
0
p
b(s
e
)
i
· max
π
p
min
π
e
s
1
p
V
N
π
p
(s
1
p
) · v
t1
s
1
p
(b
π
e
)
where the transformed belief b
π
e
depends solely
on the evader’s one-step strategy π
e
and the
parametrization of the game G
t
s
0
p
,b
:
b
π
e
(s
0
e
) =
1
s
e
V \s
0
p
b(s
e
)
s
e
V \s
0
p
b(s
e
) · π
e
(s
e
,s
0
e
) (3)
The computation of v
t
using Eq. (2) forms a dynamic
programming operator H, such that v
t
= Hv
t1
.
Proof. The correctness of the value update formula
will be proved by computing the value of G
t
s
0
p
,b
in a bottom-up fashion. We start by considering that
one-step strategies of the players for the first round of
the game are fixed, while they play optimally after-
ward. This determines pursuer’s expected reward at
every node in the game tree, which we use to express
his expected utility in the root node (Lemma 2). As
the behavior in the first round of the game is fixed,
parts of the game tree are independent on each other
we refer to these subgames as G[s
1
p
]. This allows us
to evaluate the expectation from Lemma 2 by solving
these games separately. It turns out that games G[s
1
p
]
are strategically equivalent to shorter-horizon games
ICAART 2017 - 9th International Conference on Agents and Artificial Intelligence
506
G
t1
s
1
p
,b
π
e
, the solution of which is represented by
value functions v
t1
. The expectation can be thus ex-
pressed solely in terms of v
t1
. Finally, we relax the
assumption of fixed strategies in the first round, which
yields the desired maximin formula (Equation (2)).
Let π
p
(adj(s
0
p
)) be a fixed pursuer’s one-step
strategy, and π
e
: V (V ) be a fixed one-step strat-
egy of the evader. Assume that both players play ac-
cording to π
p
and π
e
in the first round of the game, i.e.
the pursuer follows π
p
in his information set I
p
[
/
0] (i.e.
pursuer’s information set where he has not acted yet,
see Figure 1) and the evader plays according to π
e
(s
0
e
)
in her information set I
e
[s
0
e
] (where she has received
the information that she is located in vertex s
0
e
). Once
the first round of the game is over, players continue
with their best strategies available. We denote such
optimal strategies where the players are restricted to
play π
p
and π
e
in the first round as σ
p
and σ
e
.
Definition 2. Let π
p
, π
e
be fixed one-step strategies
for the first round of G
t
s
0
p
,b
and σ
p
, σ
e
be optimal
strategies with restriction to play π
p
and π
e
in the first
round. The pursuer’s expected reward when (σ
p
,σ
e
)
are followed and node h in the game tree is reached is
denoted u(h) and termed expected reward in h.
We follow by expressing the pursuer’s expected
utility when strategies (σ
p
,σ
e
) are followed by propa-
gating expected rewards from subsequent nodes in the
game tree. We use histories of the form s
0
e
s
1
p
s
1
e
where
the evader started in vertex s
0
e
(based on the move of
nature) and then in the first round the pursuer moved
his units to vertices s
1
p
and the evader moved to s
1
e
.
Lemma 2. The expected reward in the root node is:
u(
/
0) =
s
0
e
s
0
p
b(s
0
e
) +
h
s
0
e
6∈s
0
p
b(s
0
e
)
i
·
s
1
p
π
p
(s
1
p
)
γ
s
1
e
s
1
p
b
π
e
(s
1
e
) +
h
s
1
e
6∈s
1
p
b
π
e
(s
1
e
)
i
· (4)
·
s
1
e
6∈s
1
p
s
0
e
6∈s
0
p
"
b(s
0
e
) · π
e
(s
0
e
,s
1
e
)
˜s
1
e
6∈s
1
p
˜s
0
e
6∈s
0
p
b( ˜s
0
e
) · π
e
( ˜s
0
e
, ˜s
1
e
)
· u(s
0
e
s
1
p
s
1
e
)
#!
Lemma 2 expressed the value in the root node
based on the expected rewards in histories s
0
e
s
1
p
s
1
e
where the pursuer is to move. The pursuer knows only
s
1
p
, hence these histories are partitioned into his infor-
mation sets I
p
[s
1
p
], one for each pursuer’s move s
1
p
in
the first round (see Figure 1). Importantly, for every
subgame below I
p
[s
1
p
], there is no information set that
would involve nodes not present in this subgame
neither pursuer nor evader forgets that s
1
p
was played.
The optimal behavior in these subgames hence de-
pends only on the belief in I
p
[s
1
p
], which is fixed due
to the fixed behavior in the first round. We can thus
compute value of the subgame below I
p
[s
1
p
] separately
by making chance simulate the belief in I
p
[s
1
p
].
Let us construct a game G[s
1
p
] which consists of
the information set I
p
[s
1
p
] and the subgame beneath it.
In this game, information set I
p
[s
1
p
] is reached with
probability β =
s
1
e
6∈s
1
p
b
π
e
(s
1
e
), while with probability
1 β the pursuer gets utility γ without play this
accounts for the reward the pursuer gets if he catches
the evader in the first round. The nature player sim-
ulates the belief b[s
1
p
] in the information set I
p
[s
1
p
], so
that the probability of every history in this informa-
tion set, given this set was reached, is identical with
the original game. The value of the game G[s
1
p
] corre-
sponds to the following part of the Equation (4):
γ
Evader caught
in the first round
z
}| {
s
1
e
s
1
p
b
π
e
(s
1
e
) +
Evader not caught
in the first round
z }| {
h
s
1
e
6∈s
1
p
b
π
e
(s
1
e
)
i
· (5)
·
s
1
e
6∈s
1
p
s
0
e
6∈s
0
p
"
b(s
0
e
) · π
e
(s
0
e
,s
1
e
)
˜s
1
e
6∈s
1
p
˜s
0
e
6∈s
0
p
b( ˜s
0
e
) · π
e
( ˜s
0
e
, ˜s
1
e
)
| {z }
Belief b[s
1
p
] of history s
0
e
s
1
p
s
1
e
in I
p
[s
1
p
]
·u(s
0
e
s
1
p
s
1
e
)
#
In the case of G[s
1
p
], there are multiple histories for
every current position of the evader s
1
e
in the informa-
tion set I
p
[s
1
p
] (resulting from different initial locations
of the evader s
0
e
). We show that we need not account
for different initial positions of the evader, and thus
all histories in I
p
[s
1
p
] with the same current position
of the evader s
1
e
can be merged. The resulting game
contains a single history for each s
1
e
in I
p
[s
1
p
], and
thus this game is equivalent to a shorter horizon game
G
t1
s
1
p
,b
π
e
up to multiplication of the utilities by γ
to account for a round that has already passed. This
allows using the solution of G
t1
s
1
p
,b
π
e
represented
by value functions v
t1
to express the value of G[s
1
p
].
Definition 3. Two deterministic game trees over
nodes H
1
,H
2
are isomorphic if there exists a bijection
ξ : H
1
H
2
such that v H
1
is a successor of u H
1
if and only if ξ(v) is a successor of ξ(u), n H
1
is a
pursuer’s node if and only if ξ(n) is a pursuer’s node,
it is a terminal node if and only if ξ(n) is a termi-
nal node and the utilities u(n) = u(ξ(n)). Moreover
the trees have the same informational structure: two
nodes u,v H
1
are in the same information set if and
only if ξ(u),ξ(v) are in the same information set.
We can observe that subtrees of nodes s
0
e
s
1
p
s
1
e
and
s
0
e
s
1
p
s
1
e
(where s
0
e
and s
0
e
stands for two different ini-
tial positions of the evader) are isomorphic as we can
Dynamic Programming for One-sided Partially Observable Pursuit-evasion Games
507
establish a bijection ξ(s
0
e
s
1
p
s
1
e
h
rest
) = s
0
e
s
1
p
s
1
e
h
rest
. The
utility of terminal histories does not depend on the ini-
tial position of the evader (only on the time the evader
was captured). Whenever pursuer’s node u is in in-
formation set I
p
, node ξ(u) is in I
p
as well (because
pursuer has no way to detect the evader’s initial posi-
tion). Moreover whenever evader cannot distinguish
between two histories s
0
e
s
1
p
s
1
e
···s
q
p
and s
0
e
s
1
p
s
1
e
···s
q
p
,
she cannot distinguish between histories s
0
e
s
1
p
s
1
e
···s
q
p
and s
0
e
s
1
p
s
1
e
···s
q
p
either (because her uncertainty is re-
lated to the pursuer’s move at round q, which does not
depend on the initial position of the evader). Thus the
subtrees have also the same informational structure.
Lemma 3. Let I be the topmost information set of
G[s
1
p
] and let belief b[I] over nodes from I be known
and fixed. Let n
1
,n
2
I be two nodes whose sub-
trees are isomorphic. Then a game G
0
with the same
structure as G with any belief b
0
[I] in I, satisfying
b[n
1
] + b[n
2
] = b
0
[n
1
] + b
0
[n
2
] and b[n] = b
0
[n] for all
nodes other than n
1
and n
2
, has the same value as G.
Thanks to the Lemma 3 and the isomorphism
of the subtrees beneath s
0
e
s
1
p
s
1
e
and s
0
e
s
1
p
s
1
e
, histories
s
0
e
s
1
p
s
1
e
and s
0
e
s
1
p
s
1
e
can be merged and associated be-
liefs added up. By repeating this process, we end up
with a single history for each current position of the
evader s
1
e
(let s
0
e
s
1
p
s
1
e
be such history), whose belief is
b
0
[s
1
p
](s
0
e
s
1
p
s
1
e
)
:
=
s
0
e
6∈s
0
p
b(s
0
e
) · π
e
(s
0
e
,s
1
e
)
˜s
1
e
6∈s
1
p
˜s
0
e
6∈s
0
p
b( ˜s
0
e
) · π
e
( ˜s
0
e
, ˜s
1
e
)
(6)
=
b
π
e
(s
1
e
)
˜s
1
e
6∈s
1
p
b
π
e
(s
1
e
)
; b
0
[s
1
p
](s
1
e
) for short
The updated belief b
0
[s
1
p
] in Equation (6) complies
with belief b
π
e
(Equation (3)) updated with the infor-
mation that the evader is located in none of the ver-
tices in s
1
p
. The belief in I
p
[s
1
p
] matches the belief in
topmost information set of G
t1
s
1
p
,b
π
e
; and the re-
sulting game is the same as G
t1
s
1
p
,b
π
e
up to multi-
plication by γ. The value of G[s
1
p
] (Equation (5)), from
which this game was derived, is thus γv
t1
s
1
p
(b
π
e
).
We substitute this value to Equation (4) to obtain
u(
/
0) =
s
0
e
s
0
p
b(s
0
e
) +
h
s
0
e
6∈s
0
p
b(s
0
e
)
i
· (7)
·
s
1
p
π
p
(s
1
p
) ·
γv
t
s
1
p
(b
π
e
)
By allowing the players to choose their optimal one-
step strategies π
p
and π
e
in Equation (7), we obtain
the desired maximin formula from Equation (2).
3.1 Computing One-Step Strategies
The evaluation of the Equation (2) involves computa-
tion of optimal strategies of the players. In this section
we show that if the value functions v
t1
are piecewise
linear and convex and represented by sets of α-vectors
(which holds due to Theorem 2), the strategies can be
found out by means of linear programming.
Due to limited space, we provide the linear pro-
gram for computing optimal one-step strategy in
G
t
s
0
p
,b
for the pursuer only. At the beginning of
each round, the pursuer realizes what vertices the
evader is not located in, and hence updates his be-
lief about the position of the evader. We thus restrict
ourselves to the case where b(s
e
) = 0 for all s
e
s
0
p
.
In the following linear program, the pursuer seeks
for a strategy maximizing his expected utility against
the best-responding opponent. He assumes strategies
of the form “move to s
1
p
first and then follow strategy
whose value is represented by α v
t1
s
1
p
”. The
choice of α uniquely defines such strategy. The prob-
ability of playing each strategy α v
t1
s
1
p
is rep-
resented by variable
ˆ
π
p
(s
1
p
,α). Constraint (9) corre-
sponds to the value of playing such randomized strat-
egy against the best-responding evader who starts in
vertex s
e
(α(s
0
e
) denotes the value of α evaluated at
pure belief corresponding to action s
0
e
of the evader).
The evader starts in s
e
with probability b(s
e
), hence
the objective (8) calculates the expectation over indi-
vidual v(s
e
). For the resulting one-step strategy of the
pursuer, it holds that π(s
1
p
) =
αv
t1
h
s
1
p
i
ˆ
π(s
1
p
,α).
max
v,
ˆ
π
p
γ
s
e
V
b(s
e
) · v(s
e
) (8)
s.t.
s
1
p
adj(s
0
p
) ; αv
t1
h
s
1
p
i
α(s
0
e
) ·
ˆ
π
p
(s
1
p
,α) v(s
e
) ∀{s
e
,s
0
e
} E (9)
s
1
p
adj(s
0
p
) ; αv
t1
h
s
1
p
i
ˆ
π
p
(s
1
p
,α) = 1 (10)
ˆ
π
p
(s
1
p
,α) 0 s
1
p
adj(s
0
p
) α v
t1
s
1
p
(11)
3.2 Computing Value Functions
In each iteration of our value iteration algorithm,
value functions v
t
are constructed from the solution
from the previous iteration value functions v
t1
.
By repeating this construction, a sequence of finite-
horizon value functions
{
v
t
}
t=0
approaching the val-
ues of the infinite-horizon game is being constructed.
The value functions v
t
to be constructed, as well as
v
t1
, are PWLC (Theorem 2). We show that this al-
lows us to avoid evaluating the dynamic programming
ICAART 2017 - 9th International Conference on Agents and Artificial Intelligence
508
operator H (Equation (2)) in every point in the belief
space and enables us to construct v
t
using only a finite
subset of beliefs; the extreme points of line segments
of v
t
. We proceed in two steps: (1) first, we compute a
function Q
t
π
p
s
0
p
corresponding to the expected util-
ity the pursuer gets if he plays π
p
at the first round of
the longer horizon game G
t
s
0
p
,b
; (2) then we show
how to compute v
t
s
0
p
as a combination of multiple
Q
t
π
p
s
0
p
for properly chosen one-step strategies π
p
.
We start with a formal definition of function Q
t
π
p
s
0
p
.
Definition 4. Let π
p
be pursuer’s one-step strategy
for the first round of the game G
t
s
0
p
,b
. The value
of π
p
is a function Q
t
π
p
s
0
p
assigning the expected
reward the pursuer gets in the game G
t
s
0
p
,b
against
the best-responding opponent, when he plays π
p
in
the first round and continues by playing according to
his optimal strategy in the rest of the game, i.e.
Q
t
π
p
s
0
p
(b)
:
=
s
e
s
0
p
b(s
e
) + γ
h
s
e
V \s
0
p
b(s
e
)
i
· (12)
· min
π
e
s
1
p
V
N
π
p
(s
1
p
) · v
t1
s
1
p
(b
π
e
)
According to the previous definition, once the first
round of the game is over, the pursuer continues with
his optimal strategy. The following lemma shows that
this optimal strategy for the rest of the game can be
characterized by α-vectors of v
t1
.
Lemma 4. Let π
p
be pursuer’s fixed one-step strat-
egy for the first round of the game. For every belief b
there are strategies σ
p
[s
1
p
], one for each s
1
p
adj(s
0
p
),
represented by α-vectors α[s
1
p
] v
t1
s
1
p
, such that
it is optimal to follow σ
p
[s
1
p
] when s
1
p
was played in
the first round of the game. The value of strategy σ
p
prescribing the pursuer to play according to π
p
in the
first round and continue by using respective σ
p
[s
1
p
] is
linear and the corresponding α-vector satisfies
α
σ
p
(s
e
) =
(
1 s
e
s
p
γ min
s
0
e
adj(s
e
)
s
1
p
π
p
(s
1
p
) · α[s
1
p
](s
0
e
) otherwise
(13)
Lemma 4 gives us a direct algorithm for comput-
ing Q
t
π
p
. PWLC functions v
t1
correspond to a finite
number of horizon-t strategies, represented by a finite
number of α-vectors. There is only a finite number
of ways to choose strategies σ
p
[s
1
p
] from Lemma 4,
which can be found by means of enumeration. The
maximization over linear functions representing value
of such strategies corresponds to the function Q
t
π
p
h
s
p
i
which is thus piecewise linear and convex.
The definition of Q
t
π
p
h
s
p
i
implies that we can
compute the value function v
t+1
h
s
p
i
by allowing the
ˆv
t
s
0
p
n
0
|V |
o
,
ˆ
Π
p
/
0
while b (V ) : v
t
s
0
p
(b) > ˆv
t
s
0
p
(b) do
π
p
optimal strategy of the pursuer at
belief b for the first round (see (8))
ˆ
Π
p
ˆ
Π
p
{
π
p
}
ˆv
t
s
0
p
ˆv
t
s
0
p
Q
t
π
p
h
s
p
i
return ˆv
t
h
s
p
i
Algorithm 1: Incremental construction of v
t
s
p
.
pursuer to play arbitrary strategy π
p
, when
v
t
s
0
p
(b) = max
π
p
Q
t
π
p
s
0
p
(b) (14)
As a consequence of Theorem 1, it is sufficient
to consider a finite set Π
p
of strategies in the maxi-
mizer of Equation (14) and obtain v
t
s
0
p
as the point-
wise maximum from respective Q
t
π
p
s
0
p
functions,
v
t
h
s
p
i
=
L
π
p
Π
p
Q
t
π
p
h
s
p
i
. The set of such strategies
is however initially unknown. We propose the Algo-
rithm 1 that constructs both the set of strategies
ˆ
Π
p
and the value function ˆv
t
s
0
p
incrementally by iter-
atively verifying whether the current set
ˆ
Π
p
is suffi-
cient for obtaining the actual value function v
t
s
0
p
.
The algorithm constructs a set of strategies
ˆ
Π
p
and
a corresponding estimate of value function ˆv
t
s
0
p
,
starting with empty
ˆ
Π
p
. In each iteration, it verifies
if strategies
ˆ
Π
p
used to form current ˆv
t+1
s
0
p
are op-
timal in every belief b (V ). If a belief b where the
strategy can be improved is found, i.e. Q
t
π
p
s
0
p
(b) >
ˆv
t
s
0
p
(b) for some π
p
, it updates
ˆ
Π
p
and recomputes
ˆv
t
h
s
p
i
. If no such belief b exists, all required strate-
gies were considered and ˆv
t
s
0
p
= v
t
s
0
p
.
Whenever the value function ˆv
t
s
0
p
is not yet op-
timal for all beliefs, i.e. there exists a belief b where
v
t
s
0
p
(b) > ˆv
t
s
0
p
(b), there exists a belief b
0
with the
same property that forms an extreme point of a line
segment on ˆv
t
s
0
p
. This is characterized by Lemma 5.
Lemma 5. If there is a belief b where v
t
s
0
p
(b) >
ˆv
t
s
0
p
(b), there must be a belief b
0
that forms an ex-
treme point of a line segment on the surface of ˆv
t
s
0
p
where v
t
s
0
p
(b
0
) > ˆv
t
s
0
p
(b
0
).
Thanks to Lemma 5, we can consider only a fi-
nite set of beliefs that form extreme points of line seg-
ments on the value function ˆv
t
s
0
p
. In every iteration,
a one-step strategy that is optimal at some belief point
(and thus must be present in Π
p
) is added to
ˆ
Π
p
. Due
to Theorem 1, the set Π
p
required to obtain the opti-
mal value function v
t
s
0
p
is finite. Hence after a finite
number of iterations, the Algorithm 1 terminates.
Dynamic Programming for One-sided Partially Observable Pursuit-evasion Games
509
3.3 Convergence of the Algorithm
We demonstrate the convergence of our value itera-
tion algorithm by showing that the dynamic program-
ming operator H (Equation 2) has a unique fixpoint
which is reached by its iterative application. We ob-
tain this by showing that H is a contraction mapping
under the following max-norm and applying the Ba-
nach’s fixed point theorem (Ciesielski et al., 2007).
kv vk = max
s
0
p
V
N
max
b(V )
|v
s
0
p
(b) v
s
0
p
(b)| (15)
Lemma 6. The operator H is a contraction with con-
tractivity factor γ < 1 under max-norm.
Theorem 4. There is a unique set of value functions
v
satisfying v
= Hv
and the recursive application
of H converges to v
. Series
{
v
t
}
i=0
thus converges to
value functions of an infinite horizon game.
Proof. The operator H is a contraction mapping de-
fined on a metric space of sets of bounded functions
defined on the belief space. By applying Banach’s
fixed point theorem (Ciesielski et al., 2007) we get
that H has a unique fixed point v
and the recursive
application of H converges to v
.
Proposition 1. After t iterations of the value iteration
algorithm it holds that kv
t
v
k γ
t
.
4 CONCLUSIONS
We present the first algorithm for solving the class
of two-player discounted pursuit-evasion games with
infinite horizon and partial observability, where the
evader is assumed to be perfectly informed about the
current state of the game (i.e. position of pursuer’s
units). This class of games has a significant relevance
in security domains where a robust strategy that pro-
vides guarantees in the worst case is often desirable.
Our algorithm is a modification of the well-known
value iteration algorithm for solving Partially Ob-
servable Markov Decision Processes (POMDPs), or
stochastic games with concurrent moves. We show
that the strategies can be compactly represented us-
ing value functions that depend on the location of the
pursuing units and the belief about the position of the
evader, but not explicitly on the history of moves.
These value functions are piecewise linear and con-
vex and allow us to design a dynamic programming
operator for the value iteration algorithm.
Our work is the first step towards many practical
algorithms for solving discounted stochastic games
with one-sided partial observability. These can be
applied in many scenarios requiring robust strategies
and thus our work opens the whole new area of re-
search in algorithmic and computational game theory.
One natural continuation is an adaptation of point-
based approximation algorithms for POMDPs to im-
prove the scalability of the value iteration algorithm.
ACKNOWLEDGEMENTS
This research was supported by the Czech Science
Foundation (grant no. 15-23235S) and by the Grant
Agency of the Czech Technical University in Prague,
grant No. SGS16/235/OHK3/3T/13.
REFERENCES
Chung, T. H., Hollinger, G. A., and Isler, V. (2011). Search
and pursuit-evasion in mobile robotics. Autonomous
robots, 31(4):299–316.
Ciesielski, K. et al. (2007). On Stefan Banach and some of
his results. Banach Journal of Mathematical Analysis,
1(1):1–10.
Hansen, E. A., Bernstein, D. S., and Zilberstein, S.
(2004). Dynamic programming for partially observ-
able stochastic games. In AAAI, volume 4, pages 709–
715.
Koller, D., Megiddo, N., and Von Stengel, B. (1996). Ef-
ficient computation of equilibria for extensive two-
person games. Games and Economic Behavior,
14(2):247–259.
McEneaney, W. M. (2004). Some classes of imperfect infor-
mation finite state-space stochastic games with finite-
dimensional solutions. Applied Mathematics and Op-
timization, 50(2):87–118.
Monahan, G. E. (1982). State of the arta survey of partially
observable Markov decision processes: theory, mod-
els, and algorithms. Management Science, 28(1):1–
16.
Pineau, J., Gordon, G., Thrun, S., et al. (2003). Point-based
value iteration: An anytime algorithm for POMDPs.
In IJCAI, volume 3, pages 1025–1032.
Shapley, L. S. (1953). Stochastic games. Proceedings of the
National Academy of Sciences, 39(10):1095–1100.
Smallwood, R. D. and Sondik, E. J. (1973). The optimal
control of partially observable Markov processes over
a finite horizon. Operations Research, 21(5):1071–
1088.
Smith, T. and Simmons, R. (2012). Point-based POMDP
algorithms: Improved analysis and implementation.
arXiv preprint arXiv:1207.1412.
Vanderbei, R. J. (2014). Linear programming. Springer.
Vidal, R., Shakernia, O., Kim, H. J., Shim, D. H., and Sas-
try, S. (2002). Probabilistic pursuit-evasion games:
theory, implementation, and experimental evaluation.
Robotics and Automation, IEEE Transactions on,
18(5):662–669.
ICAART 2017 - 9th International Conference on Agents and Artificial Intelligence
510