Dynamic Programming for One-sided Partially Observable

Pursuit-evasion Games

Karel Hor

ak and Branislav Bo

sansk

Department of Computer Science, Faculty of Electrical Engineering,

Czech Technical University in Prague, Prague, Czech Republic

Keywords:

Pursuit-evasion Games, One-sided Partial Observability, Inﬁnite Horizon, Value Iteration, Concurrent Moves.

Abstract:

Pursuit-evasion scenarios appear widely in robotics, security domains, and many other real-world situations.

We focus on two-player pursuit-evasion games with concurrent moves, inﬁnite horizon, and discounted re-

wards. We assume that the players have partial observability, however, the evader has an advantage of know-

ing the current position of pursuer’s units. This setting is particularly interesting for security domains where

a robust strategy, maximizing the utility in the worst-case scenario, is often desirable. We provide, to the best

of our knowledge, the ﬁrst algorithm that provably converges to the value of a partially observable pursuit-

evasion game with inﬁnite horizon. Our algorithm extends well-known value iteration algorithm by exploiting

that (1) value functions of our game depend only on the position of the pursuer and the belief he has about the

position of the evader, and (2) that these functions are piecewise linear and convex in the belief space.

1 INTRODUCTION

Pursuit-evasion games (PEGs) appear in many sce-

narios in robotics and security domains (Vidal et al.,

2002; Chung et al., 2011). A team of centrally con-

trolled pursuing units (the pursuer) aims to locate

and capture the evader, while the evader aims for the

opposite. We study these games and assume their

discrete-time variant played on a ﬁnite graph. We as-

sume that units of both players move simultaneously,

the horizon of the game is inﬁnite, rewards are dis-

counted over time with discount factor γ ∈ [0,1), and

the players have only a partial information about the

current state. Formally, such a game belongs to zero-

sum partially observable stochastic games (POSGs).

We aim for ﬁnding robust strategies of the pur-

suer against the worst-case evader. Speciﬁcally, we

assume that the evader knows the positions of pur-

suer’s units and her only uncertainty is the strategy of

the pursuer and the move he will perform next. Al-

though in reality such perfectly informed adversary is

rarely met, the pursuer usually does not know what

information is being revealed to the evader. Hence,

in order to derive robust strategies (i.e. maximizing

pursuer’s reward against any type of the evader), it is

natural to use such a perfectly informed adversary.

We design the ﬁrst algorithm that provably con-

verges to the value of such one-sided partially observ-

able pursuit-evasion games. Moreover, as the value

converges, strategies of the players converge to the

optimal strategies as well. This contrasts with exist-

ing approaches in robotics and security, where heuris-

tic solutions without any optimality guarantees are

used (Vidal et al., 2002; Chung et al., 2011).

Our algorithm extends the well-known value it-

eration algorithms for concurrent-moves stochas-

tic games (Shapley, 1953) and partially observable

Markov decision processes (POMDPs) (Smallwood

and Sondik, 1973; Monahan, 1982; Pineau et al.,

2003; Smith and Simmons, 2012). We show that, sim-

ilarly to POMDPs, one-sided pursuit-evasion games

allow us to deﬁne compactly represented value func-

tions and propose a dynamic programming approach

to approximate them in an iterative manner. Speciﬁ-

cally we show that the value functions (1) depend only

on the position of the pursuer’s units and his belief

about the possible position of the evader, but not on

the history of moves, (2) these functions are piecewise

linear and convex and thus we can represent them as

a set of α-vectors (Section 2.1), and (3) we can de-

sign a dynamic-programming operator with provable

convergence to optimal value functions (Section 3).

Our results for one-sided partially observable

pursuit-evasion games have similar implications as

those derived for POMDPs. Our paper is thus the ﬁrst

step in a whole line of research. The importance of

HorÃ ˛ak K. and BoÅ ˛aanskÃ¡ B.

Dynamic Programming for One-sided Partially Observable Pursuit-evasion Games.

DOI: 10.5220/0006190605030510

In Proceedings of the 9th International Conference on Agents and Artiﬁcial Intelligence (ICAART 2017), pages 503-510

ISBN: 978-989-758-220-2

503

the results is highlighted in the derivation of the full-

backup value iteration algorithm. Moreover, due to

the close similarity with POMDP models in the struc-

ture of the solution, efﬁcient point-based versions of

the algorithm should be applicable as well.

Due to the space constraints, proofs of some re-

sults can be found in the full version of the paper.

1.1 Related Work

A similar model with one-sided partial observability

where one of the players has a perfect information

was presented in (McEneaney, 2004). This player is

assumed to know the action the opponent will play

at the current stage. Such a game is essentially turn-

based and only pure strategies are thus thought of.

Disregarding randomization has severe limita-

tions. In many cases, if the evader knows the action of

the pursuer before she has to decide herself, she can

use this additional information to avoid getting caught

(simply by avoiding the vertices the pursuer is about

to move to next forever). Randomized strategies thus

better correspond to real-world problems occurring in

real-time. Using them, however, presents additional

challenges that we solve in this paper.

Finite horizon POSGs can be also solved by con-

verting a game to the matrix form by enumerating all

pure strategies of the players. In (Hansen et al., 2004),

the pure strategies are constructed in an incremen-

tal way using dynamic programming while prunning

those that are dominated. Although this improves on

the na

ıve enumeration approach, the number of strate-

gies is still exponential in the horizon in the worst case

and so is their size, which makes the algorithm im-

practical when focusing on long-term interactions.

2 FINITE-HORIZON GAME

We use the notion of ﬁnite-horizon POSGs, or

extensive-form games (EFGs), to reason about an

inﬁnite-horizon pursuit-evasion game. An EFG is a

tuple G = (N , H ,Z,T ,u, I). N is the set of players,

in our case N =

{

p,e

}

where p stands for the pur-

suer and e for the evader. Set H denotes a ﬁnite set of

histories of actions taken by all players from the be-

ginning of the game. Every history corresponds to a

node in the game tree; hence we use the terms history

and node interchangeably. Each of the histories may

be either (1) terminal (h ∈ Z ⊆ H ) where the game

ends and player i gets utility u

(h), (2) controlled by

the nature that selects the successor node according to

a ﬁxed probability distribution known to all players,

or (3) one of the players from N may be to act. We

consider a zero-sum scenario where u

(h) = −u

(h).

To simplify the notation we use u(h) to denote pur-

suer’s reward. An ordered list of transitions of player

i from root to node h is referred to as a player i’s se-

quence. Allowed transitions in the game are modeled

using a transition function T that provides a set of

successor nodes for each non-terminal history. The

imperfect observation of players is modelled via in-

formation sets I

that form a partition over histories h

where player i ∈ N takes action. We assume perfect

recall setting where the players never forget their past

actions, i.e. for every I

∈ I

, all histories h ∈ I

have

the same player i’s sequence. Each information set

∈ I

corresponds to one decision point of player i. A

randomized behavioral strategy σ

of player i assigns

a distribution over actions to each of the information

sets in I

. σ

can be represented in the form of a real-

ization plan r which assigns probability r(σ

) of play-

ing sequence σ

to each player i’s sequence σ

. The

behavioral strategy at information set I

∈ I

reached

using a sequence σ

is then b(I

,a) = r(σ

a)/r(σ

). A

Nash equilibrium (NE) in an EFG is a pair of behav-

ioral strategies, in which each player best-responds

the strategy of his opponent. The expected utility of

playing NE strategies is termed value of the game.

We will now use this terminology to construct an

EFG for a ﬁnite-horizon version of a pursuit-evasion

game with N pursuing units played on a graph G =

hV ,Ei for t rounds (we term t as the horizon). Part

of the game tree is shown in Figure 1. At every round

τ ≤ t, pursuer’s units occupy vertices s

, where s

p,1

,...,s

p,N

} is an N-element multiset of vertices

of G, and the evader is located in vertex s

∈ V . The

goal of the pursuer is to achieve a situation where the

evader is caught, i.e. s

∈ s

. In every round, players

move their units to vertices adjacent to their current

positions (adj(v) denotes the set of vertices adjacent to

v). Position of the evader in round τ+ 1 is thus s

τ+1

∈

adj(s

). We overload the operator adj to apply it also

on multisets representing positions of pursuer’s units,

i.e. s

τ+1

∈ adj(s

), where adj(s

) = ×

i=1...N

adj(s

p,i

A horizon-t game G





is parametrized by

the initial position of the pursuer s

∈ V

and a dis-

tribution over evader’s initial positions b

∈ ∆(V )

known to both players (we term b

the initial belief ).

The game starts with a chance move selecting the ini-

tial position of the evader s

(based on b

A history h ∈ H in a game with horizon t cor-

responds to a list of positions s

···s

, where

τ ≤ t. The utility values are assigned to terminal his-

tories as follows: if the pursuer failed to capture the

evader in time, i.e. if τ = t and s

6∈ s

, he gets utility

u(h) = 0; if he successfully captured the evader in the

time limit t, i.e. if τ ≤ t and s

∈ s

, he gets the reward

ICAART 2017 - 9th International Conference on Agents and Artiﬁcial Intelligence

504

P P

E E

...

E E

...

E E

...

P P P P

𝛾𝛾

... ...

[ç]

] I

[

]

Figure 1: EFG representation of a ﬁnite-horizon PEG.

u(h) = γ

for capturing the evader in τ rounds (where

γ ∈ [0,1) is the discount factor). The transition func-

tion T complies with the graph (i.e., the adjacency

function adj), hence s

∈ adj(s

τ−1

) and s

∈ adj(s

τ−1

)

for every τ ≥ 1. For notational simplicity we denote

the sequence of pursuer’s actions s

···s

in h as h|

and the sequence of evader’s actions s

···s

as h|

The position of the evader is unknown to the pur-

suer. Hence, in a perfect recall game, there is one pur-

suer’s information set I

[σ

] for each of his sequences

where I

[σ

] = {h

| h

∈H \Z : h

=σ

Evader, on the other hand, knows the situation al-

most perfectly. She knows where the pursuer’s units

were located before the pursuer acted in the current

round (recall that the pursuer acts ﬁrst). The only in-

formation missing to the evader is the action being

taken by the pursuer in the current round. Hence,

for every history h = s

···s

where the pursuer

is to play, there is evader’s information set I

[h] =



τ+1

∈ adj(s

)



containing all possible con-

tinuations of the pursuer.

2.1 Shape of the Value Function

The sizes of the extensive-form representation and as-

sociated behavioral strategies grow exponentially as

the horizon increases. This makes it quickly impos-

sible to use standard algorithms for game trees, espe-

cially since we aim to solve inﬁnite horizon games.

We alleviate the problem of increasing complexity

of the strategy representation by representing strate-

gies only using their values. We show that the value

of a strategy is linear in the belief, and we can thus

represent it using just |V | real numbers. Moreover,

when the horizon is ﬁnite, we need to consider only

ﬁnitely many strategies regardless of the initial belief,

which makes value functions, formed by values of

best strategies for each belief, be piecewise linear and

convex and allows us to represent them compactly.

Deﬁnition 1. A value function v





: ∆(V ) → [0,1]

is a function assigning the value v





) of the

game G





to every initial belief b

. By v

mean a set of value functions v





, one for each ini-

tial position s

∈ V

of the pursuer.

In the following text, we show that a value func-

tion v





is piecewise linear and convex (PWLC)

in the belief for every ﬁnite horizon t. For notational

simplicity, the term linear is used to refer to afﬁne

functions as well. The proof is structured as follows:

(1) ﬁrstly we show that the expected utility of every

pursuer’s strategy is linear in the belief, next (2) it is

sufﬁcient to consider a ﬁnite set of pursuer’s strate-

gies Σ





when looking for the Nash equilibrium

one; and ﬁnally (3) we show that the PWLC nature of

the value function follows from (1) and (2).

Lemma 1. Let σ

be a randomized behavioral strat-

egy of the pursuer in games G





, where the pur-

suer starts in vertices s

, parametrized by the initial

belief b

. The expected utility of playing σ

against a

best responding opponent is linear in b

Theorem 1. Let G





be a horizon-t game

parametrized by the initial belief b

where the pur-

suer starts in a set of vertices s

. There exists a ﬁnite

set of pursuer’s behavioral strategies Σ





such that

for every initial belief b

, there is at least one strategy

∈ Σ





that is in Nash equilibrium of G





Proof. We use the sequence-form linear program for

solving EFGs (Koller et al., 1996) to reason about the

set of strategies Σ





. In this LP, values in every

information set of the evader, as well as the value

v(root) in the root node, are computed in a bottom-

up fashion. Every such value v(I

) of an information

set I

can be seen as a concave piecewise linear func-

tion in the space of pursuer’s realization plans (a com-

pact representation of his behavioral strategies). The

pursuer then seeks for a realization plan that maxi-

mizes v(root); the maximizer of which can be found

among extreme points of line segments of v(root), i.e.

vertices of a polytope bounded by this function (Van-

derbei, 2014). We show that the set of such extreme

points does not depend on the initial belief b

There is one information set I

] of the evader for

each of her initial positions s

. The utility of every

terminal node in the subgame beneath I

] is mul-

tiplied by chance probability b(s

), which allows us

to factor out this probability and obtain the following

constraint for the root node:

v(root) ≤

∑

∈s

) +

∑

∈V \s

) · ˆv(I

]) (1)

Value v(root) is a convex combination of concave

piecewise linear functions ˆv(I

]). As the belief was

Dynamic Programming for One-sided Partially Observable Pursuit-evasion Games

505

factored out, these functions, as well as the ﬁnite set

of their extreme points P[s

], no longer depend on the

belief. This convex combination with arbitrary coef-

ﬁcients b

cannot have an extreme where none of the

functions ˆv(I

]) has one. The set of extreme points

is therefore a subset of

P[s

] — a ﬁnite set that

does not depend on the belief. Each of the extreme

points in

P[s

] corresponds to one pursuer’s re-

alization plan, and thus one his behavioral strategy,

which allows us to construct the ﬁnite set Σ





Theorem 2. Value function v





is piecewise linear

and convex in the belief space.

Proof. This result directly follows from Lemma 1 and

Theorem 1. There is a ﬁnite set of randomized strate-

gies Σ





that has to be considered by the pursuer

and value of each such strategy is linear in the belief

space. Thus the value function v





is a pointwise

maximum taken over a ﬁnite set of linear functions,

which is a PWLC function in the belief space.

A PWLC function can be represented as a ﬁnite

set of α-vectors. Every α-vector α = (α

,...,α

|V |

)

represents one of the afﬁne functions by assigning an

expected reward α

to each pure belief. We will of-

ten work with the α-vector representation of a value

function, hence we overload the notation and consider

value functions to be sets of such α-vectors as well.

Lemma 1 and Theorem 1 imply that each linear

segment of the value function matches one pursuer’s

strategy, we thus use terms α-vector and pursuer’s

strategy interchangeably. This is similar to POMDPs

where each α-vector matches one conditional plan.

3 VALUE ITERATION

In the previous section, we related the concept of

the value functions to the EFG representation of the

game and discussed that these functions have desir-

able properties. We leverage their representation to

design a dynamic programming approach inspired by

value iteration algorithms for either POMDPs (Small-

wood and Sondik, 1973; Monahan, 1982) or perfect

information stochastic games (Shapley, 1953). A se-

quence of value functions {v

}

∞

t=0

is being constructed

by the algorithm, starting with values of a horizon-0

game, where the pursuer wins only when he starts in

the same vertex as the evader.

We avoid using the exponentially-sized represen-

tation of the underlying EFG by computing value

function of a horizon-t game using the solution of the

game with horizon t−1. First, we state a well-deﬁned

value update formula that expresses v

in terms of v

t−1

(Theorem 3). We let the players choose their strate-

gies for the ﬁrst round of the horizon-t game using the

maximin principle (we term these one-step strategies)

and we show that the pursuer can use these strate-

gies to update his belief. Pursuer’s one-step strategy

is a distribution over possible actions of his units,

∈ ∆(adj(s

)), from which he samples his action.

The evader acts similarly, however she conditions her

decision on her true position s

(not just on the overall

belief available to the pursuer); her one-step strategy

is thus a mapping π

: V → ∆(V ), such that π

)

assigns zero probability to vertices not adjacent to s

The piecewise linearity and convexity of value

functions have implications on the computation of the

value functions. Firstly it allows ﬁnding optimal one-

step strategies by means of linear programming (Sec-

tion 3.1), furthermore, we need not evaluate the value

update formula in every point in the belief space to

construct new value functions. Instead, we can use an

incremental algorithm which inspects extreme points

of line segments of a temporary function to check

if it can terminate and value function has been con-

structed, or further linear segments have to be added.

Theorem 3. The value of G





is computed from

value functions v

t−1

of horizon-(t−1) games. It holds





(b) =

∑

∈s

b(s

) + (2)

+ γ

∑

∈V \s

b(s

)

· max

min

∑

∈V

) · v

t−1





)

where the transformed belief b

depends solely

on the evader’s one-step strategy π

and the

parametrization of the game G





) =

∑

∈V \s

b(s

)

∑

∈V \s

b(s

) · π

) (3)

The computation of v

using Eq. (2) forms a dynamic

programming operator H, such that v

= Hv

t−1

Proof. The correctness of the value update formula

will be proved by computing the value of G





in a bottom-up fashion. We start by considering that

one-step strategies of the players for the ﬁrst round of

the game are ﬁxed, while they play optimally after-

ward. This determines pursuer’s expected reward at

every node in the game tree, which we use to express

his expected utility in the root node (Lemma 2). As

the behavior in the ﬁrst round of the game is ﬁxed,

parts of the game tree are independent on each other

— we refer to these subgames as G[s

]. This allows us

to evaluate the expectation from Lemma 2 by solving

these games separately. It turns out that games G[s

]

are strategically equivalent to shorter-horizon games

ICAART 2017 - 9th International Conference on Agents and Artiﬁcial Intelligence

506

t−1





, the solution of which is represented by

value functions v

t−1

. The expectation can be thus ex-

pressed solely in terms of v

t−1

. Finally, we relax the

assumption of ﬁxed strategies in the ﬁrst round, which

yields the desired maximin formula (Equation (2)).

Let π

∈ ∆(adj(s

)) be a ﬁxed pursuer’s one-step

strategy, and π

: V → ∆(V ) be a ﬁxed one-step strat-

egy of the evader. Assume that both players play ac-

cording to π

and π

in the ﬁrst round of the game, i.e.

the pursuer follows π

in his information set I

[

0] (i.e.

pursuer’s information set where he has not acted yet,

see Figure 1) and the evader plays according to π

)

in her information set I

] (where she has received

the information that she is located in vertex s

). Once

the ﬁrst round of the game is over, players continue

with their best strategies available. We denote such

optimal strategies where the players are restricted to

play π

and π

in the ﬁrst round as σ

and σ

Deﬁnition 2. Let π

, π

be ﬁxed one-step strategies

for the ﬁrst round of G





and σ

, σ

be optimal

strategies with restriction to play π

and π

in the ﬁrst

round. The pursuer’s expected reward when (σ

,σ

)

are followed and node h in the game tree is reached is

denoted u(h) and termed expected reward in h.

We follow by expressing the pursuer’s expected

utility when strategies (σ

,σ

) are followed by propa-

gating expected rewards from subsequent nodes in the

game tree. We use histories of the form s

where

the evader started in vertex s

(based on the move of

nature) and then in the ﬁrst round the pursuer moved

his units to vertices s

and the evader moved to s

Lemma 2. The expected reward in the root node is:

0) =

∑

∈s

b(s

) +

∑

6∈s

b(s

)

∑

)



∑

∈s

) +

∑

6∈s

)

· (4)

∑

6∈s

∑

6∈s

b(s

) · π

)

∑

˜s

6∈s

∑

˜s

6∈s

b( ˜s

) · π

( ˜s

, ˜s

)

· u(s

)

Lemma 2 expressed the value in the root node

based on the expected rewards in histories s

where the pursuer is to move. The pursuer knows only

, hence these histories are partitioned into his infor-

mation sets I

], one for each pursuer’s move s

the ﬁrst round (see Figure 1). Importantly, for every

subgame below I

], there is no information set that

would involve nodes not present in this subgame —

neither pursuer nor evader forgets that s

was played.

The optimal behavior in these subgames hence de-

pends only on the belief in I

], which is ﬁxed due

to the ﬁxed behavior in the ﬁrst round. We can thus

compute value of the subgame below I

] separately

by making chance simulate the belief in I

Let us construct a game G[s

] which consists of

the information set I

] and the subgame beneath it.

In this game, information set I

] is reached with

probability β =

∑

6∈s

), while with probability

1 − β the pursuer gets utility γ without play — this

accounts for the reward the pursuer gets if he catches

the evader in the ﬁrst round. The nature player sim-

ulates the belief b[s

] in the information set I

], so

that the probability of every history in this informa-

tion set, given this set was reached, is identical with

the original game. The value of the game G[s

] corre-

sponds to the following part of the Equation (4):

Evader caught

in the ﬁrst round

}| {

∑

∈s

) +

Evader not caught

in the ﬁrst round

z }| {

∑

6∈s

)

· (5)

∑

6∈s

∑

6∈s

b(s

) · π

)

∑

˜s

6∈s

∑

˜s

6∈s

b( ˜s

) · π

( ˜s

, ˜s

)

| {z }

Belief b[s

] of history s

in I

]

·u(s

)

In the case of G[s

], there are multiple histories for

every current position of the evader s

in the informa-

tion set I

] (resulting from different initial locations

of the evader s

). We show that we need not account

for different initial positions of the evader, and thus

all histories in I

] with the same current position

of the evader s

can be merged. The resulting game

contains a single history for each s

in I

], and

thus this game is equivalent to a shorter horizon game

t−1





up to multiplication of the utilities by γ

to account for a round that has already passed. This

allows using the solution of G

t−1





represented

by value functions v

t−1

to express the value of G[s

Deﬁnition 3. Two deterministic game trees over

nodes H

are isomorphic if there exists a bijection

ξ : H

→ H

such that v ∈ H

is a successor of u ∈ H

if and only if ξ(v) is a successor of ξ(u), n ∈ H

is a

pursuer’s node if and only if ξ(n) is a pursuer’s node,

it is a terminal node if and only if ξ(n) is a termi-

nal node and the utilities u(n) = u(ξ(n)). Moreover

the trees have the same informational structure: two

nodes u,v ∈ H

are in the same information set if and

only if ξ(u),ξ(v) are in the same information set.

We can observe that subtrees of nodes s

and

(where s

and s

stands for two different ini-

tial positions of the evader) are isomorphic as we can

Dynamic Programming for One-sided Partially Observable Pursuit-evasion Games

507

establish a bijection ξ(s

rest

) = s

rest

. The

utility of terminal histories does not depend on the ini-

tial position of the evader (only on the time the evader

was captured). Whenever pursuer’s node u is in in-

formation set I

, node ξ(u) is in I

as well (because

pursuer has no way to detect the evader’s initial posi-

tion). Moreover whenever evader cannot distinguish

between two histories s

···s

and s

···s

she cannot distinguish between histories s

···s

and s

···s

either (because her uncertainty is re-

lated to the pursuer’s move at round q, which does not

depend on the initial position of the evader). Thus the

subtrees have also the same informational structure.

Lemma 3. Let I be the topmost information set of

G[s

] and let belief b[I] over nodes from I be known

and ﬁxed. Let n

∈ I be two nodes whose sub-

trees are isomorphic. Then a game G

with the same

structure as G with any belief b

[I] in I, satisfying

b[n

] + b[n

] = b

] + b

] and b[n] = b

[n] for all

nodes other than n

and n

, has the same value as G.

Thanks to the Lemma 3 and the isomorphism

of the subtrees beneath s

and s

, histories

and s

can be merged and associated be-

liefs added up. By repeating this process, we end up

with a single history for each current position of the

evader s

(let s

be such history), whose belief is

](s

)

∑

6∈s

b(s

) · π

)

∑

˜s

6∈s

∑

˜s

6∈s

b( ˜s

) · π

( ˜s

, ˜s

)

(6)

)

∑

˜s

6∈s

)

; b

](s

) for short

The updated belief b

] in Equation (6) complies

with belief b

(Equation (3)) updated with the infor-

mation that the evader is located in none of the ver-

tices in s

. The belief in I

] matches the belief in

topmost information set of G

t−1





; and the re-

sulting game is the same as G

t−1





up to multi-

plication by γ. The value of G[s

] (Equation (5)), from

which this game was derived, is thus γv

t−1





We substitute this value to Equation (4) to obtain

0) =

∑

∈s

b(s

) +

∑

6∈s

b(s

)

· (7)

∑

) ·



γv





)



By allowing the players to choose their optimal one-

step strategies π

and π

in Equation (7), we obtain

the desired maximin formula from Equation (2).

3.1 Computing One-Step Strategies

The evaluation of the Equation (2) involves computa-

tion of optimal strategies of the players. In this section

we show that if the value functions v

t−1

are piecewise

linear and convex and represented by sets of α-vectors

(which holds due to Theorem 2), the strategies can be

found out by means of linear programming.

Due to limited space, we provide the linear pro-

gram for computing optimal one-step strategy in





for the pursuer only. At the beginning of

each round, the pursuer realizes what vertices the

evader is not located in, and hence updates his be-

lief about the position of the evader. We thus restrict

ourselves to the case where b(s

) = 0 for all s

∈ s

In the following linear program, the pursuer seeks

for a strategy maximizing his expected utility against

the best-responding opponent. He assumes strategies

of the form “move to s

ﬁrst and then follow strategy

whose value is represented by α ∈ v

t−1





”. The

choice of α uniquely deﬁnes such strategy. The prob-

ability of playing each strategy α ∈ v

t−1





is rep-

resented by variable

,α). Constraint (9) corre-

sponds to the value of playing such randomized strat-

egy against the best-responding evader who starts in

vertex s

(α(s

) denotes the value of α evaluated at

pure belief corresponding to action s

of the evader).

The evader starts in s

with probability b(s

), hence

the objective (8) calculates the expectation over indi-

vidual v(s

). For the resulting one-step strategy of the

pursuer, it holds that π(s

) =

∑

α∈v

t−1

π(s

,α).

max

∑

∈V

b(s

) · v(s

) (8)

s.t.

∑

∈adj(s

) ; α∈v

t−1

α(s

) ·

,α) ≥ v(s

) ∀{s

} ∈ E (9)

∑

∈adj(s

) ; α∈v

t−1

,α) = 1 (10)

,α) ≥ 0 ∀s

∈ adj(s

) ∀α ∈ v

t−1





(11)

3.2 Computing Value Functions

In each iteration of our value iteration algorithm,

value functions v

are constructed from the solution

from the previous iteration — value functions v

t−1

By repeating this construction, a sequence of ﬁnite-

horizon value functions

{

}

∞

t=0

approaching the val-

ues of the inﬁnite-horizon game is being constructed.

The value functions v

to be constructed, as well as

t−1

, are PWLC (Theorem 2). We show that this al-

lows us to avoid evaluating the dynamic programming

ICAART 2017 - 9th International Conference on Agents and Artiﬁcial Intelligence

508

operator H (Equation (2)) in every point in the belief

space and enables us to construct v

using only a ﬁnite

subset of beliefs; the extreme points of line segments

of v

. We proceed in two steps: (1) ﬁrst, we compute a

function Q





corresponding to the expected util-

ity the pursuer gets if he plays π

at the ﬁrst round of

the longer horizon game G





; (2) then we show

how to compute v





as a combination of multiple





for properly chosen one-step strategies π

We start with a formal deﬁnition of function Q





Deﬁnition 4. Let π

be pursuer’s one-step strategy

for the ﬁrst round of the game G





. The value

of π

is a function Q





assigning the expected

reward the pursuer gets in the game G





against

the best-responding opponent, when he plays π

the ﬁrst round and continues by playing according to

his optimal strategy in the rest of the game, i.e.





(b)

∑

∈s

b(s

) + γ

∑

∈V \s

b(s

)

· (12)

· min

∑

∈V

) · v

t−1





)

According to the previous deﬁnition, once the ﬁrst

round of the game is over, the pursuer continues with

his optimal strategy. The following lemma shows that

this optimal strategy for the rest of the game can be

characterized by α-vectors of v

t−1

Lemma 4. Let π

be pursuer’s ﬁxed one-step strat-

egy for the ﬁrst round of the game. For every belief b

there are strategies σ

], one for each s

∈ adj(s

represented by α-vectors α[s

] ∈ v

t−1





, such that

it is optimal to follow σ

] when s

was played in

the ﬁrst round of the game. The value of strategy σ

prescribing the pursuer to play according to π

in the

ﬁrst round and continue by using respective σ

] is

linear and the corresponding α-vector satisﬁes

) =

(

1 s

∈ s

γ min

∈adj(s

)

∑

) · α[s

](s

) otherwise

(13)

Lemma 4 gives us a direct algorithm for comput-

ing Q

. PWLC functions v

t−1

correspond to a ﬁnite

number of horizon-t strategies, represented by a ﬁnite

number of α-vectors. There is only a ﬁnite number

of ways to choose strategies σ

] from Lemma 4,

which can be found by means of enumeration. The

maximization over linear functions representing value

of such strategies corresponds to the function Q

which is thus piecewise linear and convex.

The deﬁnition of Q

implies that we can

compute the value function v

t+1

by allowing the

ˆv





←

|V |

←

while ∃b ∈ ∆(V ) : v





(b) > ˆv





(b) do

← optimal strategy of the pursuer at

belief b for the ﬁrst round (see (8))

←

∪

{

}

ˆv





← ˆv





⊕ Q

return ˆv

Algorithm 1: Incremental construction of v





pursuer to play arbitrary strategy π

, when





(b) = max





(b) (14)

As a consequence of Theorem 1, it is sufﬁcient

to consider a ﬁnite set Π

of strategies in the maxi-

mizer of Equation (14) and obtain v





as the point-

wise maximum from respective Q





functions,

∈Π

. The set of such strategies

is however initially unknown. We propose the Algo-

rithm 1 that constructs both the set of strategies

and the value function ˆv





incrementally by iter-

atively verifying whether the current set

is sufﬁ-

cient for obtaining the actual value function v





The algorithm constructs a set of strategies

and

a corresponding estimate of value function ˆv





starting with empty

. In each iteration, it veriﬁes

if strategies

used to form current ˆv

t+1





are op-

timal in every belief b ∈ ∆(V ). If a belief b where the

strategy can be improved is found, i.e. Q





(b) >

ˆv





(b) for some π

, it updates

and recomputes

ˆv

. If no such belief b exists, all required strate-

gies were considered and ˆv





= v





Whenever the value function ˆv





is not yet op-

timal for all beliefs, i.e. there exists a belief b where





(b) > ˆv





(b), there exists a belief b

with the

same property that forms an extreme point of a line

segment on ˆv





. This is characterized by Lemma 5.

Lemma 5. If there is a belief b where v





(b) >

ˆv





(b), there must be a belief b

that forms an ex-

treme point of a line segment on the surface of ˆv





where v





) > ˆv





Thanks to Lemma 5, we can consider only a ﬁ-

nite set of beliefs that form extreme points of line seg-

ments on the value function ˆv





. In every iteration,

a one-step strategy that is optimal at some belief point

(and thus must be present in Π

) is added to

. Due

to Theorem 1, the set Π

required to obtain the opti-

mal value function v





is ﬁnite. Hence after a ﬁnite

number of iterations, the Algorithm 1 terminates.

Dynamic Programming for One-sided Partially Observable Pursuit-evasion Games

509

3.3 Convergence of the Algorithm

We demonstrate the convergence of our value itera-

tion algorithm by showing that the dynamic program-

ming operator H (Equation 2) has a unique ﬁxpoint

which is reached by its iterative application. We ob-

tain this by showing that H is a contraction mapping

under the following max-norm and applying the Ba-

nach’s ﬁxed point theorem (Ciesielski et al., 2007).

kv − vk = max

∈V

max

b∈∆(V )





(b) − v





(b)| (15)

Lemma 6. The operator H is a contraction with con-

tractivity factor γ < 1 under max-norm.

Theorem 4. There is a unique set of value functions

∗

satisfying v

∗

= Hv

∗

and the recursive application

of H converges to v

∗

. Series

{

}

∞

i=0

thus converges to

value functions of an inﬁnite horizon game.

Proof. The operator H is a contraction mapping de-

ﬁned on a metric space of sets of bounded functions

deﬁned on the belief space. By applying Banach’s

ﬁxed point theorem (Ciesielski et al., 2007) we get

that H has a unique ﬁxed point v

∗

and the recursive

application of H converges to v

∗

Proposition 1. After t iterations of the value iteration

algorithm it holds that kv

− v

∗

k ≤ γ

4 CONCLUSIONS

We present the ﬁrst algorithm for solving the class

of two-player discounted pursuit-evasion games with

inﬁnite horizon and partial observability, where the

evader is assumed to be perfectly informed about the

current state of the game (i.e. position of pursuer’s

units). This class of games has a signiﬁcant relevance

in security domains where a robust strategy that pro-

vides guarantees in the worst case is often desirable.

Our algorithm is a modiﬁcation of the well-known

value iteration algorithm for solving Partially Ob-

servable Markov Decision Processes (POMDPs), or

stochastic games with concurrent moves. We show

that the strategies can be compactly represented us-

ing value functions that depend on the location of the

pursuing units and the belief about the position of the

evader, but not explicitly on the history of moves.

These value functions are piecewise linear and con-

vex and allow us to design a dynamic programming

operator for the value iteration algorithm.

Our work is the ﬁrst step towards many practical

algorithms for solving discounted stochastic games

with one-sided partial observability. These can be

applied in many scenarios requiring robust strategies

and thus our work opens the whole new area of re-

search in algorithmic and computational game theory.

One natural continuation is an adaptation of point-

based approximation algorithms for POMDPs to im-

prove the scalability of the value iteration algorithm.

ACKNOWLEDGEMENTS

This research was supported by the Czech Science

Foundation (grant no. 15-23235S) and by the Grant

Agency of the Czech Technical University in Prague,

grant No. SGS16/235/OHK3/3T/13.

REFERENCES

Chung, T. H., Hollinger, G. A., and Isler, V. (2011). Search

and pursuit-evasion in mobile robotics. Autonomous

robots, 31(4):299–316.

Ciesielski, K. et al. (2007). On Stefan Banach and some of

his results. Banach Journal of Mathematical Analysis,

1(1):1–10.

Hansen, E. A., Bernstein, D. S., and Zilberstein, S.

(2004). Dynamic programming for partially observ-

able stochastic games. In AAAI, volume 4, pages 709–

715.

Koller, D., Megiddo, N., and Von Stengel, B. (1996). Ef-

ﬁcient computation of equilibria for extensive two-

person games. Games and Economic Behavior,

14(2):247–259.

McEneaney, W. M. (2004). Some classes of imperfect infor-

mation ﬁnite state-space stochastic games with ﬁnite-

dimensional solutions. Applied Mathematics and Op-

timization, 50(2):87–118.

Monahan, G. E. (1982). State of the arta survey of partially

observable Markov decision processes: theory, mod-

els, and algorithms. Management Science, 28(1):1–

16.

Pineau, J., Gordon, G., Thrun, S., et al. (2003). Point-based

value iteration: An anytime algorithm for POMDPs.

In IJCAI, volume 3, pages 1025–1032.

Shapley, L. S. (1953). Stochastic games. Proceedings of the

National Academy of Sciences, 39(10):1095–1100.

Smallwood, R. D. and Sondik, E. J. (1973). The optimal

control of partially observable Markov processes over

a ﬁnite horizon. Operations Research, 21(5):1071–

1088.

Smith, T. and Simmons, R. (2012). Point-based POMDP

algorithms: Improved analysis and implementation.

arXiv preprint arXiv:1207.1412.

Vanderbei, R. J. (2014). Linear programming. Springer.

Vidal, R., Shakernia, O., Kim, H. J., Shim, D. H., and Sas-

try, S. (2002). Probabilistic pursuit-evasion games:

theory, implementation, and experimental evaluation.

Robotics and Automation, IEEE Transactions on,

18(5):662–669.

ICAART 2017 - 9th International Conference on Agents and Artiﬁcial Intelligence

510