other challenge is that agents can operate in discrete
action space (e.g., up, down, left, right), or continu-
ous action space (e.g., velocity). The complexity of
the problem increases when agents have continuous
actions because large action spaces are difficult to ex-
plore efficiently and can make training intractable.
In this work we consider the pursuit-evasion or
predator-prey problem and use a deep reinforcement
learning technique to solve this problem. Pursuit-
evasion is a problem where a group of agents col-
lectively try to capture one or multiple evaders while
the evaders try to avoid getting caught. Our goal is
to train agents to make decentralized decisions and
display swarm-like behavior. For our approach we
use the Multi-Agent DDPG (MADDPG) algorithm
introduced by (Lowe et al., 2017). MADDPG ex-
tends DDPG (Lillicrap et al., 2015) to the multi-agent
setting during training, potentially resulting in much
richer behavior between agents. This is an actor-critic
approach. This paper describes a centralized multi-
agent training algorithm leading to decentralized in-
dividual policies. Each agent has access to all other
agents’ state observations and actions during critic
training, but tries to predict its own actions with only
its own state observations during execution.
2 METHODOLOGY
In this section we give an intuitive explanation of the
theory behind reinforcement learning and then intro-
duce the recent developments in deep reinforcement
learning implemented herein.
2.1 Reinforcement Learning
Reinforcement Learning (RL) is a goal-oriented
reward-based learning technique. In RL agents inter-
act with an environment in discrete time-steps and at
each time-step, the agent observes the environment,
then takes an action and receives a numeric reward
based on the action. The goal of RL is to learn a good
strategy (policy) for the agent from experimental tri-
als and relatively simple feedback received (reward
signal). With the learned strategy, the agent is able to
actively adapt to the environment to maximize future
rewards. Figure 1 shows the RL framework.
The RL framework can be formalized using a
Markov Decision Process (MDP) defined by a set of
states S, a set of actions A, an initial state distribution
p(s
0
), a reward function r : S x A 7→ R, transition prob-
abilities P(s
t+1
|s
t
, a
t
), and a discount factor γ. The
agents take action based on their policy denoted by
Reward
r
t
Action
a
t
Agent
r
t+1
Environment
sampling
s
t+1
State
s
t
Figure 1: Reinforcement learning framework simplified
system diagram based on (Sutton and Barto, 2018).
π
θ
parameterized by θ, which can be either determin-
istic or stochastic. Deterministic policies are used in
environments where for every state you have a clear
defined action you will take. Stochastic policies are
used in environments where for every state, for you to
take an action, you draw a sample from possible ac-
tions that follow a distribution. A value function mea-
sures the goodness of a state or how rewarding a state
or action is by predicting the future reward. The goal
for the agent is to learn an optimal policy that tells it
which actions to take in order to maximize its own to-
tal expected reward R
i
=
∑
T
t=0
γ
t
r
i
t
, where 0 < γ < 1.
The discount factor penalizes the rewards in the future
because future rewards have higher uncertainty.
To learn an optimal policy, Richard Bellman, an
American applied mathematician, derived the Bell-
man equations which allowed us to start solving
MDPs. He made use of the state-value function de-
noted by:
V
π
(s, a) =
E
π
[R
t
|s
t
= s] (1)
and the action-value function denoted by
Q
π
(s, a) =
E
π
[R
t
|s
t
= s, a
t
= a] (2)
to derive the Bellman equations. The state-value func-
tion specifies the expected return of a state s
t
when
following an optimal policy, whereas the action-value
function specifies the expected return when choosing
action a
t
in state s
t
and following an optimal policy.
Once we have the optimal value functions, then we
can obtain the optimal policy that satisfies the Bell-
man optimality equations given by:
V
∗
(s) = max
a
0
∈A
∑
s
0
,r
P(s
0
, r|s, a)[r + γV
∗
(s
0
)] (3)
Q
∗
(s, a) =
∑
s
0
,r
P(s
0
, r|s, a)[r + max
a
0
∈A
Q
∗
(s
0
, a
0
)] (4)
The common approaches to RL are Dynamic
Programming (DP), Monte Carlo (MC) methods,
Temporal-Difference (TD) learning, and Policy Gra-
dient (PG) methods. If we have complete knowl-
edge of the environment or all the MDP variables,
following Bellman equations, we can use DP to iter-
atively evaluate value functions and improve the pol-
icy. DP methods are known as model-based methods
Pursuit-evasion with Decentralized Robotic Swarm in Continuous State Space and Action Space via Deep Reinforcement Learning
227