present a simple illustrative model and present asso-
ciated conditions. We then show theoretically for this
model, that the effect of lossy communication on the
multi-agent learning algorithm will vanish asymptot-
ically. Therefore, asymptotic guarantees for the algo-
rithm with and without communication will be iden-
tical. Moreover, we also show that infinitely many
global state-action pairs, will reach each agent.
We apply our algorithm to a simple but illustrative
two-agent water filter flow control problem, where
two agents have to control the in- and outflow to wa-
ter filter. The agents have no information about the
dynamics of the flow problem and have no prior in-
formation of the strategy of the other agents. But, the
agents are allowed to communicate over a network,
which therefore enables the use of our algorithm. Our
simulations show how the communication network
influences the rate of convergence of our multi-agent
algorithm.
2 BACKGROUND ON RL IN
CONTINUOUS ACTION SPACES
In this section we recall preliminaries on RL in con-
tinuous action spaces. In RL an agent interacts with
an environment E that is modeled as a Markov deci-
sion process (MDP) with state space S , action space
A, Markov transition kernel P, scalar reward func-
tion r(s,a). It does so by taking actions via a policy
µ : S → A. The most common objective in RL is to
maximize the discounted infinite horizon return R
n
=
∑
∞
t=n
r(s
t
,a
t
) with discount factor α ∈ [0,1). The as-
sociated action-value function is defined as Q(s, a) =
E
µ,E
[R
1
| s
1
= s,a
1
= 1]. The optimal action-value
function is characterized by the solution of the Bell-
man equation
Q
∗
(s,a) = E
E
r(s, a) + α max
a
0
∈A
Q
∗
(s
0
,a
0
)
s,a
(1)
Q-learning with function approximation seeks to ap-
proximate Q
∗
by a parametrized function Q(s,a; θ)
with parameters θ. For an observed tuple (s,a,r,s
0
),
this can be done by performing a gradient descent step
to minimize the squared Bellman loss
Q(s,a, ; θ) −(r(s, a) + α max
a
0
∈A
Q(s
0
,a
0
;θ))
2
.
(2)
When the action space A is discrete, π(s) =
argmax
a∈A
Q(s,a; θ), ∀s ∈ S defines a greedy policy
with respect to Q(s,a,; θ). When A is continuous this
is not possible, but instead we can use Q(s, a; θ) to im-
prove a parametrized policy µ(s;φ) by taking gradient
steps in the direction of policy improvement with re-
spect to the Q-function:
∇
φ
µ(s;φ)∇
a
Q(s,a; θ)|
a=µ(s;φ)
(3)
In (Silver et al., 2014) it was shown that the expec-
tation of (3) w.r.t the discounted state distribution in-
duced by a behaviour policy
1
is indeed the gradient of
the expected return of the policy µ. In practice, how-
ever, we usually do not sample from this discounted
state distribution (Nota and Thomas, 2020), but still
use (3) with samples from the history of interaction.
When the parametrized functions in (2) and (3) are
deep neural networks the corresponding algorithm is
then known as the deep deterministic policy gradient
algorithm (DDPG) (Lillicrap et al., 2016).
3 DECENTRALIZED POLICY
GRADIENT ALGORITHM
This section describes the adaptation of the DDPG al-
gorithm to the decentralized multi-agent setting.
3.1 Two-agent DDPG with Delayed
Communication
To simplify the presentation, we consider only two
agents 1 and 2. Consider an MDP as defined in the
previous section and assume that agent 1 and 2 have
local action spaces A
1
and A
2
, respectively, such that
A = A
1
×A
2
is the global action space. Further, both
agents can observe local state trajectories s
n
1
∈ S
1
and
s
n
2
∈S
2
, such that S = S
1
×S
2
is the global state space.
Finally, we consider a global reward signal r
n
at every
time step n ≥ 0 that is observable by both agents.
The objective is that the agents learn local policies
µ
1
(s
1
;φ
1
) and µ
2
(s
2
;φ
2
) parametrized by φ
1
and φ
2
to
maximize the accumulated discounted reward. The
global policy is therefore given by µ = (µ
1
,µ
2
). We
propose to use the DDPG algorithm locally at every
agent to train the policies µ
1
and µ
2
. However, to exe-
cute the DDPG algorithm locally, each agent requires
access to the parameters φ
n
1
and φ
n
2
at every time step
n ≥ 0. These parameter vectors are inherent local in-
formation of the agents and therefore are not a priori
available at the corresponding other agent.
Let us therefore suppose that the agents use a com-
munication network to exchange φ
n
1
and φ
n
2
. Since
communication networks typically induce informa-
tion delays, only φ
n−τ
1
(n)
2
and φ
n−τ
2
(n)
1
are available
at agent 1 and 2, respectively. Here, τ
1
(n) denotes the
1
The behaviour policy is typically µ distorted by a noise
process to encourage exploration.
Multi-agent Policy Gradient Algorithms for Cyber-physical Systems with Lossy Communication
283