used to train the DQN by the gradient descent algo-
rithm (LeCun and Bengio, 2015). Ideally the training
of DQN should use all data in each iteration, however,
this is very expensive when the training set is huge.
An efficient way is to use a random subset of the train-
ing set, called mini batch, to evaluate the gradients in
each iteration (M and T., 2014). The loss function of
the DQN for a random mini-batch D
t
(random sample
over D) at time step t is:
L(ξ
ξ
ξ
t
) =
∑
e∈D
t
(r + ωmax
a
0
Q(s
0
,a
0
,
ˆ
ξ
ξ
ξ) −Q(s, a,ξ
ξ
ξ
t
))
2
(13)
where e = (s, a,r, s
0
),
ˆ
ξ
ξ
ξ represents the network param-
eters to compute the target at time step t, which is
only updated every C steps, see details in (Mnih and
Kavukcuoglu, 2013). Finally the stochasitc gradient
descent algorithm is used to update ξ
ξ
ξ over the mini-
batch D
t
.
3.2 DQN Power Allocation
Typically, DQNs are very suitable to solve problems
that can be modeled as Markov decision processes,
where the goal of the agent is to maximize the cumu-
lative rewards (Mnih and Kavukcuoglu, 2013). The
power allocation in CF mmWave massive MIMO, ac-
cording to the policy formulated in (5), can be seen
as such a Markov decision process: the large-scale
fading changes according to the mobility of UEs over
time, which, in our case, can be modeled as a Markov
process (see Section 4). For a non-Markov process,
due to the high correlations between the current state
and previous several states, the updates of the DQN
have large variances, leading to an inefficient training
(Mnih and Kavukcuoglu, 2013). How to reduce the
variance of the DQN update is an open issue in case
of non-Markovian problems.
To use the DQN solving the power allocation
problem in CF mmWave massive MIMO, we define
the duration of each time step t as the large-scale time
in Fig.2. In scenarios with fixed position UEs, the dis-
count factor ω was suggested to be zero (Meng.F and
P., 2019). However, in our case, since we consider
mobile UEs, we determine ω by trial-and-error, see
Section 4.2.
As in (Nasir and Guo, 2019) and (Meng and
Chen, 2020), we define for each AP-UE link an agent,
thus the power allocation is performed by a multi-
agent system, where each agent contains a DQN. The
agents interact with the environment to collect data
(s
t
,a
t
,r
t
,s
t+1
) and store it in a dataset at the CC, then
by mini-batch sampling the DQN is trained using the
gradient descent algorithm, as shown in Fig.3. Since
the learning is done off-line, the overhead cost of the
data collection does not affect the operational phase.
It is unnecessary to have ‘real-time’ training in off-
line learning mode. The training time, however, could
be considerable. We also should clarify that for the
off-line learning mode, the DQN is trained by a suffi-
cient dataset, the size depends on the convergence of
the sum-SE. Once the training is finished, the DQN
is used to perform the power allocation. No further
training is needed. However, when there are signif-
icant changes in the network configuration, e.g., the
number of active APs, or in the temporal and spatial
traffic characteristics, the DQN should be retrained.
When that should happen and what impact it would
have on the operation of the system has not been ad-
dressed yet and requires further study.
There is a total of NK agents in the whole system.
At time step t, each agent (n,k) allocates power from
AP n to UE k. One should note that all agents use the
same DQN parameters, i.e., after the DQN is trained
by the experience of all agents, the DQN shares its
parameters with all other agents to allocate power.
We define e
t
n,k
= (s
t
n,k
,a
t
n,k
,r
t
n,k
,s
t+1
n,k
) as the experi-
ence sequence of agent (n,k) at time step t. The DQN
is trained by the dataset D = {e
1
1,1
,e
1
1,2
,. . . ,e
t
n,k
. . . },
which describes the agents’ relation with their envi-
ronment. The key to using the DQN for solving (5)
is to model the decision variables as the action of
agents. Obviouly the normalized downlink transmis-
sion power p
n,k
is the decision variable for SE, there-
fore the action of agent (n,k) is p
n,k
. We define p
t
n,k
as
the action of agent (n,k) at time step t. The agent (n,k)
takes action according to the current state s
t
n,k
, which
features the independent variables. From (4) we find
that the large-scale fading is the independent variable
for SE, therefore the large-scale fading matrix at time
step t,
β
β
β
t
=
β
t
1,1
β
t
1,2
... β
t
1,K
β
t
2,1
β
t
2,2
... β
t
2,K
... ... ... ...
β
t
N,1
β
t
N,2
... β
t
N,K
(14)
is a key element for s
t
n,k
. The objective function,
which describes the target of the agents, is defined
as the reward, i.e., the downlink sum-SE, achieved in
each time step. Based on the above analysis, the el-
ements of the experience etnk for CF mmWave mas-
sive MIMO power allocation are defined as following:
1) State s
t
n,k
: The signal-to-interference-plus-
noise ratio (SINR) is the key element of the SE.
The signal in SINR of UE k comes from the agent
set {(1,k), (2,k),. . . ,(N,k)}, while the interference in
SINR for agent (n, k) mainly comes from the agent
set {(n,1), (n,2),. . . ,( n,K)}. Therefore, for agent (n,
WINSYS 2021 - 18th International Conference on Wireless Networks and Mobile Systems
38