istic, serially correlated sequences. One way to in-
crease the efficiency of exploration is simply to add
reverberation to the noise, thus making it serially cor-
related. White Gaussian noise with reverberation is
equivalent to the Ornstein-Uhlenbeck process (Uhlen-
beck and Ornstein, 1930), which is the velocity of a
Brownian particle with momentum and drag. In dis-
crete time and without the need to incorporate physi-
cal quantities it can be written in its simplest form as
Gaussian noise with a decay factor κ ∈ [0, 1] for the
reverberation:
z
t
= κz
t−1
+ g
t
(3)
The Ornstein-Uhlenbeck process (OUP) has re-
cently been applied in RL algorithms other than CA-
CLA (Lillicrap et al., 2015). In section 4.1 of this
paper the performances resulting from the different
noise types are compared for CACLA.
2.3 Monte Carlo CACLA
A further possibility to improve CACLA’s perfor-
mance, unrelated to the described exploration noise,
is to let it record all of its observations, rewards, and
actions within an episode. At the end of an episode,
in our case lasting less than 1000 time steps, this
recorded data is used to compute the exact returns R
t
for each time step within the episode. The recorded
observations can then be used as the input objects in a
training set, where the Critic is trained with the target
R
t
and if R
t
−V
t
(s
t
) > 0 then the Actor is trained as
well with the target a
t
. We call this novel algorithm
Monte Carlo (MC) CACLA, and it is based on Monte
Carlo (MC) learning as an alternative to temporal dif-
ference (TD) learning (Sutton and Barto, 1998).
As later shown in section 4.2 the performance
of MC-CACLA is worse than the performance of
the original CACLA. We theorize about the possible
cause and find the same problem in the original CA-
CLA but to a lesser degree. The new insights led us
to a corrected version of CACLA that is described in
the following subsection.
2.4 Corrected CACLA
Consider a reward r
t+1
that is given as a spike in
time, i.e., the reward is an outlier compared to the re-
wards of its adjacent/ambient time steps. In CACLA
the Critic takes the current sensory input and outputs
V
t
(s
t
) that is an approximation of r
t+1
+ γ V
t
(s
t+1
). If
the Critic is unable to reach sufficient precision to dis-
criminate this time step with the spike from its ambi-
ent time steps then its approximation will be blurred
in time. This imprecision can cause the TD-error δ
t
(equation 2) to be negative in all ambient time steps
within the range of the blur around a positive reward
spike, or vice versa; to be positive in all ambient time
steps within the range of the blur around a negative
reward spike as illustrated in Figure 1. In such a case
the positive TD-error is not indicative of the previous
actions having been better than expected. The orig-
inal CACLA does not make a distinction and learns
the previous actions regardlessly. These might be the
very actions that led to the negative reward spike by
crashing the airplane. This is a weakness of CACLA
that can be corrected to some extent by the following
algorithm.
Besides the Actor and the Critic our Corrected
CACLA uses a third MLP D
t
with the same inputs,
and the same number of hidden neurons. Its only out-
put neuron uses a linear activation function. Like the
Critic it is trained at every time-step, but its target to
be approximated is log(|δ
t
|+ε) with as input the state
vector s
t
. The output of this MLP can be interpreted
as a prediction and thus as an expected value:
D
t
(s) = E[log(|δ
t
|+ ε)|s
t
= s]
where ε is a small positive constant that is added to
avoid the computation of log(0). We have set the pa-
rameter ε to 10
−5
. With Jensen’s inequality (Jensen
and Valdemar, 1906) a lower bound for the value of
the absolute TD-error can be obtained:
E[|δ
t
||s
t
= s] ≥ exp(D
t
(s
t
)) −ε
D
t
estimates the logarithm of |δ
t
| instead of |δ
t
| it-
self. This allows for an increased accuracy across a
wider range of orders of magnitude and also lowers
the impact of the spike on the training of D
t
. The
advantage was confirmed by an increased flight per-
formance during preliminary tests.
If D
t
learns to predict a high value of the TD-error
for an area of the state-space, this indicates that in
this area the absolute TD-error has been repeated on
a regular basis and is thus not due to an improved ac-
tion but due to an inaccuracy of the Critic. Hence we
modify CACLA’s learning rule in the following way.
In the original CACLA algorithm the Actor is trained
on the last action only if δ
t
> 0. In the Corrected CA-
CLA algorithm the same rule is used, except that this
condition is substituted by δ
t
> E[|δ
t
|], where the lat-
ter value is the output of the third MLP. Note that this
rule only improves the performance around negative
reward spikes, thus potentially improving the flight
safety. The experiments that were conducted to assess
the performance of this novel algorithm are described
in the following section.
Actor-Critic Reinforcement Learning with Neural Networks in Continuous Games
55