CONTINUOUS ACTION REINFORCEMENT LEARNING
A
UTOMATA
Performance and Convergence
Abdel Rodr´ıguez
1,2
, Ricardo Grau
1
1
Bioinformatics Lab, Center of Studies on Informatics, Central University of Las Villas, Santa Clara, Cuba
2
Computational Modeling Lab, Vrije Universiteit Brussel, Brussels, Belgium
Ann Now´e
Computational Modeling Lab, Vrije Universiteit Brussel, Brussels, Belgium
Keywords:
CARLA, Convergence, Performance.
Abstract:
Reinforcement Learning is a powerful technique for agents to solve unknown Markovian Decision Processes,
from the possibly delayed signals that they receive. Most RL work, in particular for multi-agent settings,
assume a discrete action set. Learning automata are reinforcement learners, belonging to the category of policy
iterators, that exhibit nice convergence properties in discrete action settings. Unfortunately, most applications
assume continuous actions. A formulation for a continuous action reinforcement learning automaton already
exists, but there is no convergence guarantee to optimal decisions. An improve of the performance of the
method is proposed in this paper as well as the proof for the local convergence.
1 INTRODUCTION
Since Artificial Intelligence emerged, it has been try-
ing to emulate human behavior with the hope that
some day computers will learn how to act as perfect
humans. Reinforcement Leaning is the way animals
learn how to maximize their profits in certain situa-
tions. It is based on random but not uniform explo-
ration. The basis of Reinforcement Learning is to ex-
plore actions and reinforce positively those that re-
sulted in a good outcome for the learner, or reinforce
negatively the ones that produced bad results.
The mathematical abstraction of this learning is
already formulated for discrete actions, but in many
engineering applications it is necessary to control
continuous parameters. Continuous formulations of
Reinforcement Learning are not developed as good as
discrete action learners. For single agents there is al-
ready quite a lot of work on continuous action learn-
ing but there is not much work done in multi-agent
settings.
This paper performs an analysis of the perfor-
mance of Continuous Action Reinforcement Learn-
ing Automaton (CARLA) (Howell et al., 1997) on its
usefulness for future exploration in Multi-agent Sys-
tems (MAS). Classical definition of random variable
(RV), will be used as well as the probability integral
transformation for the generation of random numbers
following a given probability distribution (Parzen,
1960). Next section introduces the LA and as a first
contribution of this paper, subsection 2.2 shows how
the numerical calculations can be reduced by some
mathematical derivations. Following subsection 2.3
introduces the local convergence proof as well as the
way to manage the λ parameter to improve this con-
vergence as a second contribution. To support the the-
oretical results, some experimentsare presented in the
section 3. Finally, conclusions and future work are
stated in section 4.
2 LEARNING AUTOMATA
The learning automaton is a simple model for adap-
tive decision making in unknown random environ-
ments. The concept of a Learning Automaton (LA)
originated in the domain of mathematical psychology
(Bush and Mosteller, 1955) where it was used to an-
alyze the behavior of human beings from the view-
473
Rodríguez A., Grau R. and Nowé A..
CONTINUOUS ACTION REINFORCEMENT LEARNING AUTOMATA - Performance and Convergence.
DOI: 10.5220/0003287104730478
In Proceedings of the 3rd International Conference on Agents and Artificial Intelligence (ICAART-2011), pages 473-478
ISBN: 978-989-8425-41-6
Copyright
c
2011 SCITEPRESS (Science and Technology Publications, Lda.)
point of psychologist and biologists (Hilgard, 1948;
Hilgard and Bower, 1966).
The engineering research on LA started in the
early 1960’s (Tsetlin, 1961; Tsetlin, 1962). Tsetlin
and his colleagues formulated the objective of learn-
ing as an optimization of some mathematical perfor-
mance index (Thathachar and Sastry, 2004; Tsypkin,
1971; Tsypkin, 1973):
J(A) =
Z
A
R(a,A)dP (1)
where R(a,A) is a function of an observation vector
a with A the space of all possible actions. The perfor-
mance index J is the expectation of R with respect to
the distribution P. This distribution includes random-
ness in a and randomness in R.
The model is well presented by the follow-
ing example introduced by Thathachar and Sastry
(Thathachar and Sastry, 2004). Consider a student
and a teacher. The student is posed a question and
is given several alternative answers. The student
can pick one of the alternatives, following which the
teacher responds yes or no. This response is proba-
bilistic the teacher may say yes for wrong alterna-
tives and vice versa. The student is expected to learn
the correct alternative through such repeated interac-
tions. While the problem is ill-posed with this gener-
ality, it becomes solvable with an added condition. It
is assumed that the probability of the teacher saying
‘yes’ is maximum for the correct alternative.
All in all, LA are useful in applications that in-
volve optimization of a function which is not com-
pletely known in the sense that only noise corrupted
values of the function for any specific values of argu-
ments are observable (Thathachar and Sastry, 2004).
Some standard implementations are introduced below
2.1 Learning Automata
Implementations
The first implementation we would like to refer
to is the Continuous Action Learning Automata
(CALA) introduced by Thathachar and Sastry in 2004
(Thathachar and Sastry, 2004). The authors imple-
mented P as the Normal Probability Distribution with
mean µ
t
and standard deviation σ
t
. At every time step
t an action is selected according to a normal distri-
bution N (µ
t
,σ
t
). Then, after exploring action a
t
and
observing signal β
t
(a
t
) , the update rules (2) and (3)
are applied resulting in a new value for µ
t+1
and σ
t+1
.
µ
t+1
= µ
t
+ λ
β
t
(a
t
) β
t
(µ
t
)
max(σ
t
,σ
L
)
a
t
µ
t
max(σ
t
,σ
L
)
(2)
σ
t+1
= σ
t
+ λ
β
t
(a
t
) β
t
(µ
t
)
max(σ
t
,σ
L
)
"
a
t
µ
t
max(σ
t
,σ
L
)
2
1
#
λK (σ
t
σ
L
)
(3)
where λ is the learning parameter controlling the step
size (0 < λ < 1), K is a large positive constant and σ
L
is a lower bound of σ. Authors also introduced the
convergence proof for this automaton and tested it in
games with multiple learners.
Notice that this first formulation presented in ex-
pressions (2) and (3) works with a parametric Proba-
bility Density Function (PDF) so it is simple and fast
to incorporate the signal into the knowledge of the
automaton. Thathachar and Sastry (Thathachar and
Sastry, 2004) introduced several examples of how to
manage a game of multiple automata meaning that
these automata can be used for controlling multiple
variables in a MAS. Notice that the update rule needs
informationabout the response of the environment for
the selected action a
t
but it also needs the feedback
for the action which corresponds to the mean of the
probability distribution, being µ
t
. In most of practical
engineering problems it is impossible to explore both
actions. Additionally, the convergence proof assumes
that the function to optimize should be integrable and
the minimal achievable standard deviation σ
L
is very
sensitive to noise: the stronger the noise, the higher
the lower bound σ
L
. These constraints are really re-
strictive for practical applications.
The second implementation we would like to re-
call is the CARLA (Howell et al., 1997). The authors
implemented P as a PDF as well but nonparametric
this time. Starting with the uniform distribution over
the whole action space A and after exploring action
a
t
A in time step t the PDF is updated as (4) shows.
f
t+1
(a) =
γ
t
f
t
(a) + β
t
(a
t
)αe
1
2
(
aa
t
λ
)
2
a A
0 a / A
(4)
This second formulation (4) saves the unneces-
sary exploration and the function to optimize is not
required to be integrable, just not chaotic. The prob-
lem is that it controls the strategy for the action selec-
tion of the automaton with a nonparametric PDF so it
becomes computational very expensive. The solution
is to numerically approximate the function but still,
some heavy numerical calculations are necessary for
γ
t
. No convergence proof is given either.
If the computational cost of this method could be
decreased and the convergence proof shown, then the
CALA introduced by Thathachar and Sastry could be
substituted by the CARLA providing a better way for
solving practical problems with a MAS approach.
ICAART 2011 - 3rd International Conference on Agents and Artificial Intelligence
474
2.2 CARLA Update Rule
Let us restart the analysis on expression (4) in order
to look for possible reductions on the computational
cost. Let a
= min(A) and a
+
= max(A) be the mini-
mum and maximum possible actions. The normaliza-
tion factor γ
t
can be computed by expression (5).
γ
t
=
1
R
a
+
a
f
t
(a) + β
t
(a
t
)αe
1
2
(
aa
t
λ
)
2
da
(5)
The original formulation is for the bounded con-
tinuous action space A which makes the analytical cal-
culation of the normalisation factor γ unlikely. Let
us relax this constraint and work over the unbounded
continuous action space . Then the PDF update rule
introduced in (4) should be redefined as (6) where
f
N(a
t
,λ)
is the normal PDF with mean a
t
and standard
deviation λ. Analogously, (5) is transformed into (7).
Notice that numerical integration is no longer needed
for calculating γ
t
.
f
t+1
(a) = γ
t
f
t
(a) + β
t
(a
t
)αe
1
2
(
aa
t
λ
)
2
= γ
t
f
t
(a) + β
t
(a
t
)αλ
2πf
N(a
t
,λ)
(a)
(6)
γ
t
=
1
R
+
f
t
(a) + β
t
(a
t
)αλ
2πf
N(a
t
,λ)
da
=
1
1+ β
t
(a
t
)αλ
2π
(7)
Let δ
t
, introduced in (8), be the extra area added
to the PDF by stretching the curve beyond the interval
[a
,a
+
] then (7), can be written as (9) and (6) as (10).
δ
t
= β
t
(a
t
)αλ
2π (8)
γ
t
=
1
1+ δ
t
(9)
f
t+1
(a) = γ
t
f
t
(a) + δ
t
f
N(a
t
,λ)
(a)
(10)
In order to generate the actions following the
policy f
t
the cumulative density function (CDF) is
needed (Parzen, 1960) which is introduced in (11).
F
t+1
(a) =
Z
a
f
t+1
(z)dz
=
Z
a
γ
t
f
t
(z) + δ
t
f
N(a
t
,λ)
(z)
dz
= γ
t
F
t
(x) + δ
t
F
N(a
t
,λ)
(a)
= γ
t
F
t
(x) + δ
t
F
N(0,1)
aa
t
λ

(11)
Although there is no analytical definition for the
normal CDF F
N(µ,σ)
it can be approximated by means
of numerical integration. So still numerical integra-
tion is needed however one single CDF is required,
being F
N(0,1)
which can be calculated at the beginning
of learning only once and there is no more need
for integration during the learning process.
Finally, the original constraint has to be met for
practical solutions, that is t : a
t
A see (4). So
γ
t
and F
t+1
defined in (9) and (11) should be trans-
formed as shown in (12) and (13) where F
dif f
t
(x,y) =
F
N(0,1)
ya
t
λ
F
N(0,1)
xa
t
λ
.
γ
t
=
1
1+ δ
t
F
dif f
t
(a
,a
+
)
(12)
F
t+1
(a) =
0 a < a
γ
t
F
t
(a) + δ
t
F
dif f
t
(a
,a
t
)
a A
1 a > a
+
(13)
For a practical implementation of this method,
equations (8), (12) and (13) are sufficient to avoid
numerical integration saving lot of calculation time
during learning process. We would like to stress
that without this reformulation the method was really
computationally too heavy to be applied in practice,
but with this change it turns to be computationally
feasible. In the next subsection we will perform an
analysis of λ used in expression (8) which will result
in better convergence properties.
2.3 CARLA Convergence
The analysis will be performed for normalized reward
signals β : [0,1] no generality is lost because
any closed interval can mapped to this interval by a
linear transformation. The final goal of this analysis
is to find the necessary restrictions to guarantee con-
vergence to local optima.
The sequence of PDF updates is a Markovian pro-
cess, where for each time-step t an action a
t
A is
selected and a new f
t
is returned. At each time-step t,
f
t
will be updated as shown in expression (10). The
expected value
¯
f
t+1
of f
t+1
can be computed follow-
ing equation (14).
¯
f
t+1
(a) =
+
Z
f
t
(z) f
t+1
(a | a
t
= z)dz
(14)
Let γ
t
z
= γ
t
|a
t
= z be the value for γ
t
if a
t
= z and
¯
γ
t
the expected value of γ
t
then (14) could be rewritten
CONTINUOUS ACTION REINFORCEMENT LEARNING AUTOMATA - Performance and Convergence
475
as (15).
¯
f
t+1
(a) =
+
Z
f
t
(z)γ
t
z
f
t
(a) + αβ
t
(z)e
1
2
(
az
λ
)
2
dz
= f
t
(a)
¯
γ
t
+ α
+
Z
f
t
(z)γ
t
z
(z)β
t
(z)e
1
2
(
az
λ
)
2
dz
(15)
Let us have a look at the right member of the in-
tegral. f
t
(z) is multiplied by the factor composed by
the normalization factor given that a
t
= z, the feed-
back signal β
t
(z) and the distance measure e
1
2
(
az
λ
)
2
which can be interpreted as the strength of the relation
of actions a and z, the higher the value of this prod-
uct, the bigger the relation of these actions. Let us call
this composed factor G
t
(a,z) and
¯
G
t
(a) its expected
value at time-step t with respect to z. Then equation
(15) could be finally formulated as (16).
¯
f
t+1
(a) = f
t
(a)
¯
γ
t
+ α
¯
G
t
(a)
= f
t
(a)
¯
γ
t
+
α
¯
G
t
(a)
f
t
(a)
(16)
The sign of the first derivative of f
t
dependson the
factor
¯
γ
t
+
α
¯
G
t
(a)
f
t
(a)
of expression (16) so it behaves as
shown in (17).
f
t
t
< 0
¯
γ
t
+
α
¯
G
t
(a)
f
t
(a)
< 1
= 0
¯
γ
t
+
α
¯
G
t
(a)
f
t
(a)
= 1
> 0
¯
γ
t
+
α
¯
G
t
(a)
f
t
(a)
> 1
(17)
Notice
¯
γ
t
is a constant for all a A and
R
+
f
t
(z)dz = 1 so:
b
1
,b
2
A
:
¯
G
t
(b
1
)
f
t
(b
1
)
6=
¯
G
t
(b
2
)
f
t
(b
2
)
=
A
+
,A
A,A
+
A
=
/
0
,
a
+
A
+
,a
A
:
f
t
(
a
+
)
t
> 0
f
t
(
a
)
t
< 0
(18)
From logical implication (18) it can be assured
that the sign of
f
t
(a)
t
will be determined by the ra-
tio
¯
G
t
(a)
f
t
(a)
. Notice subsets A
+
and A
are composed by
the elements of A that have not reached their value for
the probability density function in equilibrium with
¯
G
t
(a). That is, the A
+
subset is composed by all a A
having a value of probability density function which
is too small with respect to
¯
G
t
(a) and vice versa for
A
.
Let a
A be the action that yields the highest
value for
R
+
β
t
(z)e
1
2
(
az
λ
)
2
dz for all time-steps as
shown in (19). It is important to stress that a
is not
the optimum of β
t
but the point yielding the optimal
vicinity around it and defined by e
1
2
a
z
λ
2
which
depends on λ.
t,aA
:
+
Z
β
t
(z)e
1
2
(
a
z
λ
)
2
dz
+
Z
β
t
(z)e
1
2
(
az
λ
)
2
dz
(19)
It is a fact that
aA
:
¯
G
t
(a)
¯
G
t
(a
) and since the
first derivative depends on
¯
G
t
(a)
f
t
(a)
, the value of f
t
(a
)
necessary for keeping
f
t
(a
)
t
= 0 is also higher than
any other f
t
(a):
aA
: f
t
(a) f
t
(a
)
f
t
(a)
t
<
f
t
(a
)
t
(20)
Notice that the maximum update that f
t
(a
) may
receive is obtained when a
t
= a
centering the bell
at a
–, then if f
t
(a
) reaches the value
1
λ
2π
, its first
derivative will not be higher than 0 as shows (21)
since β : [0,1].
f
t+1
(a
) = γ
t
f
t
(a) + β
t
(a
t
)αe
1
2
(
aa
t
λ
)
2
1
1+ αλ
2π
1
λ
2π
+ α
1
1+ αλ
2π
1+ αλ
2π
λ
2π
!
1
λ
2π
(21)
Then the equilibrium point of f
t
(a
) has the
higher bound
1
λ
2π
. Notice that the closer the β
t
(a
)
to 1, the closer the equilibrium point of f
t
(a
) to its
higher bound.
f
t
(a
)
t
< 0 f
t
(a
) >
1
λ
2π
= 0 f
t
(a
) =
1
λ
2π
> 0 f
t
(a
) <
1
λ
2π
(22)
We can conclude from (20) and (22) that the high-
est value for f will be achieved at a
as shown in (23)
which has the higher bound
1
λ
2π
.
aA\{a
}
:lim
t
f
t
(a) < f
t
(a
)
lim
t
f
t
(a
)
1
λ
2π
(23)
Finally
lim
λ0
lim
t
f
t
(a
) =
a6=a
: lim
λ0
lim
t
f
t
(a) = 0
(24)
This analysis has been developed under really re-
strictive assumptions, such as t , λ 0, α is
ICAART 2011 - 3rd International Conference on Agents and Artificial Intelligence
476
small enough the bigger the α the bigger the first
derivative of probability law through time allowing
a fast convergence but also the bigger the difference
between the actual probability law and its expected
value and the reward function is noiseless enough
to assure (19).
The best solution for the problem stated above
about the constrainsof λ is to start with a wide enough
bell allowing enough exploration and make it thin-
ner as it approaches the optimum to meet (24). A
good measure of convergence could be the standard
deviation of the actions selected lately. When the
standard deviation of actions is close to the λ that is
been used to update the probability density function
then the maximum value for f
t
(a
) has been reached
as stated in (23).
Since f
0
is the Uniform Density Function, the
standard deviation of actions should start at σ
0
=
max(A)min(A)
12
. We are proposing to use expression
(25) for the convergence (conv
t
) value of the method
given the standard deviation of actions σ
t
. Then, (26)
could be used as λ necessary in equation (8) for each
time-step t improving the learning process of the au-
tomaton.
conv
t
= 1
12σ
t
max(A) min(A)
(25)
λ
t
= λ(1conv
t
) (26)
3 EXPERIMENTAL RESULTS
In order to validate these ideas the standard method
will be tested against the new proposal in 2 scenarios:
noiseless and noisy. All examples will be introduced
by the characteristic functionform(von Neumann and
Morgenstern, 1944). Formally, a characteristic func-
tion form game is given as a pair (N,v), where N de-
notes a set of players and v : S
N
is a character-
istic function with S being the action space.
3.1 Noiseless Scenarios
Three examples will be introduced in this subsection
{la}, β
i
,
{la}, β
ii
and
{la}, β
iii
where la is a
learning automaton. Their analytical expressions are
presented in (27), (28) and (29) respectively. Figure
1 shows them graphically. The operator union is de-
fined as a
S
b = a + b ab and the bell function as
Bell(a,a
0
,σ) = e
1
2
aa
0
σ
2
β
i
(a
t
) = Bell (a
t
,0.5,0.2) (27)
β
ii
(a
t
) = 0.8Bell(a
t
,0.2,1) (28)
β
iii
(a
t
) = (0.9Bell (a
t
,0.2,0.4))
[
Bell(a
t
,0.9,0.3)
(29)
0
1
0.5
i
0.6
0.8
0.2
ii
0.8
1
0.2
0.9
0.9
iii
Figure 1: Characteristic function for scenarios i, ii and iii.
Figure 2 shows the average reward obtained over
time. The selected learning rate was 0.1 and λ
0
= 0.2.
The gray curve, shows the rewards collected with the
standard method and the black one shows the same
but using the convergence to tune the bell through
time. It is clear that the results obtained with the im-
provement show better convergenceproperties. These
differences are more remarked for the first scenario
which has a very easy to learn function. The differ-
ences for the other two scenarios are not so big.
0.5
0.6
0.7
0.8
0.9
1
0
6000
iteration
0.5
0.6
0.7
0.8
0.9
1
0
6000
iteration
0.5
0.6
0.7
0.8
0.9
1
0
6000
iteration
Figure 2: Average rewards.
3.2 Noisy Scenarios
We can add some random noise to the previous for-
mulations as (30), (31) and (32) shows. Figure 3 plots
these functions.
β
i
(a
t
) = 0.8β
i
(a
t
) + rand(0.2) (30)
β
ii
(a
t
) = 0.875β
ii
(a
t
) + rand(0.2) (31)
β
iii
(a
t
) = 0.8β
iii
(a
t
) + rand(0.2)
(32)
0
1
0.5
i
0.5
0.9
0.2
ii
0.6
1
0.2 0.9
iii
Figure 3: Reward functions for scenarios i
, ii
and iii
.
Figure 4 shows the average rewards collected over
time. The same parameter setting was used here. The
results obtained here are similar to the ones of the pre-
vious subsection.
Table 1 sums up results for 100 runs of the algo-
rithms for the above mentioned examples. Better re-
sults are observed for the new learner since the im-
proved method reduces λ through the learning pro-
cess. In case of environments with a high noise level,
CONTINUOUS ACTION REINFORCEMENT LEARNING AUTOMATA - Performance and Convergence
477
0.5
0.6
0.7
0.8
0.9
1
0
6000
iteration
0.5
0.6
0.7
0.8
0.9
1
0
6000
iteration
0.5
0.6
0.7
0.8
0.9
1
0
6000
iteration
Figure 4: Average rewards.
λ cannotbe reducedthat much, and both methodsgive
similar results.
Table 1: Average long-run reward
standard improved
noiseless i 0.62 0.96
ii 0.85 0.86
iii 0.80 0.88
noisy i
0.60 0.87
ii
0.74 0.76
iii
0.74 0.75
A final remark. Notice that the standard automa-
ton did not show good convergence properties for
none of the scenariosintroduced in this paper. Despite
the new derivation of the methodto reduce the λ as the
learner reaches better convergencelevelsfor more dif-
ficult functions such as
{la}, β
ii
and
{la}, β
iii
where the difference in the signal received for actions
around the optimum are quite similar or thereare mul-
tiple optima – the learner does not convergeto the op-
timum as fast as necessary. Future work should be
focussed on the amplification of the perception of the
learner of the signals to allow a more accurate conver-
gence to the optimum.
4 CONCLUSIONS
Learning automata are reinforcement learners, be-
longing to the category of policy iterators, that ex-
hibit nice convergence properties in discrete action
settings. In this paper an improve of the performance
of the method was proposed in order to avoid unnec-
essary numerical integration – speeding up the calcu-
lations – as well as the proof for the local convergence
and a way to adjust the λ parameter during learning to
speed up the learning itself.
In future work we want to investigate the conver-
gence of these LA in multi-agent settings. It has been
shown that a set of agents applying independently
from each other an LA update scheme can converge
to a Nash equilibrium in a discrete action games. We
will study if this convergence result can be extended
to continuous action games.
ACKNOWLEDGEMENTS
This paper was developed under the Cuba-Flanders
collaboration within the VLIR project.
REFERENCES
Bush, R. and Mosteller, F. (1955). Stochastic Models for
Learning. Wiley.
Hilgard, E. (1948). Theories of Learning. New York:
Appleton-Century-Crofts.
Hilgard, E. and Bower, G. (1966). Theories of Learning.
New Jersey: Prentice Hall.
Howell, M. N., Frost, G. P., Gordon, T. J., and Wu, Q. H.
(1997). Continuous action reinforcement learning ap-
plied to vehicle suspension control. Mechatronics,
7(3):263 – 276.
Parzen, E. (1960). Modern Probability Theory And Its Ap-
plications. Wiley-interscience, wiley classics edition
edition.
Thathachar, M. A. L. and Sastry, P. S. (2004). Networks of
Learning Automata: Techniques for Online Stochastic
Optimization. Kluwer Academic Publishers.
Tsetlin, M. (1961). The behavior of finite automata in ran-
dom media. Avtomatika i Telemekhanika, pages 1345–
1354.
Tsetlin, M. (1962). The behavior of finite automata in ran-
dom media. Avtomatika i Telemekhanika, pages 1210–
1219.
Tsypkin, Y. Z. (1971). Adaptation and learning in auto-
matic systems. New York: Academic Press.
Tsypkin, Y. Z. (1973). Foundations of the theory of learning
systems. New York: Academic Press.
von Neumann, J. and Morgenstern, O. (1944). Theory of
games and economic behavior.
ICAART 2011 - 3rd International Conference on Agents and Artificial Intelligence
478