CONTINUOUS ACTION REINFORCEMENT LEARNING

UTOMATA

Performance and Convergence

Abdel Rodr´ıguez

1,2

, Ricardo Grau

Bioinformatics Lab, Center of Studies on Informatics, Central University of Las Villas, Santa Clara, Cuba

Computational Modeling Lab, Vrije Universiteit Brussel, Brussels, Belgium

Ann Now´e

Computational Modeling Lab, Vrije Universiteit Brussel, Brussels, Belgium

Keywords:

CARLA, Convergence, Performance.

Abstract:

Reinforcement Learning is a powerful technique for agents to solve unknown Markovian Decision Processes,

from the possibly delayed signals that they receive. Most RL work, in particular for multi-agent settings,

assume a discrete action set. Learning automata are reinforcement learners, belonging to the category of policy

iterators, that exhibit nice convergence properties in discrete action settings. Unfortunately, most applications

assume continuous actions. A formulation for a continuous action reinforcement learning automaton already

exists, but there is no convergence guarantee to optimal decisions. An improve of the performance of the

method is proposed in this paper as well as the proof for the local convergence.

1 INTRODUCTION

Since Artiﬁcial Intelligence emerged, it has been try-

ing to emulate human behavior with the hope that

some day computers will learn how to act as perfect

humans. Reinforcement Leaning is the way animals

learn how to maximize their proﬁts in certain situa-

tions. It is based on random but not uniform explo-

ration. The basis of Reinforcement Learning is to ex-

plore actions and reinforce positively those that re-

sulted in a good outcome for the learner, or reinforce

negatively the ones that produced bad results.

The mathematical abstraction of this learning is

already formulated for discrete actions, but in many

engineering applications it is necessary to control

continuous parameters. Continuous formulations of

Reinforcement Learning are not developed as good as

discrete action learners. For single agents there is al-

ready quite a lot of work on continuous action learn-

ing but there is not much work done in multi-agent

settings.

This paper performs an analysis of the perfor-

mance of Continuous Action Reinforcement Learn-

ing Automaton (CARLA) (Howell et al., 1997) on its

usefulness for future exploration in Multi-agent Sys-

tems (MAS). Classical deﬁnition of random variable

(RV), will be used as well as the probability integral

transformation for the generation of random numbers

following a given probability distribution (Parzen,

1960). Next section introduces the LA and as a ﬁrst

contribution of this paper, subsection 2.2 shows how

the numerical calculations can be reduced by some

mathematical derivations. Following subsection 2.3

introduces the local convergence proof as well as the

way to manage the λ parameter to improve this con-

vergence as a second contribution. To support the the-

oretical results, some experimentsare presented in the

section 3. Finally, conclusions and future work are

stated in section 4.

2 LEARNING AUTOMATA

The learning automaton is a simple model for adap-

tive decision making in unknown random environ-

ments. The concept of a Learning Automaton (LA)

originated in the domain of mathematical psychology

(Bush and Mosteller, 1955) where it was used to an-

alyze the behavior of human beings from the view-

473

Rodríguez A., Grau R. and Nowé A..

CONTINUOUS ACTION REINFORCEMENT LEARNING AUTOMATA - Performance and Convergence.

DOI: 10.5220/0003287104730478

In Proceedings of the 3rd International Conference on Agents and Artiﬁcial Intelligence (ICAART-2011), pages 473-478

ISBN: 978-989-8425-41-6

 2011 SCITEPRESS (Science and Technology Publications, Lda.)

point of psychologist and biologists (Hilgard, 1948;

Hilgard and Bower, 1966).

The engineering research on LA started in the

early 1960’s (Tsetlin, 1961; Tsetlin, 1962). Tsetlin

and his colleagues formulated the objective of learn-

ing as an optimization of some mathematical perfor-

mance index (Thathachar and Sastry, 2004; Tsypkin,

1971; Tsypkin, 1973):

J(A) =

R(a,A)dP (1)

where R(a,A) is a function of an observation vector

a with A the space of all possible actions. The perfor-

mance index J is the expectation of R with respect to

the distribution P. This distribution includes random-

ness in a and randomness in R.

The model is well presented by the follow-

ing example introduced by Thathachar and Sastry

(Thathachar and Sastry, 2004). Consider a student

and a teacher. The student is posed a question and

is given several alternative answers. The student

can pick one of the alternatives, following which the

teacher responds yes or no. This response is proba-

bilistic – the teacher may say yes for wrong alterna-

tives and vice versa. The student is expected to learn

the correct alternative through such repeated interac-

tions. While the problem is ill-posed with this gener-

ality, it becomes solvable with an added condition. It

is assumed that the probability of the teacher saying

‘yes’ is maximum for the correct alternative.

All in all, LA are useful in applications that in-

volve optimization of a function which is not com-

pletely known in the sense that only noise corrupted

values of the function for any speciﬁc values of argu-

ments are observable (Thathachar and Sastry, 2004).

Some standard implementations are introduced below

2.1 Learning Automata

Implementations

The ﬁrst implementation we would like to refer

to is the Continuous Action Learning Automata

(CALA) introduced by Thathachar and Sastry in 2004

(Thathachar and Sastry, 2004). The authors imple-

mented P as the Normal Probability Distribution with

mean µ

and standard deviation σ

. At every time step

t an action is selected according to a normal distri-

bution N (µ

,σ

). Then, after exploring action a

and

observing signal β

) , the update rules (2) and (3)

are applied resulting in a new value for µ

t+1

and σ

t+1

= µ

+ λ

) −β

(µ

)

max(σ

,σ

)

−µ

max(σ

,σ

)

(2)

t+1

= σ

+ λ

) −β

(µ

)

max(σ

,σ

)



−µ

max(σ

,σ

)



−1

−λK (σ

−σ

)

(3)

where λ is the learning parameter controlling the step

size (0 < λ < 1), K is a large positive constant and σ

is a lower bound of σ. Authors also introduced the

convergence proof for this automaton and tested it in

games with multiple learners.

Notice that this ﬁrst formulation presented in ex-

pressions (2) and (3) works with a parametric Proba-

bility Density Function (PDF) so it is simple and fast

to incorporate the signal into the knowledge of the

automaton. Thathachar and Sastry (Thathachar and

Sastry, 2004) introduced several examples of how to

manage a game of multiple automata meaning that

these automata can be used for controlling multiple

variables in a MAS. Notice that the update rule needs

informationabout the response of the environment for

the selected action a

but it also needs the feedback

for the action which corresponds to the mean of the

probability distribution, being µ

. In most of practical

engineering problems it is impossible to explore both

actions. Additionally, the convergence proof assumes

that the function to optimize should be integrable and

the minimal achievable standard deviation σ

is very

sensitive to noise: the stronger the noise, the higher

the lower bound σ

. These constraints are really re-

strictive for practical applications.

The second implementation we would like to re-

call is the CARLA (Howell et al., 1997). The authors

implemented P as a PDF as well but nonparametric

this time. Starting with the uniform distribution over

the whole action space A and after exploring action

∈A in time step t the PDF is updated as (4) shows.

t+1

(a) =









(a) + β

)αe

−

(

a−a

)



a ∈ A

0 a /∈ A

(4)

This second formulation (4) saves the unneces-

sary exploration and the function to optimize is not

required to be integrable, just not chaotic. The prob-

lem is that it controls the strategy for the action selec-

tion of the automaton with a nonparametric PDF so it

becomes computational very expensive. The solution

is to numerically approximate the function but still,

some heavy numerical calculations are necessary for

. No convergence proof is given either.

If the computational cost of this method could be

decreased and the convergence proof shown, then the

CALA introduced by Thathachar and Sastry could be

substituted by the CARLA providing a better way for

solving practical problems with a MAS approach.

ICAART 2011 - 3rd International Conference on Agents and Artificial Intelligence

474

2.2 CARLA Update Rule

Let us restart the analysis on expression (4) in order

to look for possible reductions on the computational

cost. Let a

−

= min(A) and a

= max(A) be the mini-

mum and maximum possible actions. The normaliza-

tion factor γ

can be computed by expression (5).

−



(a) + β

)αe

−

(

a−a

)



(5)

The original formulation is for the bounded con-

tinuous action space A which makes the analytical cal-

culation of the normalisation factor γ unlikely. Let

us relax this constraint and work over the unbounded

continuous action space ℜ. Then the PDF update rule

introduced in (4) should be redeﬁned as (6) where

N(a

,λ)

is the normal PDF with mean a

and standard

deviation λ. Analogously, (5) is transformed into (7).

Notice that numerical integration is no longer needed

for calculating γ

t+1

(a) = γ



(a) + β

)αe

−

(

a−a

)



= γ



(a) + β

)αλ

√

2πf

N(a

,λ)

(a)



(6)

+∞

−∞



(a) + β

)αλ

√

2πf

N(a

,λ)



1+ β

)αλ

√

2π

(7)

Let δ

, introduced in (8), be the extra area added

to the PDF by stretching the curve beyond the interval

−

] then (7), can be written as (9) and (6) as (10).

= β

)αλ

√

2π (8)

1+ δ

(9)

t+1

(a) = γ



(a) + δ

N(a

,λ)

(a)



(10)

In order to generate the actions following the

policy f

the cumulative density function (CDF) is

needed (Parzen, 1960) which is introduced in (11).

t+1

(a) =

−∞

t+1

(z)dz

−∞



(z) + δ

N(a

,λ)

(z)



= γ



(x) + δ

N(a

,λ)

(a)



= γ



(x) + δ

N(0,1)



a−a



(11)

Although there is no analytical deﬁnition for the

normal CDF F

N(µ,σ)

it can be approximated by means

of numerical integration. So still numerical integra-

tion is needed however one single CDF is required,

being F

N(0,1)

which can be calculated at the beginning

of learning – only once – and there is no more need

for integration during the learning process.

Finally, the original constraint has to be met for

practical solutions, that is ∀t : a

∈ A – see (4). So

and F

t+1

deﬁned in (9) and (11) should be trans-

formed as shown in (12) and (13) where F

dif f

(x,y) =

N(0,1)



y−a



−F

N(0,1)



x−a



1+ δ

dif f

−

)

(12)

t+1

(a) =











0 a < a

−



(a) + δ

dif f

−

)



a ∈ A

1 a > a

(13)

For a practical implementation of this method,

equations (8), (12) and (13) are sufﬁcient to avoid

numerical integration saving lot of calculation time

during learning process. We would like to stress

that without this reformulation the method was really

computationally too heavy to be applied in practice,

but with this change it turns to be computationally

feasible. In the next subsection we will perform an

analysis of λ used in expression (8) which will result

in better convergence properties.

2.3 CARLA Convergence

The analysis will be performed for normalized reward

signals β : ℜ → [0,1] – no generality is lost because

any closed interval can mapped to this interval by a

linear transformation. The ﬁnal goal of this analysis

is to ﬁnd the necessary restrictions to guarantee con-

vergence to local optima.

The sequence of PDF updates is a Markovian pro-

cess, where for each time-step t an action a

∈ A is

selected and a new f

is returned. At each time-step t,

will be updated as shown in expression (10). The

expected value

t+1

of f

t+1

can be computed follow-

ing equation (14).

t+1

(a) =

+∞

−∞

(z) f

t+1

(a | a

= z)dz

(14)

Let γ

= γ

= z be the value for γ

if a

= z and

the expected value of γ

then (14) could be rewritten

CONTINUOUS ACTION REINFORCEMENT LEARNING AUTOMATA - Performance and Convergence

475

as (15).

t+1

(a) =

+∞

−∞

(z)γ



(a) + αβ

(z)e

−

(

a−z

)



= f

(a)

+ α

+∞

−∞

(z)γ

(z)β

(z)e

−

(

a−z

)

(15)

Let us have a look at the right member of the in-

tegral. f

(z) is multiplied by the factor composed by

the normalization factor given that a

= z, the feed-

back signal β

(z) and the distance measure e

−

(

a−z

)

which can be interpreted as the strength of the relation

of actions a and z, the higher the value of this prod-

uct, the bigger the relation of these actions. Let us call

this composed factor G

(a,z) and

(a) its expected

value at time-step t with respect to z. Then equation

(15) could be ﬁnally formulated as (16).

t+1

(a) = f

(a)

+ α

(a)

= f

(a)



(a)



(16)

The sign of the ﬁrst derivative of f

dependson the

factor

(a)

of expression (16) so it behaves as

shown in (17).

∂f

∂t











< 0



(a)



< 1

= 0



(a)



= 1

> 0



(a)



> 1

(17)

Notice

is a constant for all a ∈ A and

+∞

−∞

(z)dz = 1 so:

∃

∈A

)

=⇒

∃

−

⊂A,A

∩A

−

,∀

∈A

−

∈A

− :



∂f

(

)

∂t

> 0



∧



∂f

(

−

)

∂t

< 0



(18)

From logical implication (18) it can be assured

that the sign of

∂f

(a)

∂t

will be determined by the ra-

tio

(a)

. Notice subsets A

and A

−

are composed by

the elements of A that have not reached their value for

the probability density function in equilibrium with

(a). That is, the A

subset is composed by all a ∈A

having a value of probability density function which

is too small with respect to

(a) and vice versa for

−

Let a

∗

∈ A be the action that yields the highest

value for

+∞

−∞

(z)e

−

(

a−z

)

dz for all time-steps as

shown in (19). It is important to stress that a

∗

is not

the optimum of β

but the point yielding the optimal

vicinity around it and deﬁned by e

−



∗

−z



which

depends on λ.

∀

t∈ℵ,a∈A

+∞

−∞

(z)e

−

(

∗

−z

)

dz ≥

+∞

−∞

(z)e

−

(

a−z

)

(19)

It is a fact that ∀

a∈A

(a) ≤

∗

) and since the

ﬁrst derivative depends on

(a)

, the value of f

∗

)

necessary for keeping

∂f

∗

)

∂t

= 0 is also higher than

any other f

(a):

∀

a∈A

: f

(a) ≥ f

∗

) ⇒

∂f

(a)

∂t

∂f

∗

)

∂t

(20)

Notice that the maximum update that f

∗

) may

receive is obtained when a

= a

∗

– centering the bell

at a

∗

–, then if f

∗

) reaches the value

√

2π

, its ﬁrst

derivative will not be higher than 0 as shows (21)

since β : ℜ → [0,1].

t+1

∗

) = γ



(a) + β

)αe

−

(

a−a

)



≤

1+ αλ

√

2π



√

2π

+ α



≤

1+ αλ

√

2π

1+ αλ

√

2π

√

2π

≤

√

2π

(21)

Then the equilibrium point of f

∗

) has the

higher bound

√

2π

. Notice that the closer the β

∗

)

to 1, the closer the equilibrium point of f

∗

) to its

higher bound.

∂f

∗

)

∂t











< 0 f

∗

) >

√

2π

= 0 f

∗

) =

√

2π

> 0 f

∗

) <

√

2π

(22)

We can conclude from (20) and (22) that the high-

est value for f will be achieved at a

∗

as shown in (23)

which has the higher bound

√

2π

∀

a∈A\{a

∗

}

:lim

t→∞

(a) < f

∗

)

lim

t→∞

∗

) ≤

√

2π

(23)

Finally

lim

λ↓0

lim

t→∞

∗

) = ∞

∀

a6=a

∗

: lim

λ↓0

lim

t→∞

(a) = 0

(24)

This analysis has been developed under really re-

strictive assumptions, such as t → ∞, λ ↓ 0, α is

ICAART 2011 - 3rd International Conference on Agents and Artificial Intelligence

476

small enough – the bigger the α the bigger the ﬁrst

derivative of probability law through time allowing

a fast convergence but also the bigger the difference

between the actual probability law and its expected

value – and the reward function is noiseless enough

to assure (19).

The best solution for the problem stated above

about the constrainsof λ is to start with a wide enough

bell – allowing enough exploration– and make it thin-

ner as it approaches the optimum – to meet (24). A

good measure of convergence could be the standard

deviation of the actions selected lately. When the

standard deviation of actions is close to the λ that is

been used to update the probability density function

then the maximum value for f

∗

) has been reached

as stated in (23).

Since f

is the Uniform Density Function, the

standard deviation of actions should start at σ

max(A)−min(A)

√

. We are proposing to use expression

(25) for the convergence (conv

) value of the method

given the standard deviation of actions σ

. Then, (26)

could be used as λ necessary in equation (8) for each

time-step t improving the learning process of the au-

tomaton.

conv

= 1−

√

12σ

max(A) −min(A)

(25)

= λ(1−conv

) (26)

3 EXPERIMENTAL RESULTS

In order to validate these ideas the standard method

will be tested against the new proposal in 2 scenarios:

noiseless and noisy. All examples will be introduced

by the characteristic functionform(von Neumann and

Morgenstern, 1944). Formally, a characteristic func-

tion form game is given as a pair (N,v), where N de-

notes a set of players and v : S

−→ ℜ is a character-

istic function with S ⊂ℜ being the action space.

3.1 Noiseless Scenarios

Three examples will be introduced in this subsection



{la}, β





{la}, β



and



{la}, β

iii



where la is a

learning automaton. Their analytical expressions are

presented in (27), (28) and (29) respectively. Figure

1 shows them graphically. The operator union is de-

ﬁned as a

b = a + b −ab and the bell function as

Bell(a,a

,σ) = e

−



a−a



) = Bell (a

,0.5,0.2) (27)

) = 0.8Bell(a

,0.2,1) (28)

iii

) = (0.9Bell (a

,0.2,0.4))

[

Bell(a

,0.9,0.3)

(29)

0.5

0.6

0.8

0.2

0.8

0.2

0.9

iii

Figure 1: Characteristic function for scenarios i, ii and iii.

Figure 2 shows the average reward obtained over

time. The selected learning rate was 0.1 and λ

= 0.2.

The gray curve, shows the rewards collected with the

standard method and the black one shows the same

but using the convergence to tune the bell through

time. It is clear that the results obtained with the im-

provement show better convergenceproperties. These

differences are more remarked for the ﬁrst scenario

which has a very easy to learn function. The differ-

ences for the other two scenarios are not so big.

0.5

0.6

0.7

0.8

0.9

6000

iteration

0.5

0.6

0.7

0.8

0.9

6000

iteration

0.5

0.6

0.7

0.8

0.9

6000

iteration

Figure 2: Average rewards.

3.2 Noisy Scenarios

We can add some random noise to the previous for-

mulations as (30), (31) and (32) shows. Figure 3 plots

these functions.

′

) = 0.8β

) + rand(0.2) (30)

′

) = 0.875β

) + rand(0.2) (31)

iii

′

) = 0.8β

iii

) + rand(0.2)

(32)

0.5

′

0.5

0.9

0.2

′

0.6

0.2 0.9

iii

′

Figure 3: Reward functions for scenarios i

′

, ii

′

and iii

′

Figure 4 shows the average rewards collected over

time. The same parameter setting was used here. The

results obtained here are similar to the ones of the pre-

vious subsection.

Table 1 sums up results for 100 runs of the algo-

rithms for the above mentioned examples. Better re-

sults are observed for the new learner since the im-

proved method reduces λ through the learning pro-

cess. In case of environments with a high noise level,

CONTINUOUS ACTION REINFORCEMENT LEARNING AUTOMATA - Performance and Convergence

477

0.5

0.6

0.7

0.8

0.9

6000

iteration

0.5

0.6

0.7

0.8

0.9

6000

iteration

0.5

0.6

0.7

0.8

0.9

6000

iteration

Figure 4: Average rewards.

λ cannotbe reducedthat much, and both methodsgive