point of psychologist and biologists (Hilgard, 1948;
Hilgard and Bower, 1966).
The engineering research on LA started in the
early 1960’s (Tsetlin, 1961; Tsetlin, 1962). Tsetlin
and his colleagues formulated the objective of learn-
ing as an optimization of some mathematical perfor-
mance index (Thathachar and Sastry, 2004; Tsypkin,
1971; Tsypkin, 1973):
J(A) =
Z
A
R(a,A)dP (1)
where R(a,A) is a function of an observation vector
a with A the space of all possible actions. The perfor-
mance index J is the expectation of R with respect to
the distribution P. This distribution includes random-
ness in a and randomness in R.
The model is well presented by the follow-
ing example introduced by Thathachar and Sastry
(Thathachar and Sastry, 2004). Consider a student
and a teacher. The student is posed a question and
is given several alternative answers. The student
can pick one of the alternatives, following which the
teacher responds yes or no. This response is proba-
bilistic – the teacher may say yes for wrong alterna-
tives and vice versa. The student is expected to learn
the correct alternative through such repeated interac-
tions. While the problem is ill-posed with this gener-
ality, it becomes solvable with an added condition. It
is assumed that the probability of the teacher saying
‘yes’ is maximum for the correct alternative.
All in all, LA are useful in applications that in-
volve optimization of a function which is not com-
pletely known in the sense that only noise corrupted
values of the function for any specific values of argu-
ments are observable (Thathachar and Sastry, 2004).
Some standard implementations are introduced below
2.1 Learning Automata
Implementations
The first implementation we would like to refer
to is the Continuous Action Learning Automata
(CALA) introduced by Thathachar and Sastry in 2004
(Thathachar and Sastry, 2004). The authors imple-
mented P as the Normal Probability Distribution with
mean µ
t
and standard deviation σ
t
. At every time step
t an action is selected according to a normal distri-
bution N (µ
t
,σ
t
). Then, after exploring action a
t
and
observing signal β
t
(a
t
) , the update rules (2) and (3)
are applied resulting in a new value for µ
t+1
and σ
t+1
.
µ
t+1
= µ
t
+ λ
β
t
(a
t
) −β
t
(µ
t
)
max(σ
t
,σ
L
)
a
t
−µ
t
max(σ
t
,σ
L
)
(2)
σ
t+1
= σ
t
+ λ
β
t
(a
t
) −β
t
(µ
t
)
max(σ
t
,σ
L
)
"
a
t
−µ
t
max(σ
t
,σ
L
)
2
−1
#
−λK (σ
t
−σ
L
)
(3)
where λ is the learning parameter controlling the step
size (0 < λ < 1), K is a large positive constant and σ
L
is a lower bound of σ. Authors also introduced the
convergence proof for this automaton and tested it in
games with multiple learners.
Notice that this first formulation presented in ex-
pressions (2) and (3) works with a parametric Proba-
bility Density Function (PDF) so it is simple and fast
to incorporate the signal into the knowledge of the
automaton. Thathachar and Sastry (Thathachar and
Sastry, 2004) introduced several examples of how to
manage a game of multiple automata meaning that
these automata can be used for controlling multiple
variables in a MAS. Notice that the update rule needs
informationabout the response of the environment for
the selected action a
t
but it also needs the feedback
for the action which corresponds to the mean of the
probability distribution, being µ
t
. In most of practical
engineering problems it is impossible to explore both
actions. Additionally, the convergence proof assumes
that the function to optimize should be integrable and
the minimal achievable standard deviation σ
L
is very
sensitive to noise: the stronger the noise, the higher
the lower bound σ
L
. These constraints are really re-
strictive for practical applications.
The second implementation we would like to re-
call is the CARLA (Howell et al., 1997). The authors
implemented P as a PDF as well but nonparametric
this time. Starting with the uniform distribution over
the whole action space A and after exploring action
a
t
∈A in time step t the PDF is updated as (4) shows.
f
t+1
(a) =
γ
t
f
t
(a) + β
t
(a
t
)αe
−
1
2
(
a−a
t
λ
)
2
a ∈ A
0 a /∈ A
(4)
This second formulation (4) saves the unneces-
sary exploration and the function to optimize is not
required to be integrable, just not chaotic. The prob-
lem is that it controls the strategy for the action selec-
tion of the automaton with a nonparametric PDF so it
becomes computational very expensive. The solution
is to numerically approximate the function but still,
some heavy numerical calculations are necessary for
γ
t
. No convergence proof is given either.
If the computational cost of this method could be
decreased and the convergence proof shown, then the
CALA introduced by Thathachar and Sastry could be
substituted by the CARLA providing a better way for
solving practical problems with a MAS approach.
ICAART 2011 - 3rd International Conference on Agents and Artificial Intelligence
474