2.2 LCC Synthesis
Once the control scheme has been stated, the next
step is the definition of a methodology to compute the
functions ( f
i
, D
i
). Here a policy gradient reinforce-
ment learning (PGRL) algorithm is employed. PGRL
methods (see (Williams, 1992),(Sutton et al., 2000))
are based on the measurement of the performance of
a parameterized policy π
w
applied as control function
during a delimited amount of time. In order to mea-
sure the performance of the controller, the following
function is defined,
V (w) := J(x, π
w
(x)) (3)
where the measurement of the performance V of the
parameters w is done by defining the cost function J.
By restricting the scope of the policy to certain
class of parameterized functions u = π
w
(x), the per-
formance measure (3) is a surface where the maxi-
mum value corresponds to the optimal set of param-
eters w ∈ R
d
. The search for the maximum can be
performed by standard gradient ascent techniques,
w
k+1
= w
k
+ η∇
w
V (w) (4)
where η is the step size and ∇
w
V (w) is the gradient
of V (w) with respect to w. The analytical formulation
of this gradient is not possible without the acquisition
of a mathematical model of the robot. The numeri-
cal computation is also not evident, then, a stochastic
approximation algorithm is employed : the ‘weight
perturbation’ (Jabri and Flower, 1992), which esti-
mates the unknown gradient using a Gaussian random
vector to orientate the change in the vector parame-
ters. This algorithm is selected due to its good per-
formance, easy derivation and fast implementation;
note that the focus of this research is not the choice
of a specific algorithm, nor the development of one,
but rather the cognitive architecture to provide robots
with the coordination learning ability.
This algorithm uses the fact that, by adding to w
a small Gaussian random term z with E{z
i
} = 0 and
E{z
i
z
j
} = σ
2
δ
i j
, the following expression is a sample
of the desired gradient
α(w) = [J(w + z) − J(w)]· z (5)
Then, both layers’ controllers can be found using this
PGRL algorithm.
u
i
= D
k
i
(e
i
) (6)
v
i
= f
w
i
(u
1
, ..., u
n
)
In order to be consequent with the proposed definition
of each layer, the training of the vector parameters
must be processed independently; first the dynami-
cal layer and then the coordination one. It is assumed
that the movement has to take place between a limited
amount of time T , where signals are collected to com-
pute the cost function (3), then, the gradient is esti-
mated using (5) and a new set of parameters obtained
applying the update strategy of (4). Notice that in the
case of the first layer, each function D
k
i
is trained and
updated separately from the others of the same layer,
but when learning the coordinate layer, all the func-
tions f
w
i
must be updated at the same time, because
in this case the performance being measured is that of
the whole layer.
3 LINEAR IMPLEMENTATION
The previous description of the LCC is too general
to be of practical use. Then, it is necessary to make
some assumptions about the type of functions to be
used as controllers. A well known structure to control
the dynamic position of one link is the Proportional
Integral Derivative (PID) error compensator. It has
the following form,
u
i
= K
P
i
· e
i
+ K
D
i
·
de
i
dt
+ K
I
i
· e
i
dt (7)
The functionality of its three terms (K
P
, K
D
, K
I
) of-
fers management for both transient and steady-state
responses, therefore, it is a generic and efficient so-
lution to real world control problems. By the use
of this structure, a link between optimal control and
PID compensation is revealed to robotic applications.
Other examples of optimization based techniques for
tuning a PID are (Daley and Liu, 1999; Koszalka
et al., 2006).
The purpose of the second layer is to compute the
actual velocity to be applied on each motor by gath-
ering information about the state of the robot while
processing a DT. The PID output signals are collected
and filtered by this function to coordinate them. Per-
haps the simplest structure to manage this coordina-
tion is a gain row vector W
i
, letting the functions in
the second layer be the following,
f
w
i
(u) = W
i
· u
Then, a linear combination of u commands the coor-
dination. The matrix W
DT
m
encapsules all the infor-
mation about this layer for the mth DT.
W
DT
m
=
w
11
. . . w
in
.
.
.
.
.
.
.
.
.
w
n1
. . . w
nn
(8)
Where the term w
i j
is the specific weight of the jth
PID in the velocity computation of the joint i.
ICINCO 2007 - International Conference on Informatics in Control, Automation and Robotics
176