BEHAVIOUR NAVIGATION LEARNINIG USING FACL
ALGORITHM
Abdelkarim Souissi and Hacene Rezine
EMP , Bordj Elbahri, Alger, Algerie
Keywords: Mobile Robot Navigation, Reactive Navigation, Fuzzy Control, Reinforcement learning, Fuzzy Actor Critic
Learning.
Abstract: In this article, we are interested in the reactive behaviours navigation training of a mobile robot in an
unknown environment. The method we will suggest ensures navigation in unknown environments with
presence off different obstacles shape and consists in bringing the robot in a goal position, avoiding
obstacles and releasing it from the tight corners and deadlock obstacles shape. In this framework, we use
the reinforcement learning algorithm called Fuzzy Actor-Critic learning, based on temporal difference
prediction method. The application was tested in our experimental PIONEER II platform.
1 INTRODUCTION
In this article, we propose a reinforcement training
method where the apprentice explores actively its
environment. It applies various actions in order to
discover the states causing the emission of rewards
and punishments. The agent must find the action
which it must carry out when it is in a given
situation It must learn how to choose the optimal
actions to achieve the fixed goal. The environment
can punish or reward the system according to the
applied actions. Each time that an agent applies an
action, a critic gives him a reward or a penalty to
indicate if the resulting state is desirable or not
(Sutton, 1998), (Glorennec, 2000). The task of the
agent is to learn using these rewards the continuation
of actions which gets the greatest cumulative
reward.
Mobile robotics is a privileged application field
of t
he training by reinforcement (Fujii, 1998),
(Smart, 2002), (Babvey, 2003). This established fact
is related to the growing place which takes, since a
few years, an autonomous robotics without
knowledge of the environment. The goal is then to
regard behaviour as a function of mapping sensor-
effector. The training in robotics consists of the
automatic modification of the behaviour of the robot
to improve its behaviour in its environment. The
behaviours are synthesized starting from the simple
definition of objectives through a reinforcement
function.
The considered approach of the robot navigation
using fuzzy i
nference as apprentice is ready to
integrate certain errors in the information about the
system. For example, with fuzzy logic we can
process vague data. The perception of the
environment by ultrasounds sensors and the
reinforcement training thus prove to be particularly
well adapted one to the other (Beom, 1995), (Fukuda
1995), (Jouffe, 1997), (Faria, 2000).
It is very difficult to determinate correct
co
nclusions manually in a large base rule FIS to
ensure the releasing from tight corner and deadlock
obstacles, even when we use a gradient descent
method or a potential-field technique due to the
local-minimum problem. In such situations the robot
will be blocked.
Behaviours made up of a fusion of a «goal
seeki
ng» and of an "obstacle avoidance» issues are
presented. The method we will suggest ensures
navigation in unknown environments with presence
off different obstacles shape, the behaviour will be
realised with SIF whose conclusions are determined
by reinforcement training methods. The algorithms
are written using the Matlab software after having
integrated, in a Simulink block, the functions of
perception, localization and motricity of the robot.
339
Souissi A. and Rezine H. (2007).
BEHAVIOUR NAVIGATION LEARNINIG USING FACL ALGORITHM.
In Proceedings of the Fourth International Conference on Informatics in Control, Automation and Robotics, pages 339-346
Copyright
c
SciTePress
The application was tested in our experimental
platform PIONEER II.
2 FACL ALGORITHM
We have selected a zero-order Takagi-Sugeno FIS
apprentice due to its simplicity, universal
approximator characteristics, generalization capacity
and its real time applications. The input variables
have a triangular and trapezoidal membership
functions.
The SIF thus consists of N rules of the following
form (Glorennec, 2000(:
If situation then Y1 = v[i] and Y2= u[i, 1] with q[i,1]
Y2= u[i, 2] with q[i,2]
Y2= u[i, j] with q[i,j]
Critic actor
In FACL algorithm (Jouffe, 1997(, each rule
of the apprentice has:
i
R
a conclusion used for the approximation of the
evaluation function of the current policy.
"initialized to zeros".
i
v
π
V
a set of discrete actions identical for all rules.
i
U
a vector of parameters indicating the quality
of the various discrete actions available and
intervening in the current policy definition.
i
q
The characteristics of the input variables
membership functions (number, position) are fixed.
The number of rules is also fixed. Thus, the only
modifiable characteristics of the apprentice are the
conclusions
i
(critic) and the election of an action
u[i,j] among J actions available (actor) in the rule
.
v
i
R
The FACL algorithm uses two types of training:
temporal differences for training the critic
reinforcement’s predictions, and a competition
process between available actions for the actor
(Jouffe, 1997).
2.1 Critic
The role of the critic is to approximate the
evaluation function which constitutes a better
criterion for the choice of the actions than that
represented by the primary reinforcements. The
critic value in the state is inferred from the
conclusions vector
t
S
t
v
T
tttR
AR
i
ttt
vSvSV
i
ti
φα
.)(.)( ==
(1)
Where represents the apprentice perception
at time step t (i.e. contains the truth value of
activated rules A
T
t
φ
t
for the state ).
t
S
The approximation error of the critic is given by
the temporal difference error TD as follows:
),()(.
~
111 tttttt
SVSVr
+
=
+++
γ
ε
(2)
The critic uses this error to update the
conclusions vector v by a traditional stochastic
gradient descent:
),(.
~
.
11 ttvttt
SVvv
+
=
++
ε
β
ttt
v
φ
ε
β
.
~
.
1+
+
=
(3)
2.2 Actor
Concerning the actor, the local actions are elected
for each rule activated on the basis of quality of
these actions, and also according to a policy
exploration which we will see further.
The global action for the state is then
determined by the inference of these locally elected
actions:
t
S
=
ti
ii
AR
tR
i
tUtt
SqElectionSU )().()(
α
(4)
T
tt
qElection
φ
).(=
Where Election is a function returning the action
elected for each activated rule.
The training of the optimal policy thus consists
in adjusting the vector of parameters so that the
induced policy is improved.
q
Again the TD error provides a measurement of
this quality improvement. Then, we obtains the
following rule for training the actor
titRt
i
t
i
t
i
t
i
t
ARSUqUq
i
+=
++
),(.
~
)()(
11
αε
(5)
The above expression shows that a positive error
TD implies that the action has been just applied is
preferable than the t-optimal action. It is necessary
to increase the quality of the action being applied.
ICINCO 2007 - International Conference on Informatics in Control, Automation and Robotics
340
Reciprocally, if the TD error is negative, it is
necessary to decrease the quality of the action being
applied because it led the system in a state whose
evaluation is lower than that expected.
2.3 Eligibility Traces
The traces implementation of the critic and the actor
rise directly from the incremental version of the TD.
Let
t
φ
be the trace of the critic at step t,
t
φ
is a
short term memory of the visited states by the
apprentice. This memory is based on the apprentice
perception, i.e.: truth values of the rules:
,.)(
0
=
=
t
k
k
kt
t
φγλφ
,.)(
1
0
1
=
+=
t
k
k
kt
t
φγλγλφ
,1
.
+=
kt
φγλφ
(6)
where
λ
is the proximity factor of the critic.
An equivalent trace for the actor consists in
memorizing the actions applied in the states. We use
a short term memory of the truth values of each rule
according to each action available in these rules. Let
be the trace value of the action in the
rule at the step t (Jouffe, 1997):
)(
ii
t
Ue
i
U
i
R
=+
=
elseUe
UUUe
Ue
ii
t
i
t
ii
t
ii
t
ii
t
),(.
),(,)(.
)(
1
1
λγ
φλγ
(7)
where
λ
is the proximity factor of the actor.
The updates of the parameters preset by (3) and
(5) for all the rules and actions become:
,
~
11 tttt
vv
φεβ
++
+=
(8)
,
~
11 tttt
eqq
++
+=
ε
(9)
2.4 Description of the FACL Execution
Procedure
The execution in a time step can be divided into six
principal stages. Let t+1 the step of the current
time; the apprentice then has applied the action
elected in the previous time step and on the other
hand has received the primary reinforcement
for the transition from the state to . After
the calculation of the truth values of the rules, the six
stages are as follows (Jouffe, 1997):
t
U
1+t
r
t
S
1+t
S
1- Calculation of the t-optimal evaluation
functions of the current state by the critic:
T
tttt
vSV
11
.)(
++
=
φ
(10)
As proposed by Baird (Baird, 1995), to eliminate
the instability source due to the approximation by
SIF, we do not perform a gradient descent on the
residual of Bellman defined by
ε
but on the
average residual quadratic of Bellman. The
modification to be made on the trace of eligibility
is then given by the following relation:
1+
ttt
γρφφφ
,
]1,0[
ρ
(11)
2- TD Error calculation:
),()(.
~
111 tttttt
SVSVr
+
=
+++
γ
ε
(12)
3- Update of training rates corresponding to the
critic parameters for all rules in three stages:
In order to accelerate its training speed and to
avoid instability, we adopt a heuristic adaptive
training rate (Jouffe, 1997). The implementation of
heuristic is carried out by the means of Delta Bar
Delta rule:
-
i
tt
i
t
φεδ
.
~
1+
=
-
<
>+
=
+
else
if
ifk
i
t
i
t
i
t
i
t
i
t
i
t
i
t
i
t
β
δδψβ
δδβ
β
,0.)1(
,0.
1
1
1
(13)
-
i
t
i
t
i
t 1
)1(
+=
δψδψδ
Where in our case:
- is the training rate of the critic for the rule
in the time step t;
i
t
β
i
R
-
i
tt
i
t
φεδ
.
~
1+
=
represent the variation brought to
the parameter of the critic in which we propose to
integrate the trace of eligibility directly and not
simply the partial derivative;
-
i
t
i
t
i
t 1
)1(
+=
δψδψδ
represent the exponential
average
BEHAVIOUR NAVIGATION LEARNINIG USING FACL ALGORITHM
341
This rule increases linearly the training rates in order
to prevent that it do not become too large, and
decrementing it in an exponential way to ensure that
it decrease quickly.
4- Training of the critic and the actor by updating
the vector
v and the matrix :
q
T
ttttt
vv
φβε
..
~
111 +++
+=
(14)
tttt
eqq
11
~
++
+=
ε
(15)
5- It is again necessary to calculate the t-optimal
evaluation function of the current state by the critic
but this time with the lately updated parameters:
(16)
T
tttt
vSV
1111
.)(
++++
=
φ
This value will be used for the TD error
calculation in the next step of time.
6- Now it remains the choice of the action to be
applied in the state .
1+t
S
We consider the case of continuous type actions,
which is inferred from the various actions elected in
each rule,
+
++++
=
1
,),().()(
1111
ti
i
AR
tR
i
tUtt
UUSqElectionSU
α
Where Election is defined by:
))()()(()(
11
UUUqArgMaxqElection
iii
tUU
i
tU
ρη
++=
++
(17)
The traces eligibility updates for the critic and
the actor are given by the two following formulas:
,1
.
ttt
φγλφφ
+=
+
(18)
=+
=
++
+
elseUe
UUUe
Ue
ii
t
i
t
ii
t
ii
t
ii
t
),(.
),(,)(.
)(
11
1
λγ
φλγ
(19)
2.5 Proposed Exploration /
Exploitation
The actions election strategy which we used is a
combination of directed and random exploration
(Jouffe, 1997). The global election function, applied
to each rules, is then defined by:
))()()(()( UUUqArgMaxqElection
UUU
ρ
η
++=
(20)
Where U represents the set of available discrete
actions in each rules, and the associated vector
quality.
q
)(U
η
the random exploration term, and
)(U
ρ
the directed exploration term.
The term
η
is in fact a random values vector. It
corresponds to a vector
ψ
of values sampled
according to an exponential law, standardized in
order to take into account the qualities size scale:
q
=
=
else
Max
UqMinUqMaxs
UqMinUqMaxsi
Us
p
f
,
)(
)))(())(((
))(())((1
)(
ψ
(21)
,).()(
ψ
η
UsU
f
=
(22)
Where represents the maximum size of the
noise relative to the qualities amplitude, is the
corresponding normalisation factor. Used alone, this
term of exploration allows a random choice of
actions when all qualities are identical and allows in
the other case, to test only the actions of which
quality satisfied:
p
s
f
s
(23)
)))(())((()()( UqMinUqMaxsUqUq
p
This term then allow us to avoid the choice of
bad actions (Jouffe, 1997).
The term of directed exploration
ρ
permit the
test of actions not having been applied often. It is
thus necessary to memorize the times number where
the action was elected. This term is defined in the
following way:
)(
)(
Un
t
e
U
θ
ρ
=
(24)
Where
θ
represents a positive factor which
permit to equilibrate the directed exploration,
the applications number of action U at the time step
t. In the case of actions of the continuous type,
corresponds to a number of applications of
the discrete action U in the considered rule. Let
the discrete action elected at the step of time t in the
rule ; the update of this variable is determined by
the following equation:
)(Un
t
)(Un
t
i
t
U
i
R
i
t
i
ti
i
t
i
t
UUARUnUn =+=
,,1)()(
1
(25)
3 EXPERIMENTAL PLATFORM
Pioneer P2-dx with its modest size lends itself well
to navigation in the tight corners and encumbered
ICINCO 2007 - International Conference on Informatics in Control, Automation and Robotics
342
spaces such as the laboratories and small offices. It
has two driving wheels and a castor wheel.
Figure 1: Pioneer II Robot Photo.
To detect obstacles, the robot is equipped with a set
of ultrasounds sensors. It supports eight ultrasonic
sensors placed in its front (fig. 2). The range of
measurements of these sensors lies between 10cm as
minimal range and 5m as maximum range.
Figure 2: Position of the ultrasonic sensors used in the
robot Pioneer II.
The software Saphira/Aria (Konolige, 2002a),
(Konolige, 2002b) allows the control of the robot
(C/C++ programming). We have integrated
functions of Saphira (perception, localization and
motricity) in Simulink using API (S-Function),
Thus we can benefit from MatLab computing power
and simplicity of Simulink to test our algorithms
and to control the robot Pioneer II.
4 EXPERIMENTATION
The task of the robot consists in starting from a
starting point achieving a fixed goal while avoiding
the static obstacles of convex or concave type. It is
realised by the fusion of two elementary behaviours
«goal seeking» and «obstacle avoidance ".
4.1 The «goal seeking» Behaviour
For the "goal seeking" behaviour, we consider three
membership functions for the input
Rb
θ
(robot-goal
angle), and two membership functions for the input
b
ρ
(robot-goal distance) (Figure 3). The base of
knowledge consists of six fuzzy rules. The SIF
controller has two output for the actor (rotation
speed Vrot and the translation speed Vit) and one
output for the critique who is related to the
evaluation function of the pair actions (Vrot,Vit).
Figure 3: membership functions for
Rb
θ
(a) and
b
ρ
(b).
From the heuristic nature of FACL algorithm, we
carried out several tests to determine the values of
the parameters which accelerate the training speed
and to obtain good performances for the apprentice.
After a series of experiments, we found the
following values:
5.0,9.0,50
=
=
=
=
ρ
λ
λ
θ
, and
1.0=sp 9.0
=
γ
Available actions for all rules, are as follow {-
20°/s, -10°/s, -5°/s, 0°/s, +5°/s, +10°/s, +20°/s} for
the rotation velocity and {0 mm/s, 150 mm/s,
350mm/s} for translation, which gives 21 possible
actions in each fuzzy rule.
The reinforcement function is defined as
follows:
- If the robot is far from the goal, the
reinforcement is equal to
1 if (
Rb
θ
. < 0) & Vit=0
Rb
θ
1 if (-1°<
Rb
θ
<+1°) & Vit 0
0 if (
Rb
θ
. = 0) & Vit=0
Rb
θ
-1 else
- If the robot is close to the goal, the
reinforcement is equal to
1 if (
Rb
θ
. < 0) & Vit=0
Rb
θ
-1 else
Figure (4) shows the trajectory of the robot
during the phase of training and validation.
Figure 4: Trajectory of the robot during and after training.
BEHAVIOUR NAVIGATION LEARNINIG USING FACL ALGORITHM
343
4.2 The «obstacle avoidance»
Behaviour
For this behaviour, we have determined the
translation speed of the robot proportional to the
distance from the frontal obstacles, with a maximum
value of 350 mm/s (fig.5).
Figure 5: Translation speed Vit.
Inputs of the fuzzy controller for this behaviour
are the minimal distances provided by the four sets
of sonar {min(d
90
,d
50
), min(d
30
,d
10
), min(g
10
,g
30
),
min(g
90
,g
50
)}, with respectively three membership
functions for each one (fig. 6).
Figure 6: input membership functions for the fuzzy
controller Vrot (a) :{min(d
90
,d
50
), min(g
90
,g
50
)}
(b) :{min(d
10
,d
30
), min(g
10
,g
30
)}.
The set of the action U common to all rules
consists of five actions {-20°/s, -10°/s, 0°/s, +10°/s,
+20°/s}.
The reinforcement function is defined as follows:
+1 if min {min(d
90
,d
50
),min(d
30
,d
10
)} < min
{min(g
10
,g
30
),min(g
90
,g
50
)} & Vrot > 0
+1 if min {min(d
90
,d
50
),min(d
30
,d
10
)}> min
{min(g
10
,g
30
),min(g
90
,g
50
)} & Vrot < 0
-1 otherwise
The figure (7) shows the evolution of the robot
during the training phase.
On the figure (8), we represent the time
evolution of the robot. It shows that the robot is able
to be released from the tight corners and deadlock
obstacles shape.
Figure 7: Trajectories of the robot during and after
training.
(a) (b)
48 1 59 1
P
M
L
P
M
L
(
a
)
(
b
)
(c) (d)
Figure 8: Time evolution of the robot after training.
4.3 Fusion of the Two Behaviours
The goal of the two elementary behaviours fusion is
to allow the robot navigation in environments
composed by fixed obstacles of convex or concave
type and to achieve a fixed goal while ensuring its
safety, witch is a fundamental point in reactive
navigation.
The suggested solution consists in considering
the whole input variables of the two behaviours
« goal seeking » & « obstacles avoidance »
associated with distributed reinforcement function
by the means of a weighting coefficient between the
two behaviours (0.7 for obstacles avoidance and 0.3
for the goal seeking).
The fuzzy controller input are six who are the
minimal distances provided by the four sets of sonar
{min(d
90
,d
50
), min(d
30
,d
10
), min(g
10
,g
30
),
min(g
90
,g
50
)}, with respectively three membership
functions for the side sets and two membership
functions for the frontal sets, to which we add the
input
Rb
θ
with three membership functions, and
b
ρ
with two membership functions. The rules base
ICINCO 2007 - International Conference on Informatics in Control, Automation and Robotics
344
thus consists of 216 fuzzy rules. The translation
speed of the robot is proportional to the distance
from the frontal obstacles. The FACL algorithm
parameters are slightly modified as follows:
θ
=20
&
s
p
=0.9
Figure (9) represents the type of trajectory
obtained during the training phase. The robot
manages to avoid the obstacles and to achieve the
goal assigned in environments encumbered by
maintaining a reasonable distance between the
obstacle and its with side dimensions. Figure (10)
illustrates the satisfying behaviour of the robot after
training. The various trajectories obtained for the
same environment and the same arrival and starting
points are due primarily to the problems of the
perception of the environment by the robot and are
related to the phenomena of sonar readings, these
differences confirm at the same time the
effectiveness of the training algorithm.
Figure 9: Trajectories of the robot in training phase.
(a) (b)
(c) (d)
Figure 10: Various trajectories of the robot after training.
Figure 11: Trajectory of the real robot after training.
Figure (11) illustrates the satisfying behaviour of
the real robot which evolves/moves in an unknown
environment and which manages to achieve the
fixed goal.
5 CONCLUSION
FACL Algorithm makes it possible to introduce
generalization into the space of the states and the
actions, and a Sugeno type order zero SIF
conclusions adaptation incrementally, and this only
by the means of the interactions between the
apprentice and his environment. The reinforcement
function constitutes the measure of the performance
of the required behaviour solution. Also, fusion of
the behaviours « goal seeking» & « obstacle
avoidance » is presented by using a combined
reinforcement function. The simulation and
experimentation results in various environments are
satisfactory.
REFERENCES
Babvey 03 S. Babvey, O. Momtahan, M. R. Meybodi, “
Multi Mobile Robot Using Distributed Value function
Reinforcement Learning”, IEEE, International
Conference on Robotics &Automation, September
2003.
Baird 95 L.C. Baird, “A Residual Algorithms:
Reinforcement Learning with Function
Approximation”, Proceedings of the Twelfth
International Conference on Machine Learning, 1995.
Beom 95 H.R. Beom, H. S. Cho, “ A Sensor-Based
Navigation for a Mobile Robot Using Fuzzy Logic
BEHAVIOUR NAVIGATION LEARNINIG USING FACL ALGORITHM
345
and Reinforcement Learning ”, IEEE, Transactions on
Systems Man and Cybernetics , vol.25, NO.3 ,pp.464-
477, March 1995.
Faria 00 G. Faria, R. A.F Romero “Incorporating Fuzzy
Logic to Reinforcement Learning”, IEEE, pp.847-852
,Brazil,2000.
Fujii 98 T. Fujii, Y. Arai , H. Asama, I. Endo,
“Multilayered Rienforcement Learning For
Complicated collision Avoidance problems”,
Proceedings of IEEE, International Conference on
Robotics &Automation, pp.2186-2191,May 1998.
Fukuda 95 T. Fukuda, Y. Hasegawa, K. Shimojima, F.
Saito,“Reinforcement Learning Method For
Generating Fuzzy Controller”, Department of Micro
system Engineering, Nagoya University, pp.273-278,
Japan, IEEE, 1995.
Glorennec 00 P. Y. Glorennec, “Reinforcement Learning:
An Overview”, INSA de Rennes ESIT’2000, Aachen,
pp.17-35, Germany, September 2000.
Jouffe 97 L. Jouffe, “Apprentissage de Système
D’inférence Floue par des Méthodes de
Renforcement”, Thèse de Doctorat, IRISA, Université
de Rennes I, Juin 1997.
Konolige 02a K. G. Konolige ‘Saphira Référence’, Edition
Doxygen, Novembre 2002.
Konolige 02b K. G. Konolige ‘Saphira Robot Control
Architecture’, Edition SRI International, Avril, 2002
Smart 02 W. D. Smart, L. P. Kaelbling , “ Effective
Reinforcement Learning For Mobile Robots”,
Proceedings of the IEEE, International Conference on
Robotics &Automation, pp.3404-3410, May 2002.
Sutton 98 R.S. Sutton, A. Barto, “Reinforcement
Learning”, Bradford Book, 1998.
ICINCO 2007 - International Conference on Informatics in Control, Automation and Robotics
346