BEHAVIOUR NAVIGATION LEARNINIG USING FACL

ALGORITHM

Abdelkarim Souissi and Hacene Rezine

EMP , Bordj Elbahri, Alger, Algerie

Keywords: Mobile Robot Navigation, Reactive Navigation, Fuzzy Control, Reinforcement learning, Fuzzy Actor Critic

Learning.

Abstract: In this article, we are interested in the reactive behaviours navigation training of a mobile robot in an

unknown environment. The method we will suggest ensures navigation in unknown environments with

presence off different obstacles shape and consists in bringing the robot in a goal position, avoiding

obstacles and releasing it from the tight corners and deadlock obstacles shape. In this framework, we use

the reinforcement learning algorithm called Fuzzy Actor-Critic learning, based on temporal difference

prediction method. The application was tested in our experimental PIONEER II platform.

1 INTRODUCTION

In this article, we propose a reinforcement training

method where the apprentice explores actively its

environment. It applies various actions in order to

discover the states causing the emission of rewards

and punishments. The agent must find the action

which it must carry out when it is in a given

situation It must learn how to choose the optimal

actions to achieve the fixed goal. The environment

can punish or reward the system according to the

applied actions. Each time that an agent applies an

action, a critic gives him a reward or a penalty to

indicate if the resulting state is desirable or not

(Sutton, 1998), (Glorennec, 2000). The task of the

agent is to learn using these rewards the continuation

of actions which gets the greatest cumulative

reward.

Mobile robotics is a privileged application field

of t

he training by reinforcement (Fujii, 1998),

(Smart, 2002), (Babvey, 2003). This established fact

is related to the growing place which takes, since a

few years, an autonomous robotics without

knowledge of the environment. The goal is then to

regard behaviour as a function of mapping sensor-

effector. The training in robotics consists of the

automatic modification of the behaviour of the robot

to improve its behaviour in its environment. The

behaviours are synthesized starting from the simple

definition of objectives through a reinforcement

function.

The considered approach of the robot navigation

using fuzzy i

nference as apprentice is ready to

integrate certain errors in the information about the

system. For example, with fuzzy logic we can

process vague data. The perception of the

environment by ultrasounds sensors and the

reinforcement training thus prove to be particularly

well adapted one to the other (Beom, 1995), (Fukuda

1995), (Jouffe, 1997), (Faria, 2000).

It is very difficult to determinate correct

co

nclusions manually in a large base rule FIS to

ensure the releasing from tight corner and deadlock

obstacles, even when we use a gradient descent

method or a potential-field technique due to the

local-minimum problem. In such situations the robot

will be blocked.

Behaviours made up of a fusion of a «goal

seeki

ng» and of an "obstacle avoidance» issues are

presented. The method we will suggest ensures

navigation in unknown environments with presence

off different obstacles shape, the behaviour will be

realised with SIF whose conclusions are determined

by reinforcement training methods. The algorithms

are written using the Matlab software after having

integrated, in a Simulink block, the functions of

perception, localization and motricity of the robot.

339

Souissi A. and Rezine H. (2007).

BEHAVIOUR NAVIGATION LEARNINIG USING FACL ALGORITHM.

In Proceedings of the Fourth International Conference on Informatics in Control, Automation and Robotics, pages 339-346

Copyright

c

 SciTePress

The application was tested in our experimental

platform PIONEER II.

2 FACL ALGORITHM

We have selected a zero-order Takagi-Sugeno FIS

apprentice due to its simplicity, universal

approximator characteristics, generalization capacity

and its real time applications. The input variables

have a triangular and trapezoidal membership

functions.

The SIF thus consists of N rules of the following

form (Glorennec, 2000(:

If situation then Y1 = v[i] and Y2= u[i, 1] with q[i,1]

Y2= u[i, 2] with q[i,2]

Y2= u[i, j] with q[i,j]

Critic actor

In FACL algorithm (Jouffe, 1997(, each rule

of the apprentice has:

i

R

• a conclusion used for the approximation of the

evaluation function of the current policy.

"initialized to zeros".

i

v

π

V

• a set of discrete actions identical for all rules.

i

U

• a vector of parameters indicating the quality

of the various discrete actions available and

intervening in the current policy definition.

i

q

The characteristics of the input variables

membership functions (number, position) are fixed.

The number of rules is also fixed. Thus, the only

modifiable characteristics of the apprentice are the

conclusions

i

(critic) and the election of an action

u[i,j] among J actions available (actor) in the rule

.

v

i

R

The FACL algorithm uses two types of training:

temporal differences for training the critic

reinforcement’s predictions, and a competition

process between available actions for the actor

(Jouffe, 1997).

2.1 Critic

The role of the critic is to approximate the

evaluation function which constitutes a better

criterion for the choice of the actions than that

represented by the primary reinforcements. The

critic value in the state is inferred from the

conclusions vector

t

S

t

v

T

tttR

AR

i

ttt

vSvSV

i

ti

φα

.)(.)( ==

∑

∈

(1)

Where represents the apprentice perception

at time step t (i.e. contains the truth value of

activated rules A

T

t

φ

t

for the state ).

t

S

The approximation error of the critic is given by

the temporal difference error TD as follows:

),()(.

~

111 tttttt

SVSVr −

+

=

+++

γ

ε

(2)

The critic uses this error to update the

conclusions vector v by a traditional stochastic

gradient descent:

),(.

~

.

11 ttvttt

SVvv ∇

+

=

++

ε

β

ttt

v

φ

ε

β

.

~

.

1+

+

=

(3)

2.2 Actor

Concerning the actor, the local actions are elected

for each rule activated on the basis of quality of

these actions, and also according to a policy

exploration which we will see further.

The global action for the state is then

determined by the inference of these locally elected

actions:

t

S

∑

∈

=

ti

ii

AR

tR

i

tUtt

SqElectionSU )().()(

α

(4)

T

tt

qElection

φ

).(=

Where Election is a function returning the action

elected for each activated rule.

The training of the optimal policy thus consists

in adjusting the vector of parameters so that the

induced policy is improved.

q

Again the TD error provides a measurement of

this quality improvement. Then, we obtains the

following rule for training the actor

titRt

i

t

i

t

i

t

i

t

ARSUqUq

i

∈∀+=

++

),(.

~

)()(

11

αε

(5)

The above expression shows that a positive error

TD implies that the action has been just applied is

preferable than the t-optimal action. It is necessary

to increase the quality of the action being applied.

ICINCO 2007 - International Conference on Informatics in Control, Automation and Robotics

340

Reciprocally, if the TD error is negative, it is

necessary to decrease the quality of the action being

applied because it led the system in a state whose

evaluation is lower than that expected.

2.3 Eligibility Traces

The traces implementation of the critic and the actor

rise directly from the incremental version of the TD.

Let

t

φ

be the trace of the critic at step t,

t

φ

is a

short term memory of the visited states by the

apprentice. This memory is based on the apprentice

perception, i.e.: truth values of the rules:

,.)(

0

∑

=

−

=

t

k

kt

t

φγλφ

,.)(

1

0

1

∑

−

=

−−

+=

t

k

kt

t

φγλγλφ

,1

.

−

+=

kt

φγλφ

(6)

where

λ

is the proximity factor of the critic.

An equivalent trace for the actor consists in

memorizing the actions applied in the states. We use

a short term memory of the truth values of each rule

according to each action available in these rules. Let

be the trace value of the action in the

rule at the step t (Jouffe, 1997):

)(

ii

t

Ue

i

U

i

R

⎪

⎩

⎪

⎨

⎧

′

=+

′

=

−

elseUe

UUUe

Ue

ii

t

i

t

ii

t

ii

t

ii

t

),(.

),(,)(.

)(

1

λγ

φλγ

(7)

where

λ

′

is the proximity factor of the actor.

The updates of the parameters preset by (3) and

(5) for all the rules and actions become:

,

~

11 tttt

vv

φεβ

++

+=

(8)

,

~

11 tttt

eqq

++

+=

ε

(9)

2.4 Description of the FACL Execution

Procedure

The execution in a time step can be divided into six

principal stages. Let t+1 the step of the current

time; the apprentice then has applied the action

elected in the previous time step and on the other

hand has received the primary reinforcement

for the transition from the state to . After

the calculation of the truth values of the rules, the six

stages are as follows (Jouffe, 1997):

t

U

1+t

r

t

S

1+t

S

1- Calculation of the t-optimal evaluation

functions of the current state by the critic:

T

tttt

vSV

11

.)(

++

=

φ

(10)

As proposed by Baird (Baird, 1995), to eliminate

the instability source due to the approximation by

SIF, we do not perform a gradient descent on the

residual of Bellman defined by

ε

but on the

average residual quadratic of Bellman. The

modification to be made on the trace of eligibility

is then given by the following relation:

1+

−←

ttt

γρφφφ

,

]1,0[∈

ρ

(11)

2- TD Error calculation:

),()(.

~

111 tttttt

SVSVr −

+

=

+++

γ

ε

(12)

3- Update of training rates corresponding to the

critic parameters for all rules in three stages:

In order to accelerate its training speed and to

avoid instability, we adopt a heuristic adaptive

training rate (Jouffe, 1997). The implementation of

heuristic is carried out by the means of Delta Bar

Delta rule:

-

i

tt

i

t

φεδ

.

~

1+

=

-

⎪

⎩

⎪

⎨

⎧

<−

>+

=

−

+

else

if

ifk

i

t

i

t

i

t

i

t

i

t

i

t

i

t

i

t

β

δδψβ

δδβ

β

,0.)1(

,0.

1

(13)

-

i

t

i

t

i

t 1

)1(

−

+−=

δψδψδ

Where in our case:

- is the training rate of the critic for the rule

in the time step t;

i

t

β

i

R

-

i

tt

i

t

φεδ

.

~

1+

=

represent the variation brought to

the parameter of the critic in which we propose to

integrate the trace of eligibility directly and not

simply the partial derivative;

-

i

t

i

t

i

t 1

)1(

−

+−=

δψδψδ

represent the exponential

average

BEHAVIOUR NAVIGATION LEARNINIG USING FACL ALGORITHM

341

This rule increases linearly the training rates in order

to prevent that it do not become too large, and

decrementing it in an exponential way to ensure that

it decrease quickly.

4- Training of the critic and the actor by updating

the vector

v and the matrix :

q

T

ttttt

vv

φβε

..

~

111 +++

+=

(14)

tttt

eqq

11

~

++

+=

ε

(15)

5- It is again necessary to calculate the t-optimal

evaluation function of the current state by the critic

but this time with the lately updated parameters:

(16)

T

tttt

vSV

1111

.)(

++++

=

φ

This value will be used for the TD error

calculation in the next step of time.

6- Now it remains the choice of the action to be

applied in the state .

1+t

S

We consider the case of continuous type actions,

which is inferred from the various actions elected in

each rule,

∑

+

∈

++++

∈∀=

1

,),().()(

1111

ti

i

AR

tR

i

tUtt

UUSqElectionSU

α

Where Election is defined by:

))()()(()(

11

UUUqArgMaxqElection

iii

tUU

i

tU

ρη

++=

+∈+

(17)

The traces eligibility updates for the critic and

the actor are given by the two following formulas:

,1

.

ttt

φγλφφ

+=

+

(18)

⎪

⎩

⎪

⎨

⎧

′

=+

′

=

++

+

elseUe

UUUe

Ue

ii

t

i

t

ii

t

ii

t

ii

t

),(.

),(,)(.

)(

11

1

λγ

φλγ

(19)

2.5 Proposed Exploration /

Exploitation

The actions election strategy which we used is a

combination of directed and random exploration

(Jouffe, 1997). The global election function, applied

to each rules, is then defined by:

))()()(()( UUUqArgMaxqElection

UUU

ρ

η

++=

∈

(20)

Where U represents the set of available discrete

actions in each rules, and the associated vector

quality.

q

)(U

η

the random exploration term, and

)(U

ρ

the directed exploration term.

The term

η

is in fact a random values vector. It

corresponds to a vector

ψ

of values sampled

according to an exponential law, standardized in

order to take into account the qualities size scale:

q

⎪

⎩

⎪

⎨

⎧

−

=

else

Max

UqMinUqMaxs

UqMinUqMaxsi

Us

p

f

,

)(

)))(())(((

))(())((1

)(

ψ

(21)

,).()(

ψ

η

UsU

f

=

(22)

Where represents the maximum size of the

noise relative to the qualities amplitude, is the

corresponding normalisation factor. Used alone, this

term of exploration allows a random choice of

actions when all qualities are identical and allows in

the other case, to test only the actions of which

quality satisfied:

p

s

f

s

(23)

)))(())((()()( UqMinUqMaxsUqUq

p

−−≥

∗

This term then allow us to avoid the choice of

bad actions (Jouffe, 1997).

The term of directed exploration

ρ

permit the

test of actions not having been applied often. It is

thus necessary to memorize the times number where

the action was elected. This term is defined in the

following way:

)(

Un

t

e

U

θ

ρ

=

(24)

Where

θ

represents a positive factor which

permit to equilibrate the directed exploration,

the applications number of action U at the time step

t. In the case of actions of the continuous type,

corresponds to a number of applications of

the discrete action U in the considered rule. Let

the discrete action elected at the step of time t in the

rule ; the update of this variable is determined by

the following equation:

)(Un

t

)(Un

t

i

t

U

i

R

i

t

i

ti

i

t

i

t

UUARUnUn =∈∀+=

−

,,1)()(

1

(25)

3 EXPERIMENTAL PLATFORM

Pioneer P2-dx with its modest size lends itself well

to navigation in the tight corners and encumbered

ICINCO 2007 - International Conference on Informatics in Control, Automation and Robotics

342

spaces such as the laboratories and small offices. It

has two driving wheels and a castor wheel.

Figure 1: Pioneer II Robot Photo.

To detect obstacles, the robot is equipped with a set

of ultrasounds sensors. It supports eight ultrasonic

sensors placed in its front (fig. 2). The range of

measurements of these sensors lies between 10cm as

minimal range and 5m as maximum range.

Figure 2: Position of the ultrasonic sensors used in the

robot Pioneer II.

The software Saphira/Aria (Konolige, 2002a),

(Konolige, 2002b) allows the control of the robot

(C/C++ programming). We have integrated

functions of Saphira (perception, localization and

motricity) in Simulink using API (S-Function),

Thus we can benefit from MatLab computing power

and simplicity of Simulink to test our algorithms

and to control the robot Pioneer II.

4 EXPERIMENTATION

The task of the robot consists in starting from a

starting point achieving a fixed goal while avoiding

the static obstacles of convex or concave type. It is

realised by the fusion of two elementary behaviours

«goal seeking» and «obstacle avoidance ".

4.1 The «goal seeking» Behaviour

For the "goal seeking" behaviour, we consider three

membership functions for the input

Rb

θ

(robot-goal

angle), and two membership functions for the input

b

ρ

(robot-goal distance) (Figure 3). The base of

knowledge consists of six fuzzy rules. The SIF

controller has two output for the actor (rotation

speed Vrot and the translation speed Vit) and one

output for the critique who is related to the

evaluation function of the pair actions (Vrot,Vit).

Figure 3: membership functions for

Rb

θ

(a) and

b

ρ

(b).

From the heuristic nature of FACL algorithm, we

carried out several tests to determine the values of

the parameters which accelerate the training speed

and to obtain good performances for the apprentice.

After a series of experiments, we found the

following values:

5.0,9.0,50

=

′

=

ρ

λ

θ

, and

1.0=sp 9.0

=

γ

Available actions for all rules, are as follow {-

20°/s, -10°/s, -5°/s, 0°/s, +5°/s, +10°/s, +20°/s} for

the rotation velocity and {0 mm/s, 150 mm/s,

350mm/s} for translation, which gives 21 possible

actions in each fuzzy rule.

The reinforcement function is defined as

follows:

- If the robot is far from the goal, the

reinforcement is equal to

• 1 if (

Rb

θ

. < 0) & Vit=0

Rb

θ



• 1 if (-1°<

Rb

θ

<+1°) & Vit ≠ 0

• 0 if (

Rb

θ

. = 0) & Vit=0

Rb

θ



• -1 else

- If the robot is close to the goal, the

reinforcement is equal to

• 1 if (

Rb

θ

. < 0) & Vit=0

Rb

θ



• -1 else

Figure (4) shows the trajectory of the robot

during the phase of training and validation.

Figure 4: Trajectory of the robot during and after training.

BEHAVIOUR NAVIGATION LEARNINIG USING FACL ALGORITHM

343

4.2 The «obstacle avoidance»

Behaviour

For this behaviour, we have determined the

translation speed of the robot proportional to the

distance from the frontal obstacles, with a maximum

value of 350 mm/s (fig.5).

Figure 5: Translation speed Vit.

Inputs of the fuzzy controller for this behaviour

are the minimal distances provided by the four sets

of sonar {min(d

90

,d

50

), min(d

30

,d

10

), min(g

10

,g

30

),

min(g

90

,g

50

)}, with respectively three membership

functions for each one (fig. 6).

Figure 6: input membership functions for the fuzzy

controller Vrot (a) :{min(d

90

,d

50

), min(g

90

,g

50

)}

(b) :{min(d

10

,d

30

), min(g

10

,g

30

)}.

The set of the action U common to all rules

consists of five actions {-20°/s, -10°/s, 0°/s, +10°/s,

+20°/s}.

The reinforcement function is defined as follows:

• +1 if min {min(d

90

,d

50

),min(d

30

,d

10

)} < min

{min(g

10

,g

30

),min(g

90

,g

50

)} & Vrot > 0

• +1 if min {min(d

90

,d

50

),min(d

30

,d

10

)}> min

{min(g

10

,g

30

),min(g

90

,g

50

)} & Vrot < 0

• -1 otherwise

The figure (7) shows the evolution of the robot

during the training phase.

On the figure (8), we represent the time

evolution of the robot. It shows that the robot is able

to be released from the tight corners and deadlock

obstacles shape.

Figure 7: Trajectories of the robot during and after

training.

(a) (b)

48 1 59 1

P

M

L

P

M

L

(

a

)

(

b

)

(c) (d)

Figure 8: Time evolution of the robot after training.

4.3 Fusion of the Two Behaviours

The goal of the two elementary behaviours fusion is

to allow the robot navigation in environments

composed by fixed obstacles of convex or concave

type and to achieve a fixed goal while ensuring its

safety, witch is a fundamental point in reactive

navigation.

The suggested solution consists in considering

the whole input variables of the two behaviours

« goal seeking » & « obstacles avoidance »

associated with distributed reinforcement function

by the means of a weighting coefficient between the

two behaviours (0.7 for obstacles avoidance and 0.3

for the goal seeking).

The fuzzy controller input are six who are the

minimal distances provided by the four sets of sonar

{min(d

90

,d

50

), min(d

30

,d

10

), min(g

10

,g

30

),

min(g

90

,g

50

)}, with respectively three membership

functions for the side sets and two membership

functions for the frontal sets, to which we add the

input

Rb

θ

with three membership functions, and

b

ρ

with two membership functions. The rules base

ICINCO 2007 - International Conference on Informatics in Control, Automation and Robotics

344

thus consists of 216 fuzzy rules. The translation

speed of the robot is proportional to the distance

from the frontal obstacles. The FACL algorithm

parameters are slightly modified as follows:

θ

=20

&

s

p

=0.9

Figure (9) represents the type of trajectory

obtained during the training phase. The robot

manages to avoid the obstacles and to achieve the

goal assigned in environments encumbered by

maintaining a reasonable distance between the

obstacle and its with side dimensions. Figure (10)

illustrates the satisfying behaviour of the robot after

training. The various trajectories obtained for the

same environment and the same arrival and starting

points are due primarily to the problems of the

perception of the environment by the robot and are

related to the phenomena of sonar readings, these

differences confirm at the same time the

effectiveness of the training algorithm.

Figure 9: Trajectories of the robot in training phase.

(a) (b)

(c) (d)

Figure 10: Various trajectories of the robot after training.

Figure 11: Trajectory of the real robot after training.

Figure (11) illustrates the satisfying behaviour of

the real robot which evolves/moves in an unknown

environment and which manages to achieve the

fixed goal.

5 CONCLUSION

FACL Algorithm makes it possible to introduce

generalization into the space of the states and the

actions, and a Sugeno type order zero SIF

conclusions adaptation incrementally, and this only

by the means of the interactions between the

apprentice and his environment. The reinforcement

function constitutes the measure of the performance

of the required behaviour solution. Also, fusion of

the behaviours « goal seeking» & « obstacle

avoidance » is presented by using a combined

reinforcement function. The simulation and

experimentation results in various environments are

satisfactory.

REFERENCES

Babvey 03 S. Babvey, O. Momtahan, M. R. Meybodi, “

Multi Mobile Robot Using Distributed Value function

Reinforcement Learning”, IEEE, International

Conference on Robotics &Automation, September

2003.

Baird 95 L.C. Baird, “A Residual Algorithms:

Reinforcement Learning with Function

Approximation”, Proceedings of the Twelfth

International Conference on Machine Learning, 1995.

Beom 95 H.R. Beom, H. S. Cho, “ A Sensor-Based

Navigation for a Mobile Robot Using Fuzzy Logic

BEHAVIOUR NAVIGATION LEARNINIG USING FACL ALGORITHM

345

and Reinforcement Learning ”, IEEE, Transactions on

Systems Man and Cybernetics , vol.25, NO.3 ,pp.464-

477, March 1995.

Faria 00 G. Faria, R. A.F Romero “Incorporating Fuzzy

Logic to Reinforcement Learning”, IEEE, pp.847-852

,Brazil,2000.

Fujii 98 T. Fujii, Y. Arai , H. Asama, I. Endo,

“Multilayered Rienforcement Learning For

Complicated collision Avoidance problems”,

Proceedings of IEEE, International Conference on

Robotics &Automation, pp.2186-2191,May 1998.

Fukuda 95 T. Fukuda, Y. Hasegawa, K. Shimojima, F.

Saito,“Reinforcement Learning Method For

Generating Fuzzy Controller”, Department of Micro

system Engineering, Nagoya University, pp.273-278,

Japan, IEEE, 1995.

Glorennec 00 P. Y. Glorennec, “Reinforcement Learning:

An Overview”, INSA de Rennes ESIT’2000, Aachen,

pp.17-35, Germany, September 2000.

Jouffe 97 L. Jouffe, “Apprentissage de Système

D’inférence Floue par des Méthodes de

Renforcement”, Thèse de Doctorat, IRISA, Université

de Rennes I, Juin 1997.

Konolige 02a K. G. Konolige ‘Saphira Référence’, Edition

Doxygen, Novembre 2002.

Konolige 02b K. G. Konolige ‘Saphira Robot Control

Architecture’, Edition SRI International, Avril, 2002

Smart 02 W. D. Smart, L. P. Kaelbling , “ Effective

Reinforcement Learning For Mobile Robots”,

Proceedings of the IEEE, International Conference on

Robotics &Automation, pp.3404-3410, May 2002.

Sutton 98 R.S. Sutton, A. Barto, “Reinforcement

Learning”, Bradford Book, 1998.

ICINCO 2007 - International Conference on Informatics in Control, Automation and Robotics

346