This paradigm corresponds to one of the
fundamental objectives of the mobile robotics which
constitutes a privileged applicability of
reinforcement learning. The paradigm suggested is
to regard behaviour as a sensor-effectors
correspondence function. The objective is to favour
the robots autonomy using learning algorithms.
In this article, we used the algorithm of
reinforcement learning, Fuzzy Q-Learning (FQL)
(Jouffe, 1996), (Souici, 2005) which allows the
adaptation of apprentices of the type SIF (continuous
states and actions), fuzzy Q-learning is applied to
select the consequent action values of a fuzzy
inference system. For these methods, the consequent
value is selected from a predefined value set which
is kept unchanged during learning, and if an
improper value set is assigned, and then the
algorithm may fail. Also, the approach suggested
called Fuzzy-Q-Learning Genetic Algorithm
(FQLGA), is a hybrid method of Reinforcement
Genetic combining FQL and genetic algorithms for
on line optimization of the parametric characteristics
of a SIF. In FQLGA we will tune free parameters
(precondition and consequent part) by genetic
algorithms (GAs) which is able to explore the space
of solutions effectively.
This paper is organized as follows. In Section 2,
overviews of Reinforcement learning,
implementation and the limits of the Fuzzy-Q-
Learning algorithm is described. The
implementation and the limits of the Fuzzy-Q-
Learning algorithm are introduced in Section 3.
Section 4 describes the combination of
Reinforcement Learning (RL) and genetic algorithm
(GA) and the architecture of the proposed algorithm
called Fuzzy-Q-Learning Genetic Algorithm
(FQLGA). This new algorithm is applied in the
section 5 for the on line learning of two elementary
behaviors of mobile robot reactive navigation, “Go
to Goal” and “Obstacles Avoidance”. Finally,
conclusions and prospects are drawn in Section 6.
2 REINFORCEMENT LEARNING
As previously mentioned, there are two ways to
learn either you are told what to do in different
situations or you get credit or blame for doing good
respectively bad things. The former is called
supervised learning and the latter is called learning
with a critic, of which reinforcement learning (RL)
is the most prominent representative. The basic idea
of RL is that agents learn behaviour through trial-
and-error, and receive rewards for behaving in such
a way that a goal is fulfilled.
Reinforcement signal, measures the utility of the
exits suggested relative with the task to be achieved,
the received reinforcement is the sanction (positive,
negative or neutral) of behaviour: this signal states
that it should be done without saying how to do it.
The goal of reinforcement learning is to find the
behavior most effective, i.e. to know, in each
possible situation, which action is achieved to
maximize the cumulated future
rewards.Unfortunately the sum of rewards could be
infinite for any policy. To solve this problem a
discount factor is introduced.
k
k
k0
rR
∞
≠
=
(1)
Where 0 ≤ γ ≤ 1 is the discount factor.
The idea of RL can be generalized into a model,
in which there are two components: an agent that
makes decisions and an environment in which the
agent acts. For every time step, the agent perceives
information from the environment about the current
state, s. The information perceived could be, for
example, the positions of a physical agent, to
simplify say the x and y coordinates. In every state,
the agent takes an action u
t
, which transits the agent
to a new state. As mentioned before, when taking
that action the agent receives a reward.
Formally the model can be written as follows;
for every time step t the agent is in a state st
S
where S is the set of all possible states, and in that
state the agent can take an action at
∈ (At), where
(At) is the set of all possible actions in the state st.
As the agent transits to a new state st+1 at time t + 1
it receives a numerical reward rt+1. It up to date
then its estimate of the function of evaluation of the
action using the immediate reinforcement, rt
+1
, and
the estimated value of the following state, Vt (St
+1
),
which is defined by:
)()
1
11
max ,
tt
tt t t
uU
Vs Qs u
+
++
∈
= (2)
The Q-value of each state/action pair is updated by:
)
)()()
111
,, ,
tt t t t t t t t t
Qs u Qsu r V s Qsu
βγ
+++
=++−
(3)
Where
)
()
,
11
γ
+−
++
rVs Qsu
ttt
tt
the TD error
and β is the learning rate.
GENETIC REINFORCEMENT LEARNING OF FUZZY INFERENCE SYSTEM APPLICATION TO MOBILE
ROBOTIC
207