6.2 Consequences of a Reinforcement
Signal
With the reinforcement signal, the robot evaluates its
previous actions and transitions of states, and there-
fore creates a new RLA network. This network has
the same structure as the one described in section 2.2:
Nodes represent RLAs (not Reinforcement-RLAs)
and the edges represent transitions of RLAs. These
transitions could be interpreted as transitions between
states. However, instead of attaching probabilities of
these transitions, we log a weight w for their value.
Every time the robot extracts a reinforcement signal,
it evaluates the most recently used transitions. There-
fore a RLA network is initialised where all edge’s
weights w
(x,y)
= 100. To reinforce a state’s transition,
we define reinforcement factors r f
j
for ”Good” and
”Bad” reinforcements, where j is an natural number
that represents the distance to the reinforced state. Af-
ter a reinforcement signal is given, we can update the
weights w
(x,y)
of the previous transitions as follows:
w
(x,y),i+1
= w
(x,y),i
· r f
j
For example, an episode of RLA W, RLA X,
RLA Y and RLA Z with a ”Good” reinforcement
signal is given while RLA Z was performed. The
reinforcement-factors r f
j
for a ”Good” reinforcement
should be r f
1
= 0.9, r f
2
= 0.95 and r f
3
= 0.97. The
values in the RLA network are then updated as fol-
lows:
w
(y,z),i+1
= w
(y,z),i
· 0.9
w
(x,y),i+1
= w
(x,y),i
· 0.95
w
(w,x),i+1
= w
(w,x),i
· 0.97
Therefore we receive a RLA network whose
weights represent the effectiveness of a transition.
If w
(x,y),i
< 100, thetransition between RLA X and
RLA Y is preferred. The lesser the value of w
(x,y),i
the
better the transition. If w
(x,y),i
= 100, thetransition be-
tween RLA X and RLA Y is neutral. If w
(x,y),i
> 100,
the transition between RLA X and RLA Y is avoid-
able. The higher the value of w
(x,y),i
the worse the
transition.
We call this RLA network ”effectiveness net-
work”, and the RLA network described in section 3
as containing the probabilities is called ”probability
network”.
The probability network is built during all four
phases of learning, but the effectiveness network is
only built in the fourth phase. The robot performs re-
active action planning, and evaluates the situations it
reaches. This fourth phase could last life-long, be-
cause at this point the robot is autonomous and no
longer requires a teacher.
7 RESULTS OF EXPERIMENTS
We trained three behaviours to show the practicability
and the effectiveness of our learning process. We used
three different types of attention focus:
1. (3, 3, 3, 3, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 50)
2. (2, 2, 2, 2, 2, 5, 5, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 50)
3. (1, 1, 1, 1, 1, 7, 7, 5, 5, 5, 1, 1, 1, 1, 1, 7, 7, 5, 5, 5, 50)
7.1 Collision Avoidance
In this behaviour we used attention focus number 1
which considers only front sensors. With this we can
simulate a braitenberg vehicle (Braitenberg, 1986) to
achieve a collision avoidance behaviour. All other
sensors are suppressed.
The robot achieved a satisfying collision avoid-
ance behaviour with 21 RLAs, 13 of them were cre-
ated during the first phase of learning. The robot took
8:20 minutes to learn this behaviour. (see table 7.4).
The teacher created 11 Reinforcement-RLAs in
the third phase, which lasted for five minutes. These
Reinforcement-RLAs were used by the robot to eval-
uate situations by itself for another 15 minutes. After
this fourth phase of learning, the best transition in the
robot’s effectiveness network had a value 0.01 and the
worst 159.
7.2 Wall-Following
This behaviour is a left-handed Wall-Follower and the
robot learned it with attention focus number 2. This
behavior considers the position sensors the most, and
the front sensors were necessary to perform in con-
cave edges.
The robot achieved a satisfying wall-following be-
haviour with 92 RLAs, 81 of which were created dur-
ing the first phase of learning, which lasted 11:07
minutes. There were only a few new RLAs created in
extraordinary situations in the second phase of learn-
ing. (see table 7.4).
The teacher created 12 Reinforcement-RLAs in
the third phase, which lasted for five minutes. These
Reinforcement-RLAs were used by the robot to eval-
uate the situations it reached by itself for another 10
minutes (see table 7.4). After this fourth phase of
learning, the best transition in the robot’s effective-
ness network had the value 80.26 and the worst 826.
This high negative value is caused by a situation in
which the robot is too close to a wall, which it cor-
rects in the next step. This situation occurred very
often.
A HUMAN AIDED LEARNING PROCESS FOR AN ARTIFICIAL IMMUNE SYSTEM BASED ROBOT CONTROL -
An Implementation on an Autonomous Mobile Robot
353