SIMULTANEOUS LEARNING OF PERCEPTIONS AND ACTIONS

IN AUTONOMOUS ROBOTS

Pablo Quint

ıa, Roberto Iglesias, Miguel Rodr

ıguez

Department of Electronics and Computing, Universidade de Santiago de Compostela, Santiago de Compostela, Spain

Carlos V. Regueiro

Department of Electronics and Systems, Universidade da Coru

na, A Coru

na, Spain

Keywords:

Reinforcement learning, State representation, Autonomous robots, Fuzzy ART.

Abstract:

This paper presents a new learning approach for autonomous robots. Our system will learn simultaneously the

perception – the set of states relevant to the task – and the action to execute on each state for the task-robot-

environment triad. The objective is to solve two problems that are found when learning new tasks with robots:

interpretability of the learning process and number of parameters; and the complex design of the state space.

The former was solved using a new reinforcement learning algorithm that tries to maximize the time before

failure in order to obtain a control policy suitable to the desired behavior. The state representation will be

created dynamically, starting with an empty state space and adding new states as the robot ﬁnds them, this

makes unnecessary the creation of a predeﬁned state representation, which is a tedious task.

1 INTRODUCTION

Robots must be able to adapt its behaviour to changes

in the environment if we want them operating in real

scenarios, dynamic environments or human’s com-

mon workplaces. Because of this in this paper we de-

scribe a model free learning algorithm, able to adapt

the behaviour of the robot to new situations and that

not relies on any predeﬁned knowledge. Our system

will learn simultaneously how to translate the percep-

tions of the robot into a ﬁnite state space and the ac-

tions to perform at each state to achieve a desired be-

haviour. We are not aware of any other publications

with the same objectives.

Perception

learning

Action

learning

state

action

Environment

perception

reinforcement

SOM

ART

LVQ

Q-learning

Eligibility traces

Genetic algorithm

{

Figure 1: General schema of our proposal with the two

modules for perception learning and action learning.

We propose the combination of reinforcement

learning (Sutton and Barto, 1998) to learn the actions

and a Fuzzy ART network (Carpenter et al., 1991) to

learn the states (Fig. 1). Our system will not only

learn the actions to execute on each state but also it

will learn to classify the situations the robot ﬁnds dur-

ing its operation. Our reinforcement learning based

algorithm will be simpler and easier to interpret than

other approaches, and the dynamic representation of

states will create the state space from an empty set of

states. This eliminates the burden of creating an ad-

hoc representation for each task. Thanks to this com-

bination of reinforcement learning and Fuzzy ART we

will achieve a technique able to learn on-line, adapt-

ing the behaviour of the robot to the changes that may

occur in the environment or in the robot itself.

2 ACTION LEARNING

Sutton and Barto developed reinforcement learning as

a machine learning paradigm that determines how an

agent ought to take actions in an environment so as

to maximise some notion of long-term reward (Sutton

and Barto, 1998).

Reinforcement learning is a very interesting strat-

egy, since all the robot needs for learning a behaviour

395

Quintía P., Iglesias R., Rodríguez M. and V. Regueiro C. (2010).

SIMULTANEOUS LEARNING OF PERCEPTIONS AND ACTIONS IN AUTONOMOUS ROBOTS.

In Proceedings of the 7th International Conference on Informatics in Control, Automation and Robotics, pages 395-398

DOI: 10.5220/0002943203950398

 SciTePress

is a reinforcement function which tells the robot how

good or bad it has performed, but nothing about the

set of actions it should have carried out. Through

a stochastically exploration of the environment, the

robot must ﬁnd a control policy – the action to be ex-

ecuted on each state – which maximises the expected

total reinforcement it will receive:

∞

∑

t=0

] (1)

where r

is the reinforcement received at time t, and

γ ∈ [0,1] is a discount factor which adjusts the relative

signiﬁcance of long-term versus short-term rewards.

Q-learning (Watkins, 1989) is one of the most

popular reinforcement learning algorithms, although

it might be slow when rewards occur infrequently.

What is termed Eligibility Traces (Watkins, 1989) ex-

pedite the learning by adding more memory into the

system. One problem of this algorithms is their de-

pendence of the parameters used, that usually need to

be set after a trial an error process.

In this work we present a new learning algorithm

based on reinforcement. Our algorithm will provide a

prediction of how long the robot will be able to move

before it makes a mistake. This raises clear and read-

able systems where it is easy to detect, for example,

when the learning is not evolving properly: basically a

high discrepancy between the time before failure pre-

dicted and what is actually observed on the real robot.

Another advantage of our learning proposal is that it is

almost parameterless, so it minimises the adjustments

needed when the robot operates in a different environ-

ment or performs a different task. The only parameter

needed is a learning rate which is not only easy to set,

but it is often the same value, regardless of the task to

be learnt.

Since we wish to use the experience of each state

transition to improve the robot control policy in real

time, we shall apply Q-learning, but redeﬁning the

utility function of states and actions. Q(s,a) will be

the expected time interval before a robot failure when

the robot starts moving in s, performs action a, and

follows the best possible control policy thereafter:

Q(s,a) = E[−e

(−T b f (s

=s,a

=a)/50T )

], (2)

where T b f (s

) represents the expected time

interval (in seconds) before the robot does some-

thing wrong, when it executes a in s, and then fol-

lows the best possible control policy. T is the con-

trol period of the robot (expressed in seconds). The

term −e

−T b f /50T

in Eq. 2 is a continuous func-

tion that takes values in the interval [−1,0], and

varies smoothly as the expected time before failure

increases.

Since Q(s, a) and T b f (s,a) are not known, we

can only refer to their current estimations Q

(s,a) and

T b f

(s,a):

T b f

(s,a) = −50 ∗ T ∗ Ln(−Q

(s,a)), (3)

The deﬁnition of Q(s,a), T b f , and the best possi-

ble control policy, determine the relationship between

the Q-values corresponding to consecutive states:

T b f

) =



T if r

< 0

T + max

{T b f

t+1

,a)} otherwise

(4)

is the reinforcement the robot receives when it

executes action a

in state s

. If we combine Eq. 3 and

Eq. 4, it is true to say:

t+1

(s,a) =



−e

−1/50

if r

< 0

) + δ otherwise

(5)

where,

δ = β(e

−1

∗ max

t+1

,a) − Q

)). (6)

β ∈ [0,1] is a learning rate, and it is the only pa-

rameter whose value has to be set by the user.

3 PERCEPTION LEARNING

In reinforcement learning the state space deﬁnition is

a key factor to achieve good learning times. The state

space must be ﬁne enough to distinguish the different

situations the robot might ﬁnd, but at the same time it

must have a reduced size to avoid the curse of dimen-

sionality.

The design of the state space is a delicate task,

and it is dependent on the problem the robot has to

solve. We propose a dynamic creation of the state

space as the robot explores the environment (Fig. 1).

For this task we have chosen to use a Fuzzy ART ar-

tiﬁcial neural network (Carpenter et al., 1991). This

kind of networks are able to perform an unsupervised

online classiﬁcation of the input patterns without any

previous knowledge.

Of the three parametres that are involved in the

Fuzzy ART algorithm α,β and ρ – usually called vig-

ilance parameter – the most important is ρ. α and β

are almost independent of the task to solve, but the

value of ρ will inﬂuence the number of states created.

If it is too high the Fuzzy ART will create too many

classes. If ρ is too low the state representation will be

too coarse and the system will suffer from perceptual

aliasing, resulting in an increase of the learning time

or impossibility to achieve convergence.

Due to space restrictions we can’t provide more

details of the Fuzzy ART here. Nevertheless further

information can be found in (Carpenter et al., 1991).

ICINCO 2010 - 7th International Conference on Informatics in Control, Automation and Robotics

396

4 EXPERIMENTAL RESULTS

The two systems showed in Fig. 1 complement each

other to ﬁnd a solution to the learning problem. We

will perform several experiments:

a) Evaluate the performance of our learning algo-

rithm described in 2. To do this without the inﬂu-

ence of the Fuzzy ART network we used a set of two-

layered SOM networks to translate the large number

of different situations that the ultrasound sensors may

detect, into a ﬁnite set of 220 neurones – states (Igle-

sias et al., 1998).

b) Evaluate the performance of the Fuzzy ART

network creating a state space from scratch, using

both normalised and not normalised inputs using

complement coding (CC) (Carpenter et al., 1991).

We applied our proposal to teach a mobile robot

two different tasks: a wall following task; and a door

traversal task. The inputs for the Fuzzy ART net-

work will be the inverted readings provided by a laser

rangeﬁnder. We reduced the dimensionality to 8 sec-

tors of laser readings 22.5

wide, using the lowest

measure as representative of each sector.

The parameters of the learning algorithms used

during the learning were: β = 0.288282,γ = 0.9,λ =

0.869965. The parameters of the Fuzzy ART were:

α = 0.00001, β = 0.0025.

4.1 Wall Following

As said before we will use a static state representation

to test the learning algorithms. In order to train the

SOM neural networks we used a set of sensor read-

ings collected when the robot was moved close to a

wall (Iglesias et al., 1998). For comparison purposes

we tested our learning approach against three classical

algorithms: Q-learning and two different implemen-

tation of eligibility traces: Watkins’ Q(λ) (Watkins,

1989) and what is called Naive Q(λ) (Sutton and

Barto, 1998). The results obtained after the execution

of 15 experiments for each algorithm can be seen in

Table 1. The classical learning algorithms performed

as expected. Our proposal based on learning the time

before failure performed as good as Naive Q(λ). The

main advantage of our learning algorithm is to have a

more interpretable and simple algorithm, with almost

no cost on the learning time.

The next step in the experimentation was to com-

bine the Fuzzy ART with the learning algorithm.

Considering the previous results, we chose to test the

combination of the Naive Q(λ) algorithm and Fuzzy

ART. In Fig. 2(a) we can see the variations in the

average learning time and std. deviation with differ-

ent values of the vigilance parameter – ρ . From this

Table 1: Results of the learning of a wall following task

with a predeﬁned SOM network (Iglesias et al., 1998).

Algorithm Learning time Std. deviation

Q-learning 00:29:37 00:13:59

Watkins’s Q(λ) 00:21:35 00:12:32

Naive Q(λ) 00:17:21 00:08:21

Our proposal 00:16:39 00:08:14

00:00:00

00:30:00

01:00:00

01:30:00

02:00:00

02:30:00

03:00:00

03:30:00

04:00:00

0.84 0.86 0.88 0.9 0.92 0.94 0.96 0.98

Average learning time

Vigilance

00:00:00

00:30:00

01:00:00

01:30:00

02:00:00

02:30:00

03:00:00

03:30:00

04:00:00

0.84 0.86 0.88 0.9 0.92 0.94 0.96 0.98

Average learning time

Vigilance

(a)

00:00:00

00:30:00

01:00:00

01:30:00

02:00:00

02:30:00

03:00:00

03:30:00

04:00:00

0.84 0.86 0.88 0.9 0.92 0.94 0.96 0.98

Average learning time

Vigilance

00:00:00

00:30:00

01:00:00

01:30:00

02:00:00

02:30:00

03:00:00

03:30:00

04:00:00

0.84 0.86 0.88 0.9 0.92 0.94 0.96 0.98

Average learning time

Vigilance

(b)

Figure 2: Results of the wall following task with Naive Q(λ)

and Fuzzy ART network without (a) and with (b) comple-

ment coding.

results we can extract that the valid range – learning

times lower than 1 hour – for the vigilance is approx-

imately [0.900, 0.950] and that the best values are

around 0.9125. Fig. 2(b) shows the results of the ex-

periments if the inputs of the Fuzzy ART are codiﬁed

in complement coding. We can see that the use of

complement coding does not reduce signiﬁcantly the

learning time achieved by the optimal vigilance value,

but it does improve the learning times if the vigilance

parameter is not the optimal.

The best value for the vigilance parameter found

in this experiments – 0.9125 – can serve as a good

starting point for the use of the Fuzzy ART in other

tasks. This value will be used to learn other tasks.

SIMULTANEOUS LEARNING OF PERCEPTIONS AND ACTIONS IN AUTONOMOUS ROBOTS

397

300 cm

120 cm

100 cm

45º

70º

Figure 3: Experimental scenario for the door traversal task.

The initial positions of the robot were within the shaded

area.

Table 2: Results of the learning of a door traversal task with

Naive Q(λ) and Fuzzy ART.

Average Average Average

Vigilance Beams CC learning time deviation states

SOM 8 01:36:53 00:57:55 221

0.9125

No NA NA 63.22

Yes 01:37:12 01:00:55 94.87

181

No 01:57:48 01:05:31 134.33

Yes 00:39:56 00:23:20 82.87

4.2 Door Traversal

The door traversal task (Nehmzow et al., 2006) was

learnt in the experimental scenario shown in Fig. 3.

First we tested the system using the same SOM neural

network that was used to learn the wall following task.

The robot achieves a good control policy after a learn-

ing time of 01:36:53 with high variability – 00:57:55.

Using our system the results are equivalent if comple-

ment coded is used, without complement coding the

system is unable to learn in reasonable time (Table 2).

With the ﬁrst experiments we found out that if

we use the same input as in the wall following task

the door was not visible from several positions. To

have a better perception of the door we decided to

use all 181 laser readings. This improves the times

signiﬁcantly, the average learning time is reduced to

00:39:56 and the std. deviation lowers to 00:23:20.

As can be seen in Table 2, the dynamic representation

scales very well with the increase in the dimensional-

ity. Complement coding is the appropriate choice for

the Fuzzy ART inputs.

5 CONCLUSIONS

Through reinforcement learning the robot is able to

learn on its own – through trial an error interactions

with the environment – using only the feedback pro-

vided by a very simple reinforcement function. The

learning algorithm developed in this paper represents

a simpler and more interpretable solution to the learn-

ing problem. The algorithm requires less parame-

ters and its meaning is more straightforward – the ex-

pected time before the robot commits an error.

But one of the main problems of applying rein-

forcement learning in robotics is the state space def-

inition. In this paper we showed how we can use

a Fuzzy ART neural network to dynamically create

the state space while the reinforcement learning algo-

rithm learns the actions to execute on each state.

The use of a dynamic representation of states does

not suppose an increase in the learning time, in fact it

reduces the learning time in comparison to the use of

a predeﬁned and static state representation if a good

vigilance value is chosen. We also proved that this

dynamic state representation scales well with size of

inputs. But the main advantage of this approach is

that there is no need to create an ad-hoc state repre-

sentation for the task. Creating a predeﬁned state rep-

resentation requires gathering a training and test data

set, training the network and validating the network.

This must be repeated until we obtain a good network

for our purpose.

Our proposal was used to solve two different and

common tasks in mobile robotics: wall following and

door traversal. The experimental results conﬁrm that

our proposal is valid.

ACKNOWLEDGEMENTS

This work has been funded by the research

grants TIN2009-07737, INCITE08PXIB262202PR,

and TIN2008-04008/TSI.

REFERENCES

Carpenter, G. A., Grossberg, S., and Rosen, D. B. (1991).

Fuzzy art: Fast stable learning and categorization of

analog patterns by an adaptive resonance system. Neu-

ral Networks, 4(6):759–771.

Iglesias, R., Regueiro, C. V., Correa, J., S

anchez, E., and

Barro, S. (1998). Improving wall following behaviour

in a mobile robot using reinforcement learning. In

ICSC International symposium on engineering of in-

telligent systems (EIS’98), Tenerife (Espa

na).

Nehmzow, U., Iglesias, R., Kyriacou, T., and Billings, S.

(2006). Robot learning through task identiﬁcation.

Robotics and Autonomous Systems, 54:766–778.

Sutton, R. S. and Barto, A. G. (1998). Reinforcement learn-

ing: An introduction. MIT Press.

Watkins, C. (1989). Learning from Delayed Rewards. PhD

thesis, University of Cambridge, England.

ICINCO 2010 - 7th International Conference on Informatics in Control, Automation and Robotics

398