LEARNING BY OBSERVATION IN SOFTWARE AGENTS

Paulo Costa and Luis Botelho

Instituto de Telecomunicac¸˜oes/ISCTE-Instituto Universitario de Lisboa, Lisbon, Portugal

Keywords:

Machine learning, Learning algorithms, Learning by observation, Software image, Software agents.

Abstract:

In a society of similar agents, all of them using the same kind of knowledge representation, learning with

others could be achieved through direct transfer of knowledge from experts to apprentices. However, not all

agents use the same kind of representation methods, hence learning by direct communication of knowledge is

not always possible. In such cases, learning by observation might be of key importance. This paper presents

an agent architecture that provides software agents with learning by observation capabilities similar to those

observed in superior mammals. The main contribution of our proposal is to let software agents learn by direct

observation of the actions being performed by expert agents. This is possible because, using the proposed

architecture, agents may see one another.

1 INTRODUCTION

Learning by observation is one of the most powerful

socialisation mechanisms for superior mammals and

also one of the most important means of knowledge

acquisition (Ramachandran, 2003; Bandura, 1977;

Meunier et al., 2007). The capacity to observe and

imitate the movements of others is among the least

common and most complex forms of learning (Moore,

1992). Research in neurology and psychology shows

that learning by observation may well be one of the

causes of the exponential growth of human capabili-

ties in the last centuries (Ramachandran, 2006).

Learning by observation can be classiﬁed un-

der the human and superior animals social inter-

action mechanisms. Bandura’s social learning the-

ory emphasizes the advantages of learning by obser-

vation (Bandura, 1977). Dautenhahn (Dautenhahn,

1994) claims that some intellectual capacities evolve

out of the social domain. Animals and humans take

beneﬁt from the experiences of others by learning

what they observe from them.

This provides motivation for further research on

adapting a similar learning mechanism for software

agent societies. Learning by direct observation of

the expert agent performing its actions (as opposed to

merely relying on observing their effects) can be ad-

vantageous in situations where the effects of those ac-

tions are not directly visible in the environment (e.g.

communication between agents), when the represen-

tation of world states requires too much memory mak-

ing it impossible to know the effects of all actions

(e.g. social simulations with multiple interactions), in

situations in which the same effects could be achieved

by different alternative actions but using one of them

is clearly better than using others (e.g. using a set of

sums instead of a simpler multiplication), and espe-

cially when the agent does no know the effects of its

actions (e.g. virtual character manipulation).

Learning by observation is also advantageous for

knowledge transmission between agents developed

with different programming languages and also be-

tween software and robot agents. Using learning by

observation it is possible to make robot agents learn

by observing a software simulated version of them.

We propose a software agent architecture pro-

viding learning by observation capabilities similar to

those of superior mammals. This architecture allows

apprentice agents to acquire control rules through the

observation of an expert agent in action. In this paper,

we focus on the learning by observation algorithm

that is part of the proposed architecture. The algo-

rithm depends on the agent software image (Costa and

Botelho, 2011) as a means to identify similar agents

and to observe those agents as they act. The observed

agent software image provides apprentices with the

necessary training sequences for their learning algo-

rithms. It also supplies the necessary means to evalu-

ate the acquired knowledge.

We have tried two learning by observation ap-

proaches. The memory based approach uses the ob-

served data to create sequences of actions that are

276

Costa P. and Botelho L..

LEARNING BY OBSERVATION IN SOFTWARE AGENTS.

DOI: 10.5220/0003834502760281

In Proceedings of the 4th International Conference on Agents and Artiﬁcial Intelligence (ICAART-2012), pages 276-281

ISBN: 978-989-8425-96-6

 2012 SCITEPRESS (Science and Technology Publications, Lda.)

applied to speciﬁc circumstances. The mirror based

approach uses machine learning algorithms, trained

with data from expert observations, to determine the

correct behaviour for the perceivedenvironment state.

Apprentices take the observed expert behaviours as if

they were their own, enabling an easier identiﬁcation

of the observed actions.

Since our memory based approach has not been

completed, the experimental results focus on the mir-

ror based approach. These results show that the best

suited algorithms for the mirror based approach are

KStar (classiﬁcation algorithm) and NNGE (rule as-

sociation algorithm). Results also show that after a

few observations, apprentices are able to learn the cor-

rect behaviours for a large set of environment states.

The results also seem to show that, for certain kinds

of problems, learning by observation is advantageous

over other learning methods.

The next section presents a survey of research on

learning by observation. Section 3 presents a descrip-

tion of the learning by observation algorithm. The

results of initial tests made on the learning algorithm

are presented in section 4. Section 5 presents conclu-

sions and future work.

2 RELATED WORK

Learning is an essential characteristic of intelligent

beings. A computer program is able to learn if its

performance on a set of tasks improves with experi-

ence (Mitchell, 1997). Learning algorithms can be

organized in two types: supervised learning and un-

supervised learning.

Supervised learning creates mappings between in-

puts and outputs, whose correct values are provided

by a supervisor, which is usually a person. Supervi-

sors can interact directly with the learner agent pro-

viding the necessary training sequences for the learn-

ing algorithm or building a reward system that com-

pensates the agent each time it chooses the correct ac-

tion. Unsupervised learning determines how a dataset

is organized. These algorithms try to ﬁnd regularities

in the input data to extract knowledge from them (Al-

paydin, 2004).

Several authors (Argall et al., 2009; Billard and

Dautenhahn, 1999) deﬁne learning by observation as

a subset of supervised learning, where policies are

generated by observing, retaining and replicating the

behaviour executed by an expert. Meunier and his

colleagues (Meunier et al., 2007) proved that the use

of learning by observation in rhesus monkeys im-

proved the learning speed when compared with other

types of learning such as trial and error.

The application of learning by observation tech-

niques in robot and software agents opens the learning

process to experts other than robot or software agents.

It enables the creation of an intuitive communica-

tion medium between humans and computers (Argall

et al., 2009).

Research in learning by observation is usually as-

sociated to robotics. Since software agents are usu-

ally not visible to themselves and to others, they

are mainly restricted to observing the effects of ac-

tions, instead of the actions themselves (Kerstin et al.,

1999). Software agent learning algorithms are often

directed to the use reinforcement learning techniques,

which in turn are learning methods that rely exclu-

sively on environment observations.

One of the main causes of the lack of learning by

observation of actions in software agents is the ques-

tion on how to represent those actions when they have

no physical body to make them visible (Kerstin et al.,

1999). To overcome this problem, we have developed

and integrated the software image into our agent ar-

chitecture (Costa and Botelho, 2011). The software

image is a mechanism through which software agents

can observe their bodies and actions.

According to Argall and her colleagues (Argall

et al., 2009) two aspects must be taken into account

when building a learning by observation solution:

gathering and interpreting examples from the expert

agent and deriving a policy from those examples. The

software image takes care of the ﬁrst aspect, the pro-

posed learning algorithm relates to the second aspect.

Policy derivation can be achieved through several ap-

proaches. Our approach uses classiﬁcation algorithms

and policy representation as a sequence of behaviours.

Classiﬁcation algorithms categorize and group

similar perceptions and behaviours. In these cases

the policy is derived by mapping perceptions to be-

haviours. This is best suited when environment states

change at random and behaviours do not depend on

each other (i.e. learning a package routing algorithm).

Representing the policy as a sequence of be-

haviours provides apprentices with the necessary

steps to go from one state to the goal state. It is best

suited for situations where behaviours are correlated

and a chain of events can be determined (i.e. learning

a sort algorithm).

Other approaches for learning by observation take

inspiration from neurological research of brain struc-

tures such as the mirror neurons. They allow ob-

servers of some action to feel almost the same as

if they would if they have performed it, providing

the means for easy action identiﬁcation (Ramachan-

dran, 2006; di Pellegrino et al., 1992; Rizzolatti et al.,

1996).

LEARNING BY OBSERVATION IN SOFTWARE AGENTS

277

According to Demiris and Hayes (Demiris and

Hayes, 2002), research in human brain and mirror

neuron activation suggests that learning by observa-

tion allows the apprentice to put itself in the experts

place. To understand the observed behaviour, appren-

tices generate several alternatives before the observa-

tion process is ﬁnished, allowing them to improve the

quality of response.

Based on these facts, Demiris and Hayes proposed

a biologically inspired computationalmodel for learn-

ing by observation. They used forward models to

create estimates for speciﬁc behaviours regardless of

their source (observed or generated by the appren-

tice). Forward models allow the apprentice to pre-

dict the next state without actually performing the be-

haviour. In a sense, they allow the apprentice to gen-

erate and test a set of possible behaviours without af-

fecting the environments state (Demiris and Hayes,

2002).

A similar approach is used in our learning algo-

rithm. A mirror mechanism provides agents with the

ability to generate a set of possible behaviours for a

given state, retrievedfrom observation or from the ap-

prentices perception. Instead of using estimations and

forward models, the possible behaviours are evalu-

ated, providing the apprentice with quality measures.

Evaluation is an important feature in learning by

observation since it allows the apprentice to know its

ability to accomplish the goal. Evaluation can be as-

sisted or unassisted. Milstein and her colleagues real-

ized that, unlike humans, chimps and gorillas do not

get assisted by their parents when learning new tasks.

They have to resort to trial and error to perfect the

learnt skills (Milstein et al., 2008).

The use of guidance from an expert is a singular

feature in learning by observation amongst humans.

This interaction allows the apprentice to ask for guid-

ance whenever it has doubts on the actions to take.

This provides us great advantages over trial and error.

Our learning algorithm allows the use of both assisted

and unassisted evaluation, since assisted evaluation

requires the use of specialized experts, the teachers,

which might not be always available.

3 LEARNING BY OBSERVATION

ALGORITHM

The learning by observation algorithm follows a mod-

ular design, operating in two stages: the learning

stage and the execution stage. The learning stage in-

cludes observation, storage, reﬂection and evaluation

of behaviours. The execution stage includes the ap-

plication and evaluation of the learnt skills.

Each of the learning algorithm’s modules ad-

dresses a speciﬁc task in the learning process. Fig-

ure 1 shows how these modules are arranged and how

information ﬂows at each stage.

Figure 1: The learning modules.

As ﬁgure 1 shows, the memory, mirror and evalu-

ation modules participate in both learning and execu-

tion stages. The learning stage begins in the software

vision module with the location and observation of a

similar expert. The execution stage begins in the ap-

prentice’s sensors with the acquisition of the appren-

tice’s perception.

Evaluation represents the last step for both learn-

ing and execution stages. This is an important fea-

ture since it ensures the apprentice’s learning quality.

Evaluation is a continuous process, allowing the ap-

prentice to go back to the learning stage whenever it

is necessary.

The evaluation process inﬂuences the apprentice’s

conﬁdence level. Conﬁdence is an internal quality

measure that allows switching between the learning

and execution stages. When conﬁdence is under a cer-

tain threshold the algorithm is on the learning stage,

above that threshold the algorithm is on the execution

stage.

3.1 The Learning Stage

The software vision module providesapprentices with

the necessary tools to observe experts. It makes use

of a special property that allows experts and appren-

tices to describe their capabilities and perceptions, the

software image (Costa and Botelho, 2011). The vi-

sion module focuses the apprentices attention on ex-

perts that can be imitated (whose static image is con-

tained in the apprentice’s static image). The expert’s

dynamic image provides apprentices with the expert’s

behaviour and perception as snapshots.

Software vision converts this information into ap-

ICAART 2012 - International Conference on Agents and Artificial Intelligence

278

prentice experiences by replacing all references to ex-

pert structures in the snapshot with references for the

corresponding apprentice’s structures. This conver-

sion allows apprentices to use the experts state de-

scription and behaviour as if they were their own.

Software vision is able to read both current and

historical information on what the agent is doing and

what were the conditions holding just before it has

done each action. Apprentices are able to load a lim-

ited amount of snapshots from earlier times. Each

time an apprentice starts observing a new expert it

reads its historical data and, only after that, it starts

collecting snapshots of on-going behaviour.

Two approaches are used to handle data obtained

from observation of an expert’s software image. The

memory approach focuses on storing and recalling the

observed snapshots according to their place in time.

The mirror approach focuses on making the act of

observing an expert to produce the same effects as

preparing for executing those actions.

The memory approach uses the memory mod-

ule to store observed snapshots and provide solutions

for perceived environment states. As it is the case

in superior mammals procedural memory, the mem-

ory module stores each snapshot as a chain of steps.

These are later recalled, by the recall mechanism, and

used as reference paths for moving from one envi-

ronment state to another. Thus the memory module

is able to provide a set of possible behaviours for

the currently perceived state through association with

previously executed behaviours.

The mirror approach uses the memory module to

store observed snapshots and the mirror module to

provide solutions for perceived environment states.

For the mirror module, like in mirror neurons, the

act of observing an expert produces the same effects

as preparing for executing those actions. The mirror

module uses a collection of machine learning algo-

rithms, such as KStar (classiﬁcation), ID3 (decision

trees), Naive Bayes (Bayesian networks) and NNGE

(rule association), that are trained with the data stored

in the memory module to build a list of possible be-

haviours from an environment state. Environment

states can come from both agent perception or ex-

pert observation, as they are treated in the same way.

Developers are free to use any of the algorithms but,

as section 4 shows, KStar and NNGE are the best

choices.

The apprentice’s choice between these two ap-

proaches involve the use of a weight mechanism. The

memory and mirror modules are ﬁtted with weight

factors that enhance the apprentices adaptability to

different learning circumstances. Each time a module

produces a solution that, through evaluation, is proven

to be the appropriate one, the module’s weight is in-

creased.

In the learning stage, evaluation happens each

time the apprentice makes an observation. For each

observed snapshot, the apprentice produces a solution

for the environment state described in the snapshot.

This solution is compared, in the evaluation module,

with the behaviour provided by the snapshot to deter-

mine if the apprentice is making the correct choices.

Apprentice conﬁdence increases whenever the ap-

prentice’s solution matches the behaviour provided by

the snapshot. When that is not the case, the appren-

tice’s conﬁdence decreases.

3.2 The Execution Stage

In the execution stage the apprentice’s perception is

used as input for the mirror module and the recall

mechanism. Each of these modules produce a solu-

tion according to the underlying approach (mirrror or

memory approach). The execution module picks the

most ﬁtted solution from the module with the highest

weight value. If the solution has a positive evaluation

the execution module carries out with execution.

The execution module is only activewhen the con-

ﬁdence level is above a certain threshold. Bellow that

threshold the agent is unable to perform any action, it

only has the ability to observe. Execution makes the

necessary arrangements to call the actions required

for the solution. Throughout execution, a special

mechanism collects information on the problems and

achievements that might be encountered. This mech-

anism is responsible for providing the feedback to the

evaluation module.

Evaluation in the execution stage follows two di-

rections depending on the existence of a special kind

of expert, the teacher. If teachers are available the ap-

prentice is able to evaluate directly the provided so-

lution through teacher appraisal. The apprentice asks

the teacher if its solution is correct and if the teacher

answers positively the apprentice’s conﬁdence in-

creases. If the answer is negativethe apprentice’s con-

ﬁdence decreases and the solution is not executed.

When no teachers are available, evaluation only

produces an outcome after execution. Whenever a

problem is found when executing the solution or the

apprentice realizes that it has made a step back (e.g.

needs to re-achieve a sub-goal), evaluation decreases

the apprentice’s conﬁdence level. If, by any chance,

the execution provides some type of reward, like for

example the achievement of a sub-goal, evaluation in-

creases the apprentice’s conﬁdence. In all other cases,

conﬁdence cannot be changed since it is not possible

to assume if the solution was the appropriate one.

LEARNING BY OBSERVATION IN SOFTWARE AGENTS

279

4 EXPERIMENTAL RESULTS

In this section, we describe an experiment on the mir-

ror approach that tests the capabilities of several ma-

chine learning algorithms in learning by observation

environments. These algorithms are used by the ap-

prentice’s mirror module to provide solutions for the

perceived environment states. With this experiment

we intend to see what types of algorithms are best

suited for the mirror approach and how agent com-

plexity affects the algorithm’s performance.

A test framework was developed to allow the cre-

ation of small test scenarios involving a single expert

and a single apprentice agent. In this framework, the

environment state is described by four variables with

a limited set of possible values: two numerical, one

string and one enumerate. Expert agents react to con-

trolled changes in the environment state, executing

speciﬁc behaviours when speciﬁc combinations of the

four environment variables are presented.

Table 1 presents the agent features in two of the

tested scenarios. They allow us to see how the num-

ber of rules inﬂuences the learning efﬁciency. The

complexity of agent behaviours rises with the num-

ber of rules used to describe them. When the number

of rules increases, an increase in the time required to

observe an expert is expected, since learning by ob-

servation works efﬁciently when apprentices observe

a wide range of possibilities.

Table 1: Agent features for the test scenarios.

Scenario Parts Sensors Actions Rules

1 1 2 4 4

2 1 3 5 10

Both scenarios present a total of 100 apprentice

observations (meaning the expert is presented with

100 different environment states). For each observa-

tion the apprentice stores the observed snapshot and

calculates the proper behaviours for a set of ten dif-

ferent environment states. Apprentice behaviours are

calculated by a speciﬁc algorithm in the mirror mod-

ule. The algorithm makes use of the previously ob-

served snapshots as train sequences.

Apprentice behaviours are compared with the cor-

rect responses to determine how the number of ob-

servations inﬂuences the apprentice’s accuracy rate.

The accuracy rate determines the number of correct

behaviours of the ten performed. Figure 2 presents

the apprentice’s error rate for the best two algo-

rithms, KStar and NNGE, in both scenarios. Al-

though Bayesian networks are widely used for these

kind of learning solutions, the tested algorithm (Naive

Bayes) presented a lower performance on the tested

Figure 2: Apprentice error rate for the tested scenarios.

scenarios.

Figure 2 shows that using fewer rules allows ap-

prentices to achievehigh accuracy rates with a smaller

number of observations. As another important ﬁnd-

ing, on both scenarios, the accuracy rate stabilizes af-

ter achieving a certain value, which is lower in the

second scenario. For 100 observations, it is impossi-

ble to achieve full accuracy. Even though the appren-

tice is able to perform correctly in most of the cases,

under certain circumstances incorrect behaviours are

still being performed. Further research on this mat-

ter revealed that under these circumstances, the algo-

rithm’s conﬁdence (the probability values of possible

behaviours) is quite low, meaning the algorithm’s de-

cision capacity is prone to errors.

One possible reason for the lack of a complete ac-

curacy after 100 observations may be the small range

of observed behaviours. Some of the environment

states presented to the apprentice may require the ex-

ecution of yet unobserved behaviours. This results in

the non inclusion of the correct behaviour in the algo-

rithm’s list of possibilities.

Figure 2 also shows that KStar performs a little

better than NNGE but, as we increase the number

of observations, the accuracy rate of the KSTAR de-

creases slightly. This decline may be due to the fact

that the KStar algorithm uses the similarities among

the train sequences to calculate the list of possible be-

haviours. When faced with sets of similar environ-

ment states that give rise to different behaviours, the

initial observations may lead to mistakes that are ﬁxed

with subsequent observations.

Although only a small set of the learning algo-

rithm was focused on these tests, we can say that the

apprentice is able to learn almost all the expert’s be-

haviours in a small amount of time. When compared

with other learning techniques that need long training

periods such as reinforcement learning, results show

that learning by direct observation of agent actions al-

ICAART 2012 - International Conference on Agents and Artificial Intelligence

280

lows software agents to achieveresults in a faster way.

Further tests proved that, in particular cases, ob-

serving agent actions is the only way to understand

expert behaviour. If behaviours produce no effects in

the environment(affecting only the agent and its inter-

nal state), relying on the changes in the environment

makes learning ineffective. Testing our learning ap-

proach in such a scenario produced similar results to

those presented on ﬁgure 2, were all behaviours pro-

duce visible effects in the environment. This proves

the apprentice was able to learn the expert’s behaviour

even though there were no visible effects. These re-

sults show us that learning by observation can be of

use to software agents.

5 CONCLUSIONS AND FUTURE

WORK

As shown by the experimental results on section 4,

the proposedlearning algorithm allows agents to learn

new skills within a small number of observations. As

opposed to reinforcement learning, apprentice agents

do not need to test all possibilities to determine the

correct behaviour for a given situation. Apprentices

are able to see expert’s behaviours, reducing the num-

ber of iterations necessary for the initial learning

stage.

After completing the memory approach, further

tests are needed to determine how it interacts with

learning and how it inﬂuences the algorithm’s per-

formance. The learning algorithm also needs to be

compared with other learning approaches in scenar-

ios where observation of expert actions is of key im-

portance. This will enable us to see how favourable

is learning by observation when compared with other

learning approaches.

ACKNOWLEDGEMENTS

This paper reports PhD research work, for the Doc-

toral Program on Information Science and Tech-

nology of ISCTE-Instituto Universitario de Lisboa.

It is partially supported by Fundac¸˜ao para a Cien-

cia e a Tecnologia through the PhD Grant number

SFRH/BD/44779/2008 and the Associated Labora-

tory number 12 - Instituto de Telecomunicac¸˜oes.

REFERENCES

Alpaydin, E. (2004). Introduction to machine learning.

MIT Press, Cambridge Mass.

Argall, B. D., Chernova, S., Veloso, M., and Browning, B.

(2009). A Survey of Robot Learning from Demon-

stration, volume Robotics and Autonomous Systems,

pages 469–483. Elsevier.

Bandura, A. (1977). Social learning theory. Prentice Hall.

Billard, A. and Dautenhahn, K. (1999). Experiments in

learning by imitation - grounding and use of commu-

nication in robotic agents.

Costa, P. and Botelho, L. (2011). Software image for

learning by observation. Proceedings of 15th Por-

tuguese Conference on Artiﬁcial Intelligence (EPIA

2011), pages 872–884.

Dautenhahn, K. (1994). Trying to imitate - a step towards

releasing robots from social isolation. Proceedings of

From perception to action conference, pages 290–301.

Demiris, J. and Hayes, G. (2002). Imitation as a dual-route

process featuring predictive and learning components.

a biologically plausible computational model. In

Dautenhahn, K. and Nehaniv, C., editors, Imitation

in animals and artifacts, pages 327–361. Cambridge,

MIT press edition.

di Pellegrino, G., Fadiga, L., Fogassi, L., Gallese, V.,

and Rizzolatti, G. (1992). Understanding motor

events: a neurophysiological study. Experimen-

tal brain research. Experimentelle Hirnforschung.

Exp´erimentation c´er´ebrale, 91(1):176–180.

Kerstin, T. Q., Dautenhahn, K., and Nehaniv, C. L. (1999).

The essence of embodiment: A framework for under-

standing and exploiting structural coupling between

system and environment.

Meunier, M., Monfardini, E., and Boussaoud, D. (2007).

Learning by observation in rhesus monkeys. Neuro-

biol Learn Mem, 88(2):243–8.

Milstein, M., Linick, S., Lonsdorf, E., and Ross, S. (2008).

A comparison between chimpanzee (pan troglodytes)

and gorilla (gorilla gorilla gorilla) social tolerance at

an artiﬁcial termite mound. AMERICAN JOURNAL

OF PRIMATOLOGY.

Mitchell, T. M. (1997). Machine Learning. McGraw-Hill,

New York.

Moore, B. (1992). Avian movement imitation and a new

form of mimicry: tracing the evolution of a complex

form of learning. Behaviour, 122:231–263.

Ramachandran, V. S. (2003). The emerging mind: the Reith

Lectures 2003. Proﬁle Books.

Ramachandran, V. S. (2006). Mirror neurons and imitation

learning as the driving force behind ”the great leap

forward” in human evolution. Edge Foundation.

Rizzolatti, G., Fadiga, L., Gallese, V., and Fogassi, L.

(1996). Premotor cortex and the recognition of mo-

tor actions. Brain research. Cognitive brain research,

3(2):131–141.

LEARNING BY OBSERVATION IN SOFTWARE AGENTS

281