REACTIVE LAYER IN AGI AGENT
Implementation of Adaptive Reactive Behavior and Beyond
Vilem Benes
Computer Science Department, University of West Bohemia, Univerzitní 8, Pilsen, Czech Republic
Keywords: AI, AGI, Agent architecture, Reactive agent, Situation discrimination.
Abstract: Basic mechanisms of cognition working in AGI agent are presented. I argue that reactive behavior is the
baseline of intelligence – it is the base component working and it can be further extended to produce more
intelligent agents. Mechanisms employed at reactive level enable the agent to develop behavior which both
explores and exploits the environment with the purpose of receiving highest reward possible. Three funda-
mental mechanisms are intertwined – action selection, action value estimation and situation discrimination.
Whole process of adaptation is completely unsupervised and depends only on reward received from envi-
ronment. Some technical details of implementation of given mechanisms (BAGIB agent) are described to-
gether with implications to other planned parts of “AGI-compliant” architecture. Discussed are several chal-
lenges we encounter in AGI, which are not present in usually narrow and domain-limited approach to AI.
1 INTRODUCTION
In this section I shortly introduce used notions of
intelligence, artificial intelligence, reinforcement
learning and outline design of my agent called
BAGIB. In next sections, we will go through more
detailed description of BAGIB implementation. Fi-
nal part of the article holds discussion about BAGIB
performance and about reactive behavior in AGI
realm.
1.1 AI
Intelligence is naturally (and also usefully for us)
defined as ability to reach goals (receive high re-
ward (Benes, 2004)) in wide range of environments
(Legg and Hutter, 2007).
Artificial general intelligence (AGI) is relatively
new term for creating of intelligent artefacts. It em-
phasizes the fact that the phenomenon called intelli-
gence is general. Till now, most AI solutions are
“narrow”, limited only to some given domain. De-
signing AI without the aim for generality often en-
courages using domain-dependent tricks and knowl-
edge (see (Goertzel, 2008)) and leads to solutions
that are not robust and adaptive – and dumb.
Reactive layer we are dealing with here is char-
acterized by not having any inner states – all actions
of the agent directly depend on observation. Reac-
tive component that is adaptive is fundamental for
intelligent agent. Because it is tightly bound with
agent’s body (its actuators and sensors) and suffi-
cient for considerable degree of intelligence.
The underlying framework I use is the one of Re-
inforcement Learning (RL) (Sutton and Barto,
1998). It is compatible with the notion of situated-
ness (embodiment; structural coupling of agent and
environment).
1.2 BAGIB Agent
In this article, we are going to explore the inner
working of AGI agent called BAGIB
1
. As AGI
agent, BAGIB is meant to be able to cope with any
environment. BAGIB agent adapts – it changes its
responses and inner structure according to reward
received. Agent derives and maintains reward esti-
mates (Q values) for primitive actions. Moreover,
these values are conditioned by perceived situations
which are adaptively defined. This allows more spe-
1
BAGIB – Brain of Artificial General Intelligence agent (by
Benes).
721
Benes V..
REACTIVE LAYER IN AGI AGENT - Implementation of Adaptive Reactive Behavior and Beyond.
DOI: 10.5220/0003296607210726
In Proceedings of the 3rd International Conference on Agents and Artificial Intelligence (ICAART-2011), pages 721-726
ISBN: 978-989-8425-40-9
Copyright
c
2011 SCITEPRESS (Science and Technology Publications, Lda.)
cific and hopefully better reacting. Discrimination
between situations
2
is being used and refined at the
same time. Agent is able to adapt and is increasing
complexity of inner structures “on-fly” to be able to
exploit more the environment.
If the situation does not clearly solicit a single
response or if the response does not produce a satis-
factory result, the agent is led to further refine his
discriminations, which, in turn, solicit more refined
responses (interpretation of work of Merleau-Ponty
in (Dreyfus, 2007) fits exactly as description here).
To be more specific, BAGIB features adaptation
and creation of situation discrimination by a variant
of iteratively built regression tree that is taking ob-
servation directly or preprocessed by detection of
clusters of frequent data-points. Creating these struc-
tures (clusters, regression tree) can be viewed as
creating a kind of pre-processing mechanism that
transforms sensory data and then outputs reward
estimates for different primitive actions. Current
version of BAGIB agent maintains no inner states
for use in succeeding steps (not counting stored data
that are used for adaptation of reactive layer), there-
fore this agent is said to be reactive.
Current version of BAGIB is limited to use of
primitive actions. The agent tries to find best percep-
tion of circumstances in the world (i.e., best situation
discrimination), to be able to reach highest possible
reward by performing only (the best) primitive ac-
tion in each perceived situation.
Inner representations (symbols) are grounded in
the experience – in the structures derived from data
acquired from the interaction with environment. This
should be the correct way to deal with the symbol
grounding problem. Furthermore, effects of actions
on reward are continually stored. Thanks to observ-
ing many different combinations, the agent is to
some extent able to infer what actions led to what
effects.
BAGIB agent gives us insight into more compli-
cated things. First, we can see origins of symbols in
situation discrimination. BAGIB also presents sim-
ple mechanism for approaching credit assignment
problem.
Second, BAGIB is example of general principle
of reusing same mechanisms at different levels of
inner hierarchy – selection mechanism is same for
primitive actions, regression tree condition candi-
dates (features) and also whole behaviors. Whole
described reactive behavior could be taken as struc-
2
See also associative search in (Sutton and Barto, 1998).
tural component and reused inside the brain of the
agent after definition of inner actions and sensors –
to achieve metacognition. This may enable self-
monitoring, self-reflection, control of inner mecha-
nisms and also gathering needed environment-
independent knowledge.
Third, we can think more about AGI theory us-
ing BAGIB example – we see what reactive archi-
tecture like this is capable of and we can guess what
extensions to this model can result into higher intel-
ligence. These extensions may include specialization
of behaviors, keeping and using inner states
3
; using
model of causal relations for expectations and plan-
ning; experimenting; preparing and conducting more
complicated actions and others.
In the following sections, all fundamental parts
of reactive component of BAGIB AGI agent are
described – action selection, value estimation (re-
ward assignment) and situation discrimination. Ad-
ditionally, more specific features and implementa-
tion details of this agent are presented.
2 MECHANISMS
2.1 Selection
AGI agent deals with the exploitation vs. exploration
problem. Primitive actions that were identified as
good should be repeated (exploitation) – so the agent
can collect reward, but also actions that seemed bad
at first sight should be tried from time to time (ex-
ploration) – to get the chance to reveal better ac-
tions/behaviors
4
/situations and avoiding being stuck
in local optima. The agent can be never certain that
whole environment was properly explored and that
reward estimates are correct. This is because the
environment can be either non-stationary (i.e.,
changing with time) or there can always be regions
in state-space of the environment which were never
reached. Usually, intricate structures that imply
complex behavior need to be developed before the
agent can reach (and “maintain”) highly rewarded
state-space regions of the environment.
BAGIB agent uses ε-greedy strategy – best ac-
tion is selected with 0.99 probability. With probabil-
3
Manipulating inner structures is quite similar in principle to
interacting with (outer) environment. See A-Brain B-Brain C-
Brain idea in (Minsky, 1981).
4
“Behavior” is for us an action-selection policy (possibly de-
pending on perceived situations) or its realization – a se-
quence of actions which were already performed.
ICAART 2011 - 3rd International Conference on Agents and Artificial Intelligence
722
ity of 0.01, one of remaining actions is selected at
random. This strategy aims at revealing the one ac-
tion that is best for given situation. Unfortunately,
often ideal conditions are not reached. Usually, more
actions seem to be equal in likelihood to be best, or
more actions are alternating as best in one situation
or their combination into sequence is needed. This
leads to non-trivial dynamics in selecting.
BAGIB agent uses same selection mechanism
(with ε-greedy) strategy not only for primitive action
selection, but also for using of sub-tree candidates
during their evaluation in situation discrimination
and same selection mechanism is also used for
whole behaviors.
2.2 Value Estimation
BAGIB agent uses a general principle for evaluating
of all primitive actions. It is variation of the method
known in RL as Q(λ) learning with incremental up-
date rule, step-size parameter and eligibility traces.
By this method, the agent is assigning values to ac-
tions – reward taken as input is distributed to actions
that are supposed to have influence on that particular
reward. Value is also assigned to adaptively created
states (see situation discrimination in next sub-
section).
The basic form of Q learning rule is:

=
+[

−
]
(1)
where r
k+1
is reward in this step and Q
k
are Q values
from previous step.
If step-size parameter is set to be small enough,
for stationary environments, the convergence to cor-
rect values is assured. Often, stationary environment
seems to be non-stationary to agent, because the
agent has not reached the correct situation discrimi-
nation or environments are really non-stationary. We
trade ability to drift with Q values (if we are learning
or in case of non-stationary environment) with the
ability to converge to correct values (for stationary
environment). The former is preferred in BAGIB
with usage of fixed (i.e., non-decreasing) step-size
parameter. However, adaptive control of step-size
parameters may prove to be useful.
Besides estimating values of primitive actions in
given states, BAGIB uses Q-learning also for candi-
dates for state-space bisections (i.e., sub-tree candi-
dates – see next section) and also for whole beha-
viors. This means that the same observed reward is
used to evaluate more things at once – things which
are standing at different places in inner structures.
Selection based on value estimation is at the hearth
of intelligence – it ensures adaptation and robust
behavior.
At present time, BAGIB uses ad hoc solution for
assigning reward – circular buffer is maintained that
stores last 100 observations (reward + sensors) to-
gether with taken actions. Stored are also pointers to
all kinds of other evaluated elements (sub-trees, be-
haviors). With correspondence to eligibility traces in
Q(λ) method, factor of influence of every one ele-
ment stored in circular buffer is derived (using dis-
count factor) and share of newly received reward
(together with difference of expected value between
initial and end situation) is assigned to given in-
fluencing elements.
2.2.1
Double Q Values
I investigate in BAGIB usage of two Q values for
one primitive action. Purpose of this is having better
reactions to changing conditions – agent maintains
long time policy, but also reacts quickly to sudden
big changes. In detail – each Q value is calculated as
sum of two Q values: Q
long
and Q
short
. Both of these
are calculated in the same way, but using different
step-size parameters in formula (1) – eg. α
long
is
0.00001 and α
short
is 0.001.
Agent is able to use long-time averaged Q val-
ues, but it also gets the ability to actually avoid using
action that is generally good, but bad recently. Quick
reaction to recent changes due to Q
short
fraction in Q
allows building of bigger complexity. Besides ran-
dom selection in ε-greedy strategy, this gives anoth-
er possibility to stop being stuck while maintaining
long-term structure in Q values. This is useful espe-
cially when the agent has only limited options to
recognize situations in the environment (eg. before
detailed situation discrimination was developed).
2.2.2
Advanced Value Estimation
Exact estimation of the value of actions (and also
components of the inner structure of the agent) is
limited by many factors. Effect (of using) of each of
them can also be only temporal or reaching far to the
future. Accurate assessment of value of many differ-
ent elements by described naïve method may require
more steps that are available (we have possibly
many combinations of evaluated elements). I feel
that improving value estimation will require using
more advanced (higher-level) methods in combina-
tion with present naïve approach. For example –
rules that tie effects of actions to their value may
REACTIVE LAYER IN AGI AGENT - Implementation of Adaptive Reactive Behavior and Beyond
723
improve reward assignment. These rules (possibly
conditioned by situation and derived in simpler ob-
served cases) may be used in distributing reward
5
.
Generalization provided by rules may significantly
decrease the complexity of the reward assignment.
2.3 Situation Discrimination
Because state-space (observation-space) for many
environments is either huge or infinite, we need
some sort of abstraction. That means some kind of
observation-space partition mechanism, that will
give us only limited number of useful states that
would be used in learning of according behaviors
(action polices). Not enough states means that we
are far from Markov property
6
, too much states
means learning is too slow.
Strategy for continual incremental
7
observation
state-space partitioning was adopted – it is done by
growing the regression tree. Each leaf node in in-
crementally induced tree defines one situation. In
each of them, policy learning takes place and best
further branching is searched for. Situation here is a
general term, it corresponds to the “state” term that
is prevalently used in reinforcement learning theory.
Regression tree holds Q-values for all primitive
actions for each of the situations defined by tree
branching. Used mechanism generally allows learn-
ing good overall behaviors and then evolving them
into better, more specialized solutions “on fly”.
At the beginning, regression tree consists of only
one leaf node. The only state represented by this
node is encompassing whole observation-space, us-
ing Q-learning in this state, the agent learns best
overall behavior. Then, regression tree is extended to
enable better overall behavior by splitting observa-
tion-space into two states (situations).
In more detail, this regression tree induction
process (see Table 1) runs as follows. The agent
performs Q-learning for the first leaf node (the root
node). During it, observations are stored. When
number of stored cases reaches threshold, candidate
5
For example: (we learned before that) pushing into the wall has
no positive influence – and if we detect positive change in re-
ward while doing so, we know that it probably needs to be
caused by something else.
6
If the state (perceived situation) has Markov property, it means
that the influence of next action taken only depends on this state
and not on the history of previous states. Being close to Markov
property also implies better chance to converging of the state
value (expected reward in this state).
7
Agent needs to be able to learn rapidly (to exploit sufficiently
as soon as possible), but it needs also to be improving in larger
time scales.
sub-trees are created. These “leaf trees” are held in
candidate list and consist of a decision node with
condition and two leaf nodes with primitive action
lists. The decision node of the candidate sub-tree
creates the (next) observation-space bisection.
Table 1:
Situation discrimination done by iterative re-
gression tree induction algorithm
. ‘Store observation’
saves all sensory data in given step – possibly conditioned
(by step count, reward etc.). ‘Select’ stands for applying
ε-greedy selection in given list. ‘Tree extension criterion’
is based on difference of reward obtained by using leaf-
node and by using candidate sub-tree and also on candi-
date use count. ‘Alternate’ is between leaf node and sub-
tree candidate list with period of 1000 steps.
wait for
observation
reach leaf node (descent tree)
if
candidate sub-tree list empty
then
store observation
select and output action
else
alternate leaf node and cand. list
if
using reached leaf node
then
select and output action
else if
using candidate list
then
select sub-tree from cand. list
evaluate sub-tree condition
descent to sub-tree leaf
select and output action
check if chosen sub-tree
…candidate meets tree extension
…criterion
if
candidate good enough
then
replace reached leaf node
…with candidate sub-tree
copy rest of candidate list
…to both sub-tree leaf nodes
repeat
forever
In following steps, evaluation takes place. Agent
alternates between using original leaf node and sub-
trees from candidate list. If using candidate list,
ε-greedy strategy is used to select the candidate from
list, its condition (decision node) is evaluated ac-
cording to current observation and respective leaf
node primitive action list is used to select (ε-greedy)
the output action. If candidate sub-tree from meets
tree-extension criterion (used enough times and
bringing higher reward than parent leaf node), it is
used in regression tree as replacement of the original
parent leaf node. After this replacement, regression
tree now consists of one decision node and two leaf
nodes. The agent recognizes two states (situations)
and develops two action selection policies – one for
each of the them. Distinct situations given by the
ICAART 2011 - 3rd International Conference on Agents and Artificial Intelligence
724
position in decision tree form the basis of symbol
grounding (assigning symbols to perceived phenom-
ena). Symbol represents important situation.
Process described above is repeated and situation
discrimination becomes finer. The agent searches for
best way to do situation discrimination and for best
action policy for it at the same time.
Described creation and usage of regression tree
represents in fact developing a behavior. BAGIB
agent employs not one behavior, but there is a list
with many of them – as I said before, selection and
value estimation mechanisms are used for list of
whole behaviors in the same way as it is used for
lists of primitive actions. This is a simple attempt to
address problems with changing conditions and en-
vironments.
2.4 Feature Definition
Mechanism used to create the observation-state bi-
section – feature definition was presented in (Benes,
2005). Situation is generally defined as combination
of defined features. Feature here is taken as abstrac-
tion created from data cases – a circumstance that
can either be present or not.
Features used as conditions are currently defined
as high density clusters in observations (with various
dimensionality). Currently all observations are sent
to feature definition procedure.
Potentially useful may be more guided approach
– sending only interesting observations to feature
definition procedure (eg. observations that were fol-
lowed by high/low reward or other interesting ef-
fect).
2.5 Problems
Finding out the optimal observation-space partition-
ing and corresponding action policies for different
environments is very hard for more reasons.
Agent needs to be able to cope with any number
of dimensions and all possible ranges of values. The
agent also needs to be able to cope with changing
conditions.
How we perceive the world also depends
strongly on what we do in the world – with learning
of more elaborate ways to act or ways to perceive
things, the agent needs to adjust its situation dis-
crimination (and policies for situations) – the regres-
sion tree in our case. Often, at the beginning, the
agent is unable to reach all important regions of en-
vironment state space, because of its initial simple
behavior.
Failure in exploration may lead to development
of lowly rewarded behavior. If adapting agent
misses general rules at the beginning of learning,
adaptation may be significantly slowed down or it
may fail completely. Adoption of bisections of little
usefulness in the regression tree at the beginning of
adaptation may happen if regression tree creation is
too fast. On the other hand, slow situation discrimi-
nation and policy learning leads to sub-optimal per-
formance also.
List of more competing behaviors enables BA-
GIB agent to create one behavior, reveal important
things
8
during performing it and then switch to new-
ly developed behavior that uses acquired knowledge
and is better.
Competition between similar structures and me-
chanisms seems like an fundamental principle for
intelligent mind operation. In BAGIB, primitive ac-
tions compete with each other and same holds for
features (candidates for observation-space bisec-
tions) and whole regression trees (i.e., behaviors).
Further research should tell whether it is more
beneficial to create more complex structures and
mechanisms for reactive component or whether it
will be better to create and use more advanced
higher-level methods that will control reactive parts.
3 AGENT PERFORMANCE
At present time, BAGIB is being evaluated in 6 dif-
ferent environments: classic RL environments
Mountain car and Pole balancing, three variations
of Capture region
9
environment and Q2 deathmatch.
In simple environments (Capture region) it is
able to learn better behavior than hardwired agents
tuned for best performance. In classic RL environ-
ments it is comparable to other general RL learning
agents. In Pole balance (Barto, 1993) environment,
BAGIB was tested against learning neuron-network
agent tailor-made for this environment – learning
8
Observation space regions (defined features), potentially also
relations between situations and actions (model).
9
This environment developed by author consists of a 2-d rectan-
gular space divided into roughly 100 smaller regions which can
be captured by moving agents. Variations of this environment
have different number of actions and sensors. Reward is di-
rectly derived from the fraction of regions captured by given
agent. This environment was designed to investigate reward as-
signment under non-friendly conditions (another agent, human-
created, is maximizing the same reward function) and basic
situation discrimination.
REACTIVE LAYER IN AGI AGENT - Implementation of Adaptive Reactive Behavior and Beyond
725
took approx. 10 times longer, 20 percents of runs
were successful. BAGIB learned fast to keep the
pole up, but failures were occurring due to moving
out of the allowed area. Either improving reactive
layer or adding other, non-reactive structural com-
ponents (model, planning) may help to achieve bet-
ter results.
In Quake 2 deathmatch environment, there are
no other RL agents available for comparison, yet. In
comparison with existing complex hard-wired solu-
tions (bots) or experienced human players, BAGIB
performs poorly
10
. This environment provides many
useful insights, though. One example is this – in Q2,
the agent receives a list of perceived entities as part
of observation. In environments closer to real-world,
the agent would be required to build this list from
only partial information received. If our agent learns
to use entity list in Q2, we understand better how
this agent would process its inner representations in
more “difficult” environments.
As preliminary results show, BAGIB is able to
learn more slowly and little less successfully than
adaptive solution custom-tailored and tuned for giv-
en environment, but is fully comparable to other
general RL agents. BAGIB shows potential to out-
perform human coded agents, which are often fragile
and receive great penalties in situations unforeseen
by the designer. BAGIB agent trades performance in
particular single tasks with generality. Unlike most
other agents developed, it is able to cope with many
different environments – this is what we aim at in
designing of AGI agents.
4 CONCLUSIONS
Development of BAGIB agent was an success so far
– described reactive part of BAGIB architecture is
general and able to perform reasonably in tested
scenarios. It helps us understand what is adaptive
reactive layer capable of and what will be needed in
additional higher-level mechanisms and layers
(planning, memory, model, attention, etc.).
10
This also depends on how complex are the given “primitive”
actions and sensors. If primitive actions available to agent are in
fact quite sophisticated in the environment, BAGIB may per-
form fairly well. Especially if these primitive actions are al-
ready well-designed pieces of behavior, so the learning agent is
merely “tuning” their use in different situations.
REFERENCES
Barto, A. G., Sutton, R. S. and Anderson C. W., 1983.
Neuronlike elements that can solve difficult learning
control problems, IEEE Transactions on Systems,
Man, and Cybernetics, 13: 835-846.
Benes, V., 2004. Intelligence means to be rewarded, Pro-
ceedings of the 5th International PhD. Workshop on
Systems and Control. A Young Generation Viewpoint.
Balatonfüred, Hungary. (online: http://home.zcu.cz/
~shodan/).
Benes, V., 2005. Defining Situations - Fundamental Level
of Perception, Proceedings of the 6th International
PhD Workshop on Systems and Control Young Gen-
eration Viewpoint. Ljubljana: Jozef Stefan Institute.
(online: http://home.zcu.cz/~shodan/).
Dreyfus, H. L., 2007. Why Heideggerian AI Failed and
How Fixing it Would Require Making it More Heideg-
gerian, Philosophical Psychology, Volume 20, Issue 2
(April).
Goertzel, B., 2008. AI and AGI: Past, Present and Future,
AGI-08 Conference. University of Memphis. (online:
http://www.acceleratingfuture.com/people-
blog/?p=1814).
Legg, S. and Hutter, M., 2007. A collection of definitions
of intelligence, Advances in Artificial General Intelli-
gence: Concepts, Architectures and Algorithms. NL:
IOS Press. (online: http://www.vetta.org/definitions-
of-intelligence/).
Minsky, M., 1981. Jokes and the Cognitive Unconscious,
In Cognitive Constraints on Communication, Vaina
and Hintikka (eds.). Reidel. (online: http://web.media.
mit.edu/~minsky/papers/jokes.cognitive.txt).
Sutton, R. S. and Barto, A. G., 1998. Reinforcement
Learning: An Introduction, MIT Press. (online:
http://webdocs.cs.ualberta.ca/~sutton/book/ebook/the-
book.html).
ICAART 2011 - 3rd International Conference on Agents and Artificial Intelligence
726