tions mainly implicit learning takes place and proce-
dural knowledge is acquired. The declarative knowl-
edge is formed afterwards. This indicates that the
bottom-up direction plays an important role. It is also
advantageous to continually verbalize to a certain ex-
tent what one has just learned and so speed up the
acquisition of declarative knowledge and thereby the
whole learning process (see figure 2).
3.2 A Self-learning System
The system I will now briefly introduceis intended for
learning adequate behavior based on simple features
it perceives in the environment. We combine two very
different approaches from opposite ends of the scale
of machine learning techniques. Low-levellearning is
realized by reinforcement learning (RL), more specif-
ically Q(λ)-learning (Sutton and Barto, 1998), high-
level learning is realized by techniques of belief re-
vision (BR) (Spohn, 2009). In figure 3 the systems’
functionality is illustrated. Technical details are given
in (Leopold et al., 2008a). By the combination of RL
and BR techniques the system is able to adjust much
faster and more thoroughly to the environment and to
improve its learning capabilities considerably as com-
pared to a pure RL approach. In the following I will
address the realization of the single postulated quali-
ties in our system.
Hierarchy (Q1) is obviously implemented as ex-
plained before.
Emergence on All Levels (Q2) is given as well.
For the lower level it is given inherently by employ-
ing RL. But also on the higher level a world model
(in the form of if-then rules) emerges as the genera-
tion of rules is driven by the numerical representation
(in the form of values of state-action pairs) that arises
on the lower level. Also from the BR point of view
alone, this construction is interesting. One drawback
of BR techniques consists in the fact that it is often
difficult to decide which parts of existing rules (i.e.,
which parts of logical conjunctions (see Q4)) should
be given up when a new belief comes in, in such a way
that no inherent contradictions are introduced. In this
context the rewards obtained in the RL context can be
regarded as measures for the correctness of parts of
the rules learned so far.
Multi-directional Transfer (Q3) is given by
stages 3 and 7 of a learning step (see figure 3). At
stage 3 (top-down guidance) the system uses current
beliefs to restrict the search space of actions for the
low-level process. At stage 7 (bottom-up emergence)
feedback to an action in the form of a reward is used
to acquire specific knowledge from the most recent
experience by which the current symbolic knowledge
is revised. While the implementation of the top-down
guidance in our system is straightforward the prob-
ably more important bottom-up process is the most
delicate part of our architecture. The ultimate revi-
sion of the ranking function by new information is in-
deed realized using standard techniques of BR. The
challenge however consist in the formalization of the
new information (here: which are the best actions in
a given state from a RL point of view) in such a way
that it can be utilized by BR techniques.
As revisions of the symblic knowledge have a
strong influence on the choice of future actions they
have to be handled carefully, i.e., the system should
be quite sure about the correctness of a new rule be-
fore adding it to its belief. For this reason we chose
a probabilistic approach to assess the plausibility of
a new rule. We use several counters counting, for
instance, how often an action has been a best action
in a specific state, before the symbolic representation
is adapted. Thus, stage 7 is not necessarily carried
out in each step of the learning process but only af-
ter enough evidence for a revision has been obtained
from the lower learning level.
Generalization from Few Examples (Q4) is fa-
cilitated in our system by the introduction of the
BR component. In general, the possibilities to gen-
eralize from learned knowledge to unfamiliar situa-
tions are more diverse with BR then with RL tech-
niques. In our approach rules take the form of con-
junctions of serveral, multivalued literals. For ex-
ample, a rule such as ”
If
A=x
and
B=y
then
per-
form action z” would be represented by the conjuc-
tion (A = x) ∧ (B = y) ∧ (Action = z). This allows
the defintion of similarities between conjuctions, e.g.,
simply by counting how many literals with the same
values they share. Revisions of existing rules can
then be based on the similarities between conjunc-
tions. Thus generalization can easily occur by revis-
ing similar rules in one single learning step. In Q4 I
claimed that generalizations should be possible from
few examples only. In principle, it is possible in our
approach that a single experience (in form of a reward
given for an action) could cause the introduction of a
new rule into the symbolic world model. In practice,
the necessary number of supporting examples can be
adjusted by tuning the relevant parameters of stage 7.
We address the topic of generalizations in the con-
tex of BR in more detail in (H¨aming and Peters, 2010)
and (H¨aming and Peters, 2011b). In (H¨aming and Pe-
ters, 2011c) we propose a method to exploit similari-
ties in symbolic descriptions especially in the case of
high-dimensional spaces.
The quality of Exploration (Q5) in our approach
is implemented by the RL component, with a larger
NCTA 2011 - International Conference on Neural Computation Theory and Applications
362