4.3 Distinguishing of Exemplars of
Same States
The main purpose of the internal representation is to
obtain dynamically created states. These states
represent each exemplar of the same original state as
a new state. In the L-table a new state is generated if
in filling the L-table a collision of states storing
occurs. For example, for the action east of state «9»
two destination states are possible: «1» and «13».
As soon as the collision appears, the new state will
be prepared and included in the table. In our case the
state «16» is descendant of state «9». Moreover, the
new state «16» inherits all appropriate transitions.
At the same time there arises a problem of
recognizing internal state having original external
state. The problem is solved by comparing the
transition history for one last step with the L-table. If
the L-table is not filled or filled partially, then
methods of random or directed selection are applied.
The algorithm under consideration is described
below. It is based on the original algorithm Sarsa(λ)
[Sutton]. The presentation is kept original. Only the
bold style is used to highlight the modifications
made and new elements introduced.
Initialize Q(s,a) arbitrarily and
e(s,a) = 0, for all s,a
Repeat (for each episode):
Initialize s, a, s
i
Repeat (for each step of episode):
Take action a, observe r, s΄
s΄
i
= GetInternal(s
i
, a, s΄)
IF isTransitionFickle then
Expand tables L, Q, e
s΄
i
is new one
Choose a΄ from s΄
i
using e-greedy
δ ← r + γQ(s΄
i
,a΄) – Q(s
i
,a)
e(s
i
,a) ← e(s
i
,a) + 1
for all s
i
,a :
Q(s
i
,a) ← Q(s
i
,a) + α δ e(s
i
,a)
e(s
i
,a) ← γ λ e(s
i
,a)
s
i
← s΄
i
; a ← a΄
Update table L
until s
i
is terminal
The denotation of variables is also kept original,
only s
i
and s’
i
are internal mappings of appropriate
states. As can be seen, the changes are related to
involving the mechanism of internal representation
and Q-table adaptation. The key procedure
GetInternal() returns internal state according to
table L:
Input parameters:
s
i
, a, s΄
Creating of list of descendant states
of s’
Searching of equal transition entries
Possible cases:
No entries: return external state
One entry: return it
Several entries:
applying methods of random or
directed search.
In case of several entries the incorrect internal
state might be returned. This situation is similar to
action testing in reinforcement learning: incorrect
returns will disappear. Application of an algorithm
like bucket brigade might be helpful in this case.
5 CONCLUSIONS
The proposed modified Sarsa(λ) algorithm
implements the idea of environment internal
representation. The modified algorithm is able to
recognize ambiguous states. Nevertheless, it suffers
from the lack of recurrent mechanisms to cope with
difficult mazes like Maze5 due to similar sequences
of transitions. The success of applying it on simple
mazes like Woods101, Maze7, MazeT demonstrates
the ability of the agent to build the internal
representation of the environment and use it in
reinforcement learning instead of original algorithm.
An interesting direction for further research is to
upgrade the algorithm to enable it to cope with
complicated environments. Future research will also
address the formalisation and generalisation of the
algorithm discussed
REFERENCES
Butz, M.V., Goldberg, D.E., Lanzi, P.L., 2005. Paper:
Gradient Descent Methods in Learning Classifier
Systems: Improving CXS Performance in Multistep
problems, IEEE Transactions, Vol. 9, Issue 5.
Sutton, R., Barto, R., 1998.
Reinforcement Learning. An
Introduction.
Cambridge, MA: MIT Press.
Russell, S., Norvig, R. 2003,
Artificial Intelligence: A
Modern Approach
, Prentice Hall. New Jersey, 2
nd
ed.
Padgham, L., Winikoff, M., 2004.
Developing Intelligent
Agent Systems. A Practical Guide.
John Wiley &
Sons.
Kwee, I., Hutter, M., Schmidhuber J., 2001. Paper:
Market-Based Reinforcement Learning in Partially
Observable Worlds.
Lin, L-J., 1993, PhD thesis
: Reinforcement Learning for
Robots Using Neural Networks,
Carnegie Mellon
University, Pittsburgh, CMU-CS-93-103.
APPLYING Q-LEARNING TO NON-MARKOVIAN ENVIRONMENTS
311