on the value of states to come but, unlike in Q-
learning, such values as well as the values of
immediate rewards are defined as their probability of
occurrence (of the values themselves not of the
states). Moreover, generalization follows directly:
agents do not need to have experienced previously a
state in order to value it. As long as it shares
elements with an ID or with a previously
experienced state, it inherits a value.
3 EXPERIMENTS
In the next sub-sections we show experimentally
how IDQ-learning converges to an optimal policy
and how it generalizes in a traditional Grid-world
domain.
The following parameters were set for the IDQ
agent:
σ
= 0.2 (the sensitivity of the similarity
function),
α
= 0.1 (learning rate of CS),
ε
= 0.8
(choose the greedy response in 80% of the cases),
and
γ
= 0.9 (reward discounting).
3.1 Convergence
Convergence is measured by recording the average
absolute fluctuation (AAF) for expectance memory
per episode. This is done by accumulating the
absolute differences diff
abs
between the values of the
expectance memory and their values after an update
has been carried out. Additionally the number of
steps to reach the goal, episode
length
, is recorded. The
AAF is thus given by diff
abs
/ episode
length
. For IDQ
there can be several updates of the expectance
memory. This is because of the consideration of
states as compounds, where each element of the
compound enters into separate associations from the
others. It is therefore necessary to record the number
of updates per episode step numerrors
episodestep
as
well. The average absolute fluctuation per episode
for IDQ is given by
(diff
abs
/numerrors
episodestep
)/episode
length
.
The optimal policy is the shortest path from any
spatial location to the goal. Table 1 shows a
summary of results for convergence experiments,
where MPC stands for Maximum Policy Cost for
learning period and OFPE stands for Optimal Policy
Found after n Epochs.
Table 1: Convergence: MPC and OPFE per Grid type.
Grid MPC OPFE
3×3
13 15
5×5
71 13
10×10
626 843
As expected, the algorithm converges to the
optimal results.
3.2 Generalization
Generalization in the Grid-world means how the
consideration of states as compounds can help the
transfer of learning between similar situations. When
an element X at two different locations signals the
same outcome, there is said to be a sharing of
associations between the two locations. In Figure 1
element X enters into an association with the
outcome state G. This association is strengthened at
two locations. Additionally, both elements A and B
enter into an association with G separately. In this
situation it is said that element X is generalized from
location (2,3) to location (3,2) and vice versa. There
is thus a sharing of element X between the states at
location (2,3) and at location (3,2). An algorithm
which manages to gain a savings effect from this
type of sharing is said to be able to generalize. This
generalization and sharing effect should be
manifested in faster convergence to the optimal
policy if the algorithm is successful in using the
redundant association to its benefit.
Figure 1: Generalization in the Grid world by means of
sharing of elements across states.
Generalization experiments were designed in two
phases (see Figure 2). In Phase 1 the agent is trained
with an initial Grid-world layout, and in Phase 2 this
initial Grid-world layout is changed. The G in the
lower right corner of each Grid-world layout
represents the goal state and is the same for all
layouts, for all experiments and phases. Phase 1 has
the same elements as Phase 2 for all locations,
unless otherwise stated. All experiments aim to test
whether Phase 2 will converge faster to the optimal
policy through generalization with Phase 1. Two
groups are employed for each experiment. In Group
1 there is supposed to be generalization from Phase
1 to Phase 2 due to an environment change, which
leaves an aspect of the environment layout intact,
but changes another. Group 2, on the other hand,
INTERNALLY DRIVEN Q-LEARNING - Convergence and Generalization Results
493