which is an MDP with known constant delay. We
describe CDMDP as a 5-tuple hS, A, P, R, ki, where
k is a non-negative integer representing delay. We
are also interested in the situation that the delay is
unknown to the agents, although its maximum value
is known. We define an unknown constant delayed
MDP (UCDMDP) as a 5-tuple hS, A, P, R, k
max
i,
where k
max
is a non-negative integer that bounds
delay. The real value k of delay is not given to the
agent, but is fixed and satisfies 0 ≤ k ≤ k
max
.
3 dSARSA(λ)
k
: ALGORITHM
FOR KNOWN DELAY
Q-learning and Sarsa are popular on-line algorithms
which directly estimate the Q-function Q(s, a), that
calculates the quality of a state-action combination.
In order to accelerate convergence, eligibility traces
are often combined to Q-learning and Sarsa, (see,
e.g., (Sutton and Barto, 1998)) that is called Q(λ) and
Sarsa(λ), respectively.
Several approaches using Q(λ) and Sarsa(λ) are
possible to tackle with the delay k. Due to (Kat-
sikopoulos and Engelbrecht, 2003), if the state space
of the MDP is expanded with the actions taken in
the past k steps, a CDMDP is reducible to the regu-
lar MDP hS × A
k
, A, P, Ri. It implies that normal re-
inforcement learning techniques are applicable, for
small k. However, if k is large, the state space
grows exponentially, so that the learning time and
memory requirements would be impractical. If we
treat hS× A
k
, A, P, Ri as if hS, A, P, Ri, the problem be-
longs to the Partially Observable MDPs (POMDPs).
In (Loch and Singh, 1998), they showed that Sarsa(λ)
performs very well for POMDPs.
In (Schuitema et al., 2010), they refined the update
rule of Q(s, a) by taking the delay k into the consider-
ation explicitly;
Q(s
n
, a
n−k
) ← Q(s
n
, a
n−k
) + α· δ
n
,
where α is the learning rate,
δ
n
=
r
n+1
+ γ· max
a
′
∈A
Q(s
n+1
, a
′
) − Q(s
n
, a
n−k
)
for Q-learning
r
n+1
+ γ· Q(s
n+1
, a
n−k+1
) − Q(s
n
, a
n−k
)
for Sarsa
,
and γ is the discount factor. The resulting algorithms,
called dQ, dSARSA, dQ(λ), and dSARSA(λ) are
experimentally verified that they performed well for
known and constant delay. Among them, they re-
ported that dSARSA(λ) was the most important one.
However, unfortunately, it seems to us that they
payed little attention to select next action based on
the current observed states. They did not explicitly
use delay k for prediction. As we will show in Sec-
tion 5, if we explicitly predict a sequence of k states
by considering the delay, the convergence of learning
can be accelerated further.
We now describe our algorithm dSARSA(λ)
k
in
Algorithm 1. Its update rules of Q(s, a) and e(s, a) are
based on dSARSA(λ). Note that if k = 0, our algo-
rithm dSARSA(λ)
k
becomes equivalent to the stan-
dard Sarsa(λ) using replacing traces with option of
clearing the traces of non-selected actions (Sutton and
Barto, 1998). Moreover, if changing Q( ˆs
n+k
, a) on
line 23 to Q(s
n
, a) then it is almost equivalent
1
to
dSARSA(λ).
The essential improvement of the algorithm lies
in lines 21–23. When the algorithm chooses the next
action a ∈ A, it refers Q( ˆs
n+k
, a) instead of Q(s
n
, a),
where ˆs
n+k
is a predicted state after k steps “simula-
tion” starting from the state s
n
. By simulation, we
proceed to choose the most likely state at each step.
We remark that the same idea has already appeared in
the Model Based Simulation algorithm (Walsh et al.,
2007).
We implement it as follows. The procedure
Memorize(s, a, s
′
) accumulates the number of occur-
rences of (s, a, s
′
), the experience that action a in state
s yields state s
′
. By using these numbers, we can
simply estimate the probability that the next state be-
comes s
′
when taking action a in state s, as
ˆ
P(s
′
| s, a) =
the number of occurrences of (s, a, s
′
)
∑
s
′
∈S
the number of occurrences of (s, a, s
′
)
.
Then the next state ˆs at state s taking action a is pre-
dicted by the maximum likelihood principle
ˆs = argmax
s
′
∈S
ˆ
P(s
′
| s, a).
The procedure Predict(s
n
, {a
n−k
, . . . , a
n−1
}) returns a
predicted state ˆs
n+k
after k step starting from s
n
, by
calculating the following recursive formula
ˆs
n+(i+1)
= argmax
s
′
∈S
ˆ
P(s
′
| ˆs
n+i
, a
n−(k−i)
)
for i = 0, . . . , k− 1.
4 dSARSA(λ)
X
: ALGORITHM
FOR UNKNOWN DELAY
In the previoussection, we assumed that the delay was
known to the learner. This section considers the case
1
A subtle difference is the update rule of e
k
(s, a) in
lines 13–17, although we do not regard it essential.
ICAART 2012 - International Conference on Agents and Artificial Intelligence
580