which refines line 5 of Algorithm 3. There, we as-
sume that the trajectory has already been computed
with a fixed time step ∆t = ∆τ, that is at times
t
j
= j∆t, 0 ≤ j ≤ N + 1.
Algorithm 4 TD(λ): Update traces
1: for j = 0 to N do
2: δ
j
← c(x
j
, u
j
) + γV
i
(x
j+1
) − V
i
(x
j
)
3: end for
4: e(t
N
) ← δ
N
5: for k = 1 to N do
6: e(t
N−k
) ← e(t
k
) + (λγ)e(t
N−k+1
)
7: end for
Note that the computation complexity of the update
trace is in O(N ) where N + 2 is the number of com-
puted points in the trajectory x. Thus the overall com-
plexity of the algorithms depends only on the num-
ber of trajectories to be computed in order to obtain a
good approximation of V
∗
.
3.2.3 Qualitative interpretation of TD(λ)
In the previous sections, we started by presenting
TD(λ) as an algorithm that compute a new estima-
tion of the VF using a trajectory and older estima-
tions. This allowed us to provide a continuous for-
mulation of this algorithm and an intuition of why
it should converge towards the true VF. Then, using
a fixed time step and numerical approximations for
implementation purposes, we derived equation (16)
and Algorithm (4). These provide another intuition of
how this algorithm behaves.
The TD error δ
j
(17) can be seen as a local error in
x(t
j
) (in fact, it is an order one approximation of the
Hamiltonian H(x(t
j
))). Thus, (16) means that the lo-
cal error in x(t
j
) affects the global error estimated by
TD(λ) in x(t
0
) with a shortness factor equal to (γλ)
j
.
The values of λ ranges from 0 to 1 (in the continuous
formulation, s
λ
ranges from 0 to ∞). When λ = 0,
then only local errors are considered, as in value iter-
ation algorithms. When λ = 1 then errors along the
trajectory are fully reported to x
0
. Intermediate val-
ues of λ are known to provide better results than these
extreme values. But how to choose the best value for
λ remains an open question in the general case. Our
intuition, backed by experiments, tends to show that
higher values of λ often produce larger updates, re-
sulting in a faster convergence, at least at the begin-
ning of the process, but also often return a coarser
approximation of the VF, when it does not simply di-
verge. On the other hand, smaller values of λ result in
a slower convergence but toward a more precise ap-
proximation. In the next section, we use this intuition
to design a variant of TD(λ) that combines the quali-
ties of high and low values of λ.
3.3 TD(∅)
3.3.1 Idea
The new TD algorithm that we propose is based on
the intuition about local and global updates presented
in the previous section. Global updates are those per-
formed by T D(1) whereas local updates are those
used by T D(0). The idea is that global updates
should only be used if they are “relevant”. In other
cases, local updates should be performed. To de-
cide whether the global update is “relevant” or not,
we use a monotonicity argument: from a trajectory
x(·), we compute an over-approximation
¯
V (x(t)) of
V
∗
(x(t)), along with the TD error δ(x(t)). Then,
if
¯
V (x(t)) is less than the current estimation of the
value function V
i
(x(t)), it is chosen as a new estima-
tion to be used for the next update. In the other case,
V
i
(x(t)) + δ(x(t)) is used instead.
Let us first remark that since c is bounded and s
λ
> 0,
then V
∗
(x) ≤ V
max
, ∀x ∈ X , where
V
max
=
Z
∞
0
e
−s
γ
t
¯c dt =
¯c
s
λ
(18)
This upper bound of V
∗
represents the cost-to-go of
an hypothetic forever worse trajectory, that is, a tra-
jectory for which at every moment, the pair state in-
put (x, u) has the worse cost ¯c. Thus, V
max
could be
chosen as a trivial over-approximation of V
∗
(x(t)).
In this case, our algorithm would be equivalent to
T D(0). But if we assume that we compute a tra-
jectory x(·) on the interval [0, T ], then a better over-
approximation can be obtained:
¯
V (x
0
) =
Z
T
0
e
−s
λ
t
c(x, u)dt + e
−s
γ
T
V
max
(19)
It is easy to see that (19) is indeed an over-
approximation of V
∗
(x(0)): it represents the cost of a
trajectory that would begin as the computed trajectory
x(·) on [0, T ], which is at best optimal on this finite
interval, and then from T to ∞ it behaves as the ever
worse trajectory. Thus,
¯
V (x
0
) ≥ V
∗
(x
0
).
3.3.2 Continuous Implementation of TD(∅)
In section 3.2.2 we fixed a time step ∆t and gave a nu-
merical scheme to compute estimation
˜
V
λ
. This was
useful in particular to make the connection with dis-
crete TD(λ). In this section, we give a continuous
implementation of TD(∅) by showing that the compu-
tation of
¯
V can be coupled with that of the trajectory
x(·) in the solving of a unique dynamical system.
Let x
V
(t) =
R
t
0
e
−s
γ
r
c(x, u)dr. Then,
¯
V (x
0
) = x
V
(T ) + e
−s
γ
(T )
V
max
ICINCO 2005 - INTELLIGENT CONTROL SYSTEMS AND OPTIMIZATION
60