To improve the accuracy of the approximated Q-
values and to find a (near) optimal policy, (X. Xu
and Lu, 2007) have proposed Kernel-Based LSPI
(KBLSPI), an example of offline approximated policy
iteration that uses Mercer kernels to approximate Q-
values (Vapnik, 1998). Moreover, kernel-based LSPI
provides automatic feature selection by the kernel ba-
sis functions since it uses the approximate linear de-
pendency sparsification method described in (Y. En-
gel and Meir, 2004).
(L. Bus¸oniu and Babuˇska, 2010) have adapted
LSPI, which does offline learning, for online rein-
forcement learning and the result is called online
LSPI. A good online learning algorithm must quickly
produce acceptable performancerather than at the end
of the learning process as is the case in offline learn-
ing. In order to obtain good performance, an online
algorithm has to find a proper balance between ex-
ploitation, i.e. using the collected information in the
best possible way, and exploration, i.e. testing out
the available alternatives (Sutton and Barto, 1998).
Several exploration policies are available for that pur-
pose and one of the most popular ones is ε-greedy
exploration that selects with probability 1 −ε the ac-
tion with the highest estimated Q-value and selects
uniformly, randomly with probability ε one of the ac-
tions available in the current state. To get good perfor-
mance, the parameter ε has to be tuned for each prob-
lem. To get rid of parameter tuning and to increase
the performance of online LSPI, (Yahyaa and Mand-
erick, 2013) have proposed using Knowledge Gradi-
ent (KG) policy (I.O. Ryzhov and Frazier, 2012) in
the online-LSPI.
To improve the performance of online-LSPI and
to get automatic feature selection, we propose online
kernel-based LSPI and we use the knowledge gradi-
ent (KG) as an exploration policy. The rest of the pa-
per is organised as follows: In Section 2 we present
Markov decision processes, LSPI, the knowledge gra-
dient policy for online learning, kernel-based LSPI
and the approximate linear dependency test. While
in Section 3, we present the knowledge gradient pol-
icy in online kernel-based LSPI. In Section 4 we give
the domains used in our experiments and our results.
We conclude in Section 5.
2 PRELIMINARIES
In this section, we discuss Markov decision processes,
online LSPI, the knowledge gradient exploration pol-
icy (KG), offline kernel-based LSPI (KBLSPI) and
approximate linear dependency (ALD).
2.1 Markov Decision Process
A finite Markov decision process (MDP) is a 5-tuple
(S,A, P,R,γ), where the state space S contains a fi-
nite number of states s and the action space A con-
tains a finite number of actions a, the transition prob-
abilities P(s, a,s
′
) give the conditional probabilities
p(s
′
|s,a) that the environment transits to state s
′
when
the agent takes action a in state s, the reward distribu-
tions R(s,a,s
′
) give the expected immediate reward
when the environment transits to state s
′
after tak-
ing action a in state s, and γ ∈ [0,1) is the discount
factor that determines the present value of future re-
wards (Puterman, 1994; Sutton and Barto, 1998).
A deterministic policy π : S →A determines which
action a the agent takes in each state s. For the
MDPs considered, there is always a deterministic op-
timal policy and so we can restrict the search process
to such policies (Puterman, 1994; Sutton and Barto,
1998). By definition, the state-action value function
Q
π
(s,a) for a policy π gives the expected total dis-
counted reward E
π
(
∑
∞
i=t
γ
t
r
t
) when the agent starts
in state s, takes action a and follows policy π there-
after. The goal of the agent is to find the optimal
policy π
∗
, i.e. the one that maximizes Q
π
for ev-
ery state s and action a: π
∗
(s) = argmax
a∈A
Q
∗
(s,a)
where Q
∗
(s,a) = max
π
Q
π
(s,a) is the optimal state-
action value function. For the MDPs considered, the
Bellman equations for the state-action value function
Q
π
are given by
Q
π
(s,a) = R(s,a,s
′
) + γ
∑
s
′
P(s,a,s
′
)Q
π
(s
′
,a
′
) (1)
In Equation 1, the sum is taken over all states s
′
that
can be reached from state s when action a is taken,
and the action a
′
taken in next state s
′
is determined by
the policy π, i.e. a
′
= π(s
′
). If the MDP is completely
known then algorithms such as value or policy itera-
tion find the optimal policy π
∗
. Policy iteration starts
with an initial policy π
0
, e.g. randomly selected, and
repeats the next two steps until no further improve-
ment is found: 1) policy evaluation where the current
policy π
i
is evaluated using Bellman equations 1 to
calculate the corresponding value function Q
π
i
, and 2)
policy improvement where this value function is used
to find an improved new policy π
i+1
that is greedy in
the previousone, i.e. π
i+1
= argmax
a∈A
Q
π
i
(s,a) (Sut-
ton and Barto, 1998).
For finite MDPs, the action-valuefunctions Q
π
for
a policy π can be represented by a lookup table of size
|S|×|A|, one entry per state-action pair. However,
when the state and/or action spaces are large, this ap-
proach becomes computationally infeasible due to the
curse of dimensionality and one has to rely on func-
tion approximation instead. Moreover, the agent does
ICAART2014-InternationalConferenceonAgentsandArtificialIntelligence
6