samples concentrate in the state space. The triangles
in these figures represent the samples the distributions
are computed from. Hence, in a single figure, there
are ρ
CE
·N
T S
·N = 0.1 ·500 ·20 = 1000 samples. A
triangle oriented to the left (resp. right) stands for
a −4 (resp. +4) action. Since we chose the naive
model of uncorrelated variables, all the ellipses repre-
sented have their principal directions along the state
space’s axes. A richer covariance model might allow
for a better sample set description at the cost of a more
complex distribution parameter update phase.
The main conclusion one can draw from these re-
sults is that, by comparing the influence of different
sample sets on the optimal policy computed by FQI,
and by using a probability density-based optimiza-
tion method, we were able to identify a distribution
on the sampling scheme (Figure 5(i)) which induces
very good policies with as little as 20 samples. In
contrast, the original paper on tree-based FQI (Ernst
et al., 2005) suggests that, in average, it is necessary
to collect tens of thousands of samples via random
walk in the state space, before an optimal policy can
be found. The generalization of this experimental re-
sult indicates that instead of an ever-refining process
for the sample set, batch-mode RL algorithms can
take advantage of sample set optimization, through
OSS(N) algorithms, in order to reach optimal policies
with a low number of samples.
6 DISCUSSION
6.1 Time and Space Efficiency
When the state-action space’s dimension becomes
large, processing times for batch-mode RL algorithms
increase dramatically, since the sample complexity
of these algorithms is often worse than linear (e.g.,
O(Nlog(N)) for tree-based FQI). At a certain point,
it might become preferable to run N
T S
sample col-
lections and policy optimizations for small policies
based on sample sets of size N, than one large compu-
tation on the very large equivalent sample set of size
N
T S
·N (if this computation is at all possible). More
formally, this is supported by the complexity estimate
of Section 4, which illustrates an O(N
T S
N log(N))
time per OSS iteration which can be advantageously
compared to the O(N
T S
N log(N
T S
N)) time complex-
ity of applying tree-based FQI to the huge set of N
T S
N
samples (recall N
T S
is the large value here, N being
fixed to a small value by the user).
On the space complexity side, both approaches re-
quire O(N
T S
N) space to store the trees and the sample
sets. However, OSS can benefit easily from possible
disk storage since all the sample sets are used inde-
pendently and at different times. If one allows disk
storage, then the space complexity of OSS boils down
to O(N) since an iteration of OSS(N) only requires
keeping and processing one set of size N at once.
Furthermore, it is interesting to point out that com-
puting the score of a given θ is fully independent of
any other score computation. Hence OSS methods
can be easily adapted to a setting of distributed com-
putation on several small computers and can thus take
advantage of parallel architectures, distributing the
computational burden into small light-weight tasks.
6.2 Stochastic RL Algorithms
One of the possible caveat of using a forest of ex-
tremely randomized trees for the regressor in FQI is
related to the possible variance in the results and the
associated variance in policy quality. So far, all RL al-
gorithms are implicitly supposed to be deterministic,
ie. given a fixed sample set as input, they always out-
put the same result. This is not true for extremely ran-
domized trees. Their use in the general case of FQI is
still relevant because the variance in the results tends
to zero when the number of samples grows. But in
our case, since we voluntarily kept the samples num-
ber very low, we witnessed a very large variance in the
policies generated from a given set of 20 samples. To
avoid such a variance, a simple option is to increase
the number of trees in the forest, since the variance
also tends to zero with a large number of trees. Al-
though the 200 trees used per Q-function in the previ-
ous experiment were already a large forest compared
to the one reported in (Ernst et al., 2005) (which only
had 50 trees), we tried to run the OSS meta-algorithm
on FQI with even larger numbers of trees (up to 1000)
and observed the same behaviour as reported in Sec-
tion 5.2. Obviously, when using a non-deterministic
algorithm such as tree-based FQI, one cannot guar-
antee anymore that OSS will converge to a sample set
providing the optimal policy every time, but instead, it
will lead to a training set θ
∗
that provides the optimal
policy with high probability. For the case where the
variance in the results tends to zero (with a very large
number of trees or with a deterministic algorithm such
as LSPI), then this probability should tend to one.
It is also interesting to note that variance in the al-
gorithm’s output policies might actually be desirable,
since it extends the set of policies which are “reach-
able” from an N-sized sample set. Then, by allowing
the storage of the best found policies along the way,
good policies can be found early in the search process
(as Figure 4 illustrates).
These experiments highlighted an interesting (un-
OPTIMAL SAMPLE SELECTION FOR BATCH-MODE REINFORCEMENT LEARNING
47