cies. Untuned policies are either policies that are
parameterfree or policies used with default param
eters suggested in the literature. Tuned policies are
the baselines that were tuned using Algorithm 2 and
learned policies comprise POWER1 and POWER2.
For each policy, we compute the mean expected re
gret on two kind of bandit problems: the 10000 test
ing problems drawn from D
P
and another set of 10000
problems with a different kind of underlying reward
distribution: truncated Gaussians to the interval [0, 1]
(that means that if you draw a reward which is outside
of this interval, you throw it away and draw a new
one). In order to sample one such problem, we select
the mean and the standard deviation of the Gaussian
distributions uniformly in range [0, 1]. For all poli
cies except the untuned ones, we have used three dif
ferent training horizons values T = 10, T = 100 and
T = 1000.
As already pointed out in (Auer et al., 2002), it
can be seen that UCB1BERNOULLI is particularly
well ﬁtted to bandit problems with Bernoulli distri
butions. It also proves effective on bandit problems
with Gaussian distributions, making it nearly always
outperform the other untuned policies. By tuning
UCB1, we outperform the UCB1BERNOULLI pol
icy (e.g. 4.91 instead of 5.43 on Bernoulli problems
with T = 1000). This also sometimes happens with
UCBV. However, though we used a careful tuning
procedure, UCB2 and ε
n
GREEDY do never outper
form UCB1BERNOULLI.
When using the same training and testing hori
zon T , POWER1 and POWER2 systematically out
perform all the other policies (e.g. 1.82 against 2.05
when T=100 and 3.95 against 4.91 when T = 1000).
Robustness w.r.t. the Horizon T . As expected, the
learned policies give their best performance when the
training and the testing horizons are equal. When the
testing horizon is larger than the training horizon, the
quality of the policy may quickly degrade (e.g. when
evaluating POWER1 trained with T = 10 on an hori
zon T = 1000), the inverse being less maked.
Robustness w.r.t. the Kind of Distribution. Al
though truncated Gaussian distributions are signif
icantly different from Bernoulli distributions, the
learned policies most of the time generalize well to
this new setting and still outperform all the other poli
cies.
Robustness w.r.t. to the Metric. Table 2 gives
for each policy, its regret and its percentage of wins
against UCB1BERNOULLI, when trained with the
same horizon as the test horizon. To compute the
percentage of wins against UCB1BERNOULLI, we
evaluate the expected regret on each of the 10000
testing problems and count the number of prob
lems for which the tested policy outperforms UCB1
BERNOULLI. We observe that by minimizing the ex
pected regret, we have also reached large values of
percentage of wins: 84.6 % for T = 100 and 91.3 %
for T = 1000. Note that, in our approach, it is easy
to change the objective function. So if the real ap
plicative aim was to maximize the percentage of wins
against UCB1BERNOULLI, this criterion could have
been used directly in the policy optimization stage to
reach even better scores.
5 CONCLUSIONS
The approach proposed in this paper for exploiting a
priori information for learning policies for Karmed
bandit problems has been tested when knowing the
time horizon and that arms rewards were generated
by Bernoulli distributions. The learned policies were
found to signiﬁcantly outperform other policies pre
viously published in the literature such as UCB1,
UCB2, UCBV, KLUCB and ε
n
GREEDY. The ro
bustness of the learned policy with respect to wrong
information was also highlighted.
There are in our opinion several research direc
tions that could be investigated for still improving the
algorithm for learning policies proposed in this paper.
For example, we found out that problems similar to
the problem of overﬁtting met in supervised learning
could occur when considering a too large set of candi
date polices. This naturally calls for studying whether
our learning approach could be combined with regu
larization techniques. More sophisticated optimizers
could also be thought of for identifying in the set of
candidate policies, the one which is predicted to be
have at best.
The UCB1, UCB2, UCBV, KLUCB and ε
n

GREEDY policies used for comparison can be shown
(under certain conditions) to have interesting bounds
on their expected regret under asymptotic conditions
(very large T ) while we did not provide such bounds
for our learned policy. It would certainly be rele
vant to investigate whether similar bounds could be
derived for our learned policies or, alternatively, to
see how the approach could be adapted so as to have
learned policies that have strong theoretical perfor
mance guarantees. For example, better bounds on the
expected regret could perhaps be obtained by identi
fying in a set of candidate policies the one that gives
the smallest maximal value of the expected regret over
this set rather than the one that gives the best average
performances.
ICAART 2012  International Conference on Agents and Artificial Intelligence
80