cies. Untuned policies are either policies that are
parameter-free or policies used with default param-
eters suggested in the literature. Tuned policies are
the baselines that were tuned using Algorithm 2 and
learned policies comprise POWER-1 and POWER-2.
For each policy, we compute the mean expected re-
gret on two kind of bandit problems: the 10000 test-
ing problems drawn from D
P
and another set of 10000
problems with a different kind of underlying reward
distribution: truncated Gaussians to the interval [0, 1]
(that means that if you draw a reward which is outside
of this interval, you throw it away and draw a new
one). In order to sample one such problem, we select
the mean and the standard deviation of the Gaussian
distributions uniformly in range [0, 1]. For all poli-
cies except the untuned ones, we have used three dif-
ferent training horizons values T = 10, T = 100 and
T = 1000.
As already pointed out in (Auer et al., 2002), it
can be seen that UCB1-BERNOULLI is particularly
well fitted to bandit problems with Bernoulli distri-
butions. It also proves effective on bandit problems
with Gaussian distributions, making it nearly always
outperform the other untuned policies. By tuning
UCB1, we outperform the UCB1-BERNOULLI pol-
icy (e.g. 4.91 instead of 5.43 on Bernoulli problems
with T = 1000). This also sometimes happens with
UCB-V. However, though we used a careful tuning
procedure, UCB2 and ε
n
-GREEDY do never outper-
form UCB1-BERNOULLI.
When using the same training and testing hori-
zon T , POWER-1 and POWER-2 systematically out-
perform all the other policies (e.g. 1.82 against 2.05
when T=100 and 3.95 against 4.91 when T = 1000).
Robustness w.r.t. the Horizon T . As expected, the
learned policies give their best performance when the
training and the testing horizons are equal. When the
testing horizon is larger than the training horizon, the
quality of the policy may quickly degrade (e.g. when
evaluating POWER-1 trained with T = 10 on an hori-
zon T = 1000), the inverse being less maked.
Robustness w.r.t. the Kind of Distribution. Al-
though truncated Gaussian distributions are signif-
icantly different from Bernoulli distributions, the
learned policies most of the time generalize well to
this new setting and still outperform all the other poli-
cies.
Robustness w.r.t. to the Metric. Table 2 gives
for each policy, its regret and its percentage of wins
against UCB1-BERNOULLI, when trained with the
same horizon as the test horizon. To compute the
percentage of wins against UCB1-BERNOULLI, we
evaluate the expected regret on each of the 10000
testing problems and count the number of prob-
lems for which the tested policy outperforms UCB1-
BERNOULLI. We observe that by minimizing the ex-
pected regret, we have also reached large values of
percentage of wins: 84.6 % for T = 100 and 91.3 %
for T = 1000. Note that, in our approach, it is easy
to change the objective function. So if the real ap-
plicative aim was to maximize the percentage of wins
against UCB1-BERNOULLI, this criterion could have
been used directly in the policy optimization stage to
reach even better scores.
5 CONCLUSIONS
The approach proposed in this paper for exploiting a
priori information for learning policies for K-armed
bandit problems has been tested when knowing the
time horizon and that arms rewards were generated
by Bernoulli distributions. The learned policies were
found to significantly outperform other policies pre-
viously published in the literature such as UCB1,
UCB2, UCB-V, KL-UCB and ε
n
-GREEDY. The ro-
bustness of the learned policy with respect to wrong
information was also highlighted.
There are in our opinion several research direc-
tions that could be investigated for still improving the
algorithm for learning policies proposed in this paper.
For example, we found out that problems similar to
the problem of overfitting met in supervised learning
could occur when considering a too large set of candi-
date polices. This naturally calls for studying whether
our learning approach could be combined with regu-
larization techniques. More sophisticated optimizers
could also be thought of for identifying in the set of
candidate policies, the one which is predicted to be-
have at best.
The UCB1, UCB2, UCB-V, KL-UCB and ε
n
-
GREEDY policies used for comparison can be shown
(under certain conditions) to have interesting bounds
on their expected regret under asymptotic conditions
(very large T ) while we did not provide such bounds
for our learned policy. It would certainly be rele-
vant to investigate whether similar bounds could be
derived for our learned policies or, alternatively, to
see how the approach could be adapted so as to have
learned policies that have strong theoretical perfor-
mance guarantees. For example, better bounds on the
expected regret could perhaps be obtained by identi-
fying in a set of candidate policies the one that gives
the smallest maximal value of the expected regret over
this set rather than the one that gives the best average
performances.
ICAART 2012 - International Conference on Agents and Artificial Intelligence
80