a Bayesian viewpoint for the problem (although a
Bayesian framework is not used for theoretical anal-
yses). This algorithm, which maintains a list of arms
that are close enough to the best one (and which thus
must be played), is inspired by large deviations ideas
and relies on the availability of the rate function asso-
ciated to the reward distribution.
In (Maillard et al., 2011), the K
in f
-based algorithm
was analyzed by Maillard et al. It is inspired by the
ones studied in (Lai and Robbins, 1985; Burnetas and
Katehakis, 1996), taking also into account the full
empirical distribution of the observed rewards. The
analysis accounted for Bernoulli distributions over the
arms and less explicit but finite-time bounds were ob-
tained in the case of finitely supported distributions
(whose supports do not need to be known in advance).
These results improve on DMED, since finite-time
bounds (implying their asymptotic results) are ob-
tained, UCB1, UCB1-Tuned, and UCB-V.
Later, the KL-UCB algorithm and its variant KL-
UCB+ were introduced by Garivier & Capp
´
e (Gariv-
ier and Capp
´
e, 2011). KL-UCB satisfied a uniformly
better regret bound than UCB and its variants for arbi-
trary bounded rewards, whereas it reached the lower
bound of Lai and Robbins when Bernoulli rewards are
considered. Besides, simple adaptations of the KL-
UCB algorithm were also optimal for rewards gener-
ated from exponential families of distributions. Fur-
thermore, a large-scale numerical study comparing
KL-UCB with UCB, MOSS, UCB-Tuned, UCB-V,
DMED was performed, showing that KL-UCB was
remarkably efficient and stable, including for short
time horizons.
New algorithms were proposed by Capp
´
e et al.
(Capp
´
e et al., 2013) based on upper confidence
bounds of the arm rewards computed using differ-
ent divergence functions. The kl-UCB uses the
Kullback-Leibler divergence; whereas the kl-poisson-
UCB and the kl-exp-UCB account for families of
Poisson and Exponential distributions, respectively.
A unified finite-time analysis of the regret of these
algorithms shows that they asymptotically match the
lower bounds of Lai and Robbins, and Burnetas and
Katehakis. Moreover, they provide significant im-
provements over the state-of-the-art when used with
general bounded rewards.
Finally, the best empirical sampled average
(BESA) algorithm was proposed by Baransi et al.
(Baransi et al., 2014). It is not based on the compu-
tation of an empirical confidence bounds, nor can it
be classified as a KL-based algorithm. BESA is fully
non-parametric. As shown in (Baransi et al., 2014),
BESA outperforms TS (a Bayesian approach intro-
duced in the next section) and KL-UCB in several
scenarios with different types of reward distributions.
Stochastic bandit problems have been analyzed
from a Bayesian perspective, i.e. the parameter is
drawn from a prior distribution instead of considering
a deterministic unknown quantity. The Bayesian per-
formance is then defined as the average performance
over all possible problem instances weighted by the
prior on the parameters.
The origin of this perspective is in the work by
Gittins (Gittins, 1979). Gittins’ index based policies
are a family of Bayesian-optimal policies based on
indices that fully characterize each arm given the cur-
rent history of the game, and at each time step the arm
with the highest index will be pulled.
Later, Gittins proposed the Bayes-optimal ap-
proach (Gittins, 1989) that directly maximizes ex-
pected cumulative rewards with respect to a given
prior distribution.
A lesser known family of algorithms to solve ban-
dit problems is the so-called probability matching or
Thompson sampling (TS). The idea of TS is to ran-
domly draw each arm according to its probability of
being optimal. In contrast to Gittins’ index, TS can
often be efficiently implemented (Chapelle and Li,
2001). Despite its simplicity, TS achieved state-of-
the-art results, and in some cases significantly outper-
formed other alternatives, like UCB methods.
Finally, Bayes-UCB was proposed by Kaufmann
et al. (Kaufmann et al., 2012) inspired by the
Bayesian interpretation of the problem but retaining
the simplicity of UCB-like algorithms. It constitutes
a unifying framework for several UCB variants ad-
dressing different bandit problems.
3 POSSIBILISTIC REWARD
METHOD
The allocation strategy we propose accounts for the
frequentist view but they cannot be classified as either
a UCB method nor a Kullback-Leibler (KL)-based al-
gorithm. The basic idea is as follows: the uncertainty
about the arm expected rewards are first modelled
by means of possibilistic reward distributions de-
rived from a set of infinite nested confidence intervals
around the expected value on the basis of Chernoff-
Hoeffding inequality. Then, we follow the pignistic
probability transformation from decision theory and
transferable belief model (Smets, 2000), that estab-
lishes that when we have a plausibility function, such
as a possibility function, and any further information
in order to make a decision, we can convert this func-
tion into an probability distribution following the in-
The Possibilistic Reward Method and a Dynamic Extension for the Multi-armed Bandit Problem: A Numerical Study
77