1.1 The Multi-Armed Bernoulli Bandit
(MABB) Problem
The MABB problem is a classical optimization prob-
lem that explores the trade off between exploita-
tion and exploration in reinforcement learning. The
problem consists of an agent that sequentially pulls
one of multiple arms attached to a gambling ma-
chine, with each pull resulting either in a reward or a
penalty
1
. The sequence of rewards/penalties obtained
from each arm i forms a Bernoulli process with an
unknown reward probability r
i
, and a penalty proba-
bility 1 − r
i
. This leaves the agent with the following
dilemma: Should the arm that so far seems to provide
the highest chance of reward be pulled once more,
or should the inferior arm be pulled in order to learn
more about its reward probability? Sticking prema-
turely with the arm that is presently considered to be
the best one, may lead to not discovering which arm
is truly optimal. On the other hand, lingering with the
inferior arm unnecessarily, postpones the harvest that
can be obtained from the optimal arm.
With the above in mind, we intend to evaluate an
agent’s arm selection strategy in terms of the so-called
Regret, and in terms of the probability of selecting the
optimal arm
2
. The Regret measure is non-trivial, and
in all brevity, can be perceived to be the difference
between the sum of rewards expected after N succes-
sive arm pulls, and what would have been obtained
by only pulling the optimal arm. To clarify issues,
assume that a reward amounts to the value (utility) of
unity (i.e., 1), and that a penalty possesses the value 0.
We then observe that the expected returns for pulling
Arm i is r
i
. Thus, if the optimal arm is Arm 1, the
Regret after N plays would become:
r
1
N −
N
∑
i=1
ˆr
i
, (1)
with ˆr
n
being the expected reward at Arm pull i, given
the agent’s arm-selection strategy. In other words,
as will be clear in the following, we consider the
case where rewards are undiscounted, as discussed in
(Auer et al., 2002).
In the last decades, several computationally ef-
ficient algorithms for tackling the MABB Problem
have emerged. From a theoretical point of view, LA
1
A penalty may also be perceived as the absence of a
reward. However, we choose to use the term penalty as is
customary in the LA and RL literature.
2
Using Regrets as a performance measure is typical in
the literature on Bandit Playing Algorithms, while using the
“arm selection probability” is typical in the LA literature. In
this paper, we will use both these concepts in the interest of
comprehensiveness.
are known for their ε-optimality. From the field of
Bandit Playing Algorithms, confidence interval based
algorithmsare known for logarithmically growingRe-
gret.
1.2 Applications
Solution schemes for bandit problems have formed
the basis for dealing with a number of applications.
For instance, a UCB-TUNED scheme (Auer et al.,
2002) is used for move exploration in MoGo, a top-
level Computer-Go program on 9 × 9 Go boards
(Gelly and Wang, 2006). Furthermore, the so-
called UBC1 scheme has formed the basis for guiding
Monte-Carlo planning, and improving planning effi-
ciency significantly in several domains (Kocsis and
Szepesvari, 2006).
The applications of LA are many – in the interest
of brevity, we list a few more-recent ones. LA have
been used to allocate polling resources optimally in
web monitoring, and for allocating limited sampling
resources in binomial estimation problems (Granmo
et al., 2007) . LA have also been applied for solving
NP-complete SAT problems (Granmo and Bouhmala,
2007) .
1.3 Contributions and Paper
Organization
The contributions of this paper can be summarized as
follows. In Sect. 2 we briefly review a selection of the
main MABB solution approaches, including LA and
confidence interval-based schemes. Then, in Sect. 3
we present the Bayesian Learning Automaton (BLA).
In contrast to the latter reviewed schemes, the BLA is
inherently Bayesian in nature, even though it only re-
lies on simple counting and random sampling. Thus,
to the best of our knowledge, BLA is the first MABB
algorithm that takes advantage of the Bayesian per-
spective in a computationally efficient manner. In
Sect. 4 we provide extensive experimental results that
demonstrate that, in contrast to typical LA schemes as
well as some Bandit Playing Algorithms, BLA does
not rely on external learning speed/accuracy control.
The BLA also outperforms established top perform-
ers from the field of Bandit Playing Algorithms
3
. Ac-
cordingly, from the above perspective, it is our be-
lief that the BLA represents the current state-of-the-
art and a new avenue of research. Finally, in Sect. 5
we list open BLA-related research problems, in addi-
tion to providing concluding remarks.
3
A comparison of Bandit Playing Algorithms can be
found in (Vermorel and Mohri, 2005), with the UCB-
TUNED distinguishing itself in (Auer et al., 2002).
A GENERIC SOLUTION TO MULTI-ARMED BERNOULLI BANDIT PROBLEMS BASED ON RANDOM
SAMPLING FROM SIBLING CONJUGATE PRIORS
37