cess of the best arm i
∗
, and p
i
is the probability of
success of the selected arm i at time step t. To mini-
mize the total regret, at each time step t, the agent has
to trade-off between selecting the optimal arm i
∗
(ex-
ploitation) to minimize the regret
2
and selecting one
of the non-optimal arm i to increase the confidence in
the estimated probability of success ˆp
i
, ˆp
i
=
α
i
/(α
i
+ β
i
)
of the arm i (exploration). Where α
i
is the number of
successes (the number of receiving reward equals 1)
and β
i
is the number of failures (the number of receiv-
ing reward equals 0) of the arm i.
Thompson Sampling Policy. (Thompson, 1933) as-
signs to each arm i, i ∈ A a random probability of
selection P
i
to trade-off between exploration and ex-
ploitation. The random probability of selection P
i
of
each arm i is generated from Beta distribution, i.e.
P
i
= Beta(α
i
,β
i
), where α
i
is the number of successes
and β
i
is the number of failures of the arm i. The ran-
dom probability of selection P
i
of an arm i depends
on the performance of the arm i, i.e. the unknown
probability of success p
i
of the arm i. It will be high
value if the arm i has high probability of success p
i
value. With Bayesian priors on the Bernoulli proba-
bility of success p
i
of each arm i, Thompson sampling
assumes initially the number of successes, α
i
and the
number of failures, β
i
for each arm i is 1. At each
time t, Thompson sampling samples the probability
of selection P
i
for each arm i, i ∈ A (the probability
that an arm i is optimal) from Beta distribution, i.e.
P
i
= Beta(α
i
,β
i
). Beta distribution generates random
values, therefore, probably, at time step t, the optimal
arm i
∗
, i
∗
= argmax
i∈A
p
i
has high probability of se-
lection P
i
∗
, while at time step t +1 the suboptimal arm
j, j ∈ A, j 6= i
∗
has high probability of selection P
j
.
Thompson sampling selects the optimal arm i
∗
TS
that has the maximum probability of selection P
i
∗
TS
,
i.e. i
∗
TS
= argmax
i∈A
P
i
and observes the reward r
i
∗
TS
.
If r
i
∗
TS
= 1, then Thompson sampling updates the num-
ber of successes α
i
∗
TS
= α
i
∗
TS
+ 1 for the arm i
∗
TS
. As
a result, the estimated probability of success ˆp
i
∗
TS
of
the arm i
∗
TS
will increase. If r
i
∗
TS
= 0, then Thomp-
son sampling updated the number of failures β
i
∗
TS
=
β
i
∗
TS
+1 for the arm i
∗
. As a result, the estimated prob-
ability of success ˆp
i
∗
TS
of the arm i
∗
TS
will decrease.
Since, Thompson sampling is very easy to imple-
ment, we will use it to select the scalarized function
s, s ∈ S. We assume that each scalarized function s
has unknown probability of success p
s
and when we
select s, we either receive reward 1 or 0. We call
the algorithm that uses Thompson sampling to select
the weight set ”Adaptive Scalarized Multi-Objective
Multi-Armed Bandit” (adaptive-SMOMAB). Note
2
At each time step t, the regret equals p
∗
− p
i
(t).
that, adaptive-SMOMAB uses Thompson sampling to
select the weight set, while scalarized multi-objective
multi-armed bandit (MOMAB) selects uniformly at
random one of the weight set w
w
w
s
, w
w
w
s
∈ W
W
W.
The Adaptive-SMOMAB Algorithm. As in the case
of MABs, Thompson sampling uses random of beta
distribution Beta(α
s
,β
s
) to assign a probability of se-
lection P
s
for each scalarized function s. Where α
s
is the number of successes of the scalarized function
s and β
s
is the number of failures of the scalarized
function s. We consider that each scalarized func-
tion s has unknown probability of success p
s
and by
playing each scalarized function s, we can estimate
the corresponding probability of success. At each
time step t, we maintain value V
s
(t) for each scalar-
ized function s, where V
s
(t) = max
i∈A
f
s
((
ˆ
µ
µ
µ
i
)
s
) is the
value of the optimal arm i
∗
, i
∗
= argmax
i∈A
f
s
((
ˆ
µ
µ
µ
i
)
s
)
under scalarized function s and (
ˆ
µ
µ
µ
i
)
s
is the estimated
mean vector of the arm i under the scalarized func-
tion s. If we select the scalarized function s at time
step t and the value of this scalarized function V
s
(t) is
greater or equal than the value at the previous selec-
tion, V
s
(t) ≥ V
s
(t − 1), then this scalarized function s
performs well because it has the ability to select the
same optimal arm or to select another optimal arm
that has higher value. Otherwise, the scalarized func-
tion s does not perform well.
The pseudocode of the adaptive-SMOMAB algorithm
is given in Figure 2. The linear scalarized-KG across
arms LS1-KG function f is used to convert the multi-
objective to a single one. The number of scalarized
function is |S|, |S| = D + 1, where D is the number
of objectives. The horizon of an experiment is L
steps. The algorithm in Figure 2 plays each arm for
each scalarized function s, Initial plays. The scalar-
ized function set is F
F
F = ( f
1
,· ·· , f
|S|
), each scalarized
function s has a corresponding predefined weight set,
w
w
w
s
= (w
1,s
,· ·· ,w
D,s
). N
s
is the number of times the
scalarized function s is pulled and N
s
i
is the number
of times the arm i under the scalarized function s is
pulled. (r
r
r
i
)
s
is the reward vector of the pulled arm i
under the scalarized function s which is drawn from a
normal distribution N(µ
µ
µ,σ
σ
σ
2
r
), whereµ
µ
µ is the true mean
vector andσ
σ
σ
2
r
is the true variance vector of the reward.
(
ˆ
µ
µ
µ
i
)
s
and (
ˆ
σ
σ
σ
i
)
s
are the estimated mean and standard
deviation vectors of the arm i under the scalarized
function s, respectively. V
s
, V
s
= max
i∈A
f
s
((
ˆ
µ
µ
µ
i
)
s
) is
the value of each scalarized function s after playing
each arm i Initial steps, where f
s
((
ˆ
µ
µ
µ
i
)
s
) is the value
of the LS-KG for the arm i under scalarized function
s. The number of successes α
s
, and the number of
failures β
s
for each scalarized function s are set to 1
as (Thompson, 1933), therefore, the estimated proba-
bility ˆp
s
, ˆp
s
=
α
s
/(α
s
+ β
s
) of success is 0.5. The prob-
ICAART2015-InternationalConferenceonAgentsandArtificialIntelligence
60