this paper successfully reduces the regret between the
reward obtained and the optimal reward significantly
when the samples of sampling are very limited.
2 PROBLEM FORMULATION
Multi-armed bandits (MAB) are a systematic way of
modeling sequential decision problems. It aims to
balance the exploration of uncertain options and the
exploitation of known variable options. This
interesting name of the problem comes from a story:
a gambler enters a casino and sits in front of a slot
machine with multiple arms. When he chooses an arm
to pull, the machine generates a random reward that
follows a certain distribution. Because the
distribution of each machine is unknown in advance
by the gambler, he can only use experience to
speculate the real distribution. But next, a serious
problem appears, he will face the dilemma: spending
rounds to pull the arms that already have good returns
in the past might get a good reword. But spending
rounds to pull the arms that have not been fully
explored might find a better arm and get a higher
reward. How to balance between exploration and
exploitation is the key to the MAB problem.
More formally, Multi-Armed Bandit (MAB) solves
the exploration-exploitation dilemma in sequential
decision problems. Let K = {1, . . . , K} be the set of
possible arms/actions to choose from and T = {1, 2, .
. . , n} be the time instants. So, it means at every time-
step t ∈ T the agent has to select one of the K arms.
Then each time an action i is taken, a random reward
ri(Di) is obtained from the unknown distribution Di.
The goal of the problem is to maximize total reward
in n rounds.
The movie recommendation system can be seen as
an MAB problem. Each "arm" in the MAB problem
corresponds to a movie option available for
recommendation, with different arms representing
different genres of movies. When a user interacts with
the recommendation system by selecting a movie, it's
akin to pulling an arm in the MAB problem. The
reward obtained from the chosen movie reflects the
user's feedback, which is the rating of this kind of
movie. However, the movie recommendation system
has a big different from the traditional MAB
problems. First, the probability distributions of
different genres of movies vary every year. It is both
related to previous years and has changed.
3 ALGORITHM
The Thompson Sampling (TS) algorithm was first
proposed in 1933 to address the dual arm slot machine
problem of how to allocate experimental energy in
clinical trials. And now it has been applied in
financial investment, advertising placement, and
other fields.
Algorithm 1: Thompson Sampling.
Initialize the α= (α
1
,…, α
k
) and β=
(β
1
,…, β
k
);
for Time index t ← 1 to n do
Sample θ
i
∼Beta (α
i
, β
i
) for each arm
i=1,...,K.
a
t
= argmax(θ
i
)
apply action a
t
and observe reward r
t
if r
t
== 1 then
α
at
= α
at
+ 1
else
β
at
= β
at
+ 1
end if
end for
In algorithm 1, first Initialize α and β vectors.
These vectors represent the number of successes and
failures, respectively, for each arm in the bandit
problem. Then for each round t from 1 to n, where n
is the total number of rounds, for each arm i from 1 to
k where k is the total number of arms, sample a
probability
θ
i
from a Beta distribution with
parameters
α
i
and β
i
. The Beta distribution is
parameterized by α and β, where α represents the
number of successes and β represents the number of
failures. Then choose the action that maximizes the
sampled probabilities θ
i
. Apply action a
t
and observe
the resulting reward r
t
. In the TS problem, this would
be a binary reward, 1 for success, and 0 for failure.
Finally, update the parameters α and β based on the
observed reward rt. If the reward is 1 which means
success, increment the corresponding α value;
otherwise, increment the corresponding β value. In
summary, the traditional Thompson Sampling is used
for a Bernoulli MAB problem. It gives a Beta prior
distribution for the probability of each action
succeeding.
The traditional Thompson Sampling does not fit
the movie recommendation system very well. Since
each type of movie has different levels of popularity
each year, Thompson Sampling should to resampled
whenever the data set has a rapid change. When the
number of samples that can be extracted is very
limited each time, it undoubtedly causes significant
waste. The movie rating is not a Bernoulli result but