opment of algorithms for MAB problems involving
heavy tailed data has provided a much more realis-
tic framework leading to higher performance. These
work (e.g. (Lee et al., 2020)) investigated the best
arm identification of MAB with a general assumption
that
p
-th moments of stochastic rewards, analyzed tail
probabilities of average and proposed different bandit
algorithms, such as deterministic sequencing of ex-
ploration and exploitation (Liu and Zhao, 2011) and
truncated empirical average method (Yu et al., 2018).
Instead of tail probabilities, (Dubey and Pentland,
2019) proposed an algorithm based on the symmet-
ric
α
-stable distribution and demonstrated the success
with accurate assumptions and normalized iterative
process.
α
-stable distributions is a family of distribu-
tions with power law heavy tails, which can provide
with a better reward distribution (Dubey and Pentland,
2019) and can be applied to the exploration of complex
systems. This family of distributions stand out among
rival non-Gaussian models (Chen et al., 2016) since
they satisfy the generalised central limit theorem.
α
-
stable distributions have become state of the art models
for various real data such as financial data (Embrechts
et al., 2003), sensor noise (Nguyen et al., 2019), radi-
ation from Poisson field of point sources(Win et al.,
2009), astronomy data (Herranz et al., 2004), and elec-
tric disturbances on power lines (Karaku¸s et al., 2020).
Motivated by the presence of asymmetric character-
istics in various real life data (Kuruoglu, 2003) and the
success in reinforcement learning and other directions
due to the introduction of asymmetry (Baisero and Am-
ato, 2021), in this work, we propose a statistic model,
for which the reward distribution is both heavy-tailed
and asymmetric, named asymmetric alpha-Thompson
sampling algorithm.
2 BACKGROUND INFORMATION
The multi-armed bandit (Auer et al., 2002) is a theo-
retical model that has been widely used in machine
learning and optimization, and various algorithms have
been proposed for optimal solution when the reward
distributions are Gaussian-distribution or exponential-
distribution (Korda et al., 2013). However, these re-
ward distributions do not hold for those complex sys-
tems with impulsive data. For example, when we
model stock prices or deal with behaviour in social
networks, the interactive data often lead to heavy tail
and negative skewness (Oja, 1981).
2.1 Multi-Armed Bandit Problem
Suppose that there are several slot machines available
for an agent, who can select, for each round one to
pull and record the rewards. Assuming that each slot
machine is not exactly the same, after multiple rounds
of operation, we can mine some statistical information
of each slot machine, and then select the slot machine
that gives the expected highest reward.
Learning is carried out in rounds and indexed by
t
∈ [T ]
. The total number of rounds called time range
T
is known in advance. This problem is iterative, the
agent picks arm
a
t
∈ [N]
and then observes reward
r
a
t
(t)
from that arm in each round of
t ∈ [T ]
. For
each arm
n ∈ [N]
, rewards independently come from
a distribution
D
n
with mean
µ
n
=
E
D
n
[r]
. The largest
expected reward is denoted by
µ
?
=
max
n∈[N]
µ
n
, and
the corresponding arm(s) is(are) denoted as the optimal
arm(s) n
∗
.
To quantify the performance, the regret
R(T )
is
used, which refers to the difference between the ideal
total reward the agent can achieve and the total reward
the agent actually gets.
R(T ) = µ
?
T −
T
∑
t=0
µ
a
t
.
(1)
2.2 Thompson Sampling Algorithm for
Multi-Armed Bandit Problem
There are various exploration algorithms, including
the
ε
-greedy algorithm, UCB algorithm, and Thomp-
son sampling.
ε
-greedy algorithm(Korte and Lovász,
1984) uses both exploitations to take advantage of prior
knowledge and exploration to look for new options,
while the UCB algorithm(Cappé et al., 2013) simply
pulls the arm that has the highest empirical reward esti-
mate up to that point plus some term which is inversely
proportional to the number of times the arm has been
played.
Assuming that for each arm
n ∈ [N]
, the reward
distribution is
D
n
parameterized by
θ
n
∈ Θ
(
µ
n
may
not be an appropriate parameter) and that the parameter
has a prior probability distribution p(
θ
n
), Thompson
sampling algorithm updates the prior distribution of
θ
n
based on the observed reward for the arm
n
, and
then selects the arm based on the derived posterior
probability of the reward under the arm n.
According to the Bayes rule,
p(θ|x) =
p(x|θ)p(θ)
p(x)
=
p(x|θ)p(θ)
R
p(x|θ
0
)p(θ
0
)dθ
0
, where
θ
is the model parameter and
x is the observation.
p(θ|x)
is the posterior distribu-
tion,
p(x|θ)
is likelihood function,
p(θ)
is the prior
distribution and p(x) is the evidence.
Thompson Sampling on Asymmetric a-stable Bandits
435