Thompson Sampling on Asymmetric α-stable Bandits

Zhendong Shi, Ercan E Kuruoglu and Xiaoli Wei

Data Science and Information Technology Center, Tsinghua-Berkeley Shenzhen Institute, Tsinghua University SIGS, China

Keywords:

Thompson Sampling, Multi-Armed Bandit Problem, Asymmetric Reward, Reinforcement Learning,

-stable

Distribution.

Abstract:

In algorithm optimization in reinforcement learning, how to deal with the exploration-exploitation dilemma is

particularly important. Multi-armed bandit problem can be designed to realize the dynamic balance between

exploration and exploitation by changing the reward distribution. Thompson Sampling has been proposed in the

literature for the solution of the multi-armed bandit problem by sampling rewards from posterior distributions.

Recently, it was used to process non-Gaussian data with heavy tailed distributions. It is a common observation

that various real-life data such as social network data and ﬁnancial data demonstrate not only impulsive but

also asymmetric characteristics. In this paper, we consider the Thompson Sampling approach for multi-armed

bandit problem, in which rewards conform to an asymmetric

-stable distribution with unknown parameters

and explore their applications in modelling ﬁnancial and recommendation system data.

1 INTRODUCTION

Sequential decision-making plays a key role in many

ﬁelds, such as quantitative ﬁnance and robotics. In

order to make real-time decisions under unknown en-

vironments, decision makers must carefully design al-

gorithms to balance the trade-off between exploration

and exploitation. Many decision algorithms have been

designed and widely used, such as ﬁnancial decision-

making (Shailesh, 2015) and personalized news rec-

ommendation (Li et al., 2010).

The multi-armed bandits (MAB) have an important

potential in solving the dilemma of exploration and

exploitation in the sequential decision-making prob-

lem in which a ﬁxed limited set of resources must be

allocated between competing (alternative) choices in

a way that maximizes their expected gain. Different

data may require different reward distribution. Over

the years, various reward distribution functions rang-

ing from Bernoulli distribution and Gaussian distri-

bution to sub-exponential family, have been proposed

and corresponding fast processing algorithms such as

UCB-Rad (Jia et al., 2021) (speciﬁcally for MAB with

sub-exponential rewards) have been developed.

However, when we design decision-making algo-

rithms for complex systems, the reward distribution

function (such as Bernoulli distribution and Gaus-

sian distribution) is inconsistent with the probabil-

ity distribution which each arm obeys. According

to the research on these complex system data, one

can observe that interactions often lead to heavy-tailed

or power law distributions (Lehmkuhl and Promies,

2020). When dealing with practical problems, we ﬁnd

that many data (e.g. ﬁnancial data (Embrechts et al.,

2003) and social mobile trafﬁc data (Qi et al., 2016))

have characteristics such as heavy tail and negative

skewness, which cannot be perfectly described by the

Gaussian distribution. These deviations from Gaus-

sian distribution to more complex and practical reward

distributions allow for the opportunity to develop sig-

niﬁcantly more efﬁcient algorithms than were possible

in the original setting as long as we capture the correct

reward distribution in various real world applications.

Existing machine learning algorithms ﬁnd it difﬁ-

cult to deal with the problem of multi armed bandits

with complicated reward distributions. This is because

the probability density of such reward distributions

can not be obtained analytically. When real data has

characteristics such as heavy tails or asymmetry, the

standard algorithms which make conservative statisti-

cal assumptions lead to the choice of wrong arms.

In the past few years, there have been a number of

studies on the MAB problem with heavy tailed distri-

butions (Liu and Zhao, 2011),(Bubeck et al., 2013).

Compared with algorithms that are optimized through

repeated trial and error tuning parameters (such as the

epsilon

-greedy algorithm and UCB algorithm), the

consideration of heavy-tail distributions and the devel-

434

Shi, Z., Kuruoglu, E. and Wei, X.

Thompson Sampling on Asymmetric a-stable Bandits.

DOI: 10.5220/0011684200003393

In Proceedings of the 15th International Conference on Agents and Artiﬁcial Intelligence (ICAART 2023) - Volume 3, pages 434-441

ISBN: 978-989-758-623-1; ISSN: 2184-433X

 2023 by SCITEPRESS – Science and Technology Publications, Lda. Under CC license (CC BY-NC-ND 4.0)

opment of algorithms for MAB problems involving

heavy tailed data has provided a much more realis-

tic framework leading to higher performance. These

work (e.g. (Lee et al., 2020)) investigated the best

arm identiﬁcation of MAB with a general assumption

that

-th moments of stochastic rewards, analyzed tail

probabilities of average and proposed different bandit

algorithms, such as deterministic sequencing of ex-

ploration and exploitation (Liu and Zhao, 2011) and

truncated empirical average method (Yu et al., 2018).

Instead of tail probabilities, (Dubey and Pentland,

2019) proposed an algorithm based on the symmet-

ric

-stable distribution and demonstrated the success

with accurate assumptions and normalized iterative

process.

-stable distributions is a family of distribu-

tions with power law heavy tails, which can provide

with a better reward distribution (Dubey and Pentland,

2019) and can be applied to the exploration of complex

systems. This family of distributions stand out among

rival non-Gaussian models (Chen et al., 2016) since

they satisfy the generalised central limit theorem.

stable distributions have become state of the art models

for various real data such as ﬁnancial data (Embrechts

et al., 2003), sensor noise (Nguyen et al., 2019), radi-

ation from Poisson ﬁeld of point sources(Win et al.,

2009), astronomy data (Herranz et al., 2004), and elec-

tric disturbances on power lines (Karaku¸s et al., 2020).

Motivated by the presence of asymmetric character-

istics in various real life data (Kuruoglu, 2003) and the

success in reinforcement learning and other directions

due to the introduction of asymmetry (Baisero and Am-

ato, 2021), in this work, we propose a statistic model,

for which the reward distribution is both heavy-tailed

and asymmetric, named asymmetric alpha-Thompson

sampling algorithm.

2 BACKGROUND INFORMATION

The multi-armed bandit (Auer et al., 2002) is a theo-

retical model that has been widely used in machine

learning and optimization, and various algorithms have

been proposed for optimal solution when the reward

distributions are Gaussian-distribution or exponential-

distribution (Korda et al., 2013). However, these re-

ward distributions do not hold for those complex sys-

tems with impulsive data. For example, when we

model stock prices or deal with behaviour in social

networks, the interactive data often lead to heavy tail

and negative skewness (Oja, 1981).

2.1 Multi-Armed Bandit Problem

Suppose that there are several slot machines available

for an agent, who can select, for each round one to

pull and record the rewards. Assuming that each slot

machine is not exactly the same, after multiple rounds

of operation, we can mine some statistical information

of each slot machine, and then select the slot machine

that gives the expected highest reward.

Learning is carried out in rounds and indexed by

∈ [T ]

. The total number of rounds called time range

is known in advance. This problem is iterative, the

agent picks arm

∈ [N]

and then observes reward

(t)

from that arm in each round of

t ∈ [T ]

. For

each arm

n ∈ [N]

, rewards independently come from

a distribution

with mean

[r]

. The largest

expected reward is denoted by

max

n∈[N]

, and

the corresponding arm(s) is(are) denoted as the optimal

arm(s) n

∗

To quantify the performance, the regret

R(T )

used, which refers to the difference between the ideal

total reward the agent can achieve and the total reward

the agent actually gets.

R(T ) = µ

T −

∑

t=0

(1)

2.2 Thompson Sampling Algorithm for

Multi-Armed Bandit Problem

There are various exploration algorithms, including

the

-greedy algorithm, UCB algorithm, and Thomp-

son sampling.

-greedy algorithm(Korte and Lovász,

1984) uses both exploitations to take advantage of prior

knowledge and exploration to look for new options,

while the UCB algorithm(Cappé et al., 2013) simply

pulls the arm that has the highest empirical reward esti-

mate up to that point plus some term which is inversely

proportional to the number of times the arm has been

played.

Assuming that for each arm

n ∈ [N]

, the reward

distribution is

parameterized by

∈ Θ

(

may

not be an appropriate parameter) and that the parameter

has a prior probability distribution p(

), Thompson

sampling algorithm updates the prior distribution of

based on the observed reward for the arm

, and

then selects the arm based on the derived posterior

probability of the reward under the arm n.

According to the Bayes rule,

p(θ|x) =

p(x|θ)p(θ)

p(x)

p(x|θ)p(θ)

p(x|θ

)p(θ

)dθ

, where

is the model parameter and

x is the observation.

p(θ|x)

is the posterior distribu-

tion,

p(x|θ)

is likelihood function,

p(θ)

is the prior

distribution and p(x) is the evidence.

Thompson Sampling on Asymmetric a-stable Bandits

435

For each round

t ∈ [T]

, the agent draws the pa-

rameter

(t)

for each arm

n ∈ [N]

from the posterior

distribution of parameters given the previous rewards

up to time

t −1

(t −1)

(1)

(2)

,··· ,r

(t−1))

}

where

(t)

is the number of the arm

that has been

pulled up at time t:

(t) ∼ p(θ

(t −1)) ∝ p(r

(t −1)|θ

)p(θ

(2)

Given the drawn parameters

(t) of each arm, the

agent selects the arm

with the largest average return

on the posterior distribution, receives the return

and then updates the posterior distribution of the arm

action a

= arg max

n∈[N]

(θ

(t))

(3)

We will use the Bayesian Regret (Russo and Van,

2014) for the measurement of the performance in order

to compare with the symmetric case. Bayesian Regret

(BR) is the estimated regret over the priors. Denoting

the parameters over all arms as

θ = {θ

,...,θ

}

and

their corresponding product distribution as

D =

∏

the Bayesian Regret is expressed as:

BR(T, π) = E

θ∼D

[R(T )]

(4)

2.3 α-stable Distribution

An important generalization of the Gaussian distribu-

tion is the

-stable distribution, which is often used to

model both impulsive and skewed data. It has a non-

analytic density and therefore, usually is described

with the characteristic function. We say a random vari-

able X is S

(σ,β,δ) if X has characteristic function

E[e

iuX

] = exp(−σ

(1 + iβsign(u)

(

σu

1−α

−1)) + iuδ)

(5)

where

is the characteristic exponent deﬁning the

impulsiveness of the distribution, the parameter

cor-

responds to the skewness,

is the scale parameter and

is the location parameter. (See (Embrechts et al.,

2003)).

2.4 Symmetric α-thompson Sampling

In MAB with symmetric

-stable reward distributions,

the corresponding reward distribution for each arm

is given by

= S

(σ,β = 0,δ

)

, where

α ∈ (1,2)

σ ∈ R

are known in advance, and

is unknown

((Dubey and Pentland, 2019)). In this case,

E[r

] = δ

They set a prior Gaussian distribution

p(δ

)

over the

parameter

. Since the only unknown parameter for

the reward distributions is

is parameterized by

= δ

(Dubey and Pentland, 2019) propose two algo-

rithms for Thompson Sampling. One is called Sym-

metric

-Thompson Sampling, which is based on the

scale mixtures of normals (SMiN) representation. The

other is called robust Symmetric

–Thompson Sam-

pling. It is similar to the basic

-Thompson sampling,

except for rejecting a reward when the received reward

exceeds the threshold. This strategy yields a tighter

regret bound than the basic Symmetric

-Thompson

Sampling. These algorithms, however, do not apply

for asymmetric α-stable MABs since SMiN represen-

tation does not hold.

3 ASYMMETRIC α-THOMPSON

SAMPLING

Both our algorithm and symmetric

-Thompson sam-

pling algorithm are constructed under the framework

of Thompson algorithm. The biggest difference is

the assumed reward distribution. Our corresponding

reward distribution for each arm

is given by

(σ,β,δ

), where α ∈ (1, 2).

Suppose x is observed data,

is the unknown

parameter, we can obtain the posterior density for

from prior distribution through the equation (2). How-

ever, as x is assumed to conform to

-stable distribu-

tion, density function

f (x|δ

)

is unavailable. (Dubey

and Pentland, 2019) take advantage of the symme-

try of the distribution and achieve the iterative process

through scale mixtures of normals representation. This

method is not applicable when β is not equal to 0.

We solve the sampling from the posterior density

problem through Gibbs sampling, which requires ob-

taining conditional distribution. However, it is chal-

lenging to obtain the conditional distribution of

stable distribution which was circumvented by intro-

ducing an auxiliary variable leading to a decomposi-

tion proposed by (Buckle, 1995).

Theorem 1.

Let

f : (−∞,0) ×(−

α,β

) ∪(0,∞) ×

α,β

) → (0,∞)

be the bivariate probability density

function of

X and

Y , conditional on α, β, σ and δ.

f (x,y|α,β, σ,δ) =

|α −1|

exp



−|

α,β

(y)

α−1





α,β

(y)



α−1

|z|

(6)

where z =

x−δ

, η

α,β

β(2−α)π

, l

α,β

= −

α,β

πα

α,β

(y) =



sin[παy + η

α,β

]

cos[πy]





cosπy

cos[π(α −1)y + η

α,β

]



α−1

(7)

ICAART 2023 - 15th International Conference on Agents and Artiﬁcial Intelligence

436

Then

is a proper bivariate probability density for

distribution of

(X,Y )

, and marginal distribution of

is S

(σ,β,δ).

Now we are ready to study Bayesian inference for

the arm

n ∈ [N]

. Suppose that at time

, the arm

has

been pulled for

(t)

times and hence we have

(t)

vectors of rewards

(t) = {r

(1)

,··· ,r

(t)

}

. Accord-

ing to Theorem 1 and the Bayesian rule, we derive

the posterior density of

conditional on

(t)

by the

following equation(Buckle, 1995):

p(α,β,σ, δ

(t)) ∝

(

|α −1|σ

)

(t)

exp



−

(t)

∑

i=1

α,β

)

α−1



(t)

∏

i=1



α,β

)



α−1

×p(α,β,σ,δ

)dy

(8)

where

(i)

−δ

and

p(δ

)

is the prior distri-

bution for

. We can simplify the formula further as

α,β,σ are known.

p(δ

|α,β, σ,r

) ∝ exp{−

(t)

∑

i=1

α,β

)

α/(α−1)

}

(t)

∏

i=1

α,β

)

α/(α−1)

p(δ

)

(9)

Through this formula, we have completed the

method to obtain the posterior distribution under the

assumption of asymmetric α-stable distribution.

In our algorithm we ﬁrst estimate parameters

α,β,σ

and choose normal distribution as the prior dis-

tribution of

. Suppose that we have a model driven

by the parameter vector

(α,β,δ, σ)

, and that we have

observed

x = (x

,...,x

)

. By taking a set of starting

values we can generate

from

π(δ|α

,β

,σ

,x)

from

π(α|δ

,β

,σ

,x)

, and so on continuing to other

parameters thereby performing one iteration producing

the sample

(α

,β

,δ

,σ

)

. The prior distribution of

is also taken to be a Gaussian distribution, while the

prior distributions of

α,β

are chosen to follow beta

distribution.

The conditional distributions of

-stable parame-

ters are obtained as follows:

p(α

|δ,β, σ,r

) ∝ (

|α −1|

)

exp(−

(t)

∑

i=1

α−1

)×

(t)

∏

i=1

α−1

α,β

−1

α,β

)=Φ

(i)

−δ

)

p(α

)

(10)

p(β

|α,δ,σ,r

) ∝ |

α,β

−1

α,β

)=Φ

(i)

−δ

)

p(β

)

(11)

p(σ

|α,δ,β,r

) ∝ |

α,β

−1

α,β

)=Φ

(i)

−δ

)

p(σ

)

(12)

Algorithm 1: Asymmetric α-Thompson Sampling.

Input:

Arms n

∈

[N], priors

α,β,σ

for each arm,

auxiliary variable y

estimate

α,β,σ

by empirical characteristic function

method and deduce prior distribution p(δ)

for each arms n ∈ [N] do

for each iteration t ∈ [k

(t)] do

draw δ

(t) from prior distribution

Generate u from a Uniform(0,1)

(t)|α,β,σ,r

(t)) × p(δ

(t)|

(t))

/(p(δ

(t)|α,β,σ,r

(t))p(

(t)|δ

(t)))

then

(t + 1) =

(t)

; otherwise,

(t + 1) =

(t)

choose the arm that maximizes the reward

(t)

Update distribution p(δ

(t + 1)) by (9)

Update distribution p(α

(t + 1)) by (10)

Update distribution p(β

(t + 1)) by (11)

Update distribution p(σ

(t + 1)) by (12)

3.1 Regret Analysis

Bayesian Regret.

In this section, we provide a for-

mula for the Bayesian Regret (BR) incurred by the

asymmetric α-Thompson Samplings algorithm.

In order to simplify the calculation formula, we in-

troduce the upper bound conﬁdence and lower bound

conﬁdence to show the Bayesian Regret. We gener-

alize the upper and lower conﬁdence bounds on an

arm’s expected rewards at a certain time t (given his-

tory

): respectively,

U(a, H

)

and

L(a,H

)

. There

are two properties we want these functions to have, for

some γ > 0 to be speciﬁed later:

∀a,t E[[U(a,H

) −µ(a,t)]

−

] ≤ γ

(13)

∀a,t E[[µ(a,t) −L(a,H

)]

−

] ≤ γ

(14)

Assuming we have lower and upper bound func-

tions that satisfy those two properties, the Bayesian

Regret of Thompson sampling can be bounded as fol-

lows:

BR(T ) ≤2γ ×T ×N+

∑

t=1

E[[U(a,H

) −L(a,H

)]]

(15)

Theorem 2.

Let

>1,

α ∈ (1,2)

σ ∈ R

. As-

sume that

n∈[N]

is uniformly bounded by

M > 0

i.e.

n∈[N]

∈ [−M,M]

. Then for a N-armed bandit

with rewards for each arm

independently drawn

from

(β,σ,δ

)

, for

chosen a priori such that

ε → (α −1)

−

Thompson Sampling on Asymmetric a-stable Bandits

437

BR(T,π

T S

) = O(N

1+ε

1+2ε

)

(16)

Proof.

For all heavy-tailed distributions such as

stable distributions, variance does not exist, so we need

to build our own controllable

. The key difference

between our BayesRegret and the results from (2) lies

in the moments of

-stable distribution. We have for

X ∼ S

(σ,0,0), ε ∈ (0, α −1)

E[|X|

(1+ε)

] = C(1 + ε, α)|σ|

(1+ε)/α

(17)

where

C((1 + ε),α) =

Γ(

)Γ(−

(1+ε)

)

√

πΓ(−

(1+ε)

)

. while

β 6= 0

by Proposition 1 in (Kuruoglu, 2001), we have for

X ∼ S

(σ,β,0), p ∈ (0,α).

E[|X|

(1+ε)

] = C((1 + ε), α,β)|σ|

(1+ε)/α

(18)

where

C((1 + ε), α,β) =

Γ(1−

(1+ε)

)

Γ((−ε))



cosθ



(1+ε)/α

cos(

(1+ε)θ

)

cos(

(1+ε)π

)

and θ = arctan(βtan

απ

Let

,...,x

be a real i.i.d. sequence with ﬁnite

mean

. Assume for some

ε ∈ (0,1]

v ≥ 0

and

u ≥ 0

one has E[|X −µ|

1+ε

] ≤ v and E[|X|

1+ε

] ≤ u.

Let

ˆµ

be empirical mean, then for any

δ ∈ (0,1)

with probability at least

1 − δ

. One has

ˆµ ≤ µ +

(

3∗v

δ∗n

)

1+ε

(Bubeck et al., 2013).

Thus, through the deﬁnition of upper bound con-

ﬁdence, lower bound conﬁdence and

, we obtained

γ ≤ 2 ∗NM ∗δ ∗T where |U(a, H

)| ≤ M.

∑

t=1

E[[U(a, H

) −L(a,H

)]] ≤ 2 ∗E[

∑

n=1

∑

t=1

I[A

= k]

(

3 ∗v

δ ∗n

)

1+ε

]

≤ 2 ∗(

3 ∗C(1 + ε,α,β)

δ ∗n

)

1+ε

∑

n=1

(T )

s=0

(

)

1+ε

ds]

= 2(1 + ε)(

3C(1 + ε,α,β)

δ ∗n

)

1+ε

∗(NT)

1+ε

(19)

BR(T,π

T S

) ≤ 4



3 ∗C(1 + ε,α,β)



1+ε

(NT)

1+ε

+ 4NMT

δ,

(20)

where

δ ∈ (0,1)

. By choosing suitable

, we obtain

the desired equation (16).

In particular, when β = 0 or β = ±1.

BR(T,β = 0) ≤4



3 ∗

Γ(1−

)

Γ(1−p)

cos(

pπ

)



1+ε

(NT)

1+ε

+ 4NMT

(21)

BR(T,β = ±1) ≤4



3 ∗

Γ(1−

)

Γ(1−p)

(

cos(

pπ

)

p/α



1+ε

(NT)

1+ε

+ 4NMT

(22)

where (21) is consistent with the results obtained in

the symmetric case.

Although the skewness parameter

has an impact

on the regret bound, it does not change the upper con-

ﬁdence bound of the regret bound, which is shown in

(16).

4 EXPERIMENTAL STUDIES

In order to show the efﬁciency and stability of the

asymmetric

-TS algorithm in a speciﬁc data ﬁeld, we

will use the

-greedy algorithm, bootstrapped UCB

algorithm, symmetry

-TS algorithm and asymmet-

ric algorithm in different data sets for comparative

experiments.

To test the efﬁciency of asymmetric

-TS algo-

rithm, we use synthetic

-stable data, stock prices data

and recommendation data as detailed next.

4.1 Synthetic Asymmetric α-stable Data

We generated a simulated data set with 100 arms. We

generated

t,a

∈ R

from alpha-stable distributions for

all arms a. The true parameters were ﬁrstly simulated

from an alpha-stable distribution with mean 0. The

resulting reward associated with the optimal arm was

0.994 and the mean reward was 0.195. We averaged

the experiments over 100 runs.

This asymmetric data set is generated using the

Chambers-Mallows-Stuck algorithm (Weron, 1996).

We conducted multi-armed bandit experiments with

following benchmarks – (i) an

-greedy agent, (ii)

Boostrapped-UCB agent, (iii) Symmetric

-TS and

(iv) Asymmetric

-TS. The average value of each arm

is randomly selected, where

α = 1.3

and

σ = 500

re-

spectively of each experiment.

ICAART 2023 - 15th International Conference on Agents and Artiﬁcial Intelligence

438

Figure 1: Regret for asymmetric data, the green line is

greedy strategy, blue one is common alpha-TS method, red

one is Bootstrapped UCB while orange line shows our asym-

metric α-TS method.

The test results of the asymmetric data which are

shown in Figure 1 meet our expectations, and the sym-

metric algorithm is worse than our method in time and

space efﬁciency. Under the assumption that the return

distribution conforms to the asymmetric

-stable distri-

bution, we obtain reward each iteration independently

come from the reward distribution.

4.2 Stock Selection

In this experiment, 100 shares listed in Shenzhen Stock

Exchange through Tushare using Python had been

chosen as risk assets, and the stock codes are from

000010.SZ to 300813.SZ. We choose the closed stock

price from 2016/07/01 to 2020/07/18.

Figure 2: Stock price data, the blue histogram represents

the dataset, the black curve represents the ﬁtted symmetric

stable distribution and the red one represents the ﬁtted

asymmetric α stable distribution.

In the ﬁnancial ﬁeld, reward distribution can be

regarded as the distribution of return on each stock.

In each iteration, we get the parameters that are more

consistent with the actual distribution under the as-

sumption conditions and the maximum arm obtained

by sampling. The regret tells the difference between

the ideal total reward we can achieve and the total

reward we actually gets.

Figure 3: regret for stock price data, the green line is greedy

strategy, blue one is common alpha-TS method, red one is

Bootstrapped UCB while orange line shows our asymmetric

α-TS method.

Table 1: Comparison of Different Strategies.

Strategy AR SR MaxD

ε −greedy 3.36% 9.2% 3.48%

Boostrapped −UCB 6.47% 8.3% 6.35%

Symmetric −T S 7.76% 17.8% 3.39%

Asymmetric −T S 9.68% 23.5% 3.5%

The performance of our trading strategies are

compared with

ε − greedy

, Boostrapped-UCB and

Symmetric-TS through Annual Return (AR), Sharpe

Ratio (SR), and Maximum Drawdown (MaxD, namely

the maximum portfolio value loss from the peak to the

bottom). The performances of AR, SR, and MaxD. are

shown in Table 1.

We can see the excellent performance of the

asymmetric-TS algorithm from Figure 3 in the ﬁeld of

stock selection as the log return of stock is consistent

with asymmetric

-stable distribution. The regret is

reduced to close to 0, which means that the asymmet-

ric

-stable distribution can almost perfectly ﬁt the

distribution of log return of stock prices. Our algo-

rithm also gets the optimal Annual Return (AR) and

the maximum Sharpe Ratio (SR), which means that it

has good proﬁtability and stability.

In order to illustrate the versatility of Thompson

sampling for bandit settings more complicated than

just one original data each time, one may consider

stochastic Stock portfolio selection problems that re-

late to the correlation between actions.

4.3 Recommendation System

Recommendation systems are also common applica-

tions of Multi-armed Bandits. The items to be recom-

mended are modeled as the arms to be pulled. The

Thompson Sampling on Asymmetric a-stable Bandits

439

recommendation system gets a score according to its

own scoring system, and we regard the distribution of

score as our reward distribution. Thus, the main goal

is also to maximize the expected reward achieved after

T times.

In this section, we have utilized two benchmark

datasets (MovieLens 100K) of the real-world in rec-

ommender systems to implement the model practi-

cally. MovieLens 100K contains 100,000 ratings

R ∈ {1,2,3,4, 5}

1682

movies (items) rated by

943

users.

Figure 4: Recommendation data, the blue histogram rep-

resents the dataset, the black curve represents the ﬁtted

symmetric

stable distribution and the red one represents

the ﬁtted asymmetric α stable distribution.

Figure 5: Regret for recommendation data, the green line is

Bootstrapped UCB, blue one is common alpha-TS method

while orange line shows our asymmetric α-TS method.

Thompson Sampling algorithms learn the rating

distributions of ﬁlms in few rounds, while

ε −greedy

and Boostrapped-UCB fall into local optima. Figure

5 shows that Thompson Sampling strategy is more

appropriate than

-greedy and UCB strategy in a noise-

free environment. The difference between symmetric

and asymmetric algorithms is not signiﬁcant, which

may be due to the fact that the movie dataset conforms

to the symmetric situation, or it may be due to the

constrains of dataset rating R ∈ {1,2,3, 4,5}.

5 CONCLUSIONS

In view of the complexity of action/observation space

in many problems, we designed an asymmetric

Thompson sampling algorithm using Bayesian infer-

ence for stable distribution and veriﬁed the conjecture

through the asymmetric data ,real stock price data and

recommendation data.

Asymmetric

-stable algorithm can also be used

to process symmetric data because it has no restric-

tions on

, but because it uses complicated Bayesian

inference formula (in the symmetric

Thompson algo-

rithm, the iteration from prior distribution to posterior

distribution can be greatly simpliﬁed through the char-

acteristics of symmetry and alternative variables), the

iteration speed can not be compared with symmetric

Thompson algorithm which can iterate from prior dis-

tribution to posterior distribution immediately under

symmetric conjecture and auxiliary variables.

We develop a regret bound for asymmetric

one

in the parameter, action and observation spaces. Our

algorithms only require the existence of bounded 1 +

moment of payoffs, and achieve

O(N

1+ε

1+2ε

)

regret

bound which can be used to determine the rationality

of the data.

Applying the algorithm to stock price returns and

recommendation systems, we demonstrate that asym-

metric stable distribution is a better data model, which

can explain the existence of skewness.

REFERENCES

Auer, P., Cesa-Bianchi, N., and Fischer, P. (2002). Finite-

time analysis of the multiarmed bandit problem. Ma-

chine learning, 47(2-3):235–256.

Baisero, A. and Amato, C. (2021). Unbiased Asymmetric

Actor-Critic for Partially Observable Reinforcement

Learning. The Computing Research Repository.

Bubeck, S., Cesa-Bianchi, N., and Lugosi, G. (2013). Ban-

dits with heavy tail. IEEE Transactions on Information

Theory, 59(11):7711–7717.

Buckle, D. (1995). Bayesian inference for stable distribu-

tions. Journal of the American Statistical Association,

90(430):605–613.

Cappé, O., Garivier, A., Maillard, O.-A., Munos, R., and

Stoltz, G. (2013). Kullback-Leibler upper conﬁdence

bounds for optimal sequential allocation. The Annals

of Statistics, page 1516–1541.

Chen, Y., So, H. C., and Kuruoglu, E. E. (2016). Variance

analysis of unbiased least lp-norm estimator in non-

gaussian noise. Signal Processing, 122:190 – 203.

Dubey, A. and Pentland, A. (2019). Thompson Sampling on

Symmetric alpha-Stable Bandits. International Joint

Conference on Artiﬁcial Intelligence.

ICAART 2023 - 15th International Conference on Agents and Artiﬁcial Intelligence

440

Embrechts, P., Lindskog, F., McNeil, A., and Rachev, S.

(2003). Handbook of heavy tailed distributions in ﬁ-

nance. Modelling Dependence with Copulas and Appli-

cations to Risk Management. Handbooks in Finance:

Book, 1:329–385.

Herranz, D., Kuruo

glu, E. E., and Toffolatti, L. (2004). An

alpha-stable approach to the study of the P(D) distribu-

tion of unresolved point sources in CMB sky maps.

Astronomy & Astrophysics, 424(3):1081–1096.

Jia, H., Shi, C., and Shen, S. (2021). Multi-armed Bandit

with Sub-exponential Rewards. Operations Research

Letters, 49(5):728–733.

Karaku¸s, O., Kuruo

glu, E. E., and Altınkaya, M. A. (2020).

Modelling impulsive noise in indoor powerline commu-

nication systems. Signal, Image and Video Processing,

14(8):1655 – 1661.

Korda, N., Kaufmann, E., and Munos, R. (2013). Thompson

sampling for 1-dimensional exponential family bandits.

Proceedings of NIPS.

Korte, B. and Lovász, L. (1984). Greedoids-a structural

framework for the greedy algorithm. Progress in com-

binatorial optimization, pages 221–243.

Kuruoglu, E. E. (2001). Density parameter estimation of

skewed/spl alpha/-stable distributions. IEEE Transac-

tions on signal processing, 49(10):2192–2201.

Kuruoglu, E. E. (2003). Analytical representation for pos-

itive /spl alpha/-stable densities. In 2003 IEEE Inter-

national Conference on Acoustics, Speech, and Signal

Processing, 2003. Proceedings. (ICASSP ’03)., vol-

ume 6, pages VI–729.

Lee, K., Yang, H., and Lim, S. (2020). Optimal algorithms

for stochastic multi-armed bandits with heavy tailed

rewards. Advances in Neural Information Processing

Systems, 33:8452–8462.

Lehmkuhl, M. and Promies, N. (2020). Frequency distribu-

tion of journalistic attention for scientiﬁc studies and

scientiﬁc sources: An input–output analysis. PloS one,

15(11):e0241376.

Li, L., Chu, W., Langford, J., and Schapire, R. E. (2010).

A contextual-bandit approach to personalized news

article recommendation. Proc.of Intl Conf.on World

Wide Web, pages 661–670.

Liu, K. and Zhao, Q. (2011). Multi-armed bandit problems

with heavy-tailed reward distributions. In 2011 49th

Annual Allerton Conference on Communication, Con-

trol, and Computing (Allerton), pages 485–492. IEEE.

Nguyen, N. H., Do

gançay, K., and Kuruo

glu, E. E. (2019).

An Iteratively Reweighted Instrumental-Variable Es-

timator for Robust 3-D AOA Localization in Impul-

sive Noise. IEEE Transactions on Signal Processing,

67(18):4795–4808.

Oja, H. (1981). On Location, Scale, Skewness and Kurtosis

of Univariate Distributions. Scandinavian Journal of

Statistics, 8(3):154–68.

Qi, C., Zhao, Z., Li, R., and Zhang, H. (2016). Characteriz-

ing and modeling social mobile data trafﬁc in cellular

networks. IEEE Vehicular Technology Conference,

pages 1–5.

Russo, D. J. and Van, R. B. (2014). Learning to optimize

via posterior sampling. Mathematics of Operations

Research, 39(4):1221–1243.

Shailesh, M. (2015). Decision Making-Investment, Finan-

cial and Risk Analysis in Mining Projects. National

Institute of Technology, Rourkela.

Weron, R. (1996). On the Chambers-Mallows-Stuck method

for simulating skewed stable random variables. Statis-

tics and Probability Letters, 28(2):165–171.

Win, M. Z., Pinto, P. C., and Shepp, L. A. (2009). A mathe-

matical theory of network interference and its applica-

tions. Proc. IEEE, 97(2):205–230.

Yu, X., Shao, H., Lyu, M. R., and King, I. (2018). Pure

exploration of multi-armed bandits with heavy-tailed

payoffs. Uncertainty in Artiﬁcial Intelligence.

Thompson Sampling on Asymmetric a-stable Bandits

441