Figure 2: Test Result of ๐= 10, ๐ฟ=0.1.
Figure 3: Test Result of ๐= 100, ๐ฟ=0.1.
Figure 4: Test Result of ๐= 100, ๐ฟ=0.01.
Based on the testing result, under the following
setting: ๐= 100 ๐ฟ=0.1, ๐=10 ๐ฟ=0.01, and
๐=10 ๐ฟ=0.01 , the Thompson Sampling is
showing an exceptionally better performance
compared to UCB algorithm. Under the UCB
Algorithm, Agrawal proved that regret displays
logarithmic scale. (Agrawal, 1995) Figure 2 shows
that regret is trending on a logarithmic scale. Also, it
is worth noting that, under ๐= 100, ๐ฟ=0.01
environment class setting, the Thompson Sampling is
demonstrating similar performance compared to
Upper Confidence Bound algorithm.
5 CONCLUSIONS
The Thompson Sampling is demonstrating an
exceptional performance compared to UCB
algorithm under all the setting except for ๐=
100, ๐ฟ=0.01. In the Thompson sampling, part of the
reason for a small regret deduction is due to the
posterior update. After each action, the posterior
distribution ensures that the distribution is closer to
the real distribution so that each action is induced
with less regret.
It is worth noting that although Thompson
Sampling generally displaying an exceptional, there
is an exception under the ๐= 100, ๐ฟ=0.01, where
both algorithms are demonstrating a very similar
performance in terms of cumulative regret. In fact,
Thompson Sampling demonstrated slightly worse
performance compared to UCB algorithm. It is worth
discussing in the future that under high number of
arms, and small reward probabilities difference, the
potential reason that led to the similar performance of
both algorithms.
REFERENCES
Agrawal, R. (1995). Sample mean based index policies by
o (log n) regret for the multi-armed bandit problem.
Advances in applied probability, 27(4), 1054-1078.
Auer, P., Cesa-Bianchi, N., & Fischer, P. (2002). Finite-
time analysis of the multiarmed bandit problem.
Machine learning, 47, 235-256.
Bubeck, S., & Cesa-Bianchi, N. (2012). Regret Analysis of
Stochastic and Nonstochastic Multi-armed Bandit
Problems. Foundations and Trends in Machine
Learning, 1-122.
Bubeck, S., & Cesa-Bianchi, N. (2012). Regret analysis of
stochastic and nonstochastic multi-armed bandit
problems. Foundations and Trendsยฎ in Machine
Learning, 5(1), 1-122.
Lattimore, T., & Szepesvรกri, C. (2020). Bandit algorithms.
Cambridge University Press.
Chapelle, O., & Li, L. (2011). An empirical evaluation of
thompson sampling. Advances in neural information
processing systems, 24.
Mahajan, A., & Teneketzis, D. (2008). Multi-armed bandit
problems. In Foundations and applications of sensor
management (pp. 121-151). Boston, MA: Springer US.
Russo, D. J., Van Roy, B., Kazerouni, A., Osband, I., &
Wen, Z. (2018). A tutorial on thompson sampling.