The Possibilistic Reward Method and a Dynamic Extension for the Multi-armed Bandit Problem: A Numerical Study

Miguel Martin; Antonio Jiménez-Martín; Alfonso Mateos

doi:10.5220/0006186400750084

The Possibilistic Reward Method and a Dynamic Extension for the Multi-armed Bandit Problem: A Numerical Study

Miguel Martin, Antonio Jiménez-Martín, Alfonso Mateos

2017

Abstract

Different allocation strategies can be found in the literature to deal with the multi-armed bandit problem under a frequentist view or from a Bayesian perspective. In this paper, we propose a novel allocation strategy, the possibilistic reward method. First, possibilistic reward distributions are used to model the uncertainty about the arm expected rewards, which are then converted into probability distributions using a pignistic probability transformation. Finally, a simulation experiment is carried out to find out the one with the highest expected reward, which is then pulled. A parametric probability transformation of the proposed is then introduced together with a dynamic optimization, which implies that neither previous knowledge nor a simulation of the arm distributions is required. A numerical study proves that the proposed method outperforms other policies in the literature in five scenarios: a Bernoulli distribution with very low success probabilities, with success probabilities close to 0.5 and with success probabilities close to 0.5 and Gaussian rewards; and truncated in [0,10] Poisson and exponential distributions.

References

Agrawal, R. (1995). Regret bounds and minimax policies under partial monitoring. Advances in Applied Probability, 27(4):1054-1078.
Audibert, J.-Y. and Bubeck, S. (2010). Sample mean based index policies by o(log n) regret for the multi-armed bandit problem. Journal of Machine Learning Research, 11:2785-2836.
Audibert, J.-Y., Munos, R., and Szepervári, C. (2009). Exploration-exploitation trade-off using variance estimates in multi-armed bandits. Theoretical Computer Science, 410:1876-1902.
Auer, P., Cesa-Bianchi, N., and Fischer, P. (2002). Finitetime analysis of the multiarmed bandit problem. Machine Learning, 47:235-256.
Auer, P. and Ortner, R. (2010). UCB revisited: Improved regret bounds for the stochastic multi-armed bandit problem. Advances in Applied Mathematics, 61:55- 65.
Baransi, A., Maillard, O., and Mannor, S. (2014). Subsampling for multi-armed bandits. In Proceedings of the European Conference on Machine Learning, page 13.
Berry, D. A. and Fristedt, B. (1985). Bandit Problems: Sequential Allocation of Experiments. Chapman and Hall, London.
Burnetas, A. N. and Katehakis, M. N. (1996). Optimal adaptive policies for sequential allocation problems. Advances in Applied Mathematics, 17(2):122 - 142.
Cappé, O., Garivier, A., Maillard, O., Munos, R., and Stoltz, G. (2013). Kullbackleibler upper confidence bounds for optimal sequential allocation. Annals of Statistics, 41:1516-1541.
Chapelle, O. and Li, L. (2001). An empirical evaluation of thompson sampling. In Advances in Neural Information Processing Systems, pages 2249-2257.
Dubois, D., Foulloy, L., Mauris, G., and Prade, H. (2004). Probability-possibility transformations, triangular fuzzy sets, and probabilistic inequalities. Reliable Computing, 10:273-297.
Dupont, P. (1978). Laplace and the indifference principle in the 'essai philosophique des probabilits.78. Rendiconti del Seminario Matematico Universit e Politecnico di Torino, 36:125-137.
Garivier, A. and Cappé, O. (2011). The kl-ucb algorithm for bounded stochastic bandits and beyond. Technical report, arXiv preprint arXiv:1102.2490.
Gittins, J. (1979). Bandit processes and dynamic allocation indices. Journal of the Royal Statistical Society, 41:148-177.
Gittins, J. (1989). Multi-armed Bandit Allocation Indices. Wiley Interscience Series in Systems and Optimization. John Wiley and Sons Inc., New York, USA.
Hoeffding, W. (1963). Probability inequalities for sums of bounded random variables. Advances in Applied Mathematics, 58:13-30.
Holland, J. (1992). Adaptation in Natural and Artificial Systems. MIT Press/Bradford Books, Cambridge, MA, USA.
Honda, J. and Takemura, A. (2010). An asymptotically optimal bandit algorithm for bounded support models. In Proceedings of the 24th annual Conference on Learning Theory, pages 67-79.
Kaufmann, E., Cappé, O., and Garivier, A. (2012). On bayesian upper confidence bounds for bandit problems. In International Conference on Artificial Intelligence and Statistics, pages 592-600.
Lai, T. and Robbins, H. (1985). Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics, 6:4-22.
Maillard, O., Munos, R., and Stoltz, G. (2011). Finitetime analysis of multi-armed bandits problems with kullback-leibler divergences. In Proceedings of the 24th Annual Conference on Learning Theory, pages 497-514.
Smets, P. (2000). Data fusion in the transferable belief model. In Proceedings of the Third International Conference on Information Fusion, volume 1, pages 21- 33.
Sutton, R. S. and Barto, A. G. (1998). Introduction to Reinforcement Learning. MIT Press, Cambridge, MA, USA, 1st edition.

Download

Paper Citation

in Harvard Style

Martin M., Jiménez-Martín A. and Mateos A. (2017). The Possibilistic Reward Method and a Dynamic Extension for the Multi-armed Bandit Problem: A Numerical Study . In Proceedings of the 6th International Conference on Operations Research and Enterprise Systems - Volume 1: ICORES, ISBN 978-989-758-218-9, pages 75-84. DOI: 10.5220/0006186400750084

in Bibtex Style

@conference{icores17,
author={Miguel Martin and Antonio Jiménez-Martín and Alfonso Mateos},
title={The Possibilistic Reward Method and a Dynamic Extension for the Multi-armed Bandit Problem: A Numerical Study},
booktitle={Proceedings of the 6th International Conference on Operations Research and Enterprise Systems - Volume 1: ICORES,},
year={2017},
pages={75-84},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0006186400750084},
isbn={978-989-758-218-9},
}

in EndNote Style

TY - CONF
JO - Proceedings of the 6th International Conference on Operations Research and Enterprise Systems - Volume 1: ICORES,
TI - The Possibilistic Reward Method and a Dynamic Extension for the Multi-armed Bandit Problem: A Numerical Study
SN - 978-989-758-218-9
AU - Martin M.
AU - Jiménez-Martín A.
AU - Mateos A.
PY - 2017
SP - 75
EP - 84
DO - 10.5220/0006186400750084