Improving Online Marketing Experiments with Drifting Multi-armed Bandits
Giuseppe Burtini, Jason Loeppky, Ramon Lawrence
2015
Abstract
Restless bandits model the exploration vs. exploitation trade-off in a changing (non-stationary) world. Restless bandits have been studied in both the context of continuously-changing (drifting) and change-point (sudden) restlessness. In this work, we study specific classes of drifting restless bandits selected for their relevance to modelling an online website optimization process. The contribution in this work is a simple, feasible weighted least squares technique capable of utilizing contextual arm parameters while considering the parameter space drifting non-stationary within reasonable bounds. We produce a reference implementation, then evaluate and compare its performance in several different true world states, finding experimentally that performance is robust to time drifting factors similar to those seen in many real world cases.
References
- Agrawal, S. and Goyal, N. (2012). Analysis of Thompson sampling for the multi-armed bandit problem. CoRR, abs/1111.1797.
- Auer, P. (2003). Using confidence bounds for exploitationexploration trade-offs. The Journal of Machine Learning Research, 3:397-422.
- Auer, P., Cesa-Bianchi, N., and Fischer, P. (2002). Finitetime analysis of the multiarmed bandit problem. Machine Learning, 47(2-3):235-256.
- Chapelle, O. and Li, L. (2011). An empirical evaluation of Thompson sampling. In Advances in Neural Information Processing Systems, pages 2249-2257.
- Dani, V., Hayes, T. P., and Kakade, S. M. (2008). Stochastic linear optimization under bandit feedback. In COLT, pages 355-366.
- Durand, A. and Gagné, C. (2014). Thompson sampling for combinatorial bandits and its application to online feature selection. In AI Workshops.
- Eckles, D. and Kaptein, M. (2014). Thompson sampling with the online bootstrap. arXiv:1410.4009.
- Filippi, S., Cappe, O., Garivier, A., and Szepesvári, C. (2010). Parametric bandits: the generalized linear case. In Advances in Neural Information Processing Systems, pages 586-594.
- Garivier, A. and Moulines, E. (2008). On upper-confidence bound policies for non-stationary bandit problems. arXiv preprint arXiv:0805.3415.
- Granmo, O.-C. and Glimsdal, S. (2013). Accelerated Bayesian learning for decentralized two-armed bandit based decision making with applications to the Goore Game. Applied Intelligence, 38(4):479-488.
- Hartland, C., Gelly, S., Baskiotis, N., Teytaud, O., Sebag, M., et al. (2006). Multi-armed bandit, dynamic environments and meta-bandits. In NIPS.
- Hussain, Z., Auer, P., Cesa-Bianchi, N., Newnham, L., and Shawe-Taylor, J. (2006). Exploration vs. exploitation PASCAL challenge.
- Kaufmann, E., Korda, N., and Munos, R. (2012). Thompson sampling: An asymptotically optimal finite-time analysis. In Algorithmic Learning Theory, pages 199- 213. Springer.
- Kocsis, L. and Szepesvári, C. (2006). Discounted UCB. 2nd PASCAL Challenges Workshop.
- Lai, T. L. and Robbins, H. (1985). Asymptotically efficient adaptive allocation rules. Advances in applied mathematics, 6(1):4-22.
- Li, L., Chu, W., Langford, J., and Schapire, R. E. (2010). A contextual-bandit approach to personalized news article recommendation. In WWW, pages 661-670. ACM.
- Li, L., Chu, W., Langford, J., and Wang, X. (2011). Unbiased offline evaluation of contextual-bandit-based news article recommendation algorithms. In WSDM, pages 297-306. ACM.
- May, B. C., Korda, N., Lee, A., and Leslie, D. S. (2012). Optimistic Bayesian sampling in contextual-bandit problems. The Journal of Machine Learning Research, 13(1):2069-2106.
- Mellor, J. and Shapiro, J. (2013). Thompson sampling in switching environments with Bayesian online change detection. In AISTATS, pages 442-450.
- Minka, T. P. (2001). A family of algorithms for approximate Bayesian inference. PhD thesis, MIT.
- Pavlidis, N. G., Tasoulis, D. K., and Hand, D. J. (2008). Simulation studies of multi-armed bandits with covariates. In UKSIM, pages 493-498. IEEE.
- Rusmevichientong, P. and Tsitsiklis, J. N. (2010). Linearly parameterized bandits. Mathematics of Operations Research, 35(2):395-411.
- Scott, S. L. (2010). A modern Bayesian look at the multiarmed bandit. Applied Stochastic Models in Business and Industry, 26(6):639-658.
- Strens, M. (2000). A Bayesian framework for reinforcement learning. In ICML, pages 943-950.
- Tikhonov, A. (1963). Solution of incorrectly formulated problems and the regularization method. In Soviet Math. Dokl., volume 5, pages 1035-1038.
- Vermorel, J. and Mohri, M. (2005). Multi-armed bandit algorithms and empirical evaluation. In ECML 2005, pages 437-448. Springer.
- Watkins, C. J. C. H. (1989). Learning from delayed rewards. PhD thesis, University of Cambridge.
- Wyatt, J. (1998). Exploration and inference in learning from reinforcement.
Paper Citation
in Harvard Style
Burtini G., Loeppky J. and Lawrence R. (2015). Improving Online Marketing Experiments with Drifting Multi-armed Bandits . In Proceedings of the 17th International Conference on Enterprise Information Systems - Volume 1: ICEIS, ISBN 978-989-758-096-3, pages 630-636. DOI: 10.5220/0005458706300636
in Bibtex Style
@conference{iceis15,
author={Giuseppe Burtini and Jason Loeppky and Ramon Lawrence},
title={Improving Online Marketing Experiments with Drifting Multi-armed Bandits},
booktitle={Proceedings of the 17th International Conference on Enterprise Information Systems - Volume 1: ICEIS,},
year={2015},
pages={630-636},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0005458706300636},
isbn={978-989-758-096-3},
}
in EndNote Style
TY - CONF
JO - Proceedings of the 17th International Conference on Enterprise Information Systems - Volume 1: ICEIS,
TI - Improving Online Marketing Experiments with Drifting Multi-armed Bandits
SN - 978-989-758-096-3
AU - Burtini G.
AU - Loeppky J.
AU - Lawrence R.
PY - 2015
SP - 630
EP - 636
DO - 10.5220/0005458706300636