DiscountedUCB and others as they were strictly out-
performed by UCB-Tuned or SW-UCB for all param-
eter choices tested. We also show only four represen-
tative drifting worlds due to space constraints. Across
all true worlds, we find in general that a detrending
term congruent with the true drift form (e.g. linear
detrend in the linear drift quadrant of Figure 2) out-
performs all other strategies in the long run, produc-
ing a zero-regret strategy (Vermorel and Mohri, 2005)
for restless bandits where the functional form of rest-
lessness is known. Similarly, we find that utilizing
a weighting function which closely approximates the
true drift performs well in most cases. Surprisingly,
we find that linear detrending is an effective technique
for handling the random walk, a result that is robust
to variations in the step type and scale of the random
walk. Unintuitively, WLS techniques also perform
strongly even in the case when there is no drift.
In these experiments, we find no convincing evi-
dence for a general application for detrending in poly-
nomial degree greater than one or autoregression of
any level in our model. Both autoregression and
higher degree polynomials strictly reduce regret if the
true world trend is autoregressive or determined, even
partially, by the chosen form. We find the linear
weighted least squares technique (weights set to the
inverse of t) to be the most robust technique over all
experiments, suggesting it is the strongest technique
in the case of no a priori information on the form
of drift: having the lowest mean total regret (20.8),
lowest standard deviation across all drift types (11.8)
and the lowest 75th (worst-) percentile regret (26.6).
Due to space constraints and difficulties reproducing
some challenge results, we do not present the PAS-
CAL challenge data here, however, our preliminary
results show similar promise with the weighted least
squares technique.
6 CONCLUSION
In this work, we have implemented and experimented
with integrating time series techniques and weighted
least squares with the highly successful Thompson
sampling technique to extend the state of the art in
handling restless regression bandits. We present evi-
dence that weighted least squares techniques provide
a strong solution for ongoing multi-armed bandit opti-
mization in an uncertain-stationarity world even with-
out an a priori understanding of the modality of drift.
The technique presented allows bandits with context
to handle drift in diverse form and operationalizes
monotonic discounting in a simple, easy to imple-
ment regression framework. This provides a viable
solution to the ongoing online marketing experimen-
tation problem. Future work will explore how con-
textual factors improve results for web optimization,
perform real world experiments on online marketing
optimization, and derive formal bounds for the inter-
action between weighted least squares and optimistic
Thompson sampling.
REFERENCES
Agrawal, S. and Goyal, N. (2012). Analysis of Thompson
sampling for the multi-armed bandit problem. CoRR,
abs/1111.1797.
Auer, P. (2003). Using confidence bounds for exploitation-
exploration trade-offs. The Journal of Machine Learn-
ing Research, 3:397–422.
Auer, P., Cesa-Bianchi, N., and Fischer, P. (2002). Finite-
time analysis of the multiarmed bandit problem. Ma-
chine learning, 47(2-3):235–256.
Chapelle, O. and Li, L. (2011). An empirical evaluation of
Thompson sampling. In Advances in Neural Informa-
tion Processing Systems, pages 2249–2257.
Dani, V., Hayes, T. P., and Kakade, S. M. (2008). Stochastic
linear optimization under bandit feedback. In COLT,
pages 355–366.
Durand, A. and Gagné, C. (2014). Thompson sampling for
combinatorial bandits and its application to online fea-
ture selection. In AI Workshops.
Eckles, D. and Kaptein, M. (2014). Thompson sampling
with the online bootstrap. arXiv:1410.4009.
Filippi, S., Cappe, O., Garivier, A., and Szepesvári, C.
(2010). Parametric bandits: the generalized linear
case. In Advances in Neural Information Processing
Systems, pages 586–594.
Garivier, A. and Moulines, E. (2008). On upper-confidence
bound policies for non-stationary bandit problems.
arXiv preprint arXiv:0805.3415.
Granmo, O.-C. and Glimsdal, S. (2013). Accelerated
Bayesian learning for decentralized two-armed bandit
based decision making with applications to the Goore
Game. Applied Intelligence, 38(4):479–488.
Hartland, C., Gelly, S., Baskiotis, N., Teytaud, O., Sebag,
M., et al. (2006). Multi-armed bandit, dynamic envi-
ronments and meta-bandits. In NIPS.
Hussain, Z., Auer, P., Cesa-Bianchi, N., Newnham, L., and
Shawe-Taylor, J. (2006). Exploration vs. exploitation
PASCAL challenge.
Kaufmann, E., Korda, N., and Munos, R. (2012). Thomp-
son sampling: An asymptotically optimal finite-time
analysis. In Algorithmic Learning Theory, pages 199–
213. Springer.
Kocsis, L. and Szepesvári, C. (2006). Discounted UCB. 2nd
PASCAL Challenges Workshop.
Lai, T. L. and Robbins, H. (1985). Asymptotically efficient
adaptive allocation rules. Advances in applied mathe-
matics, 6(1):4–22.
ImprovingOnlineMarketingExperimentswithDriftingMulti-armedBandits
635