
5.3 Off-Policy Evaluation
There is significant research related to off-policy eval-
uation on offline datasets, touching on aspects like
doubly robust policy evaluation (Dudik et al., 2011),
distributionally robust policy gradients (Yang et al.,
2023), and methodologies like variance-minimizing
augmentation logging (Tucker and Joachims, 2023).
The literature has also included works on neural con-
textual bandits with UCB-based exploration (Zhou
et al., 2020) and PAC-Bayesian approaches (Sakhi
et al., 2022). However, a comprehensive examination
of how different behavior policies directly influence
the efficiency and effectiveness of offline policy learn-
ing algorithms remains an underexplored area.
6 CONCLUSIONS
Our study demonstrates the critical impact of dataset
characteristics on offline policy learning of Contex-
tual Multi-Armed Bandits, offering key insights for
their practical application. The Neural Greedy algo-
rithm requires datasets with a substantial degree of ex-
ploration for effective learning. In practical scenarios,
however, it is unreasonable to anticipate the deployed
policies to consistently make highly exploratory deci-
sions e.g. based on uniformly random criteria. We ad-
vise employing the NeuraLCB method, as it can learn
effectively from datasets collected by behavior poli-
cies that leverage problem-specific knowledge. It has
the better performance, the more optimal the actions
in the dataset. Nonetheless, we show NeuraLCB still
benefits from some exploratory actions. We recom-
mend ensuring that each action gets chosen multiple
times in the datasets.
Future work shall tune the proportion of ex-
ploratory actions to the optimal ones for the best per-
formance. Experiments with learning behavioral poli-
cies in Section 4.3 could also be extended by dropping
early, uninformed decisions and checking how this in-
fluences the offline methods performance.
Our investigation adds a new dimension to the
body of knowledge concerning offline policy learn-
ing. While algorithms undoubtedly form the learning
engine, our research underscores the importance of
fuel quality — the offline dataset — for the journey
toward efficient offline policy learning and decision-
making. We hope to contribute to the ongoing dia-
logue on improving the implementation of offline pol-
icy learning in real-world scenarios.
REFERENCES
Badanidiyuru, A., Kleinberg, R., and Slivkins, A. (2013).
Bandits with Knapsacks. IEEE Computer Society.
Blundell, C., Cornebise, J., Kavukcuoglu, K., and Wierstra,
D. (2015). Weight Uncertainty in Neural Networks.
Brandfonbrener, D., Whitney, W. F., Ranganath, R., and
Bruna, J. (2021). Offline Contextual Bandits with
Overparameterized Models.
Chen, J. and Jiang, N. (2019). Information-Theoretic Con-
siderations in Batch Reinforcement Learning. PMLR.
Dudik, M., Langford, J., and Li, L. (2011). Doubly Robust
Policy Evaluation and Learning.
Dutta, P., Cheuk, M. K., Kim, J. S., and Mascaro, M.
(2019). Automl for contextual bandits. CoRR,
abs/1909.03212.
Fujimoto, S., Meger, D., and Precup, D. (2019). Off-Policy
Deep Reinforcement Learning without Exploration.
PMLR.
Hu, B., Xiao, Y., Zhang, S., and Liu, B. (2023). A Data-
Driven Solution for Energy Management Strategy of
Hybrid Electric Vehicles Based on Uncertainty-Aware
Model-Based Offline Reinforcement Learning.
Joachims, T., Swaminathan, A., and de Rijke, M. (2018).
Deep learning with logged bandit feedback. OpenRe-
view.net.
Komorowski, M., Celi, L. A., Badawi, O., Gordon, A. C.,
and Faisal, A. A. (2018). The Artificial Intelligence
Clinician learns optimal treatment strategies for sepsis
in intensive care.
Kumar, A., Zhou, A., Tucker, G., and Levine, S. (2020).
Conservative Q-Learning for Offline Reinforcement
Learning. Curran Associates, Inc.
Lattimore, T. and Szepesv
´
ari, C. (2020). Bandit Algorithms.
Cambridge University Press.
Levine, S., Kumar, A., Tucker, G., and Fu, J. (2020). Offline
reinforcement learning: Tutorial, review, and perspec-
tives on open problems.
Li, L., Chu, W., Langford, J., and Schapire, R. E. (2010). A
contextual-bandit approach to personalized news arti-
cle recommendation.
Markelle, K., Longjohn, R., and Nottingham, K. The uci
machine learning repository.
Nguyen-Tang, T., Gupta, S., Nguyen, A. T., and Venkatesh,
S. (2022). Offline neural contextual bandits: Pes-
simism, optimization and generalization.
Rashidinejad, P., Zhu, B., Ma, C., Jiao, J., and Russell, S.
(2022). Bridging offline reinforcement learning and
imitation learning: A tale of pessimism.
Sakhi, O., Chopin, N., and Alquier, P. (2022). Pac-bayesian
offline contextual bandits with guarantees.
Tucker, A. D. and Joachims, T. (2023). Variance-
minimizing augmentation logging for counterfactual
evaluation in contextual bandits.
Vannella, F., Jeong, J., and Prouti
`
ere, A. (2023). Off-policy
learning in contextual bandits for remote electrical tilt
optimization.
Dataset Characteristics and Their Impact on Offline Policy Learning of Contextual Multi-Armed Bandits
93