6 CONCLUSION
We show that the algorithms proposed in Nadjahi
et al. (2019) are not provably safe and propose a new
version that is provably safe. We also adapt their ideas
to derive a heuristic algorithm which shows, among
the entire SPIBB class on two different benchmarks,
both the best mean performance and the best 1%-
CVaR performance, which is important for safety-
critical applications. Furthermore, it proves to be
competitive in the mean performance against other
state of the art uncertainty incorporating algorithms
and especially to outperform them in the 1%-CVaR
performance. Additionally, it has been shown that
the theoretically supported Adv-Approx-Soft-SPIBB
performs almost as well as its predecessor Approx-
Soft-SPIBB, only falling slightly behind in the mean
performance.
The experiments also demonstrate different prop-
erties of the two classes of SPI algorithms in Figure 1:
algorithms penalizing the action-value functions tend
to perform better in the mean, but lack in the 1%-
CVaR, especially if the available data is scarce.
Perhaps the most relevant direction of future work
is how to apply this framework to continuous MDPs,
which has so far been explored by Nadjahi et al.
(2019) without theoretical safety guarantees. Apart
from theory, we hope that our observations of the two
classes of SPI algorithms can contribute to the choice
of algorithms for the continuous case.
REFERENCES
Brafman, R. I. and Tennenholtz, M. (2003). R-MAX - a
general polynomial time algorithm for near-optimal
reinforcement learning. Journal of Machine Learning
Research, 3.
Chow, Y., Tamar, A., Mannor, S., and Pavone, M. (2015).
Risk-sensitive and robust decision-making: a CVaR
optimization approach. In Proceedings of the 28th
International Conference on Neural Information Pro-
cessing Systems.
Dantzig, G. B. (1963). Linear Programming and Exten-
sions. RAND Corporation, Santa Monica, CA.
Fujimoto, S., Meger, D., and Precup, D. (2019). Off-policy
deep reinforcement learning without exploration. In
Proc. of the 36th International Conference on Ma-
chine Learning.
Garc
´
ıa, J. and Fernandez, F. (2015). A Comprehensive Sur-
vey on Safe Reinforcement Learning. Journal of Ma-
chine Learning Research., 16.
Hans, A., Duell, S., and Udluft, S. (2011). Agent self-
assessment: Determining policy quality without ex-
ecution. In IEEE Symposium on Adaptive Dynamic
Programming and Reinforcement Learning.
Hans, A. and Udluft, S. (2009). Efficient Uncertainty Propa-
gation for Reinforcement Learning with Limited Data.
In Artificial Neural Networks – ICANN, volume 5768.
Lange, S., Gabel, T., and Riedmiller, M. (2012). Batch
Reinforcement Learning. In Reinforcement Learning:
State-of-the-Art, Adaptation, Learning, and Optimiza-
tion. Springer, Berlin, Heidelberg.
Laroche, R., Trichelair, P., and Tachet des Combes, R.
(2019). Safe policy improvement with baseline boot-
strapping. In Proc. of the 36th International Confer-
ence on Machine Learning.
Leurent, E. (2020). Safe and Efficient Reinforcement Learn-
ing for Behavioural Planning in Autonomous Driving.
Theses, Universit
´
e de Lille.
Levine, S., Kumar, A., Tucker, G., and Fu, J. (2020). Offline
reinforcement learning: Tutorial, review, and perspec-
tives on open problems. CoRR, abs/2005.01643.
Nadjahi, K., Laroche, R., and Tachet des Combes, R.
(2019). Safe policy improvement with soft baseline
bootstrapping. In Proc. of the 2019 European Confer-
ence on Machine Learning and Principles and Prac-
tice of Knowledge Discovery in Databases.
Nilim, A. and El Ghaoui, L. (2003). Robustness in Markov
decision problems with uncertain transition matrices.
In Proc. of the 16th International Conference on Neu-
ral Information Processing Systems.
Petrik, M., Ghavamzadeh, M., and Chow, Y. (2016). Safe
policy improvement by minimizing robust baseline re-
gret. In Proceedings of the 30th International Con-
ference on Neural Information Processing Systems,
NIPS’16, Red Hook, NY, USA. Curran Associates
Inc.
Schaefer, A. M., Schneegass, D., Sterzing, V., and Udluft, S.
(2007). A neural reinforcement learning approach to
gas turbine control. In International Joint Conference
on Neural Networks.
Schneegass, D., Hans, A., and Udluft, S. (2010). Uncer-
tainty in Reinforcement Learning - Awareness, Quan-
tisation, and Control. In Robot Learning. Sciyo.
Scholl, P. (2021). Evaluation of safe policy improvement
with soft baseline bootstrapping. Master’s thesis,
Technical University of Munich.
Sim
˜
ao, T. D., Laroche, R., and Tachet des Combes, R.
(2020). Safe Policy Improvement with an Estimated
Baseline Policy. In Proc. of the 19th International
Conference on Autonomous Agents and MultiAgent
Systems.
Sutton, R. S. and Barto, A. G. (2018). Reinforcement learn-
ing: An introduction. MIT press.
Thomas, P. S. (2015). Safe Reinforcement Learning. Doc-
toral dissertations., University of Massachusetts.
Wang, R., Foster, D., and Kakade, S. M. (2021). What
are the statistical limits of offline RL with linear func-
tion approximation? In International Conference on
Learning Representations.
Safe Policy Improvement Approaches on Discrete Markov Decision Processes
151