Dhariwal, P., Hesse, C., Klimov, O., Nichol, A., Plappert,
M., Radford, A., Schulman, J., Sidor, S., and Wu, Y.
(2017). Openai baselines. https://github.com/openai/
baselines.
Garcıa, J. and Fern
´
andez, F. (2015). A comprehensive sur-
vey on safe reinforcement learning. Journal of Ma-
chine Learning Research, 16(1):1437–1480.
Hamid, O. and Braun, J. (2019). Reinforcement Learning
and Attractor Neural Network Models of Associative
Learning, pages 327–349.
Hessel, M., Modayil, J., Van Hasselt, H., Schaul, T., Ostro-
vski, G., Dabney, W., Horgan, D., Piot, B., Azar, M.,
and Silver, D. (2017). Rainbow: Combining improve-
ments in deep reinforcement learning. arXiv preprint
arXiv:1710.02298.
Howard, R. A. and Matheson, J. E. (1972). Risk-sensitive
markov decision processes. Management science,
18(7):356–369.
Kingma, D. P. and Ba, J. (2014). Adam: A
method for stochastic optimization. arXiv preprint
arXiv:1412.6980.
Koenker, R. and Hallock, K. F. (2001). Quantile regression.
Journal of economic perspectives, 15(4):143–156.
Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012). Im-
agenet classification with deep convolutional neural
networks. In Advances in neural information process-
ing systems, pages 1097–1105.
Leike, J., Martic, M., Krakovna, V., Ortega, P. A., Everitt,
T., Lefrancq, A., Orseau, L., and Legg, S. (2017). Ai
safety gridworlds. arXiv preprint arXiv:1711.09883.
Lesser, K. and Abate, A. (2017). Multi-objective optimal
control with safety as a priority. In 2017 ACM/IEEE
8th International Conference on Cyber-Physical Sys-
tems (ICCPS), pages 25–36.
Macek, K. (2010). Predictive control via lazy learning and
stochastic optimization. In Doktorandsk
´
e dny 2010 -
Sborn
´
ık doktorand
˚
u FJFI, pages 115–122.
Majumdar, A. and Pavone, M. (2017). How should a robot
assess risk? towards an axiomatic theory of risk in
robotics. arXiv preprint arXiv:1710.11040.
Miller, C. W. and Yang, I. (2017). Optimal control of condi-
tional value-at-risk in continuous time. SIAM Journal
on Control and Optimization, 55(2):856–884.
Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness,
J., Bellemare, M. G., Graves, A., Riedmiller, M., Fid-
jeland, A. K., Ostrovski, G., et al. (2015). Human-
level control through deep reinforcement learning.
Nature, 518(7540):529.
Morimura, T., Sugiyama, M., Kashima, H., Hachiya, H.,
and Tanaka, T. (2012). Parametric return density es-
timation for reinforcement learning. arXiv preprint
arXiv:1203.3497.
Obayashi, M., Uto, S., Kuremoto, T., Mabu, S., and
Kobayashi, K. (2015). An extended q learning sys-
tem with emotion state to make up an agent with indi-
viduality. 2015 7th International Joint Conference on
Computational Intelligence (IJCCI), 3:70–78.
Pflug, G. C. and Pichler, A. (2016). Time-consistent de-
cisions and temporal decomposition of coherent risk
functionals. Mathematics of Operations Research,
41(2):682–699.
Plappert, M., Houthooft, R., Dhariwal, P., Sidor, S.,
Chen, R. Y., Chen, X., Asfour, T., Abbeel, P., and
Andrychowicz, M. (2017). Parameter space noise for
exploration. arXiv preprint arXiv:1706.01905.
Prashanth, L. (2014). Policy gradients for cvar-constrained
mdps. In International Conference on Algorithmic
Learning Theory, pages 155–169. Springer.
Robbins, H. and Monro, S. (1951). A stochastic approxi-
mation method. The annals of mathematical statistics,
pages 400–407.
Rockafellar, R. T. and Uryasev, S. (2000). Optimization of
conditional value-at-risk. Journal of risk, 2:21–42.
Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I.,
Huang, A., Guez, A., Hubert, T., Baker, L., Lai, M.,
Bolton, A., et al. (2017). Mastering the game of go
without human knowledge. Nature, 550(7676):354.
Sobel, M. J. (1982). The variance of discounted markov
decision processes. Journal of Applied Probability,
19(4):794–802.
Sutton, R. S. and Barto, A. G. (1998). Reinforcement learn-
ing: An introduction, volume 1. MIT press Cam-
bridge.
Tamar, A., Chow, Y., Ghavamzadeh, M., and Mannor,
S. (2017). Sequential decision making with coher-
ent risk. IEEE Transactions on Automatic Control,
62(7):3323–3338.
Tamar, A., Glassner, Y., and Mannor, S. (2015). Optimizing
the cvar via sampling. In AAAI, pages 2993–2999.
Wang, Z., Schaul, T., Hessel, M., Van Hasselt, H., Lanc-
tot, M., and De Freitas, N. (2015). Dueling network
architectures for deep reinforcement learning. arXiv
preprint arXiv:1511.06581.
Watkins, C. J. and Dayan, P. (1992). Q-learning. Machine
learning, 8(3-4):279–292.
Risk-averse Distributional Reinforcement Learning: A CVaR Optimization Approach
423