
named ViGIS-P and ViGIS-L. ViGIS-P requires prior
knowledge of the transition function but does not need
to simulate constraint violations, whereas ViGIS-L
needs violations to be performed in silico, but does
not require prior knowledge of the transition func-
tion. The fact that violations should be simulated for
ViGIS-L brings into question why we would not per-
form both phases in simulation. This is of course pos-
sible, but by separating the safety learning from re-
ward learning the agent is able to learn the reward
on-site, where the rewards would be more accurate.
Both ViGIS approaches were tested in two discrete
environments and ViGIS-L was also tested in a con-
tinuous environment.
Both ViGIS-P and ViGIS-L showed fewer con-
straint violations than the regular and β-pessimistic
Q-Learning agents across all environments. Often,
the better safety adherence lead the ViGIS agents to
achieve a lower average reward. ViGIS does en-
counter the problem of safe policy dead ends, but var-
ious methods of dealing with these dead ends were
assessed and it was found that choosing the safest
action significantly reduces the number of commit-
ted violations. It is also possible, in some environ-
ments and configurations, for ViGIS to avoid viola-
tions, which was shown for the bank robber environ-
ment and the taxi world environment for the No Ac-
tion ViGIS agents. ViGIS-L was able to make 28%
fewer constraint violations than the regular DDQN
agent in the cart pole environment, which shows its
potential in the realm of deep RL.
There are multiple avenues where the research in
this paper can be extended in future work:
• Scalability: investigate methods to optimize
ViGIS-P for larger state spaces, potentially
through approximation or parallelism.
• Deep RL: assess the performance of ViGIS with a
broader range of deep RL architectures.
• Normalising
ˆ
V
C
: develop analytical methods for
normalizing ViGIS-L’s violation measure.
• Hybrid Approaches: explore combinations of
ViGIS with other safe RL techniques to further
enhance safety and performance.
• Environments: Assess the performance of ViGIS
in more environments.
• Benchmarking: Compare ViGIS to additional
state-of-the-art safe RL algorithms.
REFERENCES
Altman, E. (1999). Constrained Markov Decision Pro-
cesses. Chapman and Hall.
Brockman, G., Cheung, V., Pettersson, L., Schneider, J.,
Schulman, J., Tang, J., and Zaremba, W. (2016). Ope-
nAI gym. arXiv preprint arXiv:1606.01540.
Driessens, K. and D
ˇ
zeroski, S. (2004). Integrating guid-
ance into relational reinforcement learning. Machine
Learning, 57:271–304.
Garc
´
ıa, J. and Fern
´
andez, F. (2012). Safe exploration of
state and action spaces in reinforcement learning. J.
Artif. Int. Res., 45(1):515–564.
Garc
´
ıa, J. and Fern
´
andez, F. (2015). A comprehensive sur-
vey on safe reinforcement learning. Journal of Ma-
chine Learning Research, 16(42):1437–1480.
Gaskett, C. (2003). Reinforcement learning under circum-
stances beyond its control.
Ge, Y., Zhu, F., Ling, X., and Liu, Q. (2019). Safe Q-
Learning method based on constrained Markov deci-
sion processes. IEEE Access, 7:165007–165017.
Geramifard, A., Redding, J., Roy, N., and How, J. (2011).
UAV cooperative control with stochastic risk models.
Kaelbling, L. P., Littman, M. L., and Moore, A. W. (1996).
Reinforcement learning: A survey. Journal of Artifi-
cial Intelligence Research, 4:237–285.
Mutti, M., Del Col, S., and Restelli, M. (2022). Reward-free
policy space compression for reinforcement learning.
In Camps-Valls, G., Ruiz, F. J. R., and Valera, I., ed-
itors, Proceedings of The 25th International Confer-
ence on Artificial Intelligence and Statistics, volume
151 of Proceedings of Machine Learning Research,
pages 3187–3203. PMLR.
Nana, H. P. Y. (2023). Reinforcement learning by minimiz-
ing constraint violation.
Russell, S. and Norvig, P. (2010). Artificial Intelligence: A
Modern Approach. Prentice Hall, 3rd edition.
Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L.,
van den Driessche, G., Schrittwieser, J., Antonoglou,
I., Panneershelvam, V., Lanctot, M., Dieleman, S.,
Grewe, D., Nham, J., Kalchbrenner, N., Sutskever, I.,
Lillicrap, T., Leach, M., Kavukcuoglu, K., Graepel,
T., and Hassabis, D. (2016). Mastering the game of
Go with deep neural networks and tree search. Na-
ture, 529(7587):484–489.
Srinivasan, K., Eysenbach, B., Ha, S., Tan, J., and Finn, C.
(2020). Learning to be safe: Deep RL with a safety
critic.
Thomas, G., Luo, Y., and Ma, T. (2024). Safe reinforcement
learning by imagining the near future. In Proceedings
of the 35th International Conference on Neural Infor-
mation Processing Systems, NIPS ’21, Red Hook, NY,
USA. Curran Associates Inc.
Xiong, N., Du, Y., and Huang, L. (2023). Provably safe
reinforcement learning with step-wise violation con-
straints. In Oh, A., Naumann, T., Globerson, A.,
Saenko, K., Hardt, M., and Levine, S., editors, Ad-
vances in Neural Information Processing Systems,
volume 36, pages 54341–54353. Curran Associates,
Inc.
Yang, W.-C., Marra, G., Rens, G., and De Raedt, L. (2023).
Safe reinforcement learning via probabilistic logic
shields. In Elkind, E., editor, Proceedings of the
Thirty-Second International Joint Conference on Arti-
ficial Intelligence, IJCAI-23, pages 5739–5749. Inter-
national Joint Conferences on Artificial Intelligence
Organization. Main Track.
A Two-Phase Safe Reinforcement Learning Framework for Finding the Safe Policy Space
285