
Altman, E. (1999). Constrained Markov Decision Pro-
cesses, volume 7. Routledge.
Amodei, D., Olah, C., Steinhardt, J., Christiano, P. F.,
Schulman, J., and Man
´
e, D. (2016). Concrete prob-
lems in AI safety. CoRR, abs/1606.06565.
Asada, M., Noda, S., Tawaratsumida, S., and Hosoda, K.
(1996). Purposive behavior acquisition for a real
robot by vision-based reinforcement learning. Mach.
Learn., 23(2-3):279–303.
Basu, A., Bhattacharyya, T., and Borkar, V. S. (2008). A
learning algorithm for risk-sensitive cost. Math. Oper.
Res., 33(4):880–898.
Efroni, Y., Mannor, S., and Pirotta, M. (2020). Exploration-
exploitation in constrained MDPs. arXiv preprint
arXiv:2003.02189.
Eimer, T., Biedenkapp, A., Hutter, F., and Lindauer, M.
(2021). Self-paced context evaluation for contextual
reinforcement learning. In ICML, pages 2948–2958.
PMLR.
Eysenbach, B., Gu, S., Ibarz, J., and Levine, S. (2018).
Leave no trace: Learning to reset for safe and au-
tonomous reinforcement learning. In ICLR (Poster).
OpenReview.net.
Florensa, C., Held, D., Geng, X., and Abbeel, P. (2018).
Automatic goal generation for reinforcement learning
agents. In ICML, pages 1514–1523. PMLR.
Foglino, F., Christakou, C. C., and Leonetti, M. (2019a). An
optimization framework for task sequencing in cur-
riculum learning. In ICDL-EPIROB, pages 207–214.
IEEE.
Foglino, F., Leonetti, M., Sagratella, S., and Seccia, R.
(2019b). A gray-box approach for curriculum learn-
ing. In WCGO, pages 720–729. Springer.
Kadota, Y., Kurano, M., and Yasuda, M. (2006). Dis-
counted Markov decision processes with utility con-
straints. Comput. Math. Appl., 51(2):279–284.
Klink, P., Abdulsamad, H., Belousov, B., D’Eramo, C.,
Peters, J., and Pajarinen, J. (2021). A probabilis-
tic interpretation of self-paced learning with applica-
tions to reinforcement learning. J. Mach. Learn. Res.,
22:182:1–182:52.
Klink, P., Abdulsamad, H., Belousov, B., and Peters, J.
(2019). Self-paced contextual reinforcement learning.
In CoRL, pages 513–529. PMLR.
M
¨
uller, A., Alatur, P., Cevher, V., Ramponi, G., and He, N.
(2024). Truly no-regret learning in constrained MDPs.
In ICML. OpenReview.net.
Narvekar, S., Peng, B., Leonetti, M., Sinapov, J., Taylor,
M. E., and Stone, P. (2020). Curriculum learning for
reinforcement learning domains: A framework and
survey. J. Mach. Learn. Res., 21:181:1–181:50.
Narvekar, S., Sinapov, J., Leonetti, M., and Stone, P. (2016).
Source task creation for curriculum learning. In AA-
MAS, pages 566–574. ACM.
Narvekar, S. and Stone, P. (2019). Learning curricu-
lum policies for reinforcement learning. In AA-
MAS, pages 25–33. International Foundation for Au-
tonomous Agents and Multiagent Systems.
Peng, B., MacGlashan, J., Loftin, R. T., Littman, M. L.,
Roberts, D. L., and Taylor, M. E. (2018). Curricu-
lum design for machine learners in sequential deci-
sion tasks. IEEE Trans. Emerg. Top. Comput. Intell.,
2(4):268–277.
Puterman, M. L. (1994). Markov Decision Processes: Dis-
crete Stochastic Dynamic Programming. Wiley Series
in Probability and Statistics. Wiley.
Ray, A., Achiam, J., and Amodei, D. (2019). Benchmarking
safe exploration in deep reinforcement learning.
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and
Klimov, O. (2017). Proximal policy optimization al-
gorithms. CoRR, abs/1707.06347.
Sim
˜
ao, T. D., Jansen, N., and Spaan, M. T. J. (2021). Al-
wayssafe: Reinforcement learning without safety con-
straint violations during training. In AAMAS, pages
1226–1235. ACM.
Smith, A. E., Coit, D. W., Baeck, T., Fogel, D., and
Michalewicz, Z. (1997). Penalty functions. Handbook
of evolutionary computation, 97(1).
Sun, H., Xu, Z., Fang, M., Peng, Z., Guo, J., Dai, B., and
Zhou, B. (2021). Safe exploration by solving early
terminated MDP. CoRR, abs/2107.04200.
Tamar, A., Xu, H., and Mannor, S. (2013). Scaling
up robust mdps by reinforcement learning. CoRR,
abs/1306.6189.
Tang, J., Singh, A., Goehausen, N., and Abbeel, P. (2010).
Parameterized maneuver learning for autonomous he-
licopter flight. In ICRA, pages 1142–1148. IEEE.
Turchetta, M., Kolobov, A., Shah, S., Krause, A., and Agar-
wal, A. (2020). Safe reinforcement learning via cur-
riculum induction. In NeurIPS.
Wachi, A., Shen, X., and Sui, Y. (2024). A survey of con-
straint formulations in safe reinforcement learning. In
IJCAI, pages 8262–8271. ijcai.org.
Wu, Y. and Tian, Y. (2017). Training agent for first-person
shooter game with actor-critic curriculum learning. In
ICLR (Poster). OpenReview.net.
Xu, T., Liang, Y., and Lan, G. (2021). CRPO: A new
approach for safe reinforcement learning with con-
vergence guarantee. In ICML, pages 11480–11491.
PMLR.
Yang, L., Ji, J., Dai, J., Zhang, L., Zhou, B., Li, P., Yang,
Y., and Pan, G. (2022). Constrained update projection
approach to safe policy optimization. In NeurIPS.
Yang, Q., Sim
˜
ao, T. D., Jansen, N., Tindemans, S. H., and
Spaan, M. T. J. (2023). Reinforcement learning by
guided safe exploration. In ECAI, pages 2858–2865.
IOS Press.
Yang, T., Rosca, J., Narasimhan, K., and Ramadge, P. J.
(2020). Projection-based constrained policy optimiza-
tion. In ICLR. OpenReview.net.
Zhang, L., Shen, L., Yang, L., Chen, S., Wang, X., Yuan,
B., and Tao, D. (2022). Penalized proximal policy op-
timization for safe reinforcement learning. In IJCAI,
pages 3744–3750. ijcai.org.
Zhang, Y., Vuong, Q., and Ross, K. W. (2020). First order
constrained optimization in policy space. In NeurIPS.
ICAART 2025 - 17th International Conference on Agents and Artificial Intelligence
1472