
dynamic strategy to the SPPO variants is the subject
of future work.
In Fig. 1, a single run of the hopper baseline mod-
els is displayed for all three policy optimization meth-
ods. The mismatch between distance and speed is qui-
ete apparent. Although, the two metrics are somewhat
better aligned for the SGLD method, the difference
between the best distance and best speed checkpoint
is substantial.
6 CONCLUSIONS
The robustness of RL models is receiving more at-
tention in the literature. In our study, we highlighted
a new shortcoming of the RL policies namely speed
and stability are not optimized simultaneously, and to
maximize speed, the model already starts to perform
worse in terms of stability in the initial stages of train-
ing. This phenomenon was present for all three en-
vironments and the three learning algorithms that we
evaluated here. The limitation is effectively addressed
by modifying the data collection strategy, which was a
key step in increasing the stability of the TRPO algo-
rithm. The modification of the data collection strategy
included a targeted adjustment of the batch size and it-
eration bounds so it ensure the compute requirements
are similar to the baseline method. Our approach also
made more optimal use of available resources than
the baseline method. Regardless of the data collection
method used, model selection is crucial to maximize
the model score and compensate for the high variance
of RL policies.
ACKNOWLEDGEMENTS
On behalf of the SZTE adversarial robustness ex-
periments project we are grateful for the possibility
to use HUN-REN Cloud (see (H
´
eder et al., 2022);
https://science-cloud.hu/) which helped us achieve the
results published in this paper.
REFERENCES
Durrant-Whyte, H., Roy, N., and Abbeel, P. (2012). Infinite-
horizon model predictive control for periodic tasks
with contacts. In Robotics: Science and Systems VII,
pages 73–80.
H
´
eder, M., Rig
´
o, E., Medgyesi, D., Lovas, R., Tenczer,
S., T
¨
or
¨
ok, F., Farkas, A., Em
˝
odi, M., Kadlecsik, J.,
Mez
˝
o, G., Pint
´
er,
´
A., and Kacsuk, P. (2022). The past,
present and future of the ELKH cloud. Inform
´
aci
´
os
T
´
arsadalom, 22(2):128.
Kingma, D. P. and Ba, J. (2017). Adam: A method for
stochastic optimization.
Liang, Y., Sun, Y., Zheng, R., and Huang, F. (2022). Ef-
ficient adversarial training without attacking: Worst-
case-aware robust reinforcement learning. In Koyejo,
S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K.,
and Oh, A., editors, Advances in Neural Information
Processing Systems, volume 35, pages 22547–22561.
Curran Associates, Inc.
Schulman, J., Levine, S., Moritz, P., Jordan, M. I., and
Abbeel, P. (2017a). Trust region policy optimization.
In Proceedings of the 34th International Conference
on Machine Learning. arXiv.
Schulman, J., Moritz, P., Levine, S., Jordan, M. I., and
Abbeel, P. (2016). High-dimensional continuous con-
trol using generalized advantage estimation. In Ben-
gio, Y. and LeCun, Y., editors, 4th International Con-
ference on Learning Representations, ICLR 2016, San
Juan, Puerto Rico, May 2-4, 2016, Conference Track
Proceedings.
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and
Klimov, O. (2017b). Proximal policy optimization al-
gorithms. CoRR, abs/1707.06347.
Sun, C.-E., Gao, S., and Weng, T.-W. (2024). Breaking the
barrier: Enhanced utility and robustness in smoothed
drl agents. In Proceedings of the 2024 International
Conference on Artificial Intelligence and Learning.
arXiv.
Tassa, Y., Erez, T., and Todorov, E. (2012). Synthesis and
stabilization of complex behaviors through online tra-
jectory optimization. In 2012 IEEE/RSJ International
Conference on Intelligent Robots and Systems, pages
4906–4913.
Todorov, E., Erez, T., and Tassa, Y. (2012). Mujoco:
A physics engine for model-based control. In 2012
IEEE/RSJ International Conference on Intelligent
Robots and Systems, pages 5026–5033.
Zhang, H., Chen, H., Xiao, C., Li, B., Liu, M., Boning, D.,
and Hsieh, C.-J. (2020). Robust deep reinforcement
learning against adversarial perturbations on state ob-
servations. In Larochelle, H., Ranzato, M., Hadsell,
R., Balcan, M., and Lin, H., editors, Advances in Neu-
ral Information Processing Systems, volume 33, pages
21024–21037. Curran Associates, Inc.
ICAART 2025 - 17th International Conference on Agents and Artificial Intelligence
836