dynamic strategy to the SPPO variants is the subject
of future work.
In Fig. 1, a single run of the hopper baseline mod-
els is displayed for all three policy optimization meth-
ods. The mismatch between distance and speed is qui-
ete apparent. Although, the two metrics are somewhat
better aligned for the SGLD method, the difference
between the best distance and best speed checkpoint
is substantial.
The robustness of RL models is receiving more at-
tention in the literature. In our study, we highlighted
a new shortcoming of the RL policies namely speed
and stability are not optimized simultaneously, and to
maximize speed, the model already starts to perform
worse in terms of stability in the initial stages of train-
ing. This phenomenon was present for all three en-
vironments and the three learning algorithms that we
evaluated here. The limitation is effectively addressed
by modifying the data collection strategy, which was a
key step in increasing the stability of the TRPO algo-
rithm. The modification of the data collection strategy
included a targeted adjustment of the batch size and it-
eration bounds so it ensure the compute requirements
are similar to the baseline method. Our approach also
made more optimal use of available resources than
the baseline method. Regardless of the data collection
method used, model selection is crucial to maximize
the model score and compensate for the high variance
of RL policies.
On behalf of the SZTE adversarial robustness ex-
periments project we are grateful for the possibility
to use HUN-REN Cloud (see (H
eder et al., 2022);
https://science-cloud.hu/) which helped us achieve the
results published in this paper.
