
the 30th NIPS, NIPS’16, page 3988–3996, Red Hook,
NY, USA. Curran Associates Inc.
Bamler, R. and Mandt, S. (2018). Improving optimization
for models with continuous symmetry breaking.
Chen, X., Liang, C., Huang, D., Real, E., Wang, K., Liu,
Y., Pham, H., Dong, X., Luong, T., Hsieh, C.-J., Lu,
Y., and Le, Q. V. (2023). Symbolic discovery of opti-
mization algorithms.
Duchi, J., Hazan, E., and Singer, Y. (2011). Adaptive sub-
gradient methods for online learning and stochastic
optimization. Journal of Machine Learning Research,
12(61):2121–2159.
Foret, P., Kleiner, A., Mobahi, H., and Neyshabur, B.
(2021). Sharpness-aware minimization for efficiently
improving generalization. In International Confer-
ence on Learning Representations.
Harrison, J., Metz, L., and Sohl-Dickstein, J. (2022). A
closer look at learned optimization: Stability, robust-
ness, and inductive biases. In Koyejo, S., Mohamed,
S., Agarwal, A., Belgrave, D., Cho, K., and Oh, A.,
editors, Advances in Neural Information Processing
Systems, volume 35, pages 3758–3773. Curran Asso-
ciates, Inc.
Hochreiter, S. and Schmidhuber, J. (1997). Flat Minima.
Neural Computation, 9(1):1–42.
Ioffe, S. and Szegedy, C. (2015). Batch normalization:
Accelerating deep network training by reducing in-
ternal covariate shift. In Bach, F. and Blei, D., ed-
itors, Proceedings of the 32nd International Confer-
ence on Machine Learning, volume 37 of Proceedings
of Machine Learning Research, pages 448–456, Lille,
France. PMLR.
Jastrzebski, S., Szymczak, M., Fort, S., Arpit, D., Tabor,
J., Cho*, K., and Geras*, K. (2020). The break-even
point on optimization trajectories of deep neural net-
works. In International Conference on Learning Rep-
resentations.
Kingma, D. P. and Ba, J. (2017). Adam: A method for
stochastic optimization.
Kunin, D., Sagastuy-Brena, J., Ganguli, S., Yamins, D. L.,
and Tanaka, H. (2021). Neural mechanics: Symmetry
and broken conservation laws in deep learning dynam-
ics. In ICLR 2021.
Lv, K., Jiang, S., and Li, J. (2017). Learning gradient
descent: Better generalization and longer horizons.
In Proceedings of the 34th International Conference
on Machine Learning - Volume 70, ICML’17, page
2247–2255. JMLR.org.
Maheswaranathan, N., Sussillo, D., Metz, L., Sun, R.,
and Sohl-Dickstein, J. (2021). Reverse engineering
learned optimizers reveals known and novel mecha-
nisms. In Ranzato, M., Beygelzimer, A., Dauphin, Y.,
Liang, P., and Vaughan, J. W., editors, Advances in
Neural Information Processing Systems, volume 34,
pages 19910–19922. Curran Associates, Inc.
Metz, L., Freeman, C. D., Maheswaranathan, N., and Sohl-
Dickstein, J. (2021). Training learned optimizers with
randomly initialized learned optimizers.
Metz, L., Maheswaranathan, N., Freeman, C. D., Poole, B.,
and Sohl-Dickstein, J. (2020). Tasks, stability, archi-
tecture, and compute: Training more effective learned
optimizers, and using them to train themselves.
Metz, L., Maheswaranathan, N., Nixon, J., Freeman, D.,
and Sohl-Dickstein, J. (2019). Understanding and cor-
recting pathologies in the training of learned optimiz-
ers. In Chaudhuri, K. and Salakhutdinov, R., editors,
Proceedings of the 36th International Conference on
Machine Learning, volume 97 of Proceedings of Ma-
chine Learning Research, pages 4556–4565. PMLR.
Mohammadi, M., Mohammadpour, A., and Ogata, H.
(2015). On estimating the tail index and the spec-
tral measure of multivariate α-stable distributions.
Metrika: International Journal for Theoretical and
Applied Statistics, 78(5):549–561.
Neelakantan, A., Vilnis, L., Le, Q. V., Sutskever, I., Kaiser,
L., Kurach, K., and Martens, J. (2015). Adding gra-
dient noise improves learning for very deep networks.
CoRR, abs/1511.06807.
Polyak, B. (1964). Some methods of speeding up the con-
vergence of iteration methods. USSR Computational
Mathematics and Mathematical Physics, 4(5):1–17.
Simanek, P., Vasata, D., and Kordik, P. (2022). Learning
to optimize with dynamic mode decomposition. In
2022 International Joint Conference on Neural Net-
works (IJCNN). IEEE.
Simsekli, U., Sagun, L., and Gurbuzbalaban, M. (2019). A
tail-index analysis of stochastic gradient noise in deep
neural networks. In Chaudhuri, K. and Salakhutdinov,
R., editors, Proceedings of the 36th ICML, volume 97
of Proceedings of Machine Learning Research, pages
5827–5837. PMLR.
Tieleman, T., Hinton, G., et al. (2012). Lecture 6.5-
rmsprop: Divide the gradient by a running average of
its recent magnitude. COURSERA: Neural networks
for machine learning, 4(2):26–31.
Zhao, B., Dehmamy, N., Walters, R., and Yu, R. (2022).
Symmetry teleportation for accelerated optimization.
In Koyejo, S., Mohamed, S., Agarwal, A., Belgrave,
D., Cho, K., and Oh, A., editors, Advances in Neu-
ral Information Processing Systems, volume 35, pages
16679–16690. Curran Associates, Inc.
APPENDIX
Additional Results
Symmetry Breaking. Figure 10 shows the param-
eter update deviations from the constraints on gradi-
ents induced by translation symmetry, along with the
associated performance after applying the symmetry
breaking regularization.
We see similar trends as for the Leaky ReLU and
batch normalization optimizees. Specifically, L2O
breaks the constraints by a large margin early in the
training run.
ICAART 2024 - 16th International Conference on Agents and Artificial Intelligence
144