
Figure 6: Layer-wise training of Llama2-20M for vanilla
FO (yellow) versus vanilla ZO (blue). Training the full size
model with vanilla FO is shown in purple.
is performed, the trained layer is frozen, and the next
layer is unfrozen and reinitialized. The first block is
the embedding layer that includes 10M parameters,
and is trained with vanilla FO in two epochs. The
second block is trained for one epoch with vanilla FO
versus vanilla ZO, and so on. Figure 5 shows that ZO
can reach FO on smaller parameter setting. Moreover,
this result offers ways to improve ZO for pre-training,
for instance adding a momentum that suits ZO.
5 DISCUSSION AND FUTURE
WORK
In this work, we followed several new avenues for
pre-training LMs with ZO, such as variance reduc-
tion and working on a reduced dimension problem.
Low memory cost is a key aspect of ZO training. Its
use could be very relevant in applications such as on-
device training or in situations where memory is lim-
ited. By reducing memory requirements, ZO train-
ing allows larger models to be trained on the same
hardware at the cost of longer training times. Fu-
ture investigations include scaling up blockwise ZO
pre-training to larger models, on small chunks with a
moderate query budget q. The training time will be
longer than for FO, but depending on the goal, the
memory savings may compensate. Ultimately, this
work focused on untuned ZO training. Features such
as adding momentum could improve its efficiency and
robustness to scaling.
6 CONCLUSION
This exploratory work unveiled some recently un-
known behaviour of ZO optimization in pre-training
LMs. We established the connection between SO and
FO by studying the optimal learning rate. We also
provided a recepie for a successful application of ZO
in pretraining. First reducing the dimension of the
problem leads to the success of ZO to pre-train LMs
in the sense that vanilla ZO converges to vanilla FO.
Second, the high variance of ZO is not a disadvantage
as it is often thought in the community, but rather is
an asset during the pre-training. As a consequence,
artificially reducing the variance leads to a higher loss
value. We proposed to pre-train LMs using ZO opti-
mization on a reduced dimension space like blocks of
parameters, because it is the key to a successful pre-
training.
REFERENCES
Amari, S.-i. (1993). Backpropagation and stochastic gra-
dient descent method. Neurocomputing, 5(4-5):185–
196.
Ben Allal, L., Lozhkov, A., Penedo, G., Wolf, T., and von
Werra, L. (2024). Cosmopedia.
Blum, J. R. (1954). Multidimensional stochastic approxi-
mation methods. The annals of mathematical statis-
tics, pages 737–744.
Duchi, J. C., Jordan, M. I., Wainwright, M. J., and
Wibisono, A. (2015). Optimal rates for zero-order
convex optimization: The power of two function eval-
uations. IEEE Transactions on Information Theory,
61(5):2788–2806.
Gautam, T., Park, Y., Zhou, H., Raman, P., and Ha,
W. (2024). Variance-reduced zeroth-order methods
for fine-tuning language models. arXiv preprint
arXiv:2404.08080.
Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B.,
Chess, B., Child, R., Gray, S., Radford, A., Wu, J., and
Amodei, D. (2020). Scaling laws for neural language
models. arXiv preprint arXiv:2001.08361.
Kingma, D. P. (2014). Adam: A method for stochastic op-
timization. arXiv preprint arXiv:1412.6980.
Lehmann, E. and Casella, G. (1998). Theory of Point Es-
timation. Springer Texts in Statistics. Springer New
York.
Liu, S., Kailkhura, B., Chen, P.-Y., Ting, P., Chang, S., and
Amini, L. (2018). Zeroth-order stochastic variance
reduction for nonconvex optimization. Advances in
Neural Information Processing Systems, 31.
Loshchilov, I. and Hutter, F. (2017). Decoupled weight de-
cay regularization. arXiv preprint arXiv:1711.05101.
Malladi, S., Gao, T., Nichani, E., Damian, A., Lee, J. D.,
Chen, D., and Arora, S. (2023). Fine-tuning language
models with just forward passes. Advances in Neural
Information Processing Systems, 36:53038–53075.
Nesterov, Y. and Spokoiny, V. (2017). Random gradient-
free minimization of convex functions. Foundations
of Computational Mathematics, 17(2):527–566.
Prazeres, M. and Oberman, A. M. (2021). Stochastic gra-
dient descent with polyak’s learning rate. Journal of
Scientific Computing, 89:1–16.
Ross, S. M. (2022). Simulation. academic press.
Shepherd, A. J. (2012). Second-order methods for neu-
ral networks: Fast and reliable training methods for
ICPRAM 2025 - 14th International Conference on Pattern Recognition Applications and Methods
120