Zeroth Order Optimization for Pretraining Language Models

Nathan Allaire; Mahsa Ghazvini Nejad; Sébastien Le Digabel; Vahid Partovi Nia

Research.Publish.Connect.

*Please fill out at least one Field. *Value must be an number!

Title:
ISBN:
Year:
Acronym:
Subject:

Advanced Search Proceedings Search

If you're looking for an exact phrase use quotation marks on text fields.

*Please fill out at least one Field.

Title:
Author:
Affiliation:
Subject:

Advanced Search Papers Search

If you're looking for an exact phrase use quotation marks on text fields.

*Please fill out at least one Field.

Name:
Affiliation:
Country:
Conference:
Subject:

Advanced Search Authors Search

If you're looking for an exact phrase use quotation marks on text fields.

*Please fill out at least one Field.

Name:
Country:
Subject:

Advanced Search Affiliations Search

If you're looking for an exact phrase use quotation marks on text fields.

Proceedings

Proceedings Search *Please fill out at least one Field. *Value must be an number!

Title:
ISBN:
Year:
Acronym:
Subject:

Advanced Search Proceedings Search

If you're looking for an exact phrase use quotation marks on text fields.

Papers

Papers Search *Please fill out at least one Field.

Title:
Author:
Affiliation:
Subject:

Advanced Search Papers Search

If you're looking for an exact phrase use quotation marks on text fields.

Authors

Authors Search *Please fill out at least one Field.

Name:
Affiliation:
Country:
Conference:
Subject:

Advanced Search Authors Search

If you're looking for an exact phrase use quotation marks on text fields.

Advanced Search

Paper

Zeroth Order Optimization for Pretraining Language Models

Topics: Deep Learning and Neural Networks; Natural Language Processing

In Proceedings of the 14th International Conference on Pattern Recognition Applications and Methods ICPRAM - Volume 1, 113-121, 2025 , Porto, Portugal

Authors: Nathan Allaire ¹ ; Mahsa Ghazvini Nejad ² ; Sébastien Le Digabel ¹ and Vahid Partovi Nia ²

Affiliations: ¹ GERAD, Polytechnique Montréal, Montréal, Canada ; ² Noah’s Ark Lab, Montréal, Canada

Keyword(s): Backpropagation, Deep Learning, Language Models, Stochastic Gradient Descent, Transformer Architecture, Pretraining.

Abstract: The physical memory for training Large Language Models (LLMs) grow with the model size, and are limited to the GPU memory. In particular, back-propagation that requires the computation of the first-order derivatives adds to this memory overhead. Training extremely large language models with memory-efficient algorithms is still a challenge with theoretical and practical implications. Back-propagation-free training algorithms, also known as zeroth-order methods, are recently examined to address this challenge. Their usefulness has been proven in fine-tuning of language models. However, so far, there has been no study for language model pretraining using zeroth-order optimization, where the memory constraint is manifested more severely. We build the connection between the second order, the first order, and the zeroth order theoretically. Then, we apply the zeroth order optimization to pre-training light-weight language models, and discuss why they cannot be readily applied. We show in p articular that the curse of dimensionality is the main obstacle, and pave the way towards modifications of zeroth order methods for pre-training such models. (More)

CC BY-NC-ND 4.0

Guest: Register as new SciTePress user now for free.

SciTePress user: please login.

My Papers

You are not signed in, therefore limits apply to your IP address 3.21.106.239

In the current month:

Recent papers: 100 available of 100 total

2⁺ years older papers: 200 available of 200 total

Paper citation in several formats:

Allaire, N., Ghazvini Nejad, M., Le Digabel, S. and Partovi Nia, V. (2025). Zeroth Order Optimization for Pretraining Language Models. In Proceedings of the 14th International Conference on Pattern Recognition Applications and Methods - ICPRAM; ISBN 978-989-758-730-6; ISSN 2184-4313, SciTePress, pages 113-121. DOI: 10.5220/0013261100003905

@conference{icpram25,
author={Nathan Allaire and Mahsa {Ghazvini Nejad} and Sébastien {Le Digabel} and Vahid {Partovi Nia}},
title={Zeroth Order Optimization for Pretraining Language Models},
booktitle={Proceedings of the 14th International Conference on Pattern Recognition Applications and Methods - ICPRAM},
year={2025},
pages={113-121},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0013261100003905},
isbn={978-989-758-730-6},
issn={2184-4313},
}

TY - CONF

JO - Proceedings of the 14th International Conference on Pattern Recognition Applications and Methods - ICPRAM
TI - Zeroth Order Optimization for Pretraining Language Models
SN - 978-989-758-730-6
IS - 2184-4313
AU - Allaire, N.
AU - Ghazvini Nejad, M.
AU - Le Digabel, S.
AU - Partovi Nia, V.
PY - 2025
SP - 113
EP - 121
DO - 10.5220/0013261100003905
PB - SciTePress