
very competitive performance for multi-layer per-
ceptrons, graph neural networks, encoder-only, and
encoder-decoder architectures when compared to 16
and 32-bit. Specifically, we find that b1.58-bit train-
ing works well beyond language models, demonstrat-
ing competitive performance in bag-of-words MLPs
for text classification and graph neural networks for
node classification. Median quantization seems on
par or better than mean in most real-world scenar-
ios. We further analyze the “BitNet Scaling Law” for
encoder-only models, showing that 1.58-bit models
match the training performance of standard precision
models when the hidden size is twice as large, align-
ing with similar observations for decoder-only mod-
els. For encoder-decoder models, we find that no such
scaling law is applicable, as b1.58 consistently per-
forms worse than 16-bit. We encourage future work to
investigate the challenges of b1.58 quantization with
encoder-decoder architecture in more depth. Finally,
there seems to be a regularization effects of 1.58-bit
quantization-aware training that helps generalization.
Yet, more research is needed to further investigate this
regularization effect.
ACKNOWLEDGEMENTS
We are grateful to the Danish Foundational Models
(DFM) project for access to data for low-resource
language modelling, to SDU UCloud and EuroHPC
Leonard Booster for providing computational re-
sources.
REFERENCES
Ashkboos, S., Croci, M. L., do Nascimento, M. G., Hoefler,
T., and Hensman, J. (2024). Slicegpt: Compress large
language models by deleting rows and columns.
Ba, J. L., Kiros, J. R., and Hinton, G. E. (2016). Layer
normalization. arXiv preprint arXiv:1607.06450.
Bal, M., Jiang, Y., and Sengupta, A. (2024). Exploring ex-
treme quantization in spiking language models. arXiv
preprint arXiv:2405.02543.
Bengio, Y., Léonard, N., and Courville, A. C. (2013). Es-
timating or propagating gradients through stochas-
tic neurons for conditional computation. CoRR,
abs/1308.3432.
Bommasani, R., Hudson, D. A., Adeli, E., Altman, R. B.,
Arora, S., von Arx, S., Bernstein, M. S., Bohg, J.,
Bosselut, A., Brunskill, E., Brynjolfsson, E., Buch,
S., Card, D., Castellon, R., Chatterji, N. S., Chen,
A. S., Creel, K., Davis, J. Q., Demszky, D., Don-
ahue, C., Doumbouya, M., Durmus, E., Ermon, S.,
Etchemendy, J., Ethayarajh, K., Fei-Fei, L., Finn, C.,
Gale, T., Gillespie, L. E., Goel, K., Goodman, N. D.,
Grossman, S., Guha, N., Hashimoto, T., Henderson,
P., Hewitt, J., Ho, D. E., Hong, J., Hsu, K., Huang,
J., Icard, T., Jain, S., Jurafsky, D., Kalluri, P., Karam-
cheti, S., Keeling, G., Khani, F., Khattab, O., Koh,
P. W., Krass, M. S., Krishna, R., Kuditipudi, R., and
et al. (2021). On the opportunities and risks of foun-
dation models. CoRR, abs/2108.07258.
Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.,
Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G.,
Askell, A., Agarwal, S., Herbert-Voss, A., Krueger,
G., Henighan, T., Child, R., Ramesh, A., Ziegler,
D. M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler,
E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner,
C., McCandlish, S., Radford, A., Sutskever, I., and
Amodei, D. (2020). Language models are few-shot
learners. In Advances in Neural Information Process-
ing Systems 33.
Chen, A., Shwartz-Ziv, R., Cho, K., Leavitt, M. L., and
Saphra, N. (2024). Sudden drops in the loss: Syn-
tax acquisition, phase transitions, and simplicity bias
in MLMs. In The Twelfth International Conference on
Learning Representations.
Devlin, J., Chang, M., Lee, K., and Toutanova, K. (2019).
BERT: pre-training of deep bidirectional transform-
ers for language understanding. In Proceedings of
the 2019 Conference of the North American Chap-
ter of the Association for Computational Linguistics:
Human Language Technologies, NAACL-HLT 2019,
Minneapolis, MN, USA, June 2-7, 2019, Volume 1
(Long and Short Papers), pages 4171–4186. Associ-
ation for Computational Linguistics.
Frantar, E., Ashkboos, S., Hoefler, T., and Alistarh, D.
(2023). Gptq: Accurate post-training quantization for
generative pre-trained transformers.
Galke, L. and Scherp, A. (2022). Bag-of-words vs. graph
vs. sequence in text classification: Questioning the ne-
cessity of text-graphs and the surprising strength of a
wide MLP. In Muresan, S., Nakov, P., and Villavicen-
cio, A., editors, Proceedings of the 60th Annual Meet-
ing of the Association for Computational Linguistics
(Volume 1: Long Papers), pages 4038–4051, Dublin,
Ireland. Association for Computational Linguistics.
Geiping, J. and Goldstein, T. (2023). Cramming: Training
a language model on a single gpu in one day. In In-
ternational Conference on Machine Learning, pages
11117–11143. PMLR.
Groeneveld, D., Beltagy, I., Walsh, P., Bhagia, A., Kin-
ney, R., Tafjord, O., Jha, A., Ivison, H., Magnus-
son, I., Wang, Y., Arora, S., Atkinson, D., Authur, R.,
Chandu, K. R., Cohan, A., Dumas, J., Elazar, Y., Gu,
Y., Hessel, J., Khot, T., Merrill, W., Morrison, J. D.,
Muennighoff, N., Naik, A., Nam, C., Peters, M. E.,
Pyatkin, V., Ravichander, A., Schwenk, D., Shah, S.,
Smith, W., Strubell, E., Subramani, N., Wortsman, M.,
Dasigi, P., Lambert, N., Richardson, K., Zettlemoyer,
L., Dodge, J., Lo, K., Soldaini, L., Smith, N. A., and
Hajishirzi, H. (2024). Olmo: Accelerating the science
of language models. arXiv preprint.
He, S., Sun, G., Shen, Z., and Li, A. (2024). What matters
in transformers? not all attention is needed.
Kingma, D. P. and Ba, J. (2015). Adam: A method for
ICAART 2025 - 17th International Conference on Agents and Artificial Intelligence
1448