
Agarwal, A., Belgrave, D., Cho, K., and Oh, A., edi-
tors, Advances in Neural Information Processing Sys-
tems, volume 35, pages 30318–30332. Curran Asso-
ciates, Inc.
Dettmers, T., Pagnoni, A., Holtzman, A., and Zettlemoyer,
L. (2024a). Qlora: Efficient finetuning of quantized
llms. Advances in Neural Information Processing Sys-
tems, 36.
Dettmers, T., Svirschevski, R. A., Egiazarian, V.,
Kuznedelev, D., Frantar, E., Ashkboos, S., Borzunov,
A., Hoefler, T., and Alistarh, D. (2024b). SpQR:
A sparse-quantized representation for near-lossless
LLM weight compression. In The Twelfth Interna-
tional Conference on Learning Representations.
Egiazarian, V., Panferov, A., Kuznedelev, D., Frantar, E.,
Babenko, A., and Alistarh, D. (2024). Extreme com-
pression of large language models via additive quanti-
zation. arXiv preprint arXiv:2401.06118.
Frantar, E. and Alistarh, D. (2022). Optimal brain compres-
sion: A framework for accurate post-training quan-
tization and pruning. In Koyejo, S., Mohamed, S.,
Agarwal, A., Belgrave, D., Cho, K., and Oh, A.,
editors, Advances in Neural Information Processing
Systems, volume 35, pages 4475–4488. Curran Asso-
ciates, Inc.
Frantar, E., Ashkboos, S., Hoefler, T., and Alistarh, D.
(2023). OPTQ: accurate quantization for generative
pre-trained transformers. In The Eleventh Interna-
tional Conference on Learning Representations, ICLR
2023, Kigali, Rwanda, May 1-5, 2023.
Gao, L., Biderman, S., Black, S., Golding, L., Hoppe, T.,
Foster, C., Phang, J., He, H., Thite, A., Nabeshima,
N., et al. (2020). The pile: An 800gb dataset of
diverse text for language modeling. arXiv preprint
arXiv:2101.00027.
Gao, L., Tow, J., Biderman, S., Black, S., DiPofi, A., Foster,
C., Golding, L., Hsu, J., McDonell, K., Muennighoff,
N., et al. (2021). A framework for few-shot language
model evaluation. Version v0. 0.1. Sept, page 8.
Ghaffari, A., Tahaei, M. S., Tayaranian, M., Asgharian,
M., and Partovi Nia, V. (2022). Is integer arith-
metic enough for deep learning training? Advances
in Neural Information Processing Systems, 35:27402–
27413.
Hassibi, B. and Stork, D. (1992). Second order derivatives
for network pruning: Optimal brain surgeon. In Han-
son, S., Cowan, J., and Giles, C., editors, Advances
in Neural Information Processing Systems, volume 5.
Morgan-Kaufmann.
Hubara, I., Nahshan, Y., Hanani, Y., Banner, R., and Soudry,
D. (2021). Accurate post training quantization with
small calibration sets. In Meila, M. and Zhang, T.,
editors, Proceedings of the 38th International Con-
ference on Machine Learning, volume 139 of Pro-
ceedings of Machine Learning Research, pages 4466–
4475. PMLR.
Li, Y., Gong, R., Tan, X., Yang, Y., Hu, P., Zhang, Q., Yu,
F., Wang, W., and Gu, S. (2021). BRECQ: pushing
the limit of post-training quantization by block recon-
struction. In 9th International Conference on Learn-
ing Representations, ICLR 2021, Virtual Event, Aus-
tria, May 3-7, 2021.
Lin, J., Tang, J., Tang, H., Yang, S., Dang, X., and Han, S.
(2023). Awq: Activation-aware weight quantization
for llm compression and acceleration. arXiv preprint
arXiv:2306.00978.
Lin, Y., Tang, H., Yang, S., Zhang, Z., Xiao, G., Gan, C.,
and Han, S. (2024). Qserve: W4a8kv4 quantization
and system co-design for efficient llm serving. arXiv
preprint arXiv:2405.04532.
Liu, Z., Oguz, B., Zhao, C., Chang, E., Stock, P.,
Mehdad, Y., Shi, Y., Krishnamoorthi, R., and Chan-
dra, V. (2023). Llm-qat: Data-free quantization aware
training for large language models. arXiv preprint
arXiv:2305.17888.
Merity, S., Xiong, C., Bradbury, J., and Socher, R. (2016).
Pointer sentinel mixture models. arXiv preprint
arXiv:1609.07843.
Nagel, M., Amjad, R. A., Van Baalen, M., Louizos, C., and
Blankevoort, T. (2020). Up or down? Adaptive round-
ing for post-training quantization. In III, H. D. and
Singh, A., editors, Proceedings of the 37th Interna-
tional Conference on Machine Learning, volume 119
of Proceedings of Machine Learning Research, pages
7197–7206. PMLR.
Penedo, G., Malartic, Q., Hesslow, D., Cojocaru, R., Cap-
pelli, A., Alobeidli, H., Pannier, B., Almazrouei,
E., and Launay, J. (2023). The RefinedWeb dataset
for Falcon LLM: outperforming curated corpora with
web data, and web data only. arXiv preprint
arXiv:2306.01116.
Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S.,
Matena, M., Zhou, Y., Li, W., and Liu, P. J. (2020).
Exploring the limits of transfer learning with a unified
text-to-text transformer. Journal of machine learning
research, 21(140):1–67.
Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I.,
Tan, X. E., Adi, Y., Liu, J., Remez, T., Rapin, J.,
et al. (2023). Code llama: Open foundation models
for code. arXiv preprint arXiv:2308.12950.
Sakaguchi, K., Bras, R. L., Bhagavatula, C., and Choi,
Y. (2021). Winogrande: An adversarial winograd
schema challenge at scale. Communications of the
ACM, 64(9):99–106.
Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux,
M.-A., Lacroix, T., Rozi
`
ere, B., Goyal, N., Hambro,
E., Azhar, F., et al. (2023a). Llama: Open and efficient
foundation language models (2023). arXiv preprint
arXiv:2302.13971.
Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi,
A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava,
P., Bhosale, S., et al. (2023b). Llama 2: Open foun-
dation and fine-tuned chat models. arXiv preprint
arXiv:2307.09288.
Xiao, G., Lin, J., Seznec, M., Wu, H., Demouth, J., and
Han, S. (2023). SmoothQuant: Accurate and efficient
post-training quantization for large language models.
In Krause, A., Brunskill, E., Cho, K., Engelhardt, B.,
Sabato, S., and Scarlett, J., editors, Proceedings of the
40th International Conference on Machine Learning,
Rethinking Post-Training Quantization: Introducing a Statistical Pre-Calibration Approach
167