
REFERENCES
Badri, H. and Shaji, A. (2023). Half-quadratic quantization
of large machine learning models.
Baji, T. (2018). Evolution of the GPU Device widely used
in AI and Massive Parallel Processing. In 2018 IEEE
2nd Electron Devices Technology and Manufacturing
Conference (EDTM), pages 7–9.
Dettmers, T., Lewis, M., Shleifer, S., and Zettlemoyer, L.
(2022). 8-bit Optimizers via Block-wise Quantization.
arXiv:2110.02861 [cs].
Dettmers, T., Pagnoni, A., Holtzman, A., and Zettlemoyer,
L. (2023). QLoRA: Efficient Finetuning of Quantized
LLMs. arXiv:2305.14314 [cs].
Dettmers, T. and Zettlemoyer, L. (2023). The case for 4-bit
precision: k-bit Inference Scaling Laws. In Proceed-
ings of the 40th International Conference on Machine
Learning, pages 7750–7774. PMLR. ISSN: 2640-
3498.
Ding, N., Qin, Y., Yang, G., Wei, F., Yang, Z., Su,
Y., Hu, S., Chen, Y., Chan, C.-M., Chen, W., Yi,
J., Zhao, W., Wang, X., Liu, Z., Zheng, H.-T.,
Chen, J., Liu, Y., Tang, J., Li, J., and Sun, M.
(2023). Parameter-efficient fine-tuning of large-scale
pre-trained language models. Nature Machine Intel-
ligence, 5(3):220–235. Publisher: Nature Publishing
Group.
Frantar, E. and Alistarh, D. (2022). Optimal brain compres-
sion: A framework for accurate post-training quanti-
zation and pruning. Advances in Neural Information
Processing Systems, 35:4475–4488.
Frantar, E., Ashkboos, S., Hoefler, T., and Alistarh,
D. (2023). GPTQ: Accurate Post-Training Quan-
tization for Generative Pre-trained Transformers.
arXiv:2210.17323 [cs].
Guo, H., Greengard, P., Xing, E. P., and Kim, Y. (2024).
LQ-LoRA: Low-rank Plus Quantized Matrix Decom-
position for Efficient Language Model Finetuning.
arXiv:2311.12023 [cs].
Gurobi Optimization, LLC (2023). Gurobi Optimizer Ref-
erence Manual.
Huangfu, Q. and Hall, J. A. J. (2018). Parallelizing the dual
revised simplex method. Mathematical Programming
Computation, 10(1):119–142.
huggingface (2024). Perplexity of fixed-length models.
Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B.,
Chess, B., Child, R., Gray, S., Radford, A., Wu, J.,
and Amodei, D. (2020). Scaling Laws for Neural Lan-
guage Models. arXiv:2001.08361 [cs, stat].
Lin, J., Tang, J., Tang, H., Yang, S., Dang, X., and Han,
S. (2023). AWQ: Activation-aware Weight Quantiza-
tion for LLM Compression and Acceleration. arXiv
preprint arXiv:2306.00978.
Merity, S., Xiong, C., Bradbury, J., and Socher, R. (2016).
Pointer Sentinel Mixture Models. arXiv:1609.07843
[cs].
Nagel, M., Baalen, M. v., Blankevoort, T., and Welling,
M. (2019). Data-Free Quantization Through Weight
Equalization and Bias Correction. pages 1325–1334.
Nagel, M., Fournarakis, M., Amjad, R. A., Bondarenko,
Y., van Baalen, M., and Blankevoort, T. (2021).
A White Paper on Neural Network Quantization.
arXiv:2106.08295 [cs].
Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S.,
Matena, M., Zhou, Y., Li, W., and Liu, P. J. (2020).
Exploring the Limits of Transfer Learning with a Uni-
fied Text-to-Text Transformer. Journal of Machine
Learning Research, 21(140):1–67.
Reddi, V. J., Cheng, C., Kanter, D., and Mattson, P. a. e.
(2020). MLPerf Inference Benchmark. In 2020
ACM/IEEE 47th Annual International Symposium on
Computer Architecture (ISCA), pages 446–459.
Touvron, H., Martin, L., and Stone, K. e. a. (2023). Llama
2: Open Foundation and Fine-Tuned Chat Models.
arXiv:2307.09288 [cs].
APPENDIX
Experiments Procedure
The results presented in the previous sections were
produced by conducting experiments on the Llama
family models. The Llama model experiments in-
clude quantization and perplexity evaluation.
The specialized harness tool named lm-quant-
toolkit facilitates the tasks to quantize, evaluate lan-
guage and vision transformer models using the pro-
posed MXQ method as well as the baselines such as
HQQ, AWQ, GPTQ. This harness is designed to per-
form long-running experiments. It tracks the experi-
ment status and automatically resumes interrupted ex-
periment from last failed point. It collects experiment
duration, GPU memory consumption and key met-
rics such as perplexity and Open LLM LeaderBoard
scores. It consolidates outputs from sub-tasks into
experiment dataset in .csv format for further analysis
and reporting.
The quantization and evaluation of LLMs re-
quire substantial computation and storage resources.
Our experiment environment’s hardware and software
configurations are presented in Table 3.
The patched softwares need to be cloned from
github, as noted under Table 3, then be installed using
the PIP’s editable install method as they haven’t been
integrated in the upstream yet. Additional Python
libraries are automatically installed when the lm-
quant-toolkit is installed. For complete dependen-
cies, please review the ‘pyproject.toml‘ file of the lm-
quant-toolkit project.
The three Llama models, published on Hug-
ging Face by Meta, employed in this experiment
are the Llama-2-7B(meta-llama/Llama-2-7b-hf),
Llama-2-13B(meta-llama/Llama-2-13b-hf) and
ICAART 2025 - 17th International Conference on Agents and Artificial Intelligence
362