ALMP: Automatic Layer-By-Layer Mixed-Precision Quantization for Large Language Models

Wang, Jie; Liu, Huanxi; Li, Rui; Feng, Dawei; Ding, Jie; Ding, Bo

doi:10.1007/978-981-95-0014-7_13

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 15864))

Included in the following conference series:

International Conference on Intelligent Computing

397 Accesses

Abstract

Large language models (LLMs) have demonstrated remarkable performance across a diverse range of tasks, including natural language understanding, text generation, and more. However, their vast number of parameters and high computational demands impede their widespread deployment. Quantization technology stands as one of the most effective approaches to reducing the costs associated with the inference and deployment of LLMs. The existing W8A8 quantization method suffers from a significant performance degradation after model quantization due to the emergence of outliers in activations. In this paper, we put forward an Automatic Layer-by-layer Mixed-precision Quantization (ALMP) algorithm. We carefully construct a dataset based on the activation distribution characteristics and then train a high precision prediction model. Based on this model, ALMP determines whether a specific layer needs to be quantized, thereby achieving mixed-precision quantization of INT8 and FP16. Comprehensive experimental results demonstrate that the ALMP yields speedups of 1.4× and 1.2× in the pre-filling and decoding phases, respectively. Notably, for models with over 70B parameters, the quantization-induced accuracy degradation is less than 1%.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+

from $39.99 /Month

Starting from 10 chapters or articles per month
Access and download chapters and articles from more than 300k books and 2,500 journals
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 69.99; Price excludes VAT (USA)

Softcover Book: USD 89.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Precision Where It Matters: A Novel Spike Aware Mixed-Precision Quantization Strategy for LLaMA-Based Language Models

An empirical study of LLaMA3 quantization: from LLMs to MLLMs

Article Open access 30 December 2024

A survey on 1-bit quantized large language models

Article 09 August 2025

Notes

1.
https://github.com/JieWangNUDT/ALMP.

References

Achiam, J., Adler, S., Agarwal, S., et al.: Gpt-4 technical report (2023). arXiv preprint arXiv:2303.08774
Liu, A., Feng, B., Xue, B., et al.: Deepseek-v3 technical report (2024). arXiv preprint arXiv:2412.19437
Brown, T., Mann, B., Ryder, N., et al.: Language models are few-shot learners. Adv. Neural. Inf. Process. Syst. 33, 1877–1901 (2020)
Google Scholar
Yao, Z., Li, C., Wu, X., et al.: A comprehensive study on post-training quantization for large language models (2023). arXiv preprint arXiv:2303.08302
Dettmers, T., Lewis, M., Belkada, Y., et al.: Gpt3. int8 (): 8-bit matrix multiplication for transformers at scale. Adv. Neural Inf. Process. Syst. 35, 30318–30332 (2022)
Google Scholar
Xiao, G., Lin, J., Seznec, M., et al.: Smoothquant: accurate and efficient post-training quantization for large language models. International Conference on Machine Learning. PMLR, pp. 38087–38099 (2023)
Google Scholar
Shao, W., Chen, M., Zhang, Z., et al.: Omniquant: omnidirectionally calibrated quantization for large language models (2023). arXiv preprint arXiv:2308.13137
Touvron, H., Lavril, T., Izacard, G., et al.: Llama: open and efficient foundation language models (2023). arXiv preprint arXiv:2302.13971
Bai, J., Bai, S., Chu, Y., et al.: Qwen technical report (2023). arXiv preprint arXiv:2309.16609
Merity, S., Xiong, C., Bradbury, J., et al.: Pointer sentinel mixture models (2016). arXiv preprint arXiv:1609.07843
Wei, X., Zhang, Y., Zhang, X., et al.: Outlier suppression: pushing the limit of low-bit transformer language models. Adv. Neural. Inf. Process. Syst. 35, 17402–17414 (2022)
Google Scholar
Liu, J., Gong, R., Wei, X., et al.: Qllm: accurate and efficient low-bitwidth quantization for large language models (2023). arXiv preprint arXiv:2310.08041
Lin, H., Xu, H., Wu, Y., et al.: Duquant: distributing outliers via dual transformation makes stronger quantized LLMs. Adv. Neural. Inf. Process. Syst. 37, 87766–87800 (2024)
Google Scholar
Cortes, C., Vapnik, V.: Support-vector networks. Mach. Learn. 20, 273–297 (1995)
Article MATH Google Scholar
Rigatti, S.J.: Random forest. J. Insur. Med. 47(1), 31–39 (2017)
Article Google Scholar
Berkson, J.: Why i prefer logits to probits. Biometrics 7(4), 327–339 (1951)
Article Google Scholar
Chen, T., Guestrin, C.: Xgboost: a scalable tree boosting system. In: Proceedings of the 22nd ACM Sigkdd International Conference on Knowledge Discovery and Data Mining. Pp. 785–794 (2016)
Google Scholar

Download references

Acknowledgments

This work was supported by the National Key R&D Program of China (No.2021ZD0112904).

Author information

Authors and Affiliations

College of Computer Science and Technology, National University of Defense Technology, Changsha, Hunan, China
Jie Wang, Huanxi Liu, Rui Li, Dawei Feng & Bo Ding
State Key Laboratory of Complex and Critical Software Environment, Changsha, Hunan, China
Jie Wang, Huanxi Liu, Rui Li, Dawei Feng & Bo Ding
R & D Group, IFLYTEK, Hefei, Anhui, China
Jie Wang & Jie Ding

Authors

Jie Wang
View author publications
Search author on:PubMed Google Scholar
Huanxi Liu
View author publications
Search author on:PubMed Google Scholar
Rui Li
View author publications
Search author on:PubMed Google Scholar
Dawei Feng
View author publications
Search author on:PubMed Google Scholar
Jie Ding
View author publications
Search author on:PubMed Google Scholar
Bo Ding
View author publications
Search author on:PubMed Google Scholar

Corresponding author

Correspondence to Dawei Feng .

Editor information

Editors and Affiliations

Eastern Institute of Technology, Ningbo, China
De-Shuang Huang
Wuhan University of Science and Technology, Wuhan, China
Bo Li
Ningbo University, Ningbo, China
Haiming Chen
Tianjin University of Science and Technology, Tianjin, China
Chuanlei Zhang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wang, J., Liu, H., Li, R., Feng, D., Ding, J., Ding, B. (2025). ALMP: Automatic Layer-By-Layer Mixed-Precision Quantization for Large Language Models. In: Huang, DS., Li, B., Chen, H., Zhang, C. (eds) Advanced Intelligent Computing Technology and Applications. ICIC 2025. Lecture Notes in Computer Science(), vol 15864. Springer, Singapore. https://doi.org/10.1007/978-981-95-0014-7_13

Download citation

DOI: https://doi.org/10.1007/978-981-95-0014-7_13
Published: 25 July 2025
Publisher Name: Springer, Singapore
Print ISBN: 978-981-95-0013-0
Online ISBN: 978-981-95-0014-7
eBook Packages: Computer ScienceComputer Science (R0)

Keywords

Publish with us

Policies and ethics