Abstract
Large language models (LLMs) have demonstrated remarkable performance across a diverse range of tasks, including natural language understanding, text generation, and more. However, their vast number of parameters and high computational demands impede their widespread deployment. Quantization technology stands as one of the most effective approaches to reducing the costs associated with the inference and deployment of LLMs. The existing W8A8 quantization method suffers from a significant performance degradation after model quantization due to the emergence of outliers in activations. In this paper, we put forward an Automatic Layer-by-layer Mixed-precision Quantization (ALMP) algorithm. We carefully construct a dataset based on the activation distribution characteristics and then train a high precision prediction model. Based on this model, ALMP determines whether a specific layer needs to be quantized, thereby achieving mixed-precision quantization of INT8 and FP16. Comprehensive experimental results demonstrate that the ALMP yields speedups of 1.4× and 1.2× in the pre-filling and decoding phases, respectively. Notably, for models with over 70B parameters, the quantization-induced accuracy degradation is less than 1%.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Achiam, J., Adler, S., Agarwal, S., et al.: Gpt-4 technical report (2023). arXiv preprint arXiv:2303.08774
Liu, A., Feng, B., Xue, B., et al.: Deepseek-v3 technical report (2024). arXiv preprint arXiv:2412.19437
Brown, T., Mann, B., Ryder, N., et al.: Language models are few-shot learners. Adv. Neural. Inf. Process. Syst. 33, 1877–1901 (2020)
Yao, Z., Li, C., Wu, X., et al.: A comprehensive study on post-training quantization for large language models (2023). arXiv preprint arXiv:2303.08302
Dettmers, T., Lewis, M., Belkada, Y., et al.: Gpt3. int8 (): 8-bit matrix multiplication for transformers at scale. Adv. Neural Inf. Process. Syst. 35, 30318–30332 (2022)
Xiao, G., Lin, J., Seznec, M., et al.: Smoothquant: accurate and efficient post-training quantization for large language models. International Conference on Machine Learning. PMLR, pp. 38087–38099 (2023)
Shao, W., Chen, M., Zhang, Z., et al.: Omniquant: omnidirectionally calibrated quantization for large language models (2023). arXiv preprint arXiv:2308.13137
Touvron, H., Lavril, T., Izacard, G., et al.: Llama: open and efficient foundation language models (2023). arXiv preprint arXiv:2302.13971
Bai, J., Bai, S., Chu, Y., et al.: Qwen technical report (2023). arXiv preprint arXiv:2309.16609
Merity, S., Xiong, C., Bradbury, J., et al.: Pointer sentinel mixture models (2016). arXiv preprint arXiv:1609.07843
Wei, X., Zhang, Y., Zhang, X., et al.: Outlier suppression: pushing the limit of low-bit transformer language models. Adv. Neural. Inf. Process. Syst. 35, 17402–17414 (2022)
Liu, J., Gong, R., Wei, X., et al.: Qllm: accurate and efficient low-bitwidth quantization for large language models (2023). arXiv preprint arXiv:2310.08041
Lin, H., Xu, H., Wu, Y., et al.: Duquant: distributing outliers via dual transformation makes stronger quantized LLMs. Adv. Neural. Inf. Process. Syst. 37, 87766–87800 (2024)
Cortes, C., Vapnik, V.: Support-vector networks. Mach. Learn. 20, 273–297 (1995)
Rigatti, S.J.: Random forest. J. Insur. Med. 47(1), 31–39 (2017)
Berkson, J.: Why i prefer logits to probits. Biometrics 7(4), 327–339 (1951)
Chen, T., Guestrin, C.: Xgboost: a scalable tree boosting system. In: Proceedings of the 22nd ACM Sigkdd International Conference on Knowledge Discovery and Data Mining. Pp. 785–794 (2016)
Acknowledgments
This work was supported by the National Key R&D Program of China (No.2021ZD0112904).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Wang, J., Liu, H., Li, R., Feng, D., Ding, J., Ding, B. (2025). ALMP: Automatic Layer-By-Layer Mixed-Precision Quantization for Large Language Models. In: Huang, DS., Li, B., Chen, H., Zhang, C. (eds) Advanced Intelligent Computing Technology and Applications. ICIC 2025. Lecture Notes in Computer Science(), vol 15864. Springer, Singapore. https://doi.org/10.1007/978-981-95-0014-7_13
Download citation
DOI: https://doi.org/10.1007/978-981-95-0014-7_13
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-95-0013-0
Online ISBN: 978-981-95-0014-7
eBook Packages: Computer ScienceComputer Science (R0)