+
Skip to main content

ALMP: Automatic Layer-By-Layer Mixed-Precision Quantization for Large Language Models

  • Conference paper
  • First Online:
Advanced Intelligent Computing Technology and Applications (ICIC 2025)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 15864))

Included in the following conference series:

  • 397 Accesses

Abstract

Large language models (LLMs) have demonstrated remarkable performance across a diverse range of tasks, including natural language understanding, text generation, and more. However, their vast number of parameters and high computational demands impede their widespread deployment. Quantization technology stands as one of the most effective approaches to reducing the costs associated with the inference and deployment of LLMs. The existing W8A8 quantization method suffers from a significant performance degradation after model quantization due to the emergence of outliers in activations. In this paper, we put forward an Automatic Layer-by-layer Mixed-precision Quantization (ALMP) algorithm. We carefully construct a dataset based on the activation distribution characteristics and then train a high precision prediction model. Based on this model, ALMP determines whether a specific layer needs to be quantized, thereby achieving mixed-precision quantization of INT8 and FP16. Comprehensive experimental results demonstrate that the ALMP yields speedups of 1.4× and 1.2× in the pre-filling and decoding phases, respectively. Notably, for models with over 70B parameters, the quantization-induced accuracy degradation is less than 1%.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+
from $39.99 /Month
  • Starting from 10 chapters or articles per month
  • Access and download chapters and articles from more than 300k books and 2,500 journals
  • Cancel anytime
View plans

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 69.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 89.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    https://github.com/JieWangNUDT/ALMP.

References

  1. Achiam, J., Adler, S., Agarwal, S., et al.: Gpt-4 technical report (2023). arXiv preprint arXiv:2303.08774

  2. Liu, A., Feng, B., Xue, B., et al.: Deepseek-v3 technical report (2024). arXiv preprint arXiv:2412.19437

  3. Brown, T., Mann, B., Ryder, N., et al.: Language models are few-shot learners. Adv. Neural. Inf. Process. Syst. 33, 1877–1901 (2020)

    Google Scholar 

  4. Yao, Z., Li, C., Wu, X., et al.: A comprehensive study on post-training quantization for large language models (2023). arXiv preprint arXiv:2303.08302

  5. Dettmers, T., Lewis, M., Belkada, Y., et al.: Gpt3. int8 (): 8-bit matrix multiplication for transformers at scale. Adv. Neural Inf. Process. Syst. 35, 30318–30332 (2022)

    Google Scholar 

  6. Xiao, G., Lin, J., Seznec, M., et al.: Smoothquant: accurate and efficient post-training quantization for large language models. International Conference on Machine Learning. PMLR, pp. 38087–38099 (2023)

    Google Scholar 

  7. Shao, W., Chen, M., Zhang, Z., et al.: Omniquant: omnidirectionally calibrated quantization for large language models (2023). arXiv preprint arXiv:2308.13137

  8. Touvron, H., Lavril, T., Izacard, G., et al.: Llama: open and efficient foundation language models (2023). arXiv preprint arXiv:2302.13971

  9. Bai, J., Bai, S., Chu, Y., et al.: Qwen technical report (2023). arXiv preprint arXiv:2309.16609

  10. Merity, S., Xiong, C., Bradbury, J., et al.: Pointer sentinel mixture models (2016). arXiv preprint arXiv:1609.07843

  11. Wei, X., Zhang, Y., Zhang, X., et al.: Outlier suppression: pushing the limit of low-bit transformer language models. Adv. Neural. Inf. Process. Syst. 35, 17402–17414 (2022)

    Google Scholar 

  12. Liu, J., Gong, R., Wei, X., et al.: Qllm: accurate and efficient low-bitwidth quantization for large language models (2023). arXiv preprint arXiv:2310.08041

  13. Lin, H., Xu, H., Wu, Y., et al.: Duquant: distributing outliers via dual transformation makes stronger quantized LLMs. Adv. Neural. Inf. Process. Syst. 37, 87766–87800 (2024)

    Google Scholar 

  14. Cortes, C., Vapnik, V.: Support-vector networks. Mach. Learn. 20, 273–297 (1995)

    Article  MATH  Google Scholar 

  15. Rigatti, S.J.: Random forest. J. Insur. Med. 47(1), 31–39 (2017)

    Article  Google Scholar 

  16. Berkson, J.: Why i prefer logits to probits. Biometrics 7(4), 327–339 (1951)

    Article  Google Scholar 

  17. Chen, T., Guestrin, C.: Xgboost: a scalable tree boosting system. In: Proceedings of the 22nd ACM Sigkdd International Conference on Knowledge Discovery and Data Mining. Pp. 785–794 (2016)

    Google Scholar 

Download references

Acknowledgments

This work was supported by the National Key R&D Program of China (No.2021ZD0112904).

Author information

Authors and Affiliations

Corresponding author

Correspondence to Dawei Feng .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Wang, J., Liu, H., Li, R., Feng, D., Ding, J., Ding, B. (2025). ALMP: Automatic Layer-By-Layer Mixed-Precision Quantization for Large Language Models. In: Huang, DS., Li, B., Chen, H., Zhang, C. (eds) Advanced Intelligent Computing Technology and Applications. ICIC 2025. Lecture Notes in Computer Science(), vol 15864. Springer, Singapore. https://doi.org/10.1007/978-981-95-0014-7_13

Download citation

  • DOI: https://doi.org/10.1007/978-981-95-0014-7_13

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-95-0013-0

  • Online ISBN: 978-981-95-0014-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Keywords

Publish with us

Policies and ethics

点击 这是indexloc提供的php浏览器服务,不要输入任何密码和下载