AuGQ: Augmented quantization granularity to overcome accuracy degradation for sub-byte quantized deep neural networks

Mujtaba, Ahmed; Lee, Wai Kong; Ko, Byoung Chul; Chang, Hyung Jin; Hwang, Seong Oun

doi:10.1007/s10489-025-06495-1

AuGQ: Augmented quantization granularity to overcome accuracy degradation for sub-byte quantized deep neural networks

Published: 27 March 2025

Volume 55, article number 589, (2025)
Cite this article

Applied Intelligence Aims and scope Submit manuscript

225 Accesses
Explore all metrics

Abstract

Deployment of neural networks on IoT devices unleashes the potential for various innovative applications, but the sheer size and computation of many deep learning (DL) networks prevented its widespread. Quantization mitigates this issue by reducing model precision, enabling deployment on resource-constrained edge devices. However, at extremely low bit-widths, such as 2-bit and 4-bit, the aggressive compression leads to significant accuracy degradation due to the reduced representational capacity of the neural network. A critical aspect of effective quantization is identifying the range of real values (FP32) that impact model accuracy. To address accuracy loss at sub-byte levels, we introduce Augmented Quantization (AuGQ), a novel granularity technique tailored for low bit-width quantization. AuGQ segments the range of real-valued (FP32) weight and activation distributions into small uniform intervals, applying affine quantization in each interval to enhance accuracy. We evaluated AuGQ using both post-training quantization (PTQ) and quantization-aware training (QAT) methods, achieving accuracy levels comparable to full precision (32-bit) DL networks. Our findings demonstrate that AuGQ is agnostic to the training pipeline and batch normalization folding, distinguishing it from conventional quantization techniques. Furthermore, when integrated into state-of-the-art PTQ algorithms, AuGQ necessitates only 64 training samples for fine-tuning which is $16\times $ fewer than traditional methods. This reduction facilitates the application of high-accuracy quantization at sub-byte bit-widths, making it suitable for practical IoT deployments and enhancing computational efficiency on edge devices.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+

from $39.99 /Month

Starting from 10 chapters or articles per month
Access and download chapters and articles from more than 300k books and 2,500 journals
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

AQRG: adaptive quantization reconstruction granularity for post-training quantization

Article 04 May 2025

Non-uniform Step Size Quantization for Accurate Post-training Quantization

Gradient-Aware Incremental Network Quantization

Data Availability

The data that support the findings of this study are available on request from the corresponding author of references [2, 31, 33, 57].

Code availability

Researchers or interested parties are welcome to contact the corresponding author for further explanation, who may also provide the Python codes upon request.

References

Dutta L, Bharali S (2021) Tinyml meets iot: a comprehensive survey. Internet of Things. 16:100461
Article Google Scholar
Gholami A, Kim S, Dong Z, Yao Z, Mahoney MW, Keutzer K (2021) A survey of quantization methods for efficient neural network inference. arXiv preprint arXiv:2103.13630
Ding X, Zhou X, Guo Y, Han J, Liu J et al (2019) Global sparse momentum sgd for pruning very deep neural networks. Advances in Neural Information Processing Systems. 32
Ji Y, Chen L (2023) Fedqnn: A computation–communication-efficient federated learning framework for iot with low-bitwidth neural network quantization. IEEE Internet Things J 10(3):2494–2507. https://doi.org/10.1109/JIOT.2022.3213650
Article Google Scholar
Ma T, Wang H, Li C (2023) Quantized distributed federated learning for industrial internet of things. IEEE Internet Things J 10(4):3027–3036. https://doi.org/10.1109/JIOT.2021.3139772
Article Google Scholar
Seo S, Lee J, Ko H, Pack S (2023) Situation-aware cluster and quantization level selection algorithm for fast federated learning. IEEE Internet Things J 10(15):13292–13302. https://doi.org/10.1109/JIOT.2023.3262582
Article Google Scholar
Wang Z, Liu X, Huang L, Chen Y, Zhang Y, Lin Z, Wang R (2022) Qsfm: model pruning based on quantified similarity between feature maps for ai on edge. IEEE Internet Things J 9(23):24506–24515. https://doi.org/10.1109/JIOT.2022.3190873
Article Google Scholar
Zawish M, Ashraf N, Ansari RI, Davy S (2023) Energy-aware ai-driven framework for edge-computing-based iot applications. IEEE Internet Things J 10(6):5013–5023. https://doi.org/10.1109/JIOT.2022.3219202
Article Google Scholar
Nagel M, Amjad RA, Van Baalen M, Louizos C, Blankevoort T (2020) Up or down? adaptive rounding for post-training quantization. In: International Conference on Machine Learning, pp 7197–7206. PMLR
Jeon Y, Lee C, Cho E, Ro Y (2022) Mr. biq: post-training non-uniform quantization based on minimizing the reconstruction error. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 12329–12338
Choukroun Y, Kravchik E, Yang F, Kisilev P (2019) Low-bit quantization of neural networks for efficient inference. In: 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), pp 3009–3018. IEEE
Wang P, Chen Q, He X, Cheng J (2020) Towards accurate post-training network quantization via bit-split and stitching. In: International conference on machine learning, pp 9847–9856. PMLR
Zhou S, Wu Y, Ni Z, Zhou X, Wen H, Zou Y (2016) Dorefa-net: Training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv:1606.06160
Zhang D, Yang J, Ye D, Hua G (2018) Lq-nets: Learned quantization for highly accurate and compact deep neural networks. In: Proceedings of the European Conference on Computer Vision (ECCV), pp 365–382
Kim D, Lee J, Ham B (2021) Distance-aware quantization. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 5271–5280
McKinstry JL, Esser SK, Appuswamy R, Bablani D, Arthur JV, Yildiz IB, Modha DS (2019) Discovering low-precision networks close to full-precision networks for efficient inference. In: 2019 Fifth workshop on energy efficient machine learning and cognitive computing-NeurIPS edition (EMC2-NIPS), pp 6–9. IEEE
Wu B, Wang Y, Zhang P, Tian Y, Vajda P, Keutzer K (2018) Mixed precision quantization of convnets via differentiable neural architecture search. arXiv:1812.00090
Jung S, Son C, Lee S, Son J, Han J-J, Kwak Y, Hwang SJ, Choi C (2019) Learning to quantize deep networks by optimizing quantization intervals with task loss. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 4350–4359
Krishnamoorthi R (2018) Quantizing deep convolutional networks for efficient inference: a whitepaper. arXiv:1806.08342
Hinton G, Vinyals O, Dean J et al (2015) Distilling the knowledge in a neural network. arXiv:1503.02531. 2(7)
Shen Z, He Z, Xue X (2019) Meal: multi-model ensemble via adversarial learning. Proceedings of the AAAI Conference on Artificial Intelligence 33:4886–4893
Article Google Scholar
Tan M, Chen B, Pang R, Vasudevan V, Sandler M, Howard A, Le QV (2019) Mnasnet: platform-aware neural architecture search for mobile. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 2820–2828
Wu B, Dai X, Zhang P, Wang Y, Sun F, Wu Y, Tian Y, Vajda P, Jia Y, Keutzer K (2019) Fbnet: hardware-aware efficient convnet design via differentiable neural architecture search. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10734–10742
Ma N, Zhang X, Zheng H-T, Sun J (2018) Shufflenet v2: practical guidelines for efficient cnn architecture design. In: Proceedings of the European Conference on Computer Vision (ECCV), pp 116–131
Howard AG, Zhu M, Chen B, Kalenichenko D, Wang W, Weyand T, Andreetto M, Adam H (2017) Mobilenets: efficient convolutional neural networks for mobile vision applications. arXiv:1704.04861
Sandler M, Howard A, Zhu M, Zhmoginov A, Chen L-C (2018) Mobilenetv2: inverted residuals and linear bottlenecks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4510–4520
Zhang X, Zhou X, Lin M, Sun J (2018) Shufflenet: an extremely efficient convolutional neural network for mobile devices. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6848–6856
Choi J, Wang Z, Venkataramani S, Chuang PI-J, Srinivasan V, Gopalakrishnan K (2018) Pact: parameterized clipping activation for quantized neural networks. arXiv:1805.06085
Esser SK, McKinstry JL, Bablani D, Appuswamy R, Modha DS (2019) Learned step size quantization. arXiv:1902.08153
Jacob B, Kligys S, Chen B, Zhu M, Tang M, Howard A, Adam H, Kalenichenko D (2018) Quantization and training of neural networks for efficient integer-arithmetic-only inference. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2704–2713
Li Y, Gong R, Tan X, Yang Y, Hu P, Zhang Q, Yu F, Wang W, Gu S (2021) Brecq: pushing the limit of post-training quantization by block reconstruction. arXiv:2102.05426
Nahshan Y, Chmiel B, Baskin C, Zheltonozhskii E, Banner R, Bronstein AM, Mendelson A (2021) Loss aware post-training quantization. Mach Learn 110(11):3245–3262
Article MathSciNet Google Scholar
Rusci M, Capotondi A, Benini L (2020) Memory-driven mixed low precision quantization for enabling deep network inference on microcontrollers. Proceedings of Machine Learning and Systems. 2:326–335
Google Scholar
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778
Radosavovic I, Kosaraju RP, Girshick R, He K, Dollár P (2020) Designing network design spaces. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10428–10436
Fang J, Shafiee A, Abdel-Aziz H, Thorsley D, Georgiadis G, Hassoun J (2020) Near-lossless post-training quantization of deep neural networks via a piecewise linear approximation. arXiv:2002.00104. 10:978–3
Fang J, Shafiee A, Abdel-Aziz H, Thorsley D, Georgiadis G, Hassoun JH (2020) Post-training piecewise linear quantization for deep neural networks. In: European conference on computer vision, pp 69–86. Springer
Garg S, Lou J, Jain A, Nahmias M (2021) Dynamic precision analog computing for neural networks. arXiv:2102.06365
He X, Cheng J (2018) Learning compression from limited unlabeled data. In: Proceedings of the European Conference on Computer Vision (ECCV), pp 752–769
Meller E, Finkelstein A, Almog U, Grobman M (2019) Same, same but different: recovering neural network quantization error through weight factorization. In: International conference on machine learning, pp 4486–4495. PMLR
Nagel M, Baalen Mv, Blankevoort T, Welling M (2019) Data-free quantization through weight equalization and bias correction. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 1325–1334
Shomron G, Gabbay F, Kurzum S, Weiser U (2021) Post-training sparsity-aware quantization. Adv Neural Inf Process Syst 34:17737–17748
Google Scholar
Zhao R, Hu Y, Dotzel J, De Sa C, Zhang Z (2019) Improving neural network quantization without retraining using outlier channel splitting. In: International conference on machine learning, pp 7543–7552. PMLR
Cai Y, Yao Z, Dong Z, Gholami A, Mahoney MW, Keutzer K (2020) Zeroq: a novel zero shot quantization framework. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 13169–13178
Hubara I, Nahshan Y, Hanani Y, Banner R, Soudry D (2020) Improving post training neural quantization: layer-wise calibration and integer programming. arXiv:2006.10518
Gong R, Liu X, Jiang S, Li T, Hu P, Lin J, Yu F, Yan J (2019) Differentiable soft quantization: bridging full-precision and low-bit neural networks. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 4852–4861
Lin D, Talathi S, Annapureddy S (2016) Fixed point quantization of deep convolutional networks. In: International conference on machine learning, pp 2849–2858. PMLR
Choi J, Venkataramani S, Srinivasan VV, Gopalakrishnan K, Wang Z, Chuang P (2019) Accurate and efficient 2-bit quantized neural networks. Proceedings of Machine Learning and Systems. 1:348–359
Google Scholar
Li Y, Dong X, Wang W (2019) Additive powers-of-two quantization: An efficient non-uniform discretization for neural networks. arXiv:1909.13144
Miyashita D, Lee EH, Murmann B (2016) Convolutional neural networks using logarithmic data representation. arXiv:1603.01025
Yamamoto K (2021) Learnable companding quantization for accurate low-bit neural networks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 5029–5038
Zhou A, Yao A, Guo Y, Xu L, Chen Y (2017) Incremental network quantization: Towards lossless cnns with low-precision weights. arXiv:1702.03044
Yamamoto K (2021) Learnable companding quantization for accurate low-bit neural networks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 5029–5038
Liu Z, Cheng K-T, Huang D, Xing EP, Shen Z (2022) Nonuniform-to-uniform quantization: towards accurate quantization via generalized straight-through estimation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 4942–4952
Liu Z-G, Mattina M (2019) Learning low-precision neural networks without straight-through estimator (ste). arXiv:1903.01061
Banner R, Nahshan Y, Soudry D (2019) Post training 4-bit quantization of convolutional networks for rapid-deployment. Advances in Neural Information Processing Systems 32
Wei X, Gong R, Li Y, Liu X, Yu F (2022) Qdrop: randomly dropping quantization for extremely low-bit post-training quantization. arXiv:2203.05740
Xu K, Li Z, Wang S, Zhang X (2024) Ptmq: post-training multi-bit quantization of neural networks. Proceedings of the AAAI Conference on Artificial Intelligence 38:16193–16201
Article Google Scholar
Zhou S, Wu Y, Ni Z, Zhou X, Wen H, Zou Y (2016) Dorefa-net: training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv:1606.06160
Xu K, Han L, Tian Y, Yang S, Zhang X (2023) Eq-net: elastic quantization neural networks. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 1505–1514
See J-C, Ng H-F, Tan H-K, Chang J-J, Mok K-M, Lee W-K, Lin C-Y (2023) Cryptensor: a resource-shared co-processor to accelerate convolutional neural network and polynomial convolution. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems
Metz D, Kumar V, Själander M (2023) Bisdu: a bit-serial dot-product unit for microcontrollers. ACM Transactions on Embedded Computing Systems. 22(5):1–22
Article Google Scholar
AskariHemmat M, Dupuis T, Fournier Y, El Zarif N, Cavalcante M, Perotti M, Gürkaynak F, Benini L, Leduc-Primeau F, Savaria Y et al (2023) Quark: an integer risc-v vector processor for sub-byte quantized dnn inference. In: 2023 IEEE International Symposium on Circuits and Systems (ISCAS), pp 1–5. IEEE
Lai L, Suda N, Chandra V (2018) Cmsis-nn: efficient neural network kernels for arm cortex-m cpus. arXiv:1801.06601
Capotondi A, Rusci M, Fariselli M, Benini L (2020) Cmix-nn: mixed low-precision cnn library for memory-constrained edge devices. IEEE Trans Circuits Syst II Express Briefs 67(5):871–875
Google Scholar
Mujtaba A, Lee W-K, Hwang SO (2022) Low latency implementations of cnn for resource-constrained iot devices. IEEE Trans Circuits Syst II Express Briefs 69(12):5124–5128
Google Scholar
Ganji DC, Ashfaq S, Saboori E, Sah S, Mitra S, Askarihemmat M, Hoffman A, Hassanien A, Leonardon M (2023) Deepgemm: accelerated ultra low-precision inference on cpu architectures using lookup tables. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 4655–4663

Download references

Funding

This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (Ministry of Science and ICT) under Grant RS-2024-00340882 and by the Gachon University research fund under Grant GCU-202404140001.

Author information

Authors and Affiliations

Department of IT Convergence Engineering, Gachon University, Seongnam, 13120, South Korea
Ahmed Mujtaba
Department of Computer Engineering, Gachon University, Seongnam, 13120, South Korea
Wai Kong Lee & Seong Oun Hwang
Department of Computer Engineering, Keimyung University (KMU), Daegu, South Korea
Byoung Chul Ko
School of Computer Science, University of Birmingham, Birmingham, United Kingdom
Hyung Jin Chang

Authors

Ahmed Mujtaba
View author publications
Search author on:PubMed Google Scholar
Wai Kong Lee
View author publications
Search author on:PubMed Google Scholar
Byoung Chul Ko
View author publications
Search author on:PubMed Google Scholar
Hyung Jin Chang
View author publications
Search author on:PubMed Google Scholar
Seong Oun Hwang
View author publications
Search author on:PubMed Google Scholar

Contributions

Ahmed Mujtaba was responsible for the design and execution of the overall investigation. Wai Kong Lee was responsible for the investigation related to quantization. Byoung Chul Ko, Hyung Jin Chang and Seong Oun Hwang were responsible in the data curation, supervision, writing and editing of the manuscript. All authors have read and agreed to the published version of the manuscript.

Corresponding author

Correspondence to Seong Oun Hwang.

Ethics declarations

Competing interests

Not applicable

Ethics approval

This article does not contain any studies with human participants or animals performed by any of the authors.

Consent to participate

Not applicable

Consent for publication

Not applicable

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Mujtaba, A., Lee, W.K., Ko, B.C. et al. AuGQ: Augmented quantization granularity to overcome accuracy degradation for sub-byte quantized deep neural networks. Appl Intell 55, 589 (2025). https://doi.org/10.1007/s10489-025-06495-1

Download citation

Accepted: 18 March 2025
Published: 27 March 2025
Version of record: 27 March 2025
DOI: https://doi.org/10.1007/s10489-025-06495-1

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+

from $39.99 /Month

Starting from 10 chapters or articles per month
Access and download chapters and articles from more than 300k books and 2,500 journals
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

AuGQ: Augmented quantization granularity to overcome accuracy degradation for sub-byte quantized deep neural networks

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

AQRG: adaptive quantization reconstruction granularity for post-training quantization

Non-uniform Step Size Quantization for Accurate Post-training Quantization

Gradient-Aware Incremental Network Quantization

Explore related subjects

Data Availability

Code availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Ethics approval

Consent to participate

Consent for publication

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now