Abstract
Deployment of neural networks on IoT devices unleashes the potential for various innovative applications, but the sheer size and computation of many deep learning (DL) networks prevented its widespread. Quantization mitigates this issue by reducing model precision, enabling deployment on resource-constrained edge devices. However, at extremely low bit-widths, such as 2-bit and 4-bit, the aggressive compression leads to significant accuracy degradation due to the reduced representational capacity of the neural network. A critical aspect of effective quantization is identifying the range of real values (FP32) that impact model accuracy. To address accuracy loss at sub-byte levels, we introduce Augmented Quantization (AuGQ), a novel granularity technique tailored for low bit-width quantization. AuGQ segments the range of real-valued (FP32) weight and activation distributions into small uniform intervals, applying affine quantization in each interval to enhance accuracy. We evaluated AuGQ using both post-training quantization (PTQ) and quantization-aware training (QAT) methods, achieving accuracy levels comparable to full precision (32-bit) DL networks. Our findings demonstrate that AuGQ is agnostic to the training pipeline and batch normalization folding, distinguishing it from conventional quantization techniques. Furthermore, when integrated into state-of-the-art PTQ algorithms, AuGQ necessitates only 64 training samples for fine-tuning which is \(16\times \) fewer than traditional methods. This reduction facilitates the application of high-accuracy quantization at sub-byte bit-widths, making it suitable for practical IoT deployments and enhancing computational efficiency on edge devices.
Similar content being viewed by others
Code availability
Researchers or interested parties are welcome to contact the corresponding author for further explanation, who may also provide the Python codes upon request.
References
Dutta L, Bharali S (2021) Tinyml meets iot: a comprehensive survey. Internet of Things. 16:100461
Gholami A, Kim S, Dong Z, Yao Z, Mahoney MW, Keutzer K (2021) A survey of quantization methods for efficient neural network inference. arXiv preprint arXiv:2103.13630
Ding X, Zhou X, Guo Y, Han J, Liu J et al (2019) Global sparse momentum sgd for pruning very deep neural networks. Advances in Neural Information Processing Systems. 32
Ji Y, Chen L (2023) Fedqnn: A computation–communication-efficient federated learning framework for iot with low-bitwidth neural network quantization. IEEE Internet Things J 10(3):2494–2507. https://doi.org/10.1109/JIOT.2022.3213650
Ma T, Wang H, Li C (2023) Quantized distributed federated learning for industrial internet of things. IEEE Internet Things J 10(4):3027–3036. https://doi.org/10.1109/JIOT.2021.3139772
Seo S, Lee J, Ko H, Pack S (2023) Situation-aware cluster and quantization level selection algorithm for fast federated learning. IEEE Internet Things J 10(15):13292–13302. https://doi.org/10.1109/JIOT.2023.3262582
Wang Z, Liu X, Huang L, Chen Y, Zhang Y, Lin Z, Wang R (2022) Qsfm: model pruning based on quantified similarity between feature maps for ai on edge. IEEE Internet Things J 9(23):24506–24515. https://doi.org/10.1109/JIOT.2022.3190873
Zawish M, Ashraf N, Ansari RI, Davy S (2023) Energy-aware ai-driven framework for edge-computing-based iot applications. IEEE Internet Things J 10(6):5013–5023. https://doi.org/10.1109/JIOT.2022.3219202
Nagel M, Amjad RA, Van Baalen M, Louizos C, Blankevoort T (2020) Up or down? adaptive rounding for post-training quantization. In: International Conference on Machine Learning, pp 7197–7206. PMLR
Jeon Y, Lee C, Cho E, Ro Y (2022) Mr. biq: post-training non-uniform quantization based on minimizing the reconstruction error. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 12329–12338
Choukroun Y, Kravchik E, Yang F, Kisilev P (2019) Low-bit quantization of neural networks for efficient inference. In: 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), pp 3009–3018. IEEE
Wang P, Chen Q, He X, Cheng J (2020) Towards accurate post-training network quantization via bit-split and stitching. In: International conference on machine learning, pp 9847–9856. PMLR
Zhou S, Wu Y, Ni Z, Zhou X, Wen H, Zou Y (2016) Dorefa-net: Training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv:1606.06160
Zhang D, Yang J, Ye D, Hua G (2018) Lq-nets: Learned quantization for highly accurate and compact deep neural networks. In: Proceedings of the European Conference on Computer Vision (ECCV), pp 365–382
Kim D, Lee J, Ham B (2021) Distance-aware quantization. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 5271–5280
McKinstry JL, Esser SK, Appuswamy R, Bablani D, Arthur JV, Yildiz IB, Modha DS (2019) Discovering low-precision networks close to full-precision networks for efficient inference. In: 2019 Fifth workshop on energy efficient machine learning and cognitive computing-NeurIPS edition (EMC2-NIPS), pp 6–9. IEEE
Wu B, Wang Y, Zhang P, Tian Y, Vajda P, Keutzer K (2018) Mixed precision quantization of convnets via differentiable neural architecture search. arXiv:1812.00090
Jung S, Son C, Lee S, Son J, Han J-J, Kwak Y, Hwang SJ, Choi C (2019) Learning to quantize deep networks by optimizing quantization intervals with task loss. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 4350–4359
Krishnamoorthi R (2018) Quantizing deep convolutional networks for efficient inference: a whitepaper. arXiv:1806.08342
Hinton G, Vinyals O, Dean J et al (2015) Distilling the knowledge in a neural network. arXiv:1503.02531. 2(7)
Shen Z, He Z, Xue X (2019) Meal: multi-model ensemble via adversarial learning. Proceedings of the AAAI Conference on Artificial Intelligence 33:4886–4893
Tan M, Chen B, Pang R, Vasudevan V, Sandler M, Howard A, Le QV (2019) Mnasnet: platform-aware neural architecture search for mobile. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 2820–2828
Wu B, Dai X, Zhang P, Wang Y, Sun F, Wu Y, Tian Y, Vajda P, Jia Y, Keutzer K (2019) Fbnet: hardware-aware efficient convnet design via differentiable neural architecture search. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10734–10742
Ma N, Zhang X, Zheng H-T, Sun J (2018) Shufflenet v2: practical guidelines for efficient cnn architecture design. In: Proceedings of the European Conference on Computer Vision (ECCV), pp 116–131
Howard AG, Zhu M, Chen B, Kalenichenko D, Wang W, Weyand T, Andreetto M, Adam H (2017) Mobilenets: efficient convolutional neural networks for mobile vision applications. arXiv:1704.04861
Sandler M, Howard A, Zhu M, Zhmoginov A, Chen L-C (2018) Mobilenetv2: inverted residuals and linear bottlenecks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4510–4520
Zhang X, Zhou X, Lin M, Sun J (2018) Shufflenet: an extremely efficient convolutional neural network for mobile devices. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6848–6856
Choi J, Wang Z, Venkataramani S, Chuang PI-J, Srinivasan V, Gopalakrishnan K (2018) Pact: parameterized clipping activation for quantized neural networks. arXiv:1805.06085
Esser SK, McKinstry JL, Bablani D, Appuswamy R, Modha DS (2019) Learned step size quantization. arXiv:1902.08153
Jacob B, Kligys S, Chen B, Zhu M, Tang M, Howard A, Adam H, Kalenichenko D (2018) Quantization and training of neural networks for efficient integer-arithmetic-only inference. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2704–2713
Li Y, Gong R, Tan X, Yang Y, Hu P, Zhang Q, Yu F, Wang W, Gu S (2021) Brecq: pushing the limit of post-training quantization by block reconstruction. arXiv:2102.05426
Nahshan Y, Chmiel B, Baskin C, Zheltonozhskii E, Banner R, Bronstein AM, Mendelson A (2021) Loss aware post-training quantization. Mach Learn 110(11):3245–3262
Rusci M, Capotondi A, Benini L (2020) Memory-driven mixed low precision quantization for enabling deep network inference on microcontrollers. Proceedings of Machine Learning and Systems. 2:326–335
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778
Radosavovic I, Kosaraju RP, Girshick R, He K, Dollár P (2020) Designing network design spaces. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10428–10436
Fang J, Shafiee A, Abdel-Aziz H, Thorsley D, Georgiadis G, Hassoun J (2020) Near-lossless post-training quantization of deep neural networks via a piecewise linear approximation. arXiv:2002.00104. 10:978–3
Fang J, Shafiee A, Abdel-Aziz H, Thorsley D, Georgiadis G, Hassoun JH (2020) Post-training piecewise linear quantization for deep neural networks. In: European conference on computer vision, pp 69–86. Springer
Garg S, Lou J, Jain A, Nahmias M (2021) Dynamic precision analog computing for neural networks. arXiv:2102.06365
He X, Cheng J (2018) Learning compression from limited unlabeled data. In: Proceedings of the European Conference on Computer Vision (ECCV), pp 752–769
Meller E, Finkelstein A, Almog U, Grobman M (2019) Same, same but different: recovering neural network quantization error through weight factorization. In: International conference on machine learning, pp 4486–4495. PMLR
Nagel M, Baalen Mv, Blankevoort T, Welling M (2019) Data-free quantization through weight equalization and bias correction. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 1325–1334
Shomron G, Gabbay F, Kurzum S, Weiser U (2021) Post-training sparsity-aware quantization. Adv Neural Inf Process Syst 34:17737–17748
Zhao R, Hu Y, Dotzel J, De Sa C, Zhang Z (2019) Improving neural network quantization without retraining using outlier channel splitting. In: International conference on machine learning, pp 7543–7552. PMLR
Cai Y, Yao Z, Dong Z, Gholami A, Mahoney MW, Keutzer K (2020) Zeroq: a novel zero shot quantization framework. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 13169–13178
Hubara I, Nahshan Y, Hanani Y, Banner R, Soudry D (2020) Improving post training neural quantization: layer-wise calibration and integer programming. arXiv:2006.10518
Gong R, Liu X, Jiang S, Li T, Hu P, Lin J, Yu F, Yan J (2019) Differentiable soft quantization: bridging full-precision and low-bit neural networks. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 4852–4861
Lin D, Talathi S, Annapureddy S (2016) Fixed point quantization of deep convolutional networks. In: International conference on machine learning, pp 2849–2858. PMLR
Choi J, Venkataramani S, Srinivasan VV, Gopalakrishnan K, Wang Z, Chuang P (2019) Accurate and efficient 2-bit quantized neural networks. Proceedings of Machine Learning and Systems. 1:348–359
Li Y, Dong X, Wang W (2019) Additive powers-of-two quantization: An efficient non-uniform discretization for neural networks. arXiv:1909.13144
Miyashita D, Lee EH, Murmann B (2016) Convolutional neural networks using logarithmic data representation. arXiv:1603.01025
Yamamoto K (2021) Learnable companding quantization for accurate low-bit neural networks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 5029–5038
Zhou A, Yao A, Guo Y, Xu L, Chen Y (2017) Incremental network quantization: Towards lossless cnns with low-precision weights. arXiv:1702.03044
Yamamoto K (2021) Learnable companding quantization for accurate low-bit neural networks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 5029–5038
Liu Z, Cheng K-T, Huang D, Xing EP, Shen Z (2022) Nonuniform-to-uniform quantization: towards accurate quantization via generalized straight-through estimation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 4942–4952
Liu Z-G, Mattina M (2019) Learning low-precision neural networks without straight-through estimator (ste). arXiv:1903.01061
Banner R, Nahshan Y, Soudry D (2019) Post training 4-bit quantization of convolutional networks for rapid-deployment. Advances in Neural Information Processing Systems 32
Wei X, Gong R, Li Y, Liu X, Yu F (2022) Qdrop: randomly dropping quantization for extremely low-bit post-training quantization. arXiv:2203.05740
Xu K, Li Z, Wang S, Zhang X (2024) Ptmq: post-training multi-bit quantization of neural networks. Proceedings of the AAAI Conference on Artificial Intelligence 38:16193–16201
Zhou S, Wu Y, Ni Z, Zhou X, Wen H, Zou Y (2016) Dorefa-net: training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv:1606.06160
Xu K, Han L, Tian Y, Yang S, Zhang X (2023) Eq-net: elastic quantization neural networks. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 1505–1514
See J-C, Ng H-F, Tan H-K, Chang J-J, Mok K-M, Lee W-K, Lin C-Y (2023) Cryptensor: a resource-shared co-processor to accelerate convolutional neural network and polynomial convolution. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems
Metz D, Kumar V, Själander M (2023) Bisdu: a bit-serial dot-product unit for microcontrollers. ACM Transactions on Embedded Computing Systems. 22(5):1–22
AskariHemmat M, Dupuis T, Fournier Y, El Zarif N, Cavalcante M, Perotti M, Gürkaynak F, Benini L, Leduc-Primeau F, Savaria Y et al (2023) Quark: an integer risc-v vector processor for sub-byte quantized dnn inference. In: 2023 IEEE International Symposium on Circuits and Systems (ISCAS), pp 1–5. IEEE
Lai L, Suda N, Chandra V (2018) Cmsis-nn: efficient neural network kernels for arm cortex-m cpus. arXiv:1801.06601
Capotondi A, Rusci M, Fariselli M, Benini L (2020) Cmix-nn: mixed low-precision cnn library for memory-constrained edge devices. IEEE Trans Circuits Syst II Express Briefs 67(5):871–875
Mujtaba A, Lee W-K, Hwang SO (2022) Low latency implementations of cnn for resource-constrained iot devices. IEEE Trans Circuits Syst II Express Briefs 69(12):5124–5128
Ganji DC, Ashfaq S, Saboori E, Sah S, Mitra S, Askarihemmat M, Hoffman A, Hassanien A, Leonardon M (2023) Deepgemm: accelerated ultra low-precision inference on cpu architectures using lookup tables. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 4655–4663
Funding
This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (Ministry of Science and ICT) under Grant RS-2024-00340882 and by the Gachon University research fund under Grant GCU-202404140001.
Author information
Authors and Affiliations
Contributions
Ahmed Mujtaba was responsible for the design and execution of the overall investigation. Wai Kong Lee was responsible for the investigation related to quantization. Byoung Chul Ko, Hyung Jin Chang and Seong Oun Hwang were responsible in the data curation, supervision, writing and editing of the manuscript. All authors have read and agreed to the published version of the manuscript.
Corresponding author
Ethics declarations
Competing interests
Not applicable
Ethics approval
This article does not contain any studies with human participants or animals performed by any of the authors.
Consent to participate
Not applicable
Consent for publication
Not applicable
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Mujtaba, A., Lee, W.K., Ko, B.C. et al. AuGQ: Augmented quantization granularity to overcome accuracy degradation for sub-byte quantized deep neural networks. Appl Intell 55, 589 (2025). https://doi.org/10.1007/s10489-025-06495-1
Accepted:
Published:
Version of record:
DOI: https://doi.org/10.1007/s10489-025-06495-1