Learning Generalizable Mixed-Precision Quantization via Attribution Imitation

Wang, Ziwei; Xiao, Han; Zhou, Jie; Lu, Jiwen

doi:10.1007/s11263-024-02130-7

Learning Generalizable Mixed-Precision Quantization via Attribution Imitation

Published: 02 June 2024

Volume 132, pages 5101–5123, (2024)
Cite this article

International Journal of Computer Vision Aims and scope Submit manuscript

Ziwei Wang¹,
Han Xiao¹,
Jie Zhou¹ &
…
Jiwen Lu ORCID: orcid.org/0000-0002-6121-5529¹

468 Accesses
1 Citation
Explore all metrics

Abstract

In this paper, we propose a generalizable mixed-precision quantization (GMPQ) method for efficient inference. Conventional methods require the consistency of datasets for bitwidth search and model deployment to guarantee the policy optimality, leading to heavy search cost on challenging large-scale datasets in realistic applications. On the contrary, our GMPQ searches the mixed-quantization policy that can be generalized to large-scale datasets with only a small amount of data, so that the search cost is significantly reduced without performance degradation. Specifically, we observe that locating network attribution correctly is general ability for accurate visual analysis across different data distribution. Therefore, despite of pursuing higher accuracy and lower model complexity, we preserve attribution rank consistency between the quantized models and their full-precision counterparts via capacity-aware attribution imitation for generalizable mixed-precision quantization strategy search, where the capacity of quantized networks is considered to fully utilize the network capacity without insufficiency. Since slight noise in attribution is amplified by discrete ranking operations with significant rank errors, mimicking the attribution ranks of the full-precision models obstructs the quantized networks to correctly locate the attribution. To address this, we further present a robust generalizable mixed-precision quantization method to smooth the attribution for rank error alleviation by hierarchical attribution partitioning, which efficiently partitions the attribution pixels in high spatial resolution and assigns the same attribution value for pixels within a group. Moreover, we propose dynamic capacity-aware attribution imitation to adjust the concentration degree of the attribution according to sample hardness, so that sufficient model capacity is achieved with full utilization for each image. Extensive experiments on image classification and object detection show that our GMPQ and R-GMPQ obtain competitive accuracy-complexity trade-offs with significantly reduced search cost compared to the state-of-the-art mixed-precision networks.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+

from $39.99 /Month

Starting from 10 chapters or articles per month
Access and download chapters and articles from more than 300k books and 2,500 journals
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

DPQ: dynamic pseudo-mean mixed-precision quantization for pruned neural network

Article 31 January 2024

Generative Low-Bitwidth Data Free Quantization

Toward INT4 Fixed-Point Training via Exploring Quantization Error for Gradients

Data Availability

Datasets used in this work are all publicly available: 1. ImageNet (Deng et al., 2009): https://www.image-net.org. 2. Pascal VOC (Everingham et al., 2010): http://host.robots.ox.ac.uk/pascal/VOC. 3. COCO (Lin et al., 2014): https://cocodataset.org. 4. CIFAR-10 and CIFAR-100 (Krizhevsky et al., 2009): https://www.cs.toronto.edu/kriz/cifar.html. 5. Cars (Krause et al., 2013): https://www.kaggle.com/datasets/jessicali9530/stanford-cars-dataset. 6. Flowers (Nilsback and Zisserman, 2008): https://www.robots.ox.ac.uk/vgg/data/flowers. 7. Aircraft (Maji et al., 2013): https://www.robots.ox.ac.uk/vgg/data/fgvc-aircraft. 8. Pets (Parkhi et al., 2012): https://www.robots.ox.ac.uk/~vgg/data/pets. 9. Food (Bossard et al., 2014): https://www.kaggle.com/datasets/kmader/food41.

References

Bell, S., Zitnick, C. L., Bala, K. and Girshick, R. (2016) Inside-outside net: Detecting objects in context with skip pooling and recurrent neural networks. In CVPR, pages 2874–2883.
Bethge, J., Bartz, C., Yang, H., Chen, Y. and Meinel, C. (2020), Meliusnet: Can binary neural networks achieve mobilenet-level accuracy? arXiv preprint arXiv:2001.05936.
Bossard, L., Guillaumin, M. and Van Gool, L. (2014), Food-101–mining discriminative components with random forests. In ECCV, pages 446–461.
Cai, Z. and Vasconcelos, N. (2020), Rethinking differentiable search for mixed-precision neural networks. In CVPR, pages 2349–2358.
Courbariaux, M., Hubara, I., Soudry, D., El-Yaniv, R. and Bengio, Y. (2016). Binarized neural networks: Training deep neural networks with weights and activations constrained to+ 1 or-1. arXiv preprint arXiv:1602.02830.
Deng, J., Dong, W., Socher, R., Li, L. J., Li, K. and Fei-Fei, L. (2009) Imagenet: A large-scale hierarchical image database. In CVPR, pages 248–255.
Deng, J., Guo, J., Xue, N. and Zafeiriou, S. (2019), Arcface: Additive angular margin loss for deep face recognition. In CVPR, pages 4690–4699.
Denton, E., Zaremba, W., Bruna, J., LeCun, Y., and Fergus, R. (2014), Exploiting linear structure within convolutional networks for efficient evaluation. arXiv preprint arXiv:1404.0736.
Dong, Y., Pang, T., Su, H. and Zhu, J. (2019a) Evading defenses to transferable adversarial examples by translation-invariant attacks. In CVPR, pages 4312–4321.
Dong, Z., Yao, Z., Cai, Y., Arfeen, D., Gholami, A., Mahoney, M. W and Keutzer, K. (2019b) Hawq-v2: Hessian aware trace-weighted quantization of neural networks. arXiv preprint arXiv:1911.03852.
Dong, Z., Yao, Z., Gholami, A., Mahoney, M. W., and Keutzer, K. (2019c) Hawq: Hessian aware quantization of neural networks with mixed-precision. In ICCV, pages 293–302.
Bo, Du., Xiong, Wei, Jia, Wu., Zhang, Lefei, Zhang, Liangpei, & Tao, Dacheng. (2016). Stacked convolutional denoising auto-encoders for feature representation. IEEE Transactions on Cybernetics, 47(4), 1017–1027.
Google Scholar
Erhan, Dumitru, Bengio, Yoshua, Courville, Aaron, & Vincent, Pascal. (2009). Visualizing higher-layer features of a deep network. University of Montreal, 1341(3), 1.
Google Scholar
Everingham, M., Van Gool, L., Williams, C. K. I., Winn, J., & Zisserman, A. (2010). The pascal visual object classes (voc) challenge. International Journal of Computer Vision, 88(2), 303–338.
Article Google Scholar
Gao, J., Ma, X. and Xu, X. Learning transferable conceptual prototypes for interpretable unsupervised domain adaptation. arXiv preprint arXiv:2310.08071, 2023.
Gong, R., Liu, X., Jiang, S., Li, T., Hu, P., Lin, J., Yu, F. and Yan, J. (2019) Differentiable soft quantization: Bridging full-precision and low-bit neural networks. In ICCV, pages 4852–4861.
Habi, H. V., Jennings, R. H. and Netzer, A. (2020), Hmq: Hardware friendly mixed precision quantization block for cnns. arXiv preprint arXiv:2007.09952.
He, K., Zhang, X., Ren, S., and Sun, J. (2016) Deep residual learning for image recognition. In CVPR, pages 770–778.
He, K., Gkioxari, G. (2017a), Piotr Dollár, and Ross Girshick. Mask r-cnn. In ICCV, pages 2961–2969 .
He, Y., Zhang, X. and Sun, J. Channel pruning for accelerating very deep neural networks. In ICCV, pages 1389–1397, 2017b.
Howard, A. G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto, M. and Adam, H. (2017) Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861.
Huang, G., Liu, Z., Van Der Maaten, L. and Weinberger, K. Q. (2017). Densely connected convolutional networks. In CVPR, pages 4700–4708.
Huang, Wen-bing, Sun, Fu.-chun, et al. (2016). Building feature space of extreme learning machine with sparse denoising stacked-autoencoder. Neurocomputing, 174, 60–71.
Article Google Scholar
Huang, X., Shen, Z., Li, S., Liu, Z., Xianghong, H., Wicaksana, J., Xing, E., and Cheng, K. T. (2022) Sdq: Stochastic differentiable quantization with mixed precision. In ICML, pages 9295–9309.
Hubara, I., Courbariaux, M., Soudry, D., El-Yaniv, R., and Bengio, Y. (2016), Binarized neural networks. In NIPS, pages 4114–4122.
Iandola, F. N., Han, S., Moskewicz, M.W. Khalid Ashraf, Dally, W. J. and Keutzer, K. (2016) Squeezenet: Alexnet-level accuracy with 50x fewer parameters and< 0.5 mb model size. arXiv preprint arXiv:1602.07360.
Kingma, D. P. and Ba, J. (2014) Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
Krause, J., Stark, M., Deng, J. and Fei-Fei, L. (2013) 3d object representations for fine-grained categorization. In ICCVW, pages 554–561.
Krizhevsky, A., Hinton, G. et al. (2009) Learning multiple layers of features from tiny images.
Li, R., Wang, Y., Liang, F., Qin, H., Yan, J., and Fan, R. (2019) Fully quantized network for object detection. In CVPR, pages 2810–2819.
Li, Y., Gu, S., Mayer, C., Van Gool, L., and Timofte, R. (2020a) Group sparsity: The hinge between filter pruning and decomposition for network compression. In CVPR, pages 8018–8027.
Li, Y., Dong, X., & Wang, W. (2020). Additive powers-of-two quantization: A non-uniform discretization for neural networks. ICLR.
Lin, J., Rao, Y., Lu, J., and Zhou, J. (2017) Runtime neural pruning. In NIPS, pages 2178–2188.
Lin, T. Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P. and Zitnick, C. L. (2014) Microsoft coco: Common objects in context. In ECCV, pages 740–755.
Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C. -Y. and Berg, A. C. (2016) Ssd: Single shot multibox detector. In ECCV, pages 21–37.
Liu, W., Wen, Y., Yu, Z., Li, M., Raj, B. and Song, L. (2017) Sphereface: Deep hypersphere embedding for face recognition. In CVPR, pages 212–220.
Liu, Z., Wu, B., Luo, W., Yang, X., Liu, W., and Cheng, K. T. (2018) Bi-real net: Enhancing the performance of 1-bit cnns with improved representational capability and advanced training algorithm. In ECCV, pages 722–737.
Louizos, C., Reisser, M., Blankevoort, T., Gavves, E., and Welling, M. (2018) Relaxed quantization for discretized neural networks. arXiv preprint arXiv:1810.01875.
Maji, S., Rahtu, E., Kannala, J., Blaschko, M., and Vedaldi, A. (2013), Fine-grained visual classification of aircraft. arXiv preprint arXiv:1306.5151.
Molchanov, P., Mallya, A., Tyree, S., Frosio, I. and Kautz, J. (2019) Importance estimation for neural network pruning. In CVPR, pages 11264–11272.
Nilsback, M. E., and Zisserman, A. (2008). Automated flower classification over a large number of classes. In 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing, pages 722–729.
Park, G., Yang, J. Y., Hwang, S. J., and Yang, E. (2020) Attribution preservation in network compression for reliable network interpretation. arXiv preprint arXiv:2010.15054.
Parkhi, O. M., Vedaldi, A., Zisserman, A., and Jawahar, C. V. (2012). Cats and dogs. In CVPR, pages 3498–3505.
Qin, H., Gong, R., Liu, X., Shen, M., Wei, Z., Yu, F., and Song, J. (2020) Forward and backward information retention for accurate binary neural networks. In CVPR, pages 2250–2259.
Qin, Z., Li, Z., Zhang, Z., Bao, Y., Yu, G., Peng, Y., and Sun, J. (2019), Thundernet: Towards real-time generic object detection on mobile devices. In ICCV, pages 6718–6727.
Qu, Z., Zhou, Z., Cheng, Y., and Thiele, L. Adaptive loss-aware quantization for multi-bit networks. In CVPR, pages 7988–7997, 2020.
Rastegari, M., Ordonez, V., Redmon, J. and Farhadi, A. Xnor-net: Imagenet classification using binary convolutional neural networks. In ECCV, pages 525–542, 2016.
Ren, S., He, K., Girshick, R. and Sun, J. (2015) Faster r-cnn: Towards real-time object detection with region proposal networks. arXiv preprint arXiv:1506.01497.
Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., and Chen, C. -L. (2018) Mobilenetv2: Inverted residuals and linear bottlenecks. In CVPR, pages 4510–4520.
Selvaraju, R., Cogswell, M., Das, A., Vedantam, R., Parikh, D. and Batra, D. (2017) Grad-cam: Visual explanations from deep networks via gradient-based localization. In ICCV, pages 618–626.
Simonyan, K. and Zisserman, A. (2014), Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556.
Simonyan, K., Vedaldi, A., and Zisserman, A. (2013), Deep inside convolutional networks: Visualising image classification models and saliency maps. arXiv preprint arXiv:1312.6034.
Springenberg, J. T., Dosovitskiy, A., Brox, T. and Riedmiller, M. (2014) Striving for simplicity: The all convolutional net. arXiv preprint arXiv:1412.6806.
Sundararajan, M., Taly, A. and Yan, Q. (2017) Axiomatic attribution for deep networks. In ICML, pages 3319–3328.
Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A. and Hervé Jégou. (2021) Training data-efficient image transformers & distillation through attention. In ICML, pages 10347–10357.
Uhlich, S., Mauch, L., Yoshiyama, K., Cardinaux, F., Garcia, J. A., Tiedemann, S., Kemp, T. and Nakamura, A. (2019) Differentiable quantization of deep neural networks. arXiv preprint arXiv:1905.11452.
van Baalen, M., Louizos, C., Nagel, M., Amjad, R. A., Wang, Y., Blankevoort, T., and Welling, M. (2020) Bayesian bits: Unifying quantization and pruning. arXiv preprint arXiv:2005.07093.
Wang, H., Wang, Y., Zhou, Z., Ji, X., Gong, D., Zhou, J., Li, Z. and Liu, W. (2018) Cosface: Large margin cosine loss for deep face recognition. In CVPR, pages 5265–5274.
Wang, K., Liu, Z., Lin, Y., Lin, J., and Han, S. (2019a), Haq: Hardware-aware automated quantization with mixed precision. In CVPR, pages 8612–8620.
Wang, P., Chen, Q., He, X. and Cheng, J. (2020a), Towards accurate post-training network quantization via bit-split and stitching. In ICML, pages 9847–9856
Wang, T., Wang, K., Cai, H., Lin, J., Liu, Z., Wang, H., Lin, Y. and Han, S. (2020b). Apq: Joint search for network architecture, pruning and quantization policy. In CVPR, pages 2078–2087
Wang, Y., Lu, Y. and Blankevoort, T. (2020c), Differentiable joint pruning and quantization for hardware efficiency. In ECCV, pages 259–277.
Wang, Z., Guo, H., Zhang, Z., Liu, W., Qin, Z. and Ren, K. (2021), Feature importance-aware transferable adversarial attacks. In ICCV, pages 7639–7648.
Wang, Z., Lu, J., Tao, C., Zhou, J. and Tian, Q. (2019b), Learning channel-wise interactions for binary convolutional neural networks. In CVPR, pages 568–577.
Wu, W., Su, Y., Chen, X., Zhao, S., King, I., Lyu, M. R. and Tai, Y. W. (2020), Boosting the transferability of adversarial samples via attention. In CVPR, pages 1161–1170.
Xie, C., Wu, Y., van der Maaten, L., Yuille, A. L. and He, K. (2019) Feature denoising for improving adversarial robustness. In CVPR, pages 501–509.
Yang, H., Gui, S., Zhu, Y. and Liu, J. (2020), Automatic neural network compression by sparsity-quantization joint learning: A constrained optimization-based approach. In CVPR, pages 2178–2188.
Yu, H., Han, Q., Li, J., Shi, J., Cheng, G. and Fan, B. (2020), Search what you want: Barrier panelty nas for mixed precision quantization. In ECCV, pages 1–16.
Yu, X., Liu, T., Wang, X., and Tao, D. (2017), On compressing deep models by low rank and sparse decomposition. In CVPR, pages 7370–7379.
Zagoruyko, S., and Komodakis, N. (2016), Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer. arXiv preprint arXiv:1612.03928.
Zhang, J., Bargal, S. A., Lin, Z., Brandt, J., Shen, X., & Sclaroff, S. (2018). Top-down neural attention by excitation backprop. International Journal of Computer Vision, 126(10), 1084–1102.
Article Google Scholar
Zhao, R., Hu, Y., Dotzel, J., De Sa, C. and Zhang, Z. (2019), Improving neural network quantization without retraining using outlier channel splitting. In ICML, pages 7543–7552.
Zhou, B., Khosla, A., Lapedriza, A., Oliva, A. and Torralba, A. (2016), Learning deep features for discriminative localization. In CVPR, pages 2921–2929.
Zhu, C., Han, S., Mao, H. and Dally, W. J. (2016), Trained ternary quantization. arXiv preprint arXiv:1612.01064.
Zunino, A., Bargal, S. A., Volpi, R., Sameki, M., Zhang, J., Sclaroff, S., Murino, V. and Saenko, K. (2021), Explainable deep classification models for domain generalization. In CVPR, pages 3233–3242.

Download references

Acknowledgements

This work was supported in part by the National Key Research and Development Program of China under Grant 2022ZD0114903, and in part by the National Natural Science Foundation of China under Grant 2376032.

Author information

Authors and Affiliations

State Key Lab of Intelligent Technologies and Systems, Department of Automation, Beijing National Research Center for Information Science and Technology (BNRist), Tsinghua University, Beijing, 100084, China
Ziwei Wang, Han Xiao, Jie Zhou & Jiwen Lu

Authors

Ziwei Wang
View author publications
Search author on:PubMed Google Scholar
Han Xiao
View author publications
Search author on:PubMed Google Scholar
Jie Zhou
View author publications
Search author on:PubMed Google Scholar
Jiwen Lu
View author publications
Search author on:PubMed Google Scholar

Corresponding author

Correspondence to Jiwen Lu.

Ethics declarations

Conflict of interest

The authors declare that they have no Conflict of interest.

Ethical approval

This article does not contain any studies with human participants or animals.

Additional information

Communicated by Bumsub Ham.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix

A. Visualization of Optimal Quantization Policy

We searched the quantization policy on different small datasets with various architectures via the presented GMPQ. Figure 11 demonstrates the optimal bitwidth allocation for weights and activations of each layer, where ResNet18 was compressed and the policy was searched on various small datasets including CIFAR-10 (Krizhevsky et al., 2009), Cars (Krause et al., 2013), Flowers (Nilsback and Zisserman, 2008), Aircraft (Maji et al., 2013), Pets (Parkhi et al., 2012) and Food (Bossard et al., 2014). Figure 12 depicts the obtained quantization strategy searched on CIFAR-10 with MobileNet-V2 (Sandler et al., 2018), ResNet18 (He et al., 2016) and ResNet50 architectures. The BOPs limit was set to 7.4G, 15.3G and 30.7G for MobileNet-V2, ResNet18 and ResNet50.

For quantization policy searched on different small datasets, the optimal bitwidth allocation varies significantly although the complexity of the obtained model is close to each other. It is observed that activations are usually assigned with higher bitwidths than weights in most quantization policy, indicating that the classification performance and attribution rank consistency are more sensitive to activation quantization than weight quantization. The bitwidth distribution of weights and activations obtained on Cars, Aircraft, Food, and CIFAR-10 is similar, which also achieves better generalization performance on largescale datasets compared with that searched on Flowers and Pets. For the Flowers and Pets datasets, the optimal quantization policy is similar to uniform quantization in fixed-precision networks, which also leads to worse accuracy-complexity trade-offs due to the lack of generalization ability.

For quantization policy for different architectures, it is observed that Layer 7, 12 and 17 in ResNet18 containing residual connections require the larger bitwidth compared with their corresponding regular branches. Since MobileNet-V2 is very compact, it receives higher bitwidths allocations than other network architectures. On the contrary, ResNet50 is compressed with lower bitwidths due to the significant redundancy compared with MobileNet-V2.

Table 9 Top-1 accuracy (%) and BOPs (G) on ImageNet of the mixed-precision networks searched on different small datasets across various network architectures

Full size table

B. Accuracy of Quantization Policy Searched on Different Small Datasets

In this section, we show the top-1 accuracy and BOPs on ImageNet of our GMPQ with the quantization policy searched on different small datasets including CIFAR-10, Cars, Flowers, Aircraft, Pets and Food. The applied network architectures contain MobileNet-V2, ResNet-18 and ResNet-50, and more accuracy-complexity trade-offs for ResNet-18 are demonstrated in Fig. 10b. Table 9 illustrates the accuracy and the complexity on ImageNet, where those of full-precision networks are also provided. The search cost is significantly reduced across various architectures compared with conventional mixed-precision quantization methods shown in Table 5, while the accuracy is only degraded slightly. The accuracy of quantization policy searched on CIFAR-10 achieves the highest, because the gap of object category between CIFAR-10 and ImageNet is the smallest compared with other datasets. Although the discrepancy of object class distribution between ImageNet and the small datasets such as Aircraft is non-negligible, the accuracy of the mixed-precision networks is still comparable with state-of-the-art approaches shown in Table 5 due to the attribution rank preservation.

C. Explanation of the Generalization Risk (9)

As visualized in Fig. 3 of the manuscript, quantized networks with lower capacity tend to acquire more concentrated attribution although the attribution rank remains similar, where the networks focus on smaller regions to avoid capacity insufficiency for image representation. To further demonstrate the soundness of the observation, we report the entropy of attribution for networks in different bitwidths that reveals the attribution concentration. The entropy E is defined as follows:

$$\begin{aligned} ~~~~~~~~~~~~~~~~E=\sum _{i,j}-M_{ij}[y_x] \log M_{ij}[y_x] \end{aligned}$$

(15)

Large entropy indicates more diverse attribution and vice versa. Figure 13 shows the average attribution entropy and BOPs across the validation set of ImageNet dataset for ResNet18 in networks quantized by different optimal quantization policies (searched on ImageNet). The correlation coefficient is 0.733 between the attribution entropy and network BOPs, which verifies the observation that networks with smaller capacity acquire more concentrated attribution. For the value of p in (9), excessively large p for attribution imitation leads to over-concentrated attribution. Therefore, the networks focus on small image regions with little information, and the network capacity is not fully utilized for feature representation. On the contrary, extremely small p for attribution imitation results in attribution divergence, and focusing on large image regions causes the capacity insufficiency in the forward pass. Therefore, we require the attribution rank of quantized and full-precision networks to be similar, while the attribution concentration is adjusted according to the network capacity. We also conducted ablation studies to show the effectiveness of the definition shown in (9). We leverage two other functions to acquire p based on the average bitwidth of the networks in the following:

$$\begin{aligned}&p=\frac{1}{L}\sum _{k=1}^{L}[Q_{w}^0/\left( \sum _{i=1}^{N_w^k}\pi _{w,i}^kq_{w,i}^k\right) ]^{\frac{1}{2}}\cdot [Q_{a}^0/\left( \sum _{i=1}^{N_a^k}\pi _{a,i}^kq_{a,i}^k\right) ]^{\frac{1}{2}}\nonumber \\&p=\frac{1}{L}\sum _{k=1}^{L}[Q_{w}^0/\left( \sum _{i=1}^{N_w^k}\pi _{w,i}^kq_{w,i}^k\right) ]^{2}\cdot [Q_{a}^0/\left( \sum _{i=1}^{N_a^k}\pi _{a,i}^kq_{a,i}^k\right) ]^{2} \end{aligned}$$

(16)

where the concavity is different for these functions. Table 10 shows the accuracy-complexity trade-off for ResNet18 on the validation set of ImageNet, where the linear form shown in (8) of the manuscript achieves the best performance.

Table 10 The accuracy-complexity trade-offs for different definition of p in (9)

Full size table

D. Accuracy During the Compression Policy Search

We optimized the supernet containing all bitwidth selections with and without the generalization risk shown in (7) respectively, where we leveraged the training set from CIFAR-10 for policy search and validation set from ImageNet for evaluation. Meanwhile, we also directly utilized the training set of ImageNet for optimizing the supernet, and report the accuracy curve for reference as the baseline. We leveraged ResNet18 as the backbone architecture, and the BOPs budget was set as 7.5G. Evaluating the acquired quantization policy requires extremely high cost because we have to finetune the quantized models until convergence. Therefore, we evaluated the searched quantization policies from different experimental settings every 10 epochs during the search process, where the quantization policy with the largest importance weight is selected for evaluation. We plot the accuracy curve in Fig. 14, where our the objective with the generalization risk consistently outperforms the one without the generalization risk. The advantages of our method become more significant when the search process gradually converges. Meanwhile, the gap between our method and the optimal compression policy acquired by searching with ImageNet is small. The results can empirically verify the higher generalization ability of our method.

Table 11 The accuracy-complexity trade-off across different sample size from CIFAR-10 for search

Full size table

E. Influence of the Sample Size of Datasets for Policy Searching

In order to analysis the influence of dataset sample size for policy searching, we searched the mixed-precision quantization policies with different sample sizes on CIFAR-10. The data amount is set to be 20%, 40%, 60%, 80% and 100% of the original training set, and we report the accuracy-complexity trade-offs on the ImageNet. Moreover, we also demonstrate the performance of the optimal quantization policy that is obtained by searching on full training datasets of ImageNet. The networks quantized with acquired policies are finetuned by the datasets for evaluation. The results bring us following conclusions:

Utilizing extremely small amount of data (e.g. $\leqslant 40\%$) from CIFAR-10 usually leads to the over-fitting for quantization policy. Since the accuracy gap between the acquired quantization policy and the optimal one is large, the quantization policy search faces the over-fitting problem.
Enlarging the size of the dataset for quantization policy search can alleviate the over-fitting of the acquired bitwidth assignment between policy search and model deployment, since we observe that the model achieves similar accuracy for optimal quantization policy and those searched on the full training set of CIFAR-10.

F. Formulation of Rank Errors caused by Attribution Noise

The attribution acquired by Grad-cam contains noise Selvaraju et al. (2017); Sundararajan et al. (2017) which changes the attribution value slightly. However, the errors on the attribution map are significantly amplified by the ranking operation, which deviates the attribution rank of full-precision networks from the correct one obviously in attribution imitation. The generalization risk shown in (8) can be expanded as:

$$\begin{aligned} \mathcal {R}_G(\varvec{W},\mathcal {Q},\varvec{x})=&\,\sum _{i,j}||r(M_{q, ij}[y_x])-r(M_{f, ij}[y_x])||_2^2\nonumber \\ =&\,\sum _{i,j}||r(M_{q, ij}[y_x])-r(M_{f, ij}[y_x]+\delta _{ij})||_2^2 \nonumber \\&\quad +||r(M_{f, ij}[y_x]+\delta _{ij})-r(M_{f, ij}[y_x])||_2^2 \end{aligned}$$

(17)

where $\delta _{ij}$ means the noise of the attribution satisfying Gaussian distribution with zero mean and $\sigma _{ij}$ standard deviation. The cross term in the expansion is regarded to be zero for omission because there is no statistical correlation between the attribution value and the noise. The first term in (17) is the objective that we aim to optimize, and the second term can be represented as the KL-divergence between distribution of the two ranking variables. The KL-divergence can be written as:

$$\begin{aligned}&D_{KL}(p(r(M)=k)||p(r(M+\delta )=k))\nonumber \\ {}&=\int _M p(r(M)=k)\log \frac{p(r(M)=k)}{p(r(M+\delta )=k)}dM\nonumber \\&\quad =\int _M p(r(M)=k)\cdot \frac{\partial p(r(M)=k)}{\partial M}\cdot \frac{\delta }{M}dM\nonumber \\&\quad =C_0\delta \end{aligned}$$

(18)

where we omit the subscript i and j for simplification. All variables related to M and k is deterministic when optimizing (17), and we treat them as a constant $C_0$. Minimizing the second term in (17) equals to minimizing the KL-divergence shown in (18), which is also equivalent to minimizing the standard deviation $\sigma $ in the Gaussian distribution of $\delta $. As semantically similar pixels usually have feature importance in similar distribution, we smooth attribution of these pixels by averaging their attribution value. Therefore, the standard deviation of their noise can be reduced since they are i.i.d. In conclusion, leveraging semantically similar pixels for attribution smoothing can reduce rank errors caused by attribution noise, which provides accurate guidance for quantized models to locate attribution correctly.

G. Ablation Study w.r.t. the Number of Pixels in Attribution Rank Preservation

Since many attribution pixels have the extremely low value (less than $10^{-2}$), imitating these pixels on full-precision networks for quantized ones cannot bring sufficient supervision. The reason is that noise makes most contribution to the attribution pixel value in these cases. Therefore, we only select the top pixels based on their attribution values when minimizing the attribution distance between quantized and full-precision networks, so that the informative localization ability instead of noise in the full-precision networks is mimicked by quantized ones. In order to assign the optimal value of k for selecting top-k pixels in attribution imitation learning, we conducted ablation studies with respect to k and report the accuracy-complexity trade-offs in Table 12. Small k fails to acquire sufficient information on the full-precision attribution, and large k brings much noise in attribution imitation. Both of them degrade the trade-off between the accuracy and model complexity.

Table 12 The accuracy-complexity trade-off across different numbers of top-k pixels for attribution rank preservation

Full size table

H. Details of Small Datasets for Quantization Policy Search

We introduce the datasets that we carried experiments on. For quantization policy search, we employed the small datasets including CIFAR-10 (Krizhevsky et al., 2009), CIFAR-100 (Krizhevsky et al., 2009), Cars (Krause et al., 2013), Flowers (Nilsback and Zisserman, 2008), Aircraft (Maji et al., 2013), Pets (Parkhi et al., 2012) and Food (Bossard et al., 2014). CIFAR-10 contains 60, 000 images divided into 10 categories with equal number of samples, and CIFAR-100 contains the same number of images which are evenly distributed in 100 classes. Flowers has 8,189 images spread over 102 flower categories. Cars includes 16, 185 samples with 196 types at the level of maker, model and year, and Aircraft contains 10, 200 collected images with 100 samples for each of the 102 aircraft model variants. Pet was created with 37 dog and cat categories with 200 images for each class, and Food contains 32, 135 high-resolution food photos of menu items from the 6 restaurants.

I. Rank Errors for Different Settings for R-GMPQ

Because the full-precision attribution can be affected by noise in network training, the attribution rank may fail to reflect true region importance especially for attribution pixels with similar values. Therefore, we leverage the smoothing techniques to eliminate the noise in the attribution. Since the rank of the true attribution without noise is intractable, we randomly sampled five seeds to train the full-precision networks and used their average attribution as the approximated true attribution. We have reported the attribution rank difference on ImageNet with ResNet18 in Table 13. By comparing Table 13 with Table 2, we know that low attribution rank difference leads to better accuracy-complexity trade-offs.

Table 13 The average rank difference between quantized attribution and approximated true attribution with different smoothing techniques and settings

Full size table

J. Explanation of Attribution Similarity for Generalizable Quantization Policy Search

Let us assume that $Q_D$ and $Q_S$ are respectively the optimal quantization policies searched on the data in deployment and on our tractable small datasets. The generalization ability of the acquired quantization policy can be demonstrated by the difference between the expected loss of models quantized by $Q_D$ and $Q_S$:

$$\begin{aligned} ~~~~~~~~~~~~J=||L(Q_D,X_{val})-L(Q_S,X_{val})|| \end{aligned}$$

(19)

where $X_{val}$ represents the distribution of validation data in deployment. L(Q, X) means the loss function of the neural networks with the quantization policy Q on the dataset X. Smaller J indicates higher generalization ability of our policy $Q_S$ because the loss is more similar to that of the model quantized by $Q_D$. We expand J as follows:

$$\begin{aligned} J=&\,||L(Q_D,X_{val})-L(R,X_{val})+L(R,X_{val})\nonumber \\&-L(R,X_{sma})+L(R,X_{sma})-L(Q_S,X_{sma})\nonumber \\&+L(Q_S,X_{sma})-L(Q_S,X_{val})||\nonumber \\ \leqslant&\, ||L(Q_D,X_{val})-L(R,X_{val})||+||L(R,X_{sma})\nonumber \\&-L(Q_S,X_{sma})||+||\big (L(R,X_{val})-L(R,X_{sma})\big )\nonumber \\&+\big (L(Q_S,X_{sma})-L(Q_S,X_{val})\big )||\nonumber \\ =&\,J_1+J_2+J_3 \end{aligned}$$

(20)

The first term $J_1$ is the intractable loss gap for validation data in deployment caused by quantization, and it can be regarded as a constant $C_0$. The second term $J_2$ corresponds to the loss gap for training data of our small datasets caused by quantization, and we have minimized this term in our method by optimizing the task risk. The third term $J_3$ can be rewritten as follows:

$$\begin{aligned} J_3&=||\int _{X_{sma}}^{X_{val}}\frac{\partial L(R,X)}{\partial X}dX-\int _{X_{sma}}^{X_{val}}\frac{\partial L(Q_S,X)}{\partial X}dX||\nonumber \\&\leqslant \int _{X_{sma}}^{X_{val}}||\frac{\partial L(R,X)}{\partial X}-\frac{\partial L(Q_S,X)}{\partial X}||dX \end{aligned}$$

(21)

where $\partial L(R,X)/\partial X$ and $\partial L(Q_S,X)/\partial X$ demonstrate the attribution of full-precision and quantized models respectively. Since we require the similar attribution between quantized and full-precision models, $J_3$ is also minimized so that the generalization ability of the acquired quantization policy is enhanced.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Wang, Z., Xiao, H., Zhou, J. et al. Learning Generalizable Mixed-Precision Quantization via Attribution Imitation. Int J Comput Vis 132, 5101–5123 (2024). https://doi.org/10.1007/s11263-024-02130-7

Download citation

Received: 10 July 2023
Accepted: 20 May 2024
Published: 02 June 2024
Version of record: 02 June 2024
Issue date: November 2024
DOI: https://doi.org/10.1007/s11263-024-02130-7

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+

from $39.99 /Month

Starting from 10 chapters or articles per month
Access and download chapters and articles from more than 300k books and 2,500 journals
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Learning Generalizable Mixed-Precision Quantization via Attribution Imitation

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

DPQ: dynamic pseudo-mean mixed-precision quantization for pruned neural network

Generative Low-Bitwidth Data Free Quantization

Toward INT4 Fixed-Point Training via Exploring Quantization Error for Gradients

Explore related subjects

Data Availability

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval

Additional information

Publisher's Note

Appendices

Appendix

A. Visualization of Optimal Quantization Policy

B. Accuracy of Quantization Policy Searched on Different Small Datasets

C. Explanation of the Generalization Risk (9)

D. Accuracy During the Compression Policy Search

E. Influence of the Sample Size of Datasets for Policy Searching

F. Formulation of Rank Errors caused by Attribution Noise

G. Ablation Study w.r.t. the Number of Pixels in Attribution Rank Preservation

H. Details of Small Datasets for Quantization Policy Search

I. Rank Errors for Different Settings for R-GMPQ

J. Explanation of Attribution Similarity for Generalizable Quantization Policy Search

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now