这是indexloc提供的服务,不要输入任何密码
Skip to main content
Log in

An Information Theory-Inspired Strategy for Automated Network Pruning

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

Despite superior performance achieved on many computer vision tasks, deep neural networks demand high computing power and memory footprint. Most existing network pruning methods require laborious human efforts and prohibitive computation resources, especially when the constraints are changed. This practically limits the application of model compression when the model needs to be deployed on a wide range of devices. Besides, existing methods are still challenged by the missing theoretical guidance, which lacks influence on the generalization error. In this paper we propose an information theory-inspired strategy for automated network pruning. The principle behind our method is the information bottleneck theory. Concretely, we introduce a new theorem to illustrate that the hidden representation should compress information with each other to achieve a better generalization. In this way, we further introduce the normalized Hilbert-Schmidt Independence Criterion on network activations as a stable and generalized indicator to construct layer importance. When a certain resource constraint is given, we integrate the HSIC indicator with the constraint to transform the architecture search problem into a linear programming problem with quadratic constraints. Such a problem is easily solved by a convex optimization method within a few seconds. We also provide rigorous proof to reveal that optimizing the normalized HSIC simultaneously minimizes the mutual information between different layers. Without any search process, our method achieves better compression trade-offs compared to the state-of-the-art compression algorithms. For instance, on ResNet-50, we achieve a 45.3%-FLOPs reduction, with a 75.75 top-1 accuracy on ImageNet. Codes are available at https://github.com/MAC-AutoML/ITPruner.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+
from $39.99 /Month
  • Starting from 10 chapters or articles per month
  • Access and download chapters and articles from more than 300k books and 2,500 journals
  • Cancel anytime
View plans

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Algorithm 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Data Availability

The data used in this study are sourced from three publicly available datasets: CIFAR-10 (LeCun et al., 1998), ImageNet (Russakovsky et al., 2015), and PASCAL VOC2012 (Everingham et al., 2010). These datasets are widely used in the research community for various machine learning and computer vision tasks, including image classification and object detection.

All datasets are available from their respective sources: CIFAR-10 can be accessed at https://www.cs.toronto.edu/~kriz/cifar.html. ImageNet can be accessed at http://www.image-net.org/. PASCAL VOC2012 can be accessed at http://host.robots.ox.ac.uk/pascal/VOC/voc2012/.

These datasets support the findings of this study and are available for public use. Researchers interested in using these datasets can access them through the provided URLs. There are no restrictions on the availability of these data, allowing the scientific community to freely build upon the findings of this study and advance the state-of-the-art in machine learning and computer vision.

Notes

  1. Previous work (Dai et al., 2018) also proposed to compress the network using the variational information bottleneck. However, their metric is only applied inside a specific layer, which is not automated and less effective.

  2. For example, most mobile devices have the constraint that the FLOPs should be less than 600M.

  3. FBnet (Wu et al., 2019) proposed a method using a latency table to predict the latency on specific hardware. In ITPruner, we can measure the inference layer by layer to obtain an estimator using \(\varvec{\alpha }\).

  4. Such setting is consistent with the previous compression works for a fair comparison.

  5. https://github.com/renmengye/revnet-public/blob/master/resnet/configs/cifar_configs.py#L28

References

  • Abdelfattah, M.S., Mehrotra, A., Dudziak, Ł., & Lane, N.D. (2021) Zero-cost proxies for lightweight NAS. arXiv preprint arXiv:2101.08134

  • Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F.L., Almeida, D., Altenschmidt, J., Altman, S., & Anadkat, S., et al. (2023) Gpt-4 technical report. arXiv preprint arXiv:2303.08774

  • Ahn, S., Hu, S.X., Damianou, A., Lawrence, N.D., & Dai, Z. (2019) Variational information distillation for knowledge transfer. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 9163–9171

  • Alemi, A.A., Fischer, I., Dillon, J.V., & Murphy, K. (2016) Deep variational information bottleneck. arXiv preprint arXiv:1612.00410

  • Alwani, M., Wang, Y., & Madhavan, V. (2022) Decore: Deep compression with reinforcement learning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 12349–12359

  • Back, T. (1996). Evolutionary Algorithms in Theory and Practice: Evolution Strategies. Genetic Algorithms: Evolutionary Programming.

    Book  Google Scholar 

  • Blalock, D., Gonzalez Ortiz, J. J., Frankle, J., & Guttag, J. (2020). What is the state of neural network pruning? Proceedings of Machine Learning and Systems, 2, 129–146.

    Google Scholar 

  • Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., & Joulin, A. (2021) Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 9650–9660

  • Chen, L., Papandreou, G., Kokkinos, I., Murphy, K., & Yuille, A.L. (2018) Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFS. In IEEE transactions on pattern analysis and machine intelligence (TPAMI)

  • Chen, L.-C., Papandreou, G., Schroff, F., & Adam, H. (2017) Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587

  • Chen, X., Xie, S., & He, K. (2021) An empirical study of training self-supervised vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 9640–9649

  • Chen, P., Zhang, M., Shen, Y., Sheng, K., Gao, Y., Sun, X., Li, K., & Shen, C. (2022) Efficient decoder-free object detection with transformers. In European conference on computer vision, pp. 70–86. Springer

  • Chen, X., Ding, M., Wang, X., Xin, Y., Mo, S., Wang, Y., Han, S., Luo, P., Zeng, G., & Wang, J. (2024). Context autoencoder for self-supervised representation learning. International Journal of Computer Vision, 132(1), 208–223.

    Article  Google Scholar 

  • Cover, T.M. (1999) Elements of information theory

  • Dai, B., Zhu, C., Guo, B., & Wipf, D. (2018) Compressing neural networks using the variational information bottleneck. In International conference on machine learning, pp. 1135–1144. PMLR

  • Dong, X., & Yang, Y. (2019) Network pruning via transformable architecture search. In Advances in neural information processing systems

  • Elkerdawy, S., Elhoushi, M., Zhang, H., & Ray, N. (2022) Fire together wire together: A dynamic pruning approach with self-supervised mask prediction. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 12454–12463

  • Everingham, M., Van Gool, L., Williams, C. K., Winn, J., & Zisserman, A. (2010). The pascal visual object classes (voc) challenge. International Journal of Computer Vision,88(2), 303–338.

  • Everitt, W. N. (1958). A note on positive definite matrices. Proceedings of the Glasgow Mathematical Association, 3(4), 173–175. https://doi.org/10.1017/S2040618500033670

    Article  MathSciNet  Google Scholar 

  • Frankle, J., & Carbin, M. (2018) The lottery ticket hypothesis: Finding sparse, trainable neural networks. In International conference on learning representations

  • Frantar, E., & Alistarh, D. (2023) Sparsegpt: Massive language models can be accurately pruned in one-shot. In International conference on machine learning, pp. 10323–10337. PMLR

  • Gao, Z., Wang, L., Han, B., & Guo, S. (2022) Adamixer: A fast-converging query-based object detector. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 5364–5373

  • Girshick, R. (2015) Fast r-cnn. In: International conference on computer vision (ICCV)

  • Goldfeld, Z., van den Berg, E., Greenewald, K., Melnyk, I., Nguyen, N., Kingsbury, B., & Polyanskiy, Y. (2018) Estimating information flow in neural networks. arxiv e-prints, page. arXiv preprint arXiv:1810.05728

  • Gomez, A.N., Ren, M., Urtasun, R., & Grosse, R.B. (2017) The reversible residual network: Backpropagation without storing activations. In Advances in neural information processing systems 30

  • Gretton, A., Bousquet, O., Smola, A., & Schölkopf, B. (2005) Measuring statistical dependence with hilbert-schmidt norms. In International conference on algorithmic learning theory, pp. 63–77. Springer

  • He, K., Chen, X., Xie, S., Li, Y., Dollár, P., & Girshick, R. (2022) Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 16000–16009

  • He, Y., Lin, J., Liu, Z., Wang, H., Li, L.-J., & Han, S. (2018) Amc: Automl for model compression and acceleration on mobile devices. In Proceedings of the European conference on computer vision (ECCV), pp. 784–800

  • He, Y., Liu, P., Wang, Z., Hu, Z., & Yang, Y. (2019) Filter pruning via geometric median for deep convolutional neural networks acceleration. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4340–4349

  • He, Y., Liu, P., Zhu, L., & Yang, Y. (2022) Filter pruning by switching to neighboring cnns with good attributes. In IEEE transactions on neural networks and learning systems

  • He, K., Zhang, X., Ren, S., & Sun, J. (2016) Deep residual learning for image recognition. In: Computer vision and pattern recognition (CVPR)

  • Hoeffding, W. (1994) Probability inequalities for sums of bounded random variables. In The collected works of Wassily Hoeffding, pp. 409–426

  • Hou, L., Huang, Z., Shang, L., Jiang, X., Chen, X., & Liu, Q. (2020). Dynabert: Dynamic bert with adaptive width and depth. Advances in Neural Information Processing Systems, 33, 9782–9793.

    Google Scholar 

  • Howard, A., Sandler, M., Chu, G., Chen, L.-C., Chen, B., Tan, M., Wang, W., Zhu, Y., Pang, R., & Vasudevan, V. (2019) Searching for mobilenetv3. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 1314–1324

  • Howard, A.G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto, M., & Adam, H. (2017) Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861

  • Huang, G., Liu, Z., Van Der Maaten, L., & Weinberger, K.Q. (2017) Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4700–4708

  • Huang, W., Peng, Z., Dong, L., Wei, F., Jiao, J., & Ye, Q. (2023) Generic-to-specific distillation of masked autoencoders. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 15996–16005

  • Ioffe, S., & Szegedy, C. (2015) Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International conference on machine learning, pp. 448–456. PMLR

  • Jacob, B., Kligys, S., Chen, B., Zhu, M., Tang, M., Howard, A., Adam, H., & Kalenichenko, D. (2018) Quantization and training of neural networks for efficient integer-arithmetic-only inference. In: Computer vision and pattern recognition (CVPR)

  • Jacobsen, J.-H., Smeulders, A., & Oyallon, E. (2018) i-revnet: Deep invertible networks. arXiv preprint arXiv:1802.07088

  • Jia, D., Yuan, Y., He, H., Wu, X., Yu, H., Lin, W., Sun, L., Zhang, C., & Hu, H. (2023) Detrs with hybrid matching. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 19702–19712

  • Jocher, G., Nishimura, K., Mineeva, T., & Vilarino, R. (2021) Yolov5. https://github.com/ultralytics/yolov5

  • Kim, B., Jo, Y., Kim, J., & Kim, S. (2023) Misalign, contrast then distill: Rethinking misalignments in language-image pre-training. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 2563–2572

  • Kornblith, S., Norouzi, M., Lee, H., & Hinton, G. (2019) Similarity of neural network representations revisited. In International conference on machine learning, pp. 3519–3529. PMLR

  • Kraft, D., et al. (1988) A software package for sequential quadratic programming

  • Krizhevsky, A., Sutskever, I., & Hinton, G.E. (2012) Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems

  • Kusupati, A., Ramanujan, V., Somani, R., Wortsman, M., Jain, P., Kakade, S., & Farhadi, A. (2020) Soft threshold weight reparameterization for learnable sparsity. In International conference on machine learning, pp. 5544–5555. PMLR

  • LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. nature,

  • LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P., et al. (1998) Gradient-based learning applied to document recognition. In Proceedings of the IEEE

  • Lee, N., Ajanthan, T., & Torr, P.H. (2018) Snip: Single-shot network pruning based on connection sensitivity. arXiv preprint arXiv:1810.02340

  • Lee, J., Kim, J., Shon, H., Kim, B., Kim, S. H., Lee, H., & Kim, J. (2022). Uniclip: Unified framework for contrastive language-image pre-training. Advances in Neural Information Processing Systems, 35, 1008–1019.

    Google Scholar 

  • Li, Y., Adamczewski, K., Li, W., Gu, S., Timofte, R., & Van Gool, L. (2022) Revisiting random channel pruning for neural network compression. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 191–201

  • Li, H., Kadav, A., Durdanovic, I., Samet, H., & Graf, H.P. (2016) Pruning filters for efficient convnets. In International conference on learning representations (ICLR)

  • Li, Y., Liang, F., Zhao, L., Cui, Y., Ouyang, W., Shao, J., Yu, F., & Yan, J. (2021) Supervision exists everywhere: A data efficient contrastive language-image pre-training paradigm. arXiv preprint arXiv:2110.05208

  • Li, C., Wang, G., Wang, B., Liang, X., Li, Z., & Chang, X. (2022) Ds-net++: Dynamic weight slicing for efficient inference in cnns and vision transformers. In IEEE transactions on pattern analysis and machine intelligence

  • Li, F., Zeng, A., Liu, S., Zhang, H., Li, H., Zhang, L., & Ni, L.M. (2023) Lite detr: An interleaved multi-scale encoder for efficient detr. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 18558–18567

  • Lin, H., Bai, H., Liu, Z., Hou, L., Sun, M., Song, L., Wei, Y., & Sun, Z. (2024) Mope-clip: Structured pruning for efficient vision-language models with module-wise pruning error metric. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 27370–27380

  • Lin, S., Ji, R., Li, Y., Wu, Y., Huang, F., & Zhang, B. (2018) Accelerating convolutional networks via global & dynamic filter pruning. In: IJCAI, pp. 2425–2432

  • Lin, M., Ji, R., Wang, Y., Zhang, Y., Zhang, B., Tian, Y., & Shao, L. (2020) Hrank: Filter pruning using high-rank feature map. In IEEE conference on computer vision and pattern recognition

  • Lin, S., Ji, R., Yan, C., Zhang, B., Cao, L., Ye, Q., Huang, F., & Doermann, D. (2019) Towards optimal structured cnn pruning via generative adversarial learning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2790–2799

  • Lin, M., Ji, R., Zhang, Y., Zhang, B., Wu, Y., & Tian, Y. (yyy) Channel pruning via automatic structure search

  • Lin, J., Mao, X., Chen, Y., Xu, L., He, Y., & Xue, H. (2022) Detr: Decoder-only detr with computationally efficient cross-scale attention. arXiv preprint arXiv:2203.00860

  • Liu, H., Li, C., Wu, Q., & Lee, Y.J. (2024) Visual instruction tuning. In Advances in neural information processing systems 36

  • Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., & Guo, B. (2021) Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 10012–10022

  • Liu, Z., Mu, H., Zhang, X., Guo, Z., Yang, X., Cheng, K.-T., & Sun, J. (2019) Metapruning: Meta learning for automatic neural network channel pruning. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 3296–3305

  • Liu, Z., Sun, M., Zhou, T., Huang, G., & Darrell, T. (2018) Rethinking the value of network pruning. In International conference on learning representations

  • Long, J., Shelhamer, E., & Darrell, T. (2015) Fully convolutional networks for semantic segmentation. In: Computer vision and pattern recognition (CVPR)

  • Lorenzo-Seva, U., & Ten Berge, J. M. (2006). Tucker’s congruence coefficient as a meaningful index of factor similarity. Methodology, 2(2), 57–64.

    Article  Google Scholar 

  • Louizos, C., Welling, M., & Kingma, D.P. (2018) Learning sparse neural networks through l_0 regularization. In International conference on learning representations

  • Luo, J.-H., Wu, J., & Lin, W. (2017) Thinet: A filter level pruning method for deep neural network compression. In International conference on computer vision (ICCV)

  • Luo, G., Zhou, Y., Ren, T., Chen, S., Sun, X., & Ji, R. (2024) Cheap and quick: Efficient vision-language instruction tuning for large language models. In Advances in neural information processing systems 36

  • Ma, W.-D.K., Lewis, J., & Kleijn, W.B. (2020) The hsic bottleneck: Deep learning without back-propagation. In Proceedings of the AAAI conference on artificial intelligence, vol. 34, pp. 5085–5092

  • Ma, X., Fang, G., & Wang, X. (2023). Llm-pruner: On the structural pruning of large language models. Advances in Neural Information Processing Systems, 36, 21702–21720.

    Google Scholar 

  • Mohri, M., Rostamizadeh, A., & Talwalkar, A. (2018) Foundations of machine learning

  • Molchanov, P., Mallya, A., Tyree, S., Frosio, I., & Kautz, J. (2019) Importance estimation for neural network pruning. In Proceedings of the IEEE conference on computer vision and pattern recognition

  • Molchanov, P., Mallya, A., Tyree, S., Frosio, I., & Kautz, J. (2019) Importance estimation for neural network pruning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 11264–11272

  • Molchanov, P., Tyree, S., Karras, T., Aila, T., & Kautz, J. (2017) Pruning convolutional neural networks for resource efficient inference. In 5th international conference on learning representations, ICLR

  • Mu, N., Kirillov, A., Wagner, D., & Xie, S. (2022) Slip: Self-supervision meets language-image pre-training. In european conference on computer vision, pp. 529–544. Springer

  • Pan, B., Panda, R., Jiang, Y., Wang, Z., Feris, R., & Oliva, A. (2021). Ia-red \(^2\): Interpretability-aware redundancy reduction for vision transformers. Advances in Neural Information Processing Systems, 34, 24898–24911.

    Google Scholar 

  • Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., & Lerer, A. (2017) Automatic differentiation in pytorch

  • Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., & Clark, J. (2021) Learning transferable visual models from natural language supervision. In International conference on machine learning, pp. 8748–8763. PMLR

  • Ren, S., He, K., Girshick, R., & Sun, J. (2017) Faster r-cnn: Towards real-time object detection with region proposal networks. In IEEE transactions on pattern analysis and machine intelligence (TPAMI)

  • Ren, S., Wei, F., Zhang, Z., & Hu, H. (2023) Tinymim: An empirical study of distilling mim pre-trained models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 3687–3697

  • Ren, S., He, K., Girshick, R., & Sun, J. (2016). Faster r-cnn: Towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(6), 1137–1149.

    Article  Google Scholar 

  • Robert, P., & Escoufier, Y. (1976). A unifying tool for linear multivariate statistical methods: The RV-coefficient. Journal of the Royal Statistical Society: Series C (Applied Statistics), 25(3), 257–265.

    MathSciNet  Google Scholar 

  • Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., & Bernstein, M., et al. (2015) Imagenet large scale visual recognition challenge. IJCV

  • Saikh, T., Ghosal, T., Mittal, A., Ekbal, A., & Bhattacharyya, P. (2022). Scienceqa: A novel resource for question answering on scholarly articles. International Journal on Digital Libraries, 23(3), 289–301.

    Article  Google Scholar 

  • Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., & Chen, L.-C. (2018) Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4510–4520

  • Savarese, P., Silva, H., & Maire, M. (2020). Winning the lottery with continuous sparsification. Advances in Neural Information Processing Systems, 33, 11380–11390.

    Google Scholar 

  • Saxe, A.M., Bansal, Y., Dapello, J., Advani, M., Kolchinsky, A., Tracey, B.D., & Cox, D.D. (2019) On the information bottleneck theory of deep learning. In Journal of statistical mechanics: Theory and experiment

  • Shwartz-Ziv, R., & Tishby, N. (2017) Opening the black box of deep neural networks via information. CoRR

  • Shwartz-Ziv, R., Painsky, A., & Tishby, N. (2018) Representation compression and generalization in deep neural networks

  • Simonyan, K., & Zisserman, A. (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556

  • Song, H., Sun, D., Chun, S., Jampani, V., Han, D., Heo, B., Kim, W., & Yang, M.-H. (2022) An extendable, efficient and effective transformer-based object detector. arXiv preprint arXiv:2204.07962

  • Sun, M., Liu, Z., Bair, A., & Kolter, J.Z. (2023) A simple and effective pruning approach for large language models. arXiv preprint arXiv:2306.11695

  • Tan, M., & Le, Q. (2019) Efficientnet: Rethinking model scaling for convolutional neural networks. In International conference on machine learning, pp. 6105–6114. PMLR

  • Tan, M., Chen, B., Pang, R., Vasudevan, V., Sandler, M., Howard, A., & Le, Q.V. (2019) Mnasnet: Platform-aware neural architecture search for mobile. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 2820–2828

  • Tan, M., Pang, R., & Le, Q.V. (2020) Efficientdet: Scalable and efficient object detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 10781–10790

  • Tanaka, H., Kunin, D., Yamins, D. L., & Ganguli, S. (2020). Pruning neural networks without any data by iteratively conserving synaptic flow. Advances in Neural Information Processing Systems, 33, 6377–6389.

    Google Scholar 

  • Tishby, N., Pereira, F.C., & Bialek, W. (1999) The information bottleneck method. In: Proc. of the 37-th annual allerton conference on communication, control and computing

  • Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., & Jégou, H. (2021) Training data-efficient image transformers & distillation through attention. In International conference on machine learning, pp. 10347–10357. PMLR

  • Turner, J., Crowley, E.J., O’Boyle, M., Storkey, A., & Gray, G. (2019) Blockswap: Fisher-guided block substitution for network compression on a budget. arXiv preprint arXiv:1906.04113

  • Wang, J., Bai, H., Wu, J., Shi, X., Huang, J., King, I., Lyu, M., & Cheng, J. (2020) Revisiting parameter sharing for automatic neural channel number search. In Advances in neural information processing systems 33

  • Wang, C.-Y., Bochkovskiy, A., & Liao, H.-Y.M. (2023) Yolov7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 7464–7475

  • Wang, J., Chen, Y., Chakraborty, R., & Yu, S.X. (2020) Orthogonal convolutional neural networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 11505–11515

  • Wang, S., Gao, J., Li, Z., Zhang, X., & Hu, W. (2023) A closer look at self-supervised lightweight vision transformers. In International conference on machine learning, pp. 35624–35641. PMLR

  • Wang, Z., Huang, S.-L., Kuruoglu, E.E., Sun, J., Chen, X., & Zheng, Y. (2021) Pac-bayes information bottleneck. arXiv preprint arXiv:2109.14509

  • Wang, T., Yuan, L., Chen, Y., Feng, J., & Yan, S. (2021) Pnp-detr: Towards efficient visual analysis with transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 4661–4670

  • Wang, T., Zhou, W., Zeng, Y., & Zhang, X. (2022) Efficientvlm: Fast and accurate vision-language models via knowledge distillation and modal-adaptive pruning. arXiv preprint arXiv:2210.07795

  • Wen, W., Wu, C., Wang, Y., Chen, Y., & Li, H. (2016) Learning structured sparsity in deep neural networks. In Advances in neural information processing systems (NeurIPS)

  • Wortsman, M., Farhadi, A., & Rastegari, M. (2019) Discovering neural wirings. In Advances in neural information processing systems 32

  • Wu, B., Dai, X., Zhang, P., Wang, Y., Sun, F., Wu, Y., Tian, Y., Vajda, P., Jia, Y., & Keutzer, K. (2019) Fbnet: Hardware-aware efficient convnet design via differentiable neural architecture search. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

  • Wu, K., Peng, H., Zhou, Z., Xiao, B., Liu, M., Yuan, L., Xuan, H., Valenzuela, M., Chen, X.S., & Wang, X. (2023) Tinyclip: Clip distillation via affinity mimicking and weight inheritance. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 21970–21980

  • Yang, K., Deng, J., An, X., Li, J., Feng, Z., Guo, J., Yang, J., & Liu, T. (2023) Alip: Adaptive language-image pre-training with synthetic caption. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 2922–2931

  • Yang, T.-J., Howard, A., Chen, B., Zhang, X., Go, A., Sandler, M., Sze, V., & Adam, H. (2018) Netadapt: Platform-aware neural network adaptation for mobile applications. In: Proceedings of the European conference on computer vision (ECCV), pp. 285–300

  • Yang, H., Yin, H., Molchanov, P., Li, H., & Kautz, J. (2021) Nvit: Vision transformer compression and parameter redistribution

  • Yang, Y., Zhuang, Y., & Pan, Y. (2021). Multiple knowledge representation for big data artificial intelligence: framework, applications, and case studies. Frontiers of Information Technology & Electronic Engineering, 22(12), 1551–1558.

    Article  Google Scholar 

  • Yu, J., & Huang, T. (2019) Autoslim: Towards one-shot architecture search for channel numbers. arXiv preprint arXiv:1903.11728

  • Yu, F., Huang, K., Wang, M., Cheng, Y., Chu, W., & Cui, L. (2022) Width & depth pruning for vision transformers. In AAAI conference on artificial intelligence (AAAI), vol. 2022

  • Yuan, X., Savarese, P.H.P., & Maire, M. (2020) Growing efficient deep networks by structured continuous sparsification. In International conference on learning representations

  • Zhang, G., Luo, Z., Tian, Z., Zhang, J., Zhang, X., & Lu, S. (2023) Towards efficient use of multi-scale features in transformer-based object detectors. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 6206–6216

  • Zhang, Z., Zhang, A., Li, M., Zhao, H., Karypis, G., & Smola, A. (2023) Multimodal chain-of-thought reasoning in language models. arXiv preprint arXiv:2302.00923

  • Zhao, C., Ni, B., Zhang, J., Zhao, Q., Zhang, W., & Tian, Q. (2019) Variational convolutional neural network pruning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 2780–2789

  • Zhou, A., Li, Y., Qin, Z., Liu, J., Pan, J., Zhang, R., Zhao, R., Gao, P., & Li, H. (2023) Sparsemae: Sparse training meets masked autoencoders. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 16176–16186

  • Zhou, S., Wu, Y., Ni, Z., Zhou, X., Wen, H., & Zou, Y. (2016) Dorefa-net: Training low bitwidth convolutional neural networks with low bitwidth gradients. In Computer vision and pattern recognition (CVPR)

Download references

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Xiawu Zheng or Rongrong Ji.

Additional information

Communicated by Jianfei Cai.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A Proof of Theorem 1

Theorem 6

Assuming X is the input random variable follows a Markov random field structure and the Markov random field is ergodic. For a network that have L hidden representations \(X_1, X_2,..., X_L\), with a probability \(1-\delta \), the generalization error \(\epsilon \) is bounded by

$$\begin{aligned} \epsilon \le \sum _{i=1}^L\sqrt{\frac{\log \frac{2}{\delta } + \log 2\text {I}(X;X_i)}{2n}}, \end{aligned}$$

where n is the number of training examples.

Proof

Let \(\mathcal {D}_n = {(x_1, y_1),...,(x_n, y_n)}\) be an i.i.d random sample from \(P_{X,Y}\). We define the loss function as a mapping \(l:\mathcal {Y}\times \mathcal {Y}\rightarrow \mathbb {R}^+\) and the performance of a predictor \(h:\mathcal {X}\rightarrow \mathcal {Y}\) is measured by the expectation on \(P_{X,Y}\), which is also known as generalization error and formally defined as \(R(h):=\mathbb {E}_{X,Y}\left[ l(Y, h(X))\right] .\) In practice, \(P_{X,Y}\) is usually unknown, we thus estimate \(\hat{h}\) based on \(\mathcal {D}_n\) and the corresponding empirical risk is defined as \(\hat{R}_n(h):=\frac{1}{n}\sum _{i=1}^nl(y_i, h(x_i)).\) For simplicity, we further define the mapping function \(l = I\left( h(X)\ne Y\right) \), where I(., .) is the indicator function. In this case, applying the Hoeffding inequality (Hoeffding, 1994) we have

$$\begin{aligned} P\left( \left| \hat{R}_n(h)-R(h)\right| \ge \epsilon \right) \le 2\exp \left( -2n\epsilon ^2\right) . \end{aligned}$$
(A1)

Together with PAC learning theory (Mohri et al., 2018), we obtain the following corollary: \(\square \)

Corollary 1

Let \(\mathcal {H}\) be a finite set of hyothesis. Then for all \(\epsilon >0\), we have

$$\begin{aligned} P\left( \exists h\in \mathcal {H} \Bigg | \left| \hat{R}_n(h)-R(h)\right| \ge \epsilon \right) \le 2\left| \mathcal {H} \right| \exp \left( -2n\epsilon ^2\right) .\nonumber \\ \end{aligned}$$
(A2)

We further control the probability in Corollary 1 with a confidence \(\delta \). That is, we ask

$$\begin{aligned} 2\left| \mathcal {H} \right| \exp \left( -2n\epsilon ^2\right) \le \delta . \end{aligned}$$

In other words, with a probability \(\delta \), we have

$$\begin{aligned} 2\left| \mathcal {H} \right| \exp \left( -2n\epsilon ^2\right)&\le \delta \\ \exp \left( -2n\epsilon ^2\right)&\le \frac{\delta }{2\left| \mathcal {H} \right| }\\ -2n\epsilon ^2&\le \log \frac{\delta }{2\left| \mathcal {H} \right| }\\ \epsilon ^2&\ge \frac{\log \frac{2}{\delta }+ \log \left| \mathcal {H} \right| }{2n}. \end{aligned}$$

Meanwhile, we can also conclude that \(\epsilon ^2\) is also bounded

$$\begin{aligned} \epsilon ^2\le \frac{\log \frac{2}{\delta }+ \log \left| \mathcal {H} \right| }{2n}, \end{aligned}$$
(A3)

with probability \(1-\delta \). To prove our theorem, we further introduce a lemma (Shwartz-Ziv et al., 2018) about Asymptotic Equipartition Property (AEP) (Cover, 1999) as follows:

Lemma 1

Assuming X is a random variable follows a Markov random field structure and the Markov random field is ergodic. Then, we have that

$$\begin{aligned} \left| \mathcal {H} \right| \le 2^{H(X)}, \end{aligned}$$
(A4)

Let \(X_i\) be a mapping of X, the size of typical set \(|\mathcal {T}|\) is bounded by

$$\begin{aligned} |\mathcal {T}| \le \frac{2^{H(X)}}{2^{H(X|X_i)}} = 2^{\text {I}(X;X_i)} \end{aligned}$$
(A5)

Therefore, we conclude that for a mapping T, the square error is bounded by

$$\begin{aligned} \epsilon ^2_T\le \frac{\log \frac{2}{\delta }+ \log |\mathcal {T}|}{2n} \le \frac{\log \frac{2}{\delta }+ \text {I}(X;X_i)\log 2}{2n}. \end{aligned}$$
(A6)

Considering that there are L mappings in the network, we thus conclude that the overall square error is bounded by

$$\begin{aligned} \epsilon \le \sum _{i=1}^L\sqrt{\frac{\log \frac{2}{\delta } + \text {I}(X;X_i)\log 2}{2n}}, \end{aligned}$$
(A7)

with a probability \(1-\delta \).

Fig. 8
figure 8

The relationship between mutual information (MI) and accuracy on the validation set. The x-axis measures the nHSIC of the sampled feature, and the y-axis measures the accuracies of the randomly sampled architectures. The correlation between MI and accuracy is \(-0.955872\). In other words, nHSIC is negatively correlated to the finetuned accuracy and thus can be used as an accurate predictor in model compression

Appendix B The Debate on the Information Bottleneck

ITPruner is based on the theory of the information bottleneck in deep learning (Shwartz-Ziv & Tishby, 2017), which is challenged by Saxe et al. (2019). Saxe et al. (2019) argue that the representation compression phase empirically demonstrated in Shwartz-Ziv and Tishby (2017) only appears when double-sided saturating nonlinearities like tanh and sigmoid are deployed. In this case, the claimed causality between compression and generalization was also questioned, which is foundation of ITPruner.

To clear such debate, we first propose Theorem 1 to mathematically demonstrate that reducing the mutual information between input and the hidden representations indeed leads to a smaller generalization error. Secondly, we also provide empirical evidence in Fig. 8 to show the performance on the validation set is highly correlated with the layer-wise mutual information. At last, the critical argument raised in Saxe et al. (2019) may cause by the inaccurate estimation of mutual information. Recent work (Wang et al., 2021) propose a new method to estimate the mutual information and empirically identify the original experiments in Shwartz-Ziv and Tishby (2017). For example, Fig. 9 reports the change of mutual information (MI) between input and different layers in the training process. As we can see, there is a clear boundary between the initial fitting (increase of mutual information) and the compression phases (decrease of mutual information). Overall, we thus conclude that the generalized information bottleneck principle behind ITPruner is solid theoretically and empirically.

Fig. 9
figure 9

The change of mutual information (MI) between input and different layers in training process. There is a clear boundary between the initial fitting (increase of mutual information) and the compression phases (decrease of mutual information) in different layers

Table 12 The effect of different regularization terms on ResNet-20 and Invertable-ResNet. “baseline” indicates that the model is trained without weight-decay

Appendix C The Debate on the Invertable-ResNet

In the work of Gomez et al. (2017) and Jacobsen et al. (2018), they propose reversible networks, which is capable of constructing hidden representation that encode the input. Here we argue that Invertable-ResNet (Gomez et al., 2017) is not an experimental counterexample for our Theorem 6.

Specifically, Invertable-ResNet (Gomez et al., 2017) actually need to optimize layer-wise mutual information mentioned in Theorem 5 for better generalization. Firstly, in their released source code,Footnote 5 weight decay is necessary for a better generalization. We also conduct an ablation study to validate the effectiveness of weight decay. As shown in Tab. 12, both ResNet-20 and Invertable-ResNet show a significant performance promotion when the regularization term is adopted in the training. Meanwhile, as theoretically demonstrated in Theorem 5, the adopted weight decay regularization is highly correlated to mutual information. Moreover, directly replacing wight decay with mutual information still shows a performance promotion, which is reported in Tab. 12. We thus conclude that optimizing mutual information in Invertable-ResNet (Gomez et al., 2017) is necessary.

There are two types of hidden representations exist in Invertable-ResNet (Gomez et al., 2017) simultaneously. One is highly related to the input X and uncorrelated to the output Y. The other one is the opposite, which is related to Y and uncorrelated to X. To validate our argument, we conduct a new experiment that extracts features in Invertable-ResNet and reports the mutual information with input and output. As illustrated in Fig. 10, most of the features have a certain degree of correlation with the X and Y. However, the two types of the feature described before do exist in Invertable-ResNet, which is marked as a darker red dot in Fig. 10.

Fig. 10
figure 10

The mutual information between the intermediate feature map and the input X, output Y, respectively. Each point represents a different channel in the feature map

Theorem 5

Assuming that \(X \sim \mathcal {N} (\varvec{0}, \varvec{\Sigma }_{X})\) and \(Y \sim \mathcal {N} (\varvec{0}, \varvec{\Sigma }_{Y}), Y=W^TX\), \(\min I(X;Y) \propto \min ||W||_F^2\).

Proof

The definition of mutual information is \(I(X;Y)=H(X)+H(Y)-H(X,Y)\). Meanwhile, the entropy of multivariate Gaussian distribution \(H(X)=\frac{1}{2}\text {ln}({(2\pi e)}^D|\Sigma _X|)\) and the joint distribution \((X,Y) \sim N\left( 0,\Sigma _{(X,Y)}\right) \), where

$$\begin{aligned} \Sigma _{(X,Y)}=\left( \begin{array}{cc} \Sigma _{X} & \Sigma _{XY}\\ \Sigma _{YX} & \Sigma _{Y}\\ \end{array} \right) . \end{aligned}$$
(C8)
Table 13 Top-1 accuracy of compressed VGG using different local metrics and architectures on CIFAR-10

Therefore, the mutual information between Gaussian distributed random variables X and Y is represented as,

$$\begin{aligned} I(X;Y)=\text {ln}|\Sigma _X|+\text {ln}|\Sigma _Y|-\text {ln}|\Sigma _{(X,Y)}|. \end{aligned}$$
(C9)

According to Everitt inequality, \(|\Sigma _{(X,Y)}| \le |\Sigma _X||\Sigma _Y|\), this inequality holds if and only if \(\Sigma _{YX}=\Sigma _{XY^T}=X^TY\) is a zero matrix. In other words,

$$\begin{aligned} \begin{aligned}&\min I(X;Y) \\ \propto&\min ~||\varvec{X}^T\varvec{Y}||_F^2 \\ \propto&\min ~||\varvec{X}^T\varvec{W}^T\varvec{X}||_F^2. \end{aligned} \end{aligned}$$
(C10)

For simplicity, we take an orthogonalized \(\varvec{X}\). According to the unitary invariance of Frobenius norm, we have,

$$\begin{aligned} \begin{aligned}&\min ~||\varvec{X}^T\varvec{W}^T\varvec{X}||_F^2 \\ =&\min ~||\varvec{W}||_F^2. \end{aligned} \end{aligned}$$
(C11)

\(\square \)

Appendix D Ablation Study

We conduct ablation study to evaluate our calim that selecting weights in a layer need to be pruned does not help as much as finding a better \(\alpha \). As we can see in Tab. 13, ITPruner always shows a significant improvement compared to their original methods. Meanwhile, integrating different pruning methods show a very similar performance. We thus conclude that selecting which weights in a layer need to be pruned does not help as much as finding a better \(\alpha \), which is also well verified in the previous survey (Blalock et al., 2020).

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zheng, X., Ma, Y., Xi, T. et al. An Information Theory-Inspired Strategy for Automated Network Pruning. Int J Comput Vis 133, 5455–5482 (2025). https://doi.org/10.1007/s11263-025-02437-z

Download citation

  • Received:

  • Accepted:

  • Published:

  • Version of record:

  • Issue date:

  • DOI: https://doi.org/10.1007/s11263-025-02437-z

Keywords