Abstract
Despite superior performance achieved on many computer vision tasks, deep neural networks demand high computing power and memory footprint. Most existing network pruning methods require laborious human efforts and prohibitive computation resources, especially when the constraints are changed. This practically limits the application of model compression when the model needs to be deployed on a wide range of devices. Besides, existing methods are still challenged by the missing theoretical guidance, which lacks influence on the generalization error. In this paper we propose an information theory-inspired strategy for automated network pruning. The principle behind our method is the information bottleneck theory. Concretely, we introduce a new theorem to illustrate that the hidden representation should compress information with each other to achieve a better generalization. In this way, we further introduce the normalized Hilbert-Schmidt Independence Criterion on network activations as a stable and generalized indicator to construct layer importance. When a certain resource constraint is given, we integrate the HSIC indicator with the constraint to transform the architecture search problem into a linear programming problem with quadratic constraints. Such a problem is easily solved by a convex optimization method within a few seconds. We also provide rigorous proof to reveal that optimizing the normalized HSIC simultaneously minimizes the mutual information between different layers. Without any search process, our method achieves better compression trade-offs compared to the state-of-the-art compression algorithms. For instance, on ResNet-50, we achieve a 45.3%-FLOPs reduction, with a 75.75 top-1 accuracy on ImageNet. Codes are available at https://github.com/MAC-AutoML/ITPruner.
Similar content being viewed by others
Data Availability
The data used in this study are sourced from three publicly available datasets: CIFAR-10 (LeCun et al., 1998), ImageNet (Russakovsky et al., 2015), and PASCAL VOC2012 (Everingham et al., 2010). These datasets are widely used in the research community for various machine learning and computer vision tasks, including image classification and object detection.
All datasets are available from their respective sources: CIFAR-10 can be accessed at https://www.cs.toronto.edu/~kriz/cifar.html. ImageNet can be accessed at http://www.image-net.org/. PASCAL VOC2012 can be accessed at http://host.robots.ox.ac.uk/pascal/VOC/voc2012/.
These datasets support the findings of this study and are available for public use. Researchers interested in using these datasets can access them through the provided URLs. There are no restrictions on the availability of these data, allowing the scientific community to freely build upon the findings of this study and advance the state-of-the-art in machine learning and computer vision.
Notes
Previous work (Dai et al., 2018) also proposed to compress the network using the variational information bottleneck. However, their metric is only applied inside a specific layer, which is not automated and less effective.
For example, most mobile devices have the constraint that the FLOPs should be less than 600M.
FBnet (Wu et al., 2019) proposed a method using a latency table to predict the latency on specific hardware. In ITPruner, we can measure the inference layer by layer to obtain an estimator using \(\varvec{\alpha }\).
Such setting is consistent with the previous compression works for a fair comparison.
References
Abdelfattah, M.S., Mehrotra, A., Dudziak, Ł., & Lane, N.D. (2021) Zero-cost proxies for lightweight NAS. arXiv preprint arXiv:2101.08134
Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F.L., Almeida, D., Altenschmidt, J., Altman, S., & Anadkat, S., et al. (2023) Gpt-4 technical report. arXiv preprint arXiv:2303.08774
Ahn, S., Hu, S.X., Damianou, A., Lawrence, N.D., & Dai, Z. (2019) Variational information distillation for knowledge transfer. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 9163–9171
Alemi, A.A., Fischer, I., Dillon, J.V., & Murphy, K. (2016) Deep variational information bottleneck. arXiv preprint arXiv:1612.00410
Alwani, M., Wang, Y., & Madhavan, V. (2022) Decore: Deep compression with reinforcement learning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 12349–12359
Back, T. (1996). Evolutionary Algorithms in Theory and Practice: Evolution Strategies. Genetic Algorithms: Evolutionary Programming.
Blalock, D., Gonzalez Ortiz, J. J., Frankle, J., & Guttag, J. (2020). What is the state of neural network pruning? Proceedings of Machine Learning and Systems, 2, 129–146.
Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., & Joulin, A. (2021) Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 9650–9660
Chen, L., Papandreou, G., Kokkinos, I., Murphy, K., & Yuille, A.L. (2018) Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFS. In IEEE transactions on pattern analysis and machine intelligence (TPAMI)
Chen, L.-C., Papandreou, G., Schroff, F., & Adam, H. (2017) Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587
Chen, X., Xie, S., & He, K. (2021) An empirical study of training self-supervised vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 9640–9649
Chen, P., Zhang, M., Shen, Y., Sheng, K., Gao, Y., Sun, X., Li, K., & Shen, C. (2022) Efficient decoder-free object detection with transformers. In European conference on computer vision, pp. 70–86. Springer
Chen, X., Ding, M., Wang, X., Xin, Y., Mo, S., Wang, Y., Han, S., Luo, P., Zeng, G., & Wang, J. (2024). Context autoencoder for self-supervised representation learning. International Journal of Computer Vision, 132(1), 208–223.
Cover, T.M. (1999) Elements of information theory
Dai, B., Zhu, C., Guo, B., & Wipf, D. (2018) Compressing neural networks using the variational information bottleneck. In International conference on machine learning, pp. 1135–1144. PMLR
Dong, X., & Yang, Y. (2019) Network pruning via transformable architecture search. In Advances in neural information processing systems
Elkerdawy, S., Elhoushi, M., Zhang, H., & Ray, N. (2022) Fire together wire together: A dynamic pruning approach with self-supervised mask prediction. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 12454–12463
Everingham, M., Van Gool, L., Williams, C. K., Winn, J., & Zisserman, A. (2010). The pascal visual object classes (voc) challenge. International Journal of Computer Vision,88(2), 303–338.
Everitt, W. N. (1958). A note on positive definite matrices. Proceedings of the Glasgow Mathematical Association, 3(4), 173–175. https://doi.org/10.1017/S2040618500033670
Frankle, J., & Carbin, M. (2018) The lottery ticket hypothesis: Finding sparse, trainable neural networks. In International conference on learning representations
Frantar, E., & Alistarh, D. (2023) Sparsegpt: Massive language models can be accurately pruned in one-shot. In International conference on machine learning, pp. 10323–10337. PMLR
Gao, Z., Wang, L., Han, B., & Guo, S. (2022) Adamixer: A fast-converging query-based object detector. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 5364–5373
Girshick, R. (2015) Fast r-cnn. In: International conference on computer vision (ICCV)
Goldfeld, Z., van den Berg, E., Greenewald, K., Melnyk, I., Nguyen, N., Kingsbury, B., & Polyanskiy, Y. (2018) Estimating information flow in neural networks. arxiv e-prints, page. arXiv preprint arXiv:1810.05728
Gomez, A.N., Ren, M., Urtasun, R., & Grosse, R.B. (2017) The reversible residual network: Backpropagation without storing activations. In Advances in neural information processing systems 30
Gretton, A., Bousquet, O., Smola, A., & Schölkopf, B. (2005) Measuring statistical dependence with hilbert-schmidt norms. In International conference on algorithmic learning theory, pp. 63–77. Springer
He, K., Chen, X., Xie, S., Li, Y., Dollár, P., & Girshick, R. (2022) Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 16000–16009
He, Y., Lin, J., Liu, Z., Wang, H., Li, L.-J., & Han, S. (2018) Amc: Automl for model compression and acceleration on mobile devices. In Proceedings of the European conference on computer vision (ECCV), pp. 784–800
He, Y., Liu, P., Wang, Z., Hu, Z., & Yang, Y. (2019) Filter pruning via geometric median for deep convolutional neural networks acceleration. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4340–4349
He, Y., Liu, P., Zhu, L., & Yang, Y. (2022) Filter pruning by switching to neighboring cnns with good attributes. In IEEE transactions on neural networks and learning systems
He, K., Zhang, X., Ren, S., & Sun, J. (2016) Deep residual learning for image recognition. In: Computer vision and pattern recognition (CVPR)
Hoeffding, W. (1994) Probability inequalities for sums of bounded random variables. In The collected works of Wassily Hoeffding, pp. 409–426
Hou, L., Huang, Z., Shang, L., Jiang, X., Chen, X., & Liu, Q. (2020). Dynabert: Dynamic bert with adaptive width and depth. Advances in Neural Information Processing Systems, 33, 9782–9793.
Howard, A., Sandler, M., Chu, G., Chen, L.-C., Chen, B., Tan, M., Wang, W., Zhu, Y., Pang, R., & Vasudevan, V. (2019) Searching for mobilenetv3. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 1314–1324
Howard, A.G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto, M., & Adam, H. (2017) Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861
Huang, G., Liu, Z., Van Der Maaten, L., & Weinberger, K.Q. (2017) Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4700–4708
Huang, W., Peng, Z., Dong, L., Wei, F., Jiao, J., & Ye, Q. (2023) Generic-to-specific distillation of masked autoencoders. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 15996–16005
Ioffe, S., & Szegedy, C. (2015) Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International conference on machine learning, pp. 448–456. PMLR
Jacob, B., Kligys, S., Chen, B., Zhu, M., Tang, M., Howard, A., Adam, H., & Kalenichenko, D. (2018) Quantization and training of neural networks for efficient integer-arithmetic-only inference. In: Computer vision and pattern recognition (CVPR)
Jacobsen, J.-H., Smeulders, A., & Oyallon, E. (2018) i-revnet: Deep invertible networks. arXiv preprint arXiv:1802.07088
Jia, D., Yuan, Y., He, H., Wu, X., Yu, H., Lin, W., Sun, L., Zhang, C., & Hu, H. (2023) Detrs with hybrid matching. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 19702–19712
Jocher, G., Nishimura, K., Mineeva, T., & Vilarino, R. (2021) Yolov5. https://github.com/ultralytics/yolov5
Kim, B., Jo, Y., Kim, J., & Kim, S. (2023) Misalign, contrast then distill: Rethinking misalignments in language-image pre-training. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 2563–2572
Kornblith, S., Norouzi, M., Lee, H., & Hinton, G. (2019) Similarity of neural network representations revisited. In International conference on machine learning, pp. 3519–3529. PMLR
Kraft, D., et al. (1988) A software package for sequential quadratic programming
Krizhevsky, A., Sutskever, I., & Hinton, G.E. (2012) Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems
Kusupati, A., Ramanujan, V., Somani, R., Wortsman, M., Jain, P., Kakade, S., & Farhadi, A. (2020) Soft threshold weight reparameterization for learnable sparsity. In International conference on machine learning, pp. 5544–5555. PMLR
LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. nature,
LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P., et al. (1998) Gradient-based learning applied to document recognition. In Proceedings of the IEEE
Lee, N., Ajanthan, T., & Torr, P.H. (2018) Snip: Single-shot network pruning based on connection sensitivity. arXiv preprint arXiv:1810.02340
Lee, J., Kim, J., Shon, H., Kim, B., Kim, S. H., Lee, H., & Kim, J. (2022). Uniclip: Unified framework for contrastive language-image pre-training. Advances in Neural Information Processing Systems, 35, 1008–1019.
Li, Y., Adamczewski, K., Li, W., Gu, S., Timofte, R., & Van Gool, L. (2022) Revisiting random channel pruning for neural network compression. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 191–201
Li, H., Kadav, A., Durdanovic, I., Samet, H., & Graf, H.P. (2016) Pruning filters for efficient convnets. In International conference on learning representations (ICLR)
Li, Y., Liang, F., Zhao, L., Cui, Y., Ouyang, W., Shao, J., Yu, F., & Yan, J. (2021) Supervision exists everywhere: A data efficient contrastive language-image pre-training paradigm. arXiv preprint arXiv:2110.05208
Li, C., Wang, G., Wang, B., Liang, X., Li, Z., & Chang, X. (2022) Ds-net++: Dynamic weight slicing for efficient inference in cnns and vision transformers. In IEEE transactions on pattern analysis and machine intelligence
Li, F., Zeng, A., Liu, S., Zhang, H., Li, H., Zhang, L., & Ni, L.M. (2023) Lite detr: An interleaved multi-scale encoder for efficient detr. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 18558–18567
Lin, H., Bai, H., Liu, Z., Hou, L., Sun, M., Song, L., Wei, Y., & Sun, Z. (2024) Mope-clip: Structured pruning for efficient vision-language models with module-wise pruning error metric. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 27370–27380
Lin, S., Ji, R., Li, Y., Wu, Y., Huang, F., & Zhang, B. (2018) Accelerating convolutional networks via global & dynamic filter pruning. In: IJCAI, pp. 2425–2432
Lin, M., Ji, R., Wang, Y., Zhang, Y., Zhang, B., Tian, Y., & Shao, L. (2020) Hrank: Filter pruning using high-rank feature map. In IEEE conference on computer vision and pattern recognition
Lin, S., Ji, R., Yan, C., Zhang, B., Cao, L., Ye, Q., Huang, F., & Doermann, D. (2019) Towards optimal structured cnn pruning via generative adversarial learning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2790–2799
Lin, M., Ji, R., Zhang, Y., Zhang, B., Wu, Y., & Tian, Y. (yyy) Channel pruning via automatic structure search
Lin, J., Mao, X., Chen, Y., Xu, L., He, Y., & Xue, H. (2022) Detr: Decoder-only detr with computationally efficient cross-scale attention. arXiv preprint arXiv:2203.00860
Liu, H., Li, C., Wu, Q., & Lee, Y.J. (2024) Visual instruction tuning. In Advances in neural information processing systems 36
Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., & Guo, B. (2021) Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 10012–10022
Liu, Z., Mu, H., Zhang, X., Guo, Z., Yang, X., Cheng, K.-T., & Sun, J. (2019) Metapruning: Meta learning for automatic neural network channel pruning. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 3296–3305
Liu, Z., Sun, M., Zhou, T., Huang, G., & Darrell, T. (2018) Rethinking the value of network pruning. In International conference on learning representations
Long, J., Shelhamer, E., & Darrell, T. (2015) Fully convolutional networks for semantic segmentation. In: Computer vision and pattern recognition (CVPR)
Lorenzo-Seva, U., & Ten Berge, J. M. (2006). Tucker’s congruence coefficient as a meaningful index of factor similarity. Methodology, 2(2), 57–64.
Louizos, C., Welling, M., & Kingma, D.P. (2018) Learning sparse neural networks through l_0 regularization. In International conference on learning representations
Luo, J.-H., Wu, J., & Lin, W. (2017) Thinet: A filter level pruning method for deep neural network compression. In International conference on computer vision (ICCV)
Luo, G., Zhou, Y., Ren, T., Chen, S., Sun, X., & Ji, R. (2024) Cheap and quick: Efficient vision-language instruction tuning for large language models. In Advances in neural information processing systems 36
Ma, W.-D.K., Lewis, J., & Kleijn, W.B. (2020) The hsic bottleneck: Deep learning without back-propagation. In Proceedings of the AAAI conference on artificial intelligence, vol. 34, pp. 5085–5092
Ma, X., Fang, G., & Wang, X. (2023). Llm-pruner: On the structural pruning of large language models. Advances in Neural Information Processing Systems, 36, 21702–21720.
Mohri, M., Rostamizadeh, A., & Talwalkar, A. (2018) Foundations of machine learning
Molchanov, P., Mallya, A., Tyree, S., Frosio, I., & Kautz, J. (2019) Importance estimation for neural network pruning. In Proceedings of the IEEE conference on computer vision and pattern recognition
Molchanov, P., Mallya, A., Tyree, S., Frosio, I., & Kautz, J. (2019) Importance estimation for neural network pruning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 11264–11272
Molchanov, P., Tyree, S., Karras, T., Aila, T., & Kautz, J. (2017) Pruning convolutional neural networks for resource efficient inference. In 5th international conference on learning representations, ICLR
Mu, N., Kirillov, A., Wagner, D., & Xie, S. (2022) Slip: Self-supervision meets language-image pre-training. In european conference on computer vision, pp. 529–544. Springer
Pan, B., Panda, R., Jiang, Y., Wang, Z., Feris, R., & Oliva, A. (2021). Ia-red \(^2\): Interpretability-aware redundancy reduction for vision transformers. Advances in Neural Information Processing Systems, 34, 24898–24911.
Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., & Lerer, A. (2017) Automatic differentiation in pytorch
Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., & Clark, J. (2021) Learning transferable visual models from natural language supervision. In International conference on machine learning, pp. 8748–8763. PMLR
Ren, S., He, K., Girshick, R., & Sun, J. (2017) Faster r-cnn: Towards real-time object detection with region proposal networks. In IEEE transactions on pattern analysis and machine intelligence (TPAMI)
Ren, S., Wei, F., Zhang, Z., & Hu, H. (2023) Tinymim: An empirical study of distilling mim pre-trained models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 3687–3697
Ren, S., He, K., Girshick, R., & Sun, J. (2016). Faster r-cnn: Towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(6), 1137–1149.
Robert, P., & Escoufier, Y. (1976). A unifying tool for linear multivariate statistical methods: The RV-coefficient. Journal of the Royal Statistical Society: Series C (Applied Statistics), 25(3), 257–265.
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., & Bernstein, M., et al. (2015) Imagenet large scale visual recognition challenge. IJCV
Saikh, T., Ghosal, T., Mittal, A., Ekbal, A., & Bhattacharyya, P. (2022). Scienceqa: A novel resource for question answering on scholarly articles. International Journal on Digital Libraries, 23(3), 289–301.
Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., & Chen, L.-C. (2018) Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4510–4520
Savarese, P., Silva, H., & Maire, M. (2020). Winning the lottery with continuous sparsification. Advances in Neural Information Processing Systems, 33, 11380–11390.
Saxe, A.M., Bansal, Y., Dapello, J., Advani, M., Kolchinsky, A., Tracey, B.D., & Cox, D.D. (2019) On the information bottleneck theory of deep learning. In Journal of statistical mechanics: Theory and experiment
Shwartz-Ziv, R., & Tishby, N. (2017) Opening the black box of deep neural networks via information. CoRR
Shwartz-Ziv, R., Painsky, A., & Tishby, N. (2018) Representation compression and generalization in deep neural networks
Simonyan, K., & Zisserman, A. (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556
Song, H., Sun, D., Chun, S., Jampani, V., Han, D., Heo, B., Kim, W., & Yang, M.-H. (2022) An extendable, efficient and effective transformer-based object detector. arXiv preprint arXiv:2204.07962
Sun, M., Liu, Z., Bair, A., & Kolter, J.Z. (2023) A simple and effective pruning approach for large language models. arXiv preprint arXiv:2306.11695
Tan, M., & Le, Q. (2019) Efficientnet: Rethinking model scaling for convolutional neural networks. In International conference on machine learning, pp. 6105–6114. PMLR
Tan, M., Chen, B., Pang, R., Vasudevan, V., Sandler, M., Howard, A., & Le, Q.V. (2019) Mnasnet: Platform-aware neural architecture search for mobile. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 2820–2828
Tan, M., Pang, R., & Le, Q.V. (2020) Efficientdet: Scalable and efficient object detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 10781–10790
Tanaka, H., Kunin, D., Yamins, D. L., & Ganguli, S. (2020). Pruning neural networks without any data by iteratively conserving synaptic flow. Advances in Neural Information Processing Systems, 33, 6377–6389.
Tishby, N., Pereira, F.C., & Bialek, W. (1999) The information bottleneck method. In: Proc. of the 37-th annual allerton conference on communication, control and computing
Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., & Jégou, H. (2021) Training data-efficient image transformers & distillation through attention. In International conference on machine learning, pp. 10347–10357. PMLR
Turner, J., Crowley, E.J., O’Boyle, M., Storkey, A., & Gray, G. (2019) Blockswap: Fisher-guided block substitution for network compression on a budget. arXiv preprint arXiv:1906.04113
Wang, J., Bai, H., Wu, J., Shi, X., Huang, J., King, I., Lyu, M., & Cheng, J. (2020) Revisiting parameter sharing for automatic neural channel number search. In Advances in neural information processing systems 33
Wang, C.-Y., Bochkovskiy, A., & Liao, H.-Y.M. (2023) Yolov7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 7464–7475
Wang, J., Chen, Y., Chakraborty, R., & Yu, S.X. (2020) Orthogonal convolutional neural networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 11505–11515
Wang, S., Gao, J., Li, Z., Zhang, X., & Hu, W. (2023) A closer look at self-supervised lightweight vision transformers. In International conference on machine learning, pp. 35624–35641. PMLR
Wang, Z., Huang, S.-L., Kuruoglu, E.E., Sun, J., Chen, X., & Zheng, Y. (2021) Pac-bayes information bottleneck. arXiv preprint arXiv:2109.14509
Wang, T., Yuan, L., Chen, Y., Feng, J., & Yan, S. (2021) Pnp-detr: Towards efficient visual analysis with transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 4661–4670
Wang, T., Zhou, W., Zeng, Y., & Zhang, X. (2022) Efficientvlm: Fast and accurate vision-language models via knowledge distillation and modal-adaptive pruning. arXiv preprint arXiv:2210.07795
Wen, W., Wu, C., Wang, Y., Chen, Y., & Li, H. (2016) Learning structured sparsity in deep neural networks. In Advances in neural information processing systems (NeurIPS)
Wortsman, M., Farhadi, A., & Rastegari, M. (2019) Discovering neural wirings. In Advances in neural information processing systems 32
Wu, B., Dai, X., Zhang, P., Wang, Y., Sun, F., Wu, Y., Tian, Y., Vajda, P., Jia, Y., & Keutzer, K. (2019) Fbnet: Hardware-aware efficient convnet design via differentiable neural architecture search. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Wu, K., Peng, H., Zhou, Z., Xiao, B., Liu, M., Yuan, L., Xuan, H., Valenzuela, M., Chen, X.S., & Wang, X. (2023) Tinyclip: Clip distillation via affinity mimicking and weight inheritance. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 21970–21980
Yang, K., Deng, J., An, X., Li, J., Feng, Z., Guo, J., Yang, J., & Liu, T. (2023) Alip: Adaptive language-image pre-training with synthetic caption. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 2922–2931
Yang, T.-J., Howard, A., Chen, B., Zhang, X., Go, A., Sandler, M., Sze, V., & Adam, H. (2018) Netadapt: Platform-aware neural network adaptation for mobile applications. In: Proceedings of the European conference on computer vision (ECCV), pp. 285–300
Yang, H., Yin, H., Molchanov, P., Li, H., & Kautz, J. (2021) Nvit: Vision transformer compression and parameter redistribution
Yang, Y., Zhuang, Y., & Pan, Y. (2021). Multiple knowledge representation for big data artificial intelligence: framework, applications, and case studies. Frontiers of Information Technology & Electronic Engineering, 22(12), 1551–1558.
Yu, J., & Huang, T. (2019) Autoslim: Towards one-shot architecture search for channel numbers. arXiv preprint arXiv:1903.11728
Yu, F., Huang, K., Wang, M., Cheng, Y., Chu, W., & Cui, L. (2022) Width & depth pruning for vision transformers. In AAAI conference on artificial intelligence (AAAI), vol. 2022
Yuan, X., Savarese, P.H.P., & Maire, M. (2020) Growing efficient deep networks by structured continuous sparsification. In International conference on learning representations
Zhang, G., Luo, Z., Tian, Z., Zhang, J., Zhang, X., & Lu, S. (2023) Towards efficient use of multi-scale features in transformer-based object detectors. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 6206–6216
Zhang, Z., Zhang, A., Li, M., Zhao, H., Karypis, G., & Smola, A. (2023) Multimodal chain-of-thought reasoning in language models. arXiv preprint arXiv:2302.00923
Zhao, C., Ni, B., Zhang, J., Zhao, Q., Zhang, W., & Tian, Q. (2019) Variational convolutional neural network pruning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 2780–2789
Zhou, A., Li, Y., Qin, Z., Liu, J., Pan, J., Zhang, R., Zhao, R., Gao, P., & Li, H. (2023) Sparsemae: Sparse training meets masked autoencoders. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 16176–16186
Zhou, S., Wu, Y., Ni, Z., Zhou, X., Wen, H., & Zou, Y. (2016) Dorefa-net: Training low bitwidth convolutional neural networks with low bitwidth gradients. In Computer vision and pattern recognition (CVPR)
Author information
Authors and Affiliations
Corresponding authors
Additional information
Communicated by Jianfei Cai.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix A Proof of Theorem 1
Theorem 6
Assuming X is the input random variable follows a Markov random field structure and the Markov random field is ergodic. For a network that have L hidden representations \(X_1, X_2,..., X_L\), with a probability \(1-\delta \), the generalization error \(\epsilon \) is bounded by
where n is the number of training examples.
Proof
Let \(\mathcal {D}_n = {(x_1, y_1),...,(x_n, y_n)}\) be an i.i.d random sample from \(P_{X,Y}\). We define the loss function as a mapping \(l:\mathcal {Y}\times \mathcal {Y}\rightarrow \mathbb {R}^+\) and the performance of a predictor \(h:\mathcal {X}\rightarrow \mathcal {Y}\) is measured by the expectation on \(P_{X,Y}\), which is also known as generalization error and formally defined as \(R(h):=\mathbb {E}_{X,Y}\left[ l(Y, h(X))\right] .\) In practice, \(P_{X,Y}\) is usually unknown, we thus estimate \(\hat{h}\) based on \(\mathcal {D}_n\) and the corresponding empirical risk is defined as \(\hat{R}_n(h):=\frac{1}{n}\sum _{i=1}^nl(y_i, h(x_i)).\) For simplicity, we further define the mapping function \(l = I\left( h(X)\ne Y\right) \), where I(., .) is the indicator function. In this case, applying the Hoeffding inequality (Hoeffding, 1994) we have
Together with PAC learning theory (Mohri et al., 2018), we obtain the following corollary: \(\square \)
Corollary 1
Let \(\mathcal {H}\) be a finite set of hyothesis. Then for all \(\epsilon >0\), we have
We further control the probability in Corollary 1 with a confidence \(\delta \). That is, we ask
In other words, with a probability \(\delta \), we have
Meanwhile, we can also conclude that \(\epsilon ^2\) is also bounded
with probability \(1-\delta \). To prove our theorem, we further introduce a lemma (Shwartz-Ziv et al., 2018) about Asymptotic Equipartition Property (AEP) (Cover, 1999) as follows:
Lemma 1
Assuming X is a random variable follows a Markov random field structure and the Markov random field is ergodic. Then, we have that
Let \(X_i\) be a mapping of X, the size of typical set \(|\mathcal {T}|\) is bounded by
Therefore, we conclude that for a mapping T, the square error is bounded by
Considering that there are L mappings in the network, we thus conclude that the overall square error is bounded by
with a probability \(1-\delta \).
The relationship between mutual information (MI) and accuracy on the validation set. The x-axis measures the nHSIC of the sampled feature, and the y-axis measures the accuracies of the randomly sampled architectures. The correlation between MI and accuracy is \(-0.955872\). In other words, nHSIC is negatively correlated to the finetuned accuracy and thus can be used as an accurate predictor in model compression
Appendix B The Debate on the Information Bottleneck
ITPruner is based on the theory of the information bottleneck in deep learning (Shwartz-Ziv & Tishby, 2017), which is challenged by Saxe et al. (2019). Saxe et al. (2019) argue that the representation compression phase empirically demonstrated in Shwartz-Ziv and Tishby (2017) only appears when double-sided saturating nonlinearities like tanh and sigmoid are deployed. In this case, the claimed causality between compression and generalization was also questioned, which is foundation of ITPruner.
To clear such debate, we first propose Theorem 1 to mathematically demonstrate that reducing the mutual information between input and the hidden representations indeed leads to a smaller generalization error. Secondly, we also provide empirical evidence in Fig. 8 to show the performance on the validation set is highly correlated with the layer-wise mutual information. At last, the critical argument raised in Saxe et al. (2019) may cause by the inaccurate estimation of mutual information. Recent work (Wang et al., 2021) propose a new method to estimate the mutual information and empirically identify the original experiments in Shwartz-Ziv and Tishby (2017). For example, Fig. 9 reports the change of mutual information (MI) between input and different layers in the training process. As we can see, there is a clear boundary between the initial fitting (increase of mutual information) and the compression phases (decrease of mutual information). Overall, we thus conclude that the generalized information bottleneck principle behind ITPruner is solid theoretically and empirically.
Appendix C The Debate on the Invertable-ResNet
In the work of Gomez et al. (2017) and Jacobsen et al. (2018), they propose reversible networks, which is capable of constructing hidden representation that encode the input. Here we argue that Invertable-ResNet (Gomez et al., 2017) is not an experimental counterexample for our Theorem 6.
Specifically, Invertable-ResNet (Gomez et al., 2017) actually need to optimize layer-wise mutual information mentioned in Theorem 5 for better generalization. Firstly, in their released source code,Footnote 5 weight decay is necessary for a better generalization. We also conduct an ablation study to validate the effectiveness of weight decay. As shown in Tab. 12, both ResNet-20 and Invertable-ResNet show a significant performance promotion when the regularization term is adopted in the training. Meanwhile, as theoretically demonstrated in Theorem 5, the adopted weight decay regularization is highly correlated to mutual information. Moreover, directly replacing wight decay with mutual information still shows a performance promotion, which is reported in Tab. 12. We thus conclude that optimizing mutual information in Invertable-ResNet (Gomez et al., 2017) is necessary.
There are two types of hidden representations exist in Invertable-ResNet (Gomez et al., 2017) simultaneously. One is highly related to the input X and uncorrelated to the output Y. The other one is the opposite, which is related to Y and uncorrelated to X. To validate our argument, we conduct a new experiment that extracts features in Invertable-ResNet and reports the mutual information with input and output. As illustrated in Fig. 10, most of the features have a certain degree of correlation with the X and Y. However, the two types of the feature described before do exist in Invertable-ResNet, which is marked as a darker red dot in Fig. 10.
Theorem 5
Assuming that \(X \sim \mathcal {N} (\varvec{0}, \varvec{\Sigma }_{X})\) and \(Y \sim \mathcal {N} (\varvec{0}, \varvec{\Sigma }_{Y}), Y=W^TX\), \(\min I(X;Y) \propto \min ||W||_F^2\).
Proof
The definition of mutual information is \(I(X;Y)=H(X)+H(Y)-H(X,Y)\). Meanwhile, the entropy of multivariate Gaussian distribution \(H(X)=\frac{1}{2}\text {ln}({(2\pi e)}^D|\Sigma _X|)\) and the joint distribution \((X,Y) \sim N\left( 0,\Sigma _{(X,Y)}\right) \), where
Therefore, the mutual information between Gaussian distributed random variables X and Y is represented as,
According to Everitt inequality, \(|\Sigma _{(X,Y)}| \le |\Sigma _X||\Sigma _Y|\), this inequality holds if and only if \(\Sigma _{YX}=\Sigma _{XY^T}=X^TY\) is a zero matrix. In other words,
For simplicity, we take an orthogonalized \(\varvec{X}\). According to the unitary invariance of Frobenius norm, we have,
\(\square \)
Appendix D Ablation Study
We conduct ablation study to evaluate our calim that selecting weights in a layer need to be pruned does not help as much as finding a better \(\alpha \). As we can see in Tab. 13, ITPruner always shows a significant improvement compared to their original methods. Meanwhile, integrating different pruning methods show a very similar performance. We thus conclude that selecting which weights in a layer need to be pruned does not help as much as finding a better \(\alpha \), which is also well verified in the previous survey (Blalock et al., 2020).
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Zheng, X., Ma, Y., Xi, T. et al. An Information Theory-Inspired Strategy for Automated Network Pruning. Int J Comput Vis 133, 5455–5482 (2025). https://doi.org/10.1007/s11263-025-02437-z
Received:
Accepted:
Published:
Version of record:
Issue date:
DOI: https://doi.org/10.1007/s11263-025-02437-z