InfoPro: Locally Supervised Deep Learning by Maximizing Information Propagation

Wang, Yulin; Ni, Zanlin; Pu, Yifan; Zhou, Cai; Ying, Jixuan; Song, Shiji; Huang, Gao

doi:10.1007/s11263-024-02296-0

InfoPro: Locally Supervised Deep Learning by Maximizing Information Propagation

Published: 11 December 2024

Volume 133, pages 2752–2782, (2025)
Cite this article

International Journal of Computer Vision Aims and scope Submit manuscript

Yulin Wang¹^na1,
Zanlin Ni¹^na1,
Yifan Pu¹,
Cai Zhou⁴,
Jixuan Ying²,
Shiji Song¹ &
…
Gao Huang ORCID: orcid.org/0000-0002-7251-0988^1,3

566 Accesses
2 Citations
Explore all metrics

Abstract

End-to-end (E2E) training has become the de-facto standard for training modern deep networks, e.g., ConvNets and vision Transformers (ViTs). Typically, a global error signal is generated at the end of a model and back-propagated layer-by-layer to update the parameters. This paper shows that the reliance on back-propagating global errors may not be necessary for deep learning. More precisely, deep networks with a competitive or even better performance can be obtained by purely leveraging locally supervised learning, i.e., splitting a network into gradient-isolated modules and training them with local supervision signals. However, such an extension is non-trivial. Our experimental and theoretical analysis demonstrates that simply training local modules with an E2E objective tends to be short-sighted, collapsing task-relevant information at early layers, and hurting the performance of the full model. To avoid this issue, we propose an information propagation (InfoPro) loss, which encourages local modules to preserve as much useful information as possible, while progressively discarding task-irrelevant information. As InfoPro loss is difficult to compute in its original form, we derive a feasible upper bound as a surrogate optimization objective, yielding a simple but effective algorithm. We evaluate InfoPro extensively with ConvNets and ViTs, based on twelve computer vision benchmarks organized into five tasks (i.e., image/video recognition, semantic/instance segmentation, and object detection). InfoPro exhibits superior efficiency over E2E training in terms of GPU memory footprints, convergence speed, and training data scale. Moreover, InfoPro enables the effective training of more parameter- and computation-efficient models (e.g., much deeper networks), which suffer from inferior performance when trained in E2E. Code: https://github.com/blackfeather-wang/InfoPro-Pytorch.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+

from $39.99 /Month

Starting from 10 chapters or articles per month
Access and download chapters and articles from more than 300k books and 2,500 journals
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Self-Promoted Supervision for Few-Shot Transformer

Unsupervised Dense Prediction Using Differentiable Normalized Cuts

Mixed Evidence for Gestalt Grouping in Deep Neural Networks

Article Open access 10 July 2023

Data Availability

As introduced in Sect. 5, the data that support the findings of this study are openly available.

References

Krizhevsky, A., Sutskever, I., & Hinton, G.E. (2012). Imagenet classification with deep convolutional neural networks. In NeurIPS,, pp. 1097–1105.
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. in CVPR, pp. 770–778.
Huang, G., Liu, Z., Pleiss, G., Van Der Maaten, L., & Weinberger, K. (2019). Convolutional networks with dense connectivity. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44, 8704–8716.
Article Google Scholar
Han, K., Wang, Y., Chang, X., Guo, J., Chunjing, X., Enhua, W., & Tian, Q. (2022). Ghostnets on heterogeneous devices via cheap operations. International Journal of Computer Vision, 130(4), 1050–1069.
Article Google Scholar
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., & Houlsby, N. (2021). An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR.
Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., & Jégou, H. (2021). Training data-efficient image transformers & distillation through attention. In ICML, pp. 10347–10357.
Zhang, Q., Yufei, X., Zhang, J., & Tao, D. (2023). Vitaev2: Vision transformer advanced by exploring inductive bias for image recognition and beyond. International Journal of Computer Vision, 131(5), 1141–1162.
Article Google Scholar
Lin, M., Chen, M., Zhang, Y., Shen, C., Ji, R., & Cao, L. (2023). Super vision transformer. International Journal of Computer Vision, 131(12), 3136–3151.
Article Google Scholar
Jaderberg, M., Czarnecki, W., Marian, O., Simon, V., Oriol, G., Alex, S.D., & Kavukcuoglu, K. (2017). Decoupled neural interfaces using synthetic gradients. In ICML, pp. 1627–1635.
Touvron, H., Cord, M., Sablayrolles, A., Synnaeve, G., & Jégou, H. (2021). Going deeper with image transformers. In ICCV, pp. 32–42.
Ni, Z., Wang, Y., Jiangwei, Y., Jiang, H., Cao, Y., & Huang, G. (2023). Deep incubation: Training large models by divide-and-conquering. In ICCV, pp. 17335–17345.
Zhai, X., Kolesnikov, A., Houlsby, N., & Beyer, L. (2022). Scaling vision transformers. In CVPR, pp. 12104–12113.
Crick, F. (1989). The recent excitement about neural networks. Nature, 337(6203), 129–132.
Article Google Scholar
Marblestone, A. H., Wayne, G., & Kording, K. P. (2016). Toward an integration of deep learning and neuroscience. Frontiers in Computational Neuroscience, 10, 215943.
Article Google Scholar
Löwe, S., O’Connor, P., & Veeling, B. (2019). Putting an end to end-to-end: Gradient-isolated learning of representations. In NeurIPS, pp. 3039–3051.
Dan, Y., & Poo, M. (2004). Spike timing-dependent plasticity of neural circuits. Neuron, 44(1), 23–30.
Article Google Scholar
Caporale, N., & Dan, Y. (2008). Spike timing-dependent plasticity: A hebbian learning rule. Annual Review of Neuroscience, 31, 25–46.
Article Google Scholar
Bengio, Y., Lee, D.H., Bornschein, J., Mesnard, T., & Lin, Z. (2015) Towards biologically plausible deep learning. arXiv:1502.04156.
Mosca, A., & Magoulas, G.D. (2017) Deep incremental boosting. arXiv:1708.03704.
Mostafa, H., Ramesh, V., & Cauwenberghs, G. (2018). Deep supervised learning using local errors. Frontiers in Neuroscience, 12, 608.
Article Google Scholar
Huang, F., Ash, J., Langford, J., & Schapire, R. (2018). Learning deep resnet blocks sequentially using boosting theory. In ICML, pp. 2058–2067.
Belilovsky, E., Eickenberg, M., & Oyallon, E. (2019). Greedy layerwise learning can scale to imagenet. In ICML, pp. 583–593.
Belilovsky, E., Eickenberg, M., & Oyallon, E. (2020). Decoupled greedy learning of cnns. ICML, 56, 736–745.
Google Scholar
Nøkland, A., & Eidnes, L.H. (2019). Training neural networks with local error signals. In ICML, pp. 4839–4850.
Wang, Y., Ni, Z., Song, S., Yang, L., & Huang, G. (2021). Revisiting locally supervised learning: an alternative to end-to-end training. In ICLR.
Carreira, J. & Zisserman, A. (2017). Quo vadis, action recognition? a new model and the kinetics dataset. In CVPR, pp. 6299–6308.
Lin, J., Gan, C., & Han, S. (2019). Tsm: Temporal shift module for efficient video understanding. In ICCV, pp. 7083–7093.
Tan, M., & Quoc, V. L. (2019). Efficientnet: Rethinking model scaling for convolutional neural networks. In ICML, 97, 6105–6114.
Google Scholar
Ye, P., Tang, S., Li, B., Chen, T., & Ouyang, W. (2022). Stimulative training of residual networks: A social psychology perspective of loafing. In NeurIPS, 35, 3596–3608.
Google Scholar
Ye, P., He, T., Tang, S., Li, B., Chen, T., Bai, L., & Ouyang, W. (2023) Stimulative training++: Go beyond the performance limits of residual networks. arXiv:2305.02507.
Hinton, G. E., Osindero, S., & Teh, Y. W. (2006). A fast learning algorithm for deep belief nets. Neural Computation, 18, 1527–1554.
Article MathSciNet Google Scholar
Bengio, Y., Lamblin, P., Popovici, D., & Larochelle, H. (2007). Greedy layer-wise training of deep networks. In NeurIPS, pp. 153–160.
Ioffe, S. & Szegedy, C. (2015) Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv:1502.03167.
Kulkarni, M. & Karande, S. (2017) Layer-wise training of deep networks using kernel similarity. arXiv:1703.07115.
Malach, E. & Shalev-Shwartz, S. (2018) A provably correct algorithm for deep learning that actually works. arXiv:1803.09522 .
Marquez, E. S., Hare, J. S., & Niranjan, M. (2018). Deep cascade learning. IEEE Transactions on Neural Networks and Learning Systems, 29(11), 5475–5485.
Article MathSciNet Google Scholar
Fahlman, S.E. & Lebiere, C. (1990). The cascade-correlation learning architecture. In NeurIPS, pp. 524–532.
Xiong, Y., Ren, M., & Urtasun, R. (2020). Loco: Local contrastive representation learning. In NeurIPS, pp. 11142–11153.
Laskin, M., Metz, L., Nabarro, S., Saroufim, M., Noune, B., Luschi, C., Sohl-Dickstein, J., & Abbeel, P. (2020) Parallel training of deep networks with local updates. arXiv:2012.03837.
Gomez, A. N., Key, O., Perlin, K., Gou, S., Frosst, N., Dean, J., & Gal, Y. (2022). Interlocking backpropagation: Improving depthwise model-parallelism. Journal of Machine Learning Research, 23(1), 7714–7741.
MathSciNet Google Scholar
Chen, T., Xu, B., Zhang, C., & Guestrin, C. (2016) Training deep nets with sublinear memory cost. arXiv:1604.06174.
Gruslys, A., Munos, R., Danihelka, I., Lanctot, M., & Graves, A. (2016). Memory-efficient backpropagation through time. NeurIPS, 29, 4125–4133.
Google Scholar
Gomez, A.N., Ren, M., Urtasun, R., & Grosse, R.B. (2017). The reversible residual network: Backpropagation without storing activations. In NeurIPS, pp. 2214–2224.
Salimans, T. & Bulatov, Y. (2017) Gradient checkpointing.
Jacobsen, JH, Smeulders, A, & Oyallon, E. (2018) i-revnet: Deep invertible networks. arXiv:1802.07088.
Lee, D.-H., Zhang, S., Fischer, A., & Bengio, Y. (2015). Difference target propagation. in Joint european conference on machine learning and knowledge discovery in databases, pp. 498–515.
Bartunov, S., Santoro, A., Richards, B., Marris, L., Hinton, G.E., & Lillicrap, T. (2018). Assessing the scalability of biologically-motivated deep learning algorithms and architectures. In NeurIPS, pp. 9368–9378.
, Timothy, P., Cownden, D., Tweed, D.B., & Akerman C.J. (2014) Random feedback weights support learning in deep neural networks. arXiv:1411.0247.
Nøkland, A. (2016). Direct feedback alignment provides learning in deep neural networks. In NeurIPS, pp. 1037–1045.
Taylor, G., Burmeister, R., Zheng, X., Singh, B., Patel, A., & Goldstein, T. (2016). Training neural networks without gradients: A scalable admm approach. In ICML, pp. 2722–2731.
Anna Choromanska, E. C. E. N. Y. U., Tandon, S. K., Luss, R., Rish, I., Kingsbury, B., Tejwani, R., & Bouneffouf, D. (2018). Beyond backprop: Alternating minimization with co-activation memory. Statistics, 1050, 24.
Google Scholar
Huo, Z., Bin, G., Yang, Q., & Huang, H. (2018). Decoupled parallel backpropagation with convergence guarantee. In ICML, pp. 2098–2106.
Huo, Z., Bin, G., & Huang, H. (2018). Training neural networks using features replay. In NeurIPS, pp. 6659–6668.
Shwartz-Ziv, R., & Tishby, N. (2017) Opening the black box of deep neural networks via information. arXiv:1703.00810.
Saxe, A. M., Bansal, Y., Dapello, J., Advani, M., Kolchinsky, A., Tracey, B. D., & Cox, D. D. (2019). On the information bottleneck theory of deep learning. Journal of Statistical Mechanics: Theory and Experiment, 2019(12), 124020.
Article MathSciNet Google Scholar
Tishby, N., Pereira, F.C., & Bialek, W. (2000) The information bottleneck method. arXiv:physics/0004057.
Achille, A., & Soatto, S. (2018). Emergence of invariance and disentanglement in deep representations. Journal of Machine Learning Research, 19(1), 1947–1980.
MathSciNet Google Scholar
Alemi, A.A., Fischer, I., Dillon, J.V., & Murphy, K. (2016) Deep variational information bottleneck. arXiv:1612.00410.
van den Oord, A., Li, Y, & Vinyals, O. (2018) Representation learning with contrastive predictive coding. arXiv:1807.03748.
Tian, Y., Sun, C., Poole, B., Krishnan, D., Schmid, C., & Phillip I. (2020) What makes for good views for contrastive learning. arXiv:2005.10243.
Hjelm, R.D., Fedorov, A., Lavoie-Marchildon, S., Grewal, K., Bachman, P., Trischler, A., & Bengio, Y. (2019) Learning deep representations by mutual information estimation and maximization. In ICLR.
Li, Y., Yang, M., Peng, D., Li, T., Huang, J., & Peng, X. (2022). Twin contrastive learning for online clustering. International Journal of Computer Vision, 130(9), 2205–2221.
Article Google Scholar
Chen, T., Kornblith, S., Norouzi, M., & Hinton, G. (2020). A simple framework for contrastive learning of visual representations. In ICML, pp. 1597–1607.
He, K., Fan, H., Yuxin, W., Xie, S., & Girshick, R. (2020). Momentum contrast for unsupervised visual representation learning. In CVPR, pp. 9729–9738.
Krizhevsky, A., Hinton, G., et al. (2009). Learning multiple layers of features from tiny images. Citeseer: Technical report.
Google Scholar
Netzer, Y., Wang, T., Coates, A., Bissacco, A., Wu, B., & Ng, A.Y. (2011) Reading digits in natural images with unsupervised feature learning.
Coates, A., Ng, A., & Lee, H. (2011). An analysis of single-layer networks in unsupervised feature learning. In AISTATS, pp. 215–223.
Lee, C.Y., Xie, S., Gallagher, P., Zhang, Z., & Zhuowen, T. (2015). Deeply-supervised nets. In AISTATS, pp. 562–570.
Tschannen, M., Djolonga, J., Rubenstein, P.K., Gelly, S., Lucic, M. (2020) On mutual information maximization for representation learning. In ICLR.
Mohamed I.B., Aristide, B., Sai, R., Sherjil, O., Yoshua, B., Aaron, C., & Devon, H. (2018). Mutual information neural estimation. In ICML , pp. 531–540.
LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11), 2278–2324.
Article Google Scholar
Vincent, P., Larochelle, H., Bengio, Y., & Manzagol, P.A. (2008). Extracting and composing robust features with denoising autoencoders. In ICML, pp. 1096–1103.
Rifai, S., Bengio, Y., Courville, A., Vincent, P., & Mirza, M. (2012). Disentangling factors of variation for facial expression recognition. In ECCV, pp. 808–822.
Kingma, D.P., & Welling, M. (2013) Auto-encoding variational bayes. arXiv:1312.6114.
Makhzani, A., Shlens, J., Jaitly, N., Goodfellow, I., & Frey, B. (2015) Adversarial autoencoders. arXiv:1511.05644.
Khosla, P., Teterwak, P., Wang, C., Sarna, A., Tian, Y., Isola, P., Maschinot, A., Liu, C., & Krishnan, D. (2020). Supervised contrastive learning. In NeurIPS , pp. 18661–18673.
Deng, J., Dong, W., Socher, R., Li, L., Li, Kai, & Fei-Fei, Li. (2009). Imagenet: A large-scale hierarchical image database. In ICML, pp. 248–255.
Mark Everingham, S. M., Eslami, A., Van Gool, L., Williams, C. K. I., Winn, J., & Zisserman, A. (2015). The pascal visual object classes challenge: A retrospective. International Journal of Computer Vision, 111(1), 98–136.
Article Google Scholar
Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., & Schiele, B. (2016). The cityscapes dataset for semantic urban scene understanding. In CVPR, pp. 3213–3223.
Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., & Torralba, A. (2017). Scene parsing through ade20k dataset. In CVPR, pp. 633–641.
Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., & Serre, T. (2011). Hmdb: a large video database for human motion recognition. In ICCV, pp. 2556–2563.
Soomro, K., Zamir, A.R., Shah, M. (2012) Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv:1212.0402.
Goyal, R., Kahou, S.E., Michalski, V., Materzynska, J., Westphal, S., Kim, H., Haenel, V., Fruend, I., Yianilos, P., Mueller-Freitag, M., et al. (2017). The" something something" video database for learning and evaluating visual common sense. In ICCV, pp. 5842–5850.
Lin, T.Y., Maire, M., Belongie, S., Bourdev, L., Girshick, R., Hays, P., James, P., Ramanan, D., Lawrence Zitnick, C., & Dollár, P. (2014) Microsoft coco: Common objects in context.
Wang, Y., Pan, X., Song, S., Zhang, H., Huang, G., & Cheng, W. (2019). Implicit semantic data augmentation for deep networks. In NeurIPS, pp. 12635–12644.
Tarvainen, A., & Valpola, H. (2017). Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. In NeurIPS, pp. 1195–1204.
Hariharan, B., Arbeláez, P., Bourdev, L., Maji, S., & Malik, J. (2011). Semantic contours from inverse detectors. In ICCV, pp. 991–998.
Chen, L.C., Papandreou, G., Schroff, F., & Adam, H. (2017) Rethinking atrous convolution for semantic image segmentation. arXiv:1706.05587.
MMSegmentation Contributors. (2020) Mmsegmentation, an open source semantic segmentation toolbox. https://github.com/open-mmlab/mmsegmentation.
Wang, Y., Chen, Z., Jiang, H., Song, S., Han, Y., & Huang, G. (2021). Adaptive focus for efficient video recognition. In ICCV, pp. 16249–16258.
Wang, Y., Yue, Y., Lin, Y., Jiang, H., Lai, Z., Kulikov, V., Orlov, N., Shi, H., & Huang, G. (2022a). Adafocus v2: End-to-end training of spatial dynamic networks for video recognition. In CVPR, pp. 20030–20040.
Wang, Y., Yue, Y., Xinhong, X., Hassani, A., Kulikov, V., Orlov, N., Song, S., Shi, H., & Huang, G. (2022b). Adafocusv3: On unified spatial-temporal dynamic video recognition. In ECCV, pp. 226–243.
Chen, K., Wang, J., Pang, J., Yuhang Cao, Y., Xiong, X.L., Sun, S., Feng, W., Liu, Z., Jiarui, X., Zhang, Z., Cheng, D., Zhu, C., Cheng, T., Zhao, Q., Li, B., Xin, L., Zhu, R., Yue, W., Dai, J., Lin, D. (2019). MMDetection: Open mmlab detection toolbox and benchmark. arXiv:1906.07155.
Jacobsen, Jörn-Henrik, S., Arnold W.M., & Oyallon, Edouard. (2018). i-revnet: Deep invertible networks. In ICLR.
Cubuk, Ekin D., Zoph, B., Mane, D., Vasudevan, V., & Le, Q. V. (2018). Autoaugment: Learning augmentation policies from data. arXiv preprintarXiv:1805.09501, .
Cubuk, Ekin D., Zoph, B., Shlens, J., & Le, Q. V. (2020). Randaugment: Practical automated data augmentation with a reduced search space. In CVPRW, 702–703.
Zhong, Z., Zheng, L., Kang, G., Li, S., & Yang, Y. (2020). Random erasing data augmentation. In AAAI, 34, 13001–13008.
Zhang, H., Cisse, M., Dauphin, Yann N., & Lopez-Paz, D. (2017). mixup: Beyond empirical risk minimization. arXiv preprintarXiv:1710.09412.
Yun, S., Han, D., Seong J., Oh., Chun, S., Choe, J., & Yoo, Y., (2019). Cutmix: Regularization strategy to train strong classifiers with localizable features. In ICCV, 6023–6032.
Gao Huang, Y., Sun, Z. L., Sedra, D., & Weinberger, K. Q. (2016). Deep networks with stochastic depth. In ECCV, 646–661.
Bengio, Y., Simard, P., & Frasconi, P. (1994). Learning long-term dependencies with gradient descent is difficult. IEEE Transactions on Neural Networks, 5(2), 157–166.
Article Google Scholar
Ren, S., He, K., Girshick, R., & Sun, J. (2015). Faster r-cnn: Towards real-time object detection with region proposal networks. In NeurIPS, 91–99.
Chen, Z., Duan, Y., Wang, W., He, J., Lu, T., Dai, J., & Qiao, Y. (2023). Vision transformer adapter for dense predictions. In ICLR.
Zhao, H., Shi, J., Qi, X., Wang, X., & Jia, J. (2017). Pyramid scene parsing network. In CVPR, 2881–2890.
Jun, F., Liu, J., Tian, H., Li, Y., Bao, Y., Fang, Z., & Hanqing, Lu. (2019). Dual attention network for scene segmentation. In CVPR, 3146–3154.
He, K., Gkioxari, G., Dollár, P., & Girshick, R. (2017). Mask r-cnn. In ICCV, 2961–2969.
Lin, T.-Y., Dollár, P., Girshick, R., He, K., Hariharan, B., & Belongie, S. (2017). Feature pyramid networks for object detection. In CVPR, 2117–2125.
Veit, A., Wilber, M. J., & Belongie, S. (2016). Residual networks behave like ensembles of relatively shallow networks. In NeurIPS, 550–558.
Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al. (2023) Gpt-4 technical report. arXiv preprintarXiv:2303.08774.
Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., et al. (2023) Llama 2: Open foundation and fine-tuned chat models. arXiv preprintarXiv:2307.09288.
Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al. (2021). Learning transferable visual models from natural language supervision. In ICML,56, 8748–8763, PMLR.
Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., & Chen, L.-C. (2018). Mobilenetv 2: Inverted residuals and linear bottlenecks. In CVPR, 4510–4520.
Elfwing, S., Uchibe, E., & Doya, K. (2018). Sigmoid-weighted linear units for neural network function approximation in reinforcement learning. Neural Networks, 107, 3–11.
Article Google Scholar
Kingma, D. P., & Ba, J. (2014) Adam: A method for stochastic optimization. arXiv preprintarXiv:1412.6980, .

Download references

Acknowledgements

This work was supported in part by the National Key R&D Program of China under Grant 2021ZD0140407, and in part by the National Natural Science Foundation of China under Grants 42327901.

Author information

Yulin Wang and Zanlin Ni have contributed equally.

Authors and Affiliations

Department of Automation, BNRist, Tsinghua University, Beijing, China
Yulin Wang, Zanlin Ni, Yifan Pu, Shiji Song & Gao Huang
Weiyang College, Tsinghua University, Beijing, China
Jixuan Ying
Beijing Academy of Artificial Intelligence, Beijing, China
Gao Huang
Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, USA
Cai Zhou

Authors

Yulin Wang
View author publications
Search author on:PubMed Google Scholar
Zanlin Ni
View author publications
Search author on:PubMed Google Scholar
Yifan Pu
View author publications
Search author on:PubMed Google Scholar
Cai Zhou
View author publications
Search author on:PubMed Google Scholar
Jixuan Ying
View author publications
Search author on:PubMed Google Scholar
Shiji Song
View author publications
Search author on:PubMed Google Scholar
Gao Huang
View author publications
Search author on:PubMed Google Scholar

Corresponding author

Correspondence to Gao Huang.

Additional information

Communicated by Jifeng Dai.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix: Architecture of Auxiliary Networks

In this section, we introduce the network architectures of $\varvec{w}$, $\varvec{\psi }$ and $\varvec{\phi }$ we use in our experiments. Note that, $\varvec{w}$ is a decoder that aims to reconstruct the input images from deep features, while $\varvec{\psi }$ and $\varvec{\phi }$ share the same architecture except for the last layer. The architectures used on CIFAR, SVHN and STL-10 are shown in Table 18 and Table 19. Architectures on ImageNet for ResNet/ResNeXt and EfficientNet are shown in Tables 20, 21 and Tables 22, 23, respectively. We apply the same mobile inverted bottleneck convolutional layers (MobileConv) (Sandler et al., 2018; Tan and Quoc, 2019) and Swish activations (Elfwing et al., 2018; Tan and Quoc, 2019) as EfficientNet to its corresponding auxiliary networks. For DeiT-S, we simply use an MLP as $\varvec{w}$ to reconstruct the model inputs, while $\varvec{\psi }$ consists of a 3x3 convolutional layer with stride=2, followed by 3 standard Transformer layers. The architectures of $\varvec{w}$ and $\varvec{\psi }$ for semantic segmentation are shown in Table 24 and Table 25, where the same networks are used with varying input sizes. Herein, “s”, “p” and “oc” refer to stride, padding and output channels, respectively. “conv.” refers to a single convolutional layer.

Table 18 Architecture of $\varvec{w}$ on CIFAR, SVHN and STL-10

Full size table

Table 19 Architecture of $\varvec{\psi }$ and $\varvec{\phi }$ on CIFAR, SVHN and STL-10

Full size table

In particular, in all of our experiments, DGL (Belilovsky et al., 2020) uses exactly the same architecture of auxiliary networks as us.

Appendix: Details of Mutual Information Estimation

In this section, we describe the details on obtaining the estimates of $I(\varvec{h}, \varvec{x})$ and $I(\varvec{h}, y)$ we present in Figs. 2, 3 and 5.

Table 20 Architecture of $\varvec{w}$ on ImageNet

Full size table

Table 21 Architecture of $\varvec{\psi }$ on ImageNet

Full size table

Table 22 Architecture of $\varvec{w}$ for EfficientNet-B2

Full size table

Table 23 Architecture of $\varvec{\psi }$ for EfficientNet-B2

Full size table

Table 24 Architecture of $\varvec{w}$ for semantic segmentation

Full size table

Table 25 Architecture of $\varvec{\psi }$ for semantic segmentation

Full size table

As we have discussed in Sect. 4.3, the expected reconstruction error $\mathcal {R}(\varvec{x}|\varvec{h})$ follows $I(\varvec{h}, \varvec{x})\!=\!H(\varvec{x})\!-\!H(\varvec{x}|\varvec{h})\!\ge \!H(\varvec{x})\!-\!\mathcal {R}(\varvec{x}|\varvec{h})$ (Vincent et al., 2008; Rifai et al., 2012; Kingma and Welling, 2013; Makhzani et al., 2015; Hjelm et al., 2019). Therefore, similar to Sect. 4.3, we estimate $I(\varvec{h}, \varvec{x})$ by training a decoder parameterized by $\varvec{w}$ to obtain the minimal reconstruction loss, namely $I(\varvec{h}, \varvec{x}) \approx \mathop {\max }_{\varvec{w}}[H(\varvec{x}) - \mathcal {R}_{\varvec{w}}(\varvec{x}|\varvec{h})]$. Note that, ideally, this bound can be arbitrarily tight provided that $\varvec{w}$ has sufficient capacity. In specific, we use the same network architecture as Table 18, and train it for 10 epochs to minimize the averaged binary cross-entropy reconstruction loss of each pixel. An adam (Kingma et al., 2014) optimizer with default hyper-parameters (lr=0.001, betas=(0.9, 0.999), eps=1e-08, weight_decay=0) is adopted. Naive as this procedure might seem, for one thing, we find that it is sufficient to reconstruct the input images well given enough information, and meanwhile distinguish different values of $I(\varvec{h}, \varvec{x})$ via the quality of reconstructed images. For another, we are primarily concerned with the comparisons of $I(\varvec{h}, \varvec{x})$ between end-to-end training and various cases of greedy supervised learning rather than obtaining the exact values of $I(\varvec{h}, \varvec{x})$. The same training process is applied to all the intermediate features $\varvec{h}$, and hence the comparisons are fair. Finally, since $H(\varvec{x})$ is a constant, for the ease of understanding, we simply present $1-{AverageBinaryCrossEntropyLoss}(\varvec{x}|\varvec{h})$ as the estimates of $I(\varvec{h}, \varvec{x})$, equivalent to adding the constant $1 - H(\varvec{x})$ to the real estimates of $I(\varvec{h}, \varvec{x})$.

For $I(\varvec{h}, y)$, as we have discussed in Sect. 4.3 as well, since $I(\varvec{h}, y) = H(y) - H(y|\varvec{h}) = H(y) - \mathbb {E}_{(\varvec{h}, y)}[-\text {log}\ p(y|\varvec{h})]$, we train an auxiliary classifier $q_{\varvec{\psi }}(y|\varvec{h})$ with parameters $\varvec{\psi }$ to approximate $p(y|\varvec{h})$, such that we have $I(\varvec{h}, y)\!\approx \!\mathop {\max }_{\varvec{\psi }} \{H(y)-\frac{1}{N}[\sum _{i=1}^N\!-\text {log}\ q_{\varvec{\psi }}(y_i|\varvec{h}_i)]\}$. Here we simply adopt the test accuracy of $q_{\varvec{\psi }}(y|\varvec{h})$ as the estimates of $I(\varvec{h}, y)$, which is highly correlated to the value of $-\frac{1}{N}[\sum _{i=1}^N\!-\text {log}\ q_{\varvec{\psi }}(y_i|\varvec{h}_i)]\}$ (or say, the cross-entropy loss). This can be viewed as the highest generalization performance that a classifier based on $\varvec{h}$ is able to reach. Notably, we use a ResNet-32 as $q_{\varvec{\psi }}$. For the inputs of $q_{\varvec{\psi }}$, we up-sample $\varvec{h}$ to $32\times 32$ and map $\varvec{h}$ to 16 channels at the first layer. All the training hyper-parameters of the ResNet-32 are the same as Sect. 5.2.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Wang, Y., Ni, Z., Pu, Y. et al. InfoPro: Locally Supervised Deep Learning by Maximizing Information Propagation. Int J Comput Vis 133, 2752–2782 (2025). https://doi.org/10.1007/s11263-024-02296-0

Download citation

Received: 06 November 2023
Accepted: 29 October 2024
Published: 11 December 2024
Version of record: 11 December 2024
Issue date: May 2025
DOI: https://doi.org/10.1007/s11263-024-02296-0

Keywords

Profiles

Yulin Wang View author profile
Gao Huang View author profile

Access this article

Log in via an institution

Subscribe and save

Springer+

from $39.99 /Month

Starting from 10 chapters or articles per month
Access and download chapters and articles from more than 300k books and 2,500 journals
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

InfoPro: Locally Supervised Deep Learning by Maximizing Information Propagation

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Self-Promoted Supervision for Few-Shot Transformer

Unsupervised Dense Prediction Using Differentiable Normalized Cuts

Mixed Evidence for Gestalt Grouping in Deep Neural Networks

Explore related subjects

Data Availability

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendices

Appendix: Architecture of Auxiliary Networks

Appendix: Details of Mutual Information Estimation

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Profiles

Subscribe and save

Buy Now