Abstract
End-to-end (E2E) training has become the de-facto standard for training modern deep networks, e.g., ConvNets and vision Transformers (ViTs). Typically, a global error signal is generated at the end of a model and back-propagated layer-by-layer to update the parameters. This paper shows that the reliance on back-propagating global errors may not be necessary for deep learning. More precisely, deep networks with a competitive or even better performance can be obtained by purely leveraging locally supervised learning, i.e., splitting a network into gradient-isolated modules and training them with local supervision signals. However, such an extension is non-trivial. Our experimental and theoretical analysis demonstrates that simply training local modules with an E2E objective tends to be short-sighted, collapsing task-relevant information at early layers, and hurting the performance of the full model. To avoid this issue, we propose an information propagation (InfoPro) loss, which encourages local modules to preserve as much useful information as possible, while progressively discarding task-irrelevant information. As InfoPro loss is difficult to compute in its original form, we derive a feasible upper bound as a surrogate optimization objective, yielding a simple but effective algorithm. We evaluate InfoPro extensively with ConvNets and ViTs, based on twelve computer vision benchmarks organized into five tasks (i.e., image/video recognition, semantic/instance segmentation, and object detection). InfoPro exhibits superior efficiency over E2E training in terms of GPU memory footprints, convergence speed, and training data scale. Moreover, InfoPro enables the effective training of more parameter- and computation-efficient models (e.g., much deeper networks), which suffer from inferior performance when trained in E2E. Code: https://github.com/blackfeather-wang/InfoPro-Pytorch.
Similar content being viewed by others
Data Availability
As introduced in Sect. 5, the data that support the findings of this study are openly available.
References
Krizhevsky, A., Sutskever, I., & Hinton, G.E. (2012). Imagenet classification with deep convolutional neural networks. In NeurIPS,, pp. 1097–1105.
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. in CVPR, pp. 770–778.
Huang, G., Liu, Z., Pleiss, G., Van Der Maaten, L., & Weinberger, K. (2019). Convolutional networks with dense connectivity. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44, 8704–8716.
Han, K., Wang, Y., Chang, X., Guo, J., Chunjing, X., Enhua, W., & Tian, Q. (2022). Ghostnets on heterogeneous devices via cheap operations. International Journal of Computer Vision, 130(4), 1050–1069.
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., & Houlsby, N. (2021). An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR.
Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., & Jégou, H. (2021). Training data-efficient image transformers & distillation through attention. In ICML, pp. 10347–10357.
Zhang, Q., Yufei, X., Zhang, J., & Tao, D. (2023). Vitaev2: Vision transformer advanced by exploring inductive bias for image recognition and beyond. International Journal of Computer Vision, 131(5), 1141–1162.
Lin, M., Chen, M., Zhang, Y., Shen, C., Ji, R., & Cao, L. (2023). Super vision transformer. International Journal of Computer Vision, 131(12), 3136–3151.
Jaderberg, M., Czarnecki, W., Marian, O., Simon, V., Oriol, G., Alex, S.D., & Kavukcuoglu, K. (2017). Decoupled neural interfaces using synthetic gradients. In ICML, pp. 1627–1635.
Touvron, H., Cord, M., Sablayrolles, A., Synnaeve, G., & Jégou, H. (2021). Going deeper with image transformers. In ICCV, pp. 32–42.
Ni, Z., Wang, Y., Jiangwei, Y., Jiang, H., Cao, Y., & Huang, G. (2023). Deep incubation: Training large models by divide-and-conquering. In ICCV, pp. 17335–17345.
Zhai, X., Kolesnikov, A., Houlsby, N., & Beyer, L. (2022). Scaling vision transformers. In CVPR, pp. 12104–12113.
Crick, F. (1989). The recent excitement about neural networks. Nature, 337(6203), 129–132.
Marblestone, A. H., Wayne, G., & Kording, K. P. (2016). Toward an integration of deep learning and neuroscience. Frontiers in Computational Neuroscience, 10, 215943.
Löwe, S., O’Connor, P., & Veeling, B. (2019). Putting an end to end-to-end: Gradient-isolated learning of representations. In NeurIPS, pp. 3039–3051.
Dan, Y., & Poo, M. (2004). Spike timing-dependent plasticity of neural circuits. Neuron, 44(1), 23–30.
Caporale, N., & Dan, Y. (2008). Spike timing-dependent plasticity: A hebbian learning rule. Annual Review of Neuroscience, 31, 25–46.
Bengio, Y., Lee, D.H., Bornschein, J., Mesnard, T., & Lin, Z. (2015) Towards biologically plausible deep learning. arXiv:1502.04156.
Mosca, A., & Magoulas, G.D. (2017) Deep incremental boosting. arXiv:1708.03704.
Mostafa, H., Ramesh, V., & Cauwenberghs, G. (2018). Deep supervised learning using local errors. Frontiers in Neuroscience, 12, 608.
Huang, F., Ash, J., Langford, J., & Schapire, R. (2018). Learning deep resnet blocks sequentially using boosting theory. In ICML, pp. 2058–2067.
Belilovsky, E., Eickenberg, M., & Oyallon, E. (2019). Greedy layerwise learning can scale to imagenet. In ICML, pp. 583–593.
Belilovsky, E., Eickenberg, M., & Oyallon, E. (2020). Decoupled greedy learning of cnns. ICML, 56, 736–745.
Nøkland, A., & Eidnes, L.H. (2019). Training neural networks with local error signals. In ICML, pp. 4839–4850.
Wang, Y., Ni, Z., Song, S., Yang, L., & Huang, G. (2021). Revisiting locally supervised learning: an alternative to end-to-end training. In ICLR.
Carreira, J. & Zisserman, A. (2017). Quo vadis, action recognition? a new model and the kinetics dataset. In CVPR, pp. 6299–6308.
Lin, J., Gan, C., & Han, S. (2019). Tsm: Temporal shift module for efficient video understanding. In ICCV, pp. 7083–7093.
Tan, M., & Quoc, V. L. (2019). Efficientnet: Rethinking model scaling for convolutional neural networks. In ICML, 97, 6105–6114.
Ye, P., Tang, S., Li, B., Chen, T., & Ouyang, W. (2022). Stimulative training of residual networks: A social psychology perspective of loafing. In NeurIPS, 35, 3596–3608.
Ye, P., He, T., Tang, S., Li, B., Chen, T., Bai, L., & Ouyang, W. (2023) Stimulative training++: Go beyond the performance limits of residual networks. arXiv:2305.02507.
Hinton, G. E., Osindero, S., & Teh, Y. W. (2006). A fast learning algorithm for deep belief nets. Neural Computation, 18, 1527–1554.
Bengio, Y., Lamblin, P., Popovici, D., & Larochelle, H. (2007). Greedy layer-wise training of deep networks. In NeurIPS, pp. 153–160.
Ioffe, S. & Szegedy, C. (2015) Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv:1502.03167.
Kulkarni, M. & Karande, S. (2017) Layer-wise training of deep networks using kernel similarity. arXiv:1703.07115.
Malach, E. & Shalev-Shwartz, S. (2018) A provably correct algorithm for deep learning that actually works. arXiv:1803.09522 .
Marquez, E. S., Hare, J. S., & Niranjan, M. (2018). Deep cascade learning. IEEE Transactions on Neural Networks and Learning Systems, 29(11), 5475–5485.
Fahlman, S.E. & Lebiere, C. (1990). The cascade-correlation learning architecture. In NeurIPS, pp. 524–532.
Xiong, Y., Ren, M., & Urtasun, R. (2020). Loco: Local contrastive representation learning. In NeurIPS, pp. 11142–11153.
Laskin, M., Metz, L., Nabarro, S., Saroufim, M., Noune, B., Luschi, C., Sohl-Dickstein, J., & Abbeel, P. (2020) Parallel training of deep networks with local updates. arXiv:2012.03837.
Gomez, A. N., Key, O., Perlin, K., Gou, S., Frosst, N., Dean, J., & Gal, Y. (2022). Interlocking backpropagation: Improving depthwise model-parallelism. Journal of Machine Learning Research, 23(1), 7714–7741.
Chen, T., Xu, B., Zhang, C., & Guestrin, C. (2016) Training deep nets with sublinear memory cost. arXiv:1604.06174.
Gruslys, A., Munos, R., Danihelka, I., Lanctot, M., & Graves, A. (2016). Memory-efficient backpropagation through time. NeurIPS, 29, 4125–4133.
Gomez, A.N., Ren, M., Urtasun, R., & Grosse, R.B. (2017). The reversible residual network: Backpropagation without storing activations. In NeurIPS, pp. 2214–2224.
Salimans, T. & Bulatov, Y. (2017) Gradient checkpointing.
Jacobsen, JH, Smeulders, A, & Oyallon, E. (2018) i-revnet: Deep invertible networks. arXiv:1802.07088.
Lee, D.-H., Zhang, S., Fischer, A., & Bengio, Y. (2015). Difference target propagation. in Joint european conference on machine learning and knowledge discovery in databases, pp. 498–515.
Bartunov, S., Santoro, A., Richards, B., Marris, L., Hinton, G.E., & Lillicrap, T. (2018). Assessing the scalability of biologically-motivated deep learning algorithms and architectures. In NeurIPS, pp. 9368–9378.
, Timothy, P., Cownden, D., Tweed, D.B., & Akerman C.J. (2014) Random feedback weights support learning in deep neural networks. arXiv:1411.0247.
Nøkland, A. (2016). Direct feedback alignment provides learning in deep neural networks. In NeurIPS, pp. 1037–1045.
Taylor, G., Burmeister, R., Zheng, X., Singh, B., Patel, A., & Goldstein, T. (2016). Training neural networks without gradients: A scalable admm approach. In ICML, pp. 2722–2731.
Anna Choromanska, E. C. E. N. Y. U., Tandon, S. K., Luss, R., Rish, I., Kingsbury, B., Tejwani, R., & Bouneffouf, D. (2018). Beyond backprop: Alternating minimization with co-activation memory. Statistics, 1050, 24.
Huo, Z., Bin, G., Yang, Q., & Huang, H. (2018). Decoupled parallel backpropagation with convergence guarantee. In ICML, pp. 2098–2106.
Huo, Z., Bin, G., & Huang, H. (2018). Training neural networks using features replay. In NeurIPS, pp. 6659–6668.
Shwartz-Ziv, R., & Tishby, N. (2017) Opening the black box of deep neural networks via information. arXiv:1703.00810.
Saxe, A. M., Bansal, Y., Dapello, J., Advani, M., Kolchinsky, A., Tracey, B. D., & Cox, D. D. (2019). On the information bottleneck theory of deep learning. Journal of Statistical Mechanics: Theory and Experiment, 2019(12), 124020.
Tishby, N., Pereira, F.C., & Bialek, W. (2000) The information bottleneck method. arXiv:physics/0004057.
Achille, A., & Soatto, S. (2018). Emergence of invariance and disentanglement in deep representations. Journal of Machine Learning Research, 19(1), 1947–1980.
Alemi, A.A., Fischer, I., Dillon, J.V., & Murphy, K. (2016) Deep variational information bottleneck. arXiv:1612.00410.
van den Oord, A., Li, Y, & Vinyals, O. (2018) Representation learning with contrastive predictive coding. arXiv:1807.03748.
Tian, Y., Sun, C., Poole, B., Krishnan, D., Schmid, C., & Phillip I. (2020) What makes for good views for contrastive learning. arXiv:2005.10243.
Hjelm, R.D., Fedorov, A., Lavoie-Marchildon, S., Grewal, K., Bachman, P., Trischler, A., & Bengio, Y. (2019) Learning deep representations by mutual information estimation and maximization. In ICLR.
Li, Y., Yang, M., Peng, D., Li, T., Huang, J., & Peng, X. (2022). Twin contrastive learning for online clustering. International Journal of Computer Vision, 130(9), 2205–2221.
Chen, T., Kornblith, S., Norouzi, M., & Hinton, G. (2020). A simple framework for contrastive learning of visual representations. In ICML, pp. 1597–1607.
He, K., Fan, H., Yuxin, W., Xie, S., & Girshick, R. (2020). Momentum contrast for unsupervised visual representation learning. In CVPR, pp. 9729–9738.
Krizhevsky, A., Hinton, G., et al. (2009). Learning multiple layers of features from tiny images. Citeseer: Technical report.
Netzer, Y., Wang, T., Coates, A., Bissacco, A., Wu, B., & Ng, A.Y. (2011) Reading digits in natural images with unsupervised feature learning.
Coates, A., Ng, A., & Lee, H. (2011). An analysis of single-layer networks in unsupervised feature learning. In AISTATS, pp. 215–223.
Lee, C.Y., Xie, S., Gallagher, P., Zhang, Z., & Zhuowen, T. (2015). Deeply-supervised nets. In AISTATS, pp. 562–570.
Tschannen, M., Djolonga, J., Rubenstein, P.K., Gelly, S., Lucic, M. (2020) On mutual information maximization for representation learning. In ICLR.
Mohamed I.B., Aristide, B., Sai, R., Sherjil, O., Yoshua, B., Aaron, C., & Devon, H. (2018). Mutual information neural estimation. In ICML , pp. 531–540.
LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11), 2278–2324.
Vincent, P., Larochelle, H., Bengio, Y., & Manzagol, P.A. (2008). Extracting and composing robust features with denoising autoencoders. In ICML, pp. 1096–1103.
Rifai, S., Bengio, Y., Courville, A., Vincent, P., & Mirza, M. (2012). Disentangling factors of variation for facial expression recognition. In ECCV, pp. 808–822.
Kingma, D.P., & Welling, M. (2013) Auto-encoding variational bayes. arXiv:1312.6114.
Makhzani, A., Shlens, J., Jaitly, N., Goodfellow, I., & Frey, B. (2015) Adversarial autoencoders. arXiv:1511.05644.
Khosla, P., Teterwak, P., Wang, C., Sarna, A., Tian, Y., Isola, P., Maschinot, A., Liu, C., & Krishnan, D. (2020). Supervised contrastive learning. In NeurIPS , pp. 18661–18673.
Deng, J., Dong, W., Socher, R., Li, L., Li, Kai, & Fei-Fei, Li. (2009). Imagenet: A large-scale hierarchical image database. In ICML, pp. 248–255.
Mark Everingham, S. M., Eslami, A., Van Gool, L., Williams, C. K. I., Winn, J., & Zisserman, A. (2015). The pascal visual object classes challenge: A retrospective. International Journal of Computer Vision, 111(1), 98–136.
Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., & Schiele, B. (2016). The cityscapes dataset for semantic urban scene understanding. In CVPR, pp. 3213–3223.
Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., & Torralba, A. (2017). Scene parsing through ade20k dataset. In CVPR, pp. 633–641.
Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., & Serre, T. (2011). Hmdb: a large video database for human motion recognition. In ICCV, pp. 2556–2563.
Soomro, K., Zamir, A.R., Shah, M. (2012) Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv:1212.0402.
Goyal, R., Kahou, S.E., Michalski, V., Materzynska, J., Westphal, S., Kim, H., Haenel, V., Fruend, I., Yianilos, P., Mueller-Freitag, M., et al. (2017). The" something something" video database for learning and evaluating visual common sense. In ICCV, pp. 5842–5850.
Lin, T.Y., Maire, M., Belongie, S., Bourdev, L., Girshick, R., Hays, P., James, P., Ramanan, D., Lawrence Zitnick, C., & Dollár, P. (2014) Microsoft coco: Common objects in context.
Wang, Y., Pan, X., Song, S., Zhang, H., Huang, G., & Cheng, W. (2019). Implicit semantic data augmentation for deep networks. In NeurIPS, pp. 12635–12644.
Tarvainen, A., & Valpola, H. (2017). Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. In NeurIPS, pp. 1195–1204.
Hariharan, B., Arbeláez, P., Bourdev, L., Maji, S., & Malik, J. (2011). Semantic contours from inverse detectors. In ICCV, pp. 991–998.
Chen, L.C., Papandreou, G., Schroff, F., & Adam, H. (2017) Rethinking atrous convolution for semantic image segmentation. arXiv:1706.05587.
MMSegmentation Contributors. (2020) Mmsegmentation, an open source semantic segmentation toolbox. https://github.com/open-mmlab/mmsegmentation.
Wang, Y., Chen, Z., Jiang, H., Song, S., Han, Y., & Huang, G. (2021). Adaptive focus for efficient video recognition. In ICCV, pp. 16249–16258.
Wang, Y., Yue, Y., Lin, Y., Jiang, H., Lai, Z., Kulikov, V., Orlov, N., Shi, H., & Huang, G. (2022a). Adafocus v2: End-to-end training of spatial dynamic networks for video recognition. In CVPR, pp. 20030–20040.
Wang, Y., Yue, Y., Xinhong, X., Hassani, A., Kulikov, V., Orlov, N., Song, S., Shi, H., & Huang, G. (2022b). Adafocusv3: On unified spatial-temporal dynamic video recognition. In ECCV, pp. 226–243.
Chen, K., Wang, J., Pang, J., Yuhang Cao, Y., Xiong, X.L., Sun, S., Feng, W., Liu, Z., Jiarui, X., Zhang, Z., Cheng, D., Zhu, C., Cheng, T., Zhao, Q., Li, B., Xin, L., Zhu, R., Yue, W., Dai, J., Lin, D. (2019). MMDetection: Open mmlab detection toolbox and benchmark. arXiv:1906.07155.
Jacobsen, Jörn-Henrik, S., Arnold W.M., & Oyallon, Edouard. (2018). i-revnet: Deep invertible networks. In ICLR.
Cubuk, Ekin D., Zoph, B., Mane, D., Vasudevan, V., & Le, Q. V. (2018). Autoaugment: Learning augmentation policies from data. arXiv preprintarXiv:1805.09501, .
Cubuk, Ekin D., Zoph, B., Shlens, J., & Le, Q. V. (2020). Randaugment: Practical automated data augmentation with a reduced search space. In CVPRW, 702–703.
Zhong, Z., Zheng, L., Kang, G., Li, S., & Yang, Y. (2020). Random erasing data augmentation. In AAAI, 34, 13001–13008.
Zhang, H., Cisse, M., Dauphin, Yann N., & Lopez-Paz, D. (2017). mixup: Beyond empirical risk minimization. arXiv preprintarXiv:1710.09412.
Yun, S., Han, D., Seong J., Oh., Chun, S., Choe, J., & Yoo, Y., (2019). Cutmix: Regularization strategy to train strong classifiers with localizable features. In ICCV, 6023–6032.
Gao Huang, Y., Sun, Z. L., Sedra, D., & Weinberger, K. Q. (2016). Deep networks with stochastic depth. In ECCV, 646–661.
Bengio, Y., Simard, P., & Frasconi, P. (1994). Learning long-term dependencies with gradient descent is difficult. IEEE Transactions on Neural Networks, 5(2), 157–166.
Ren, S., He, K., Girshick, R., & Sun, J. (2015). Faster r-cnn: Towards real-time object detection with region proposal networks. In NeurIPS, 91–99.
Chen, Z., Duan, Y., Wang, W., He, J., Lu, T., Dai, J., & Qiao, Y. (2023). Vision transformer adapter for dense predictions. In ICLR.
Zhao, H., Shi, J., Qi, X., Wang, X., & Jia, J. (2017). Pyramid scene parsing network. In CVPR, 2881–2890.
Jun, F., Liu, J., Tian, H., Li, Y., Bao, Y., Fang, Z., & Hanqing, Lu. (2019). Dual attention network for scene segmentation. In CVPR, 3146–3154.
He, K., Gkioxari, G., Dollár, P., & Girshick, R. (2017). Mask r-cnn. In ICCV, 2961–2969.
Lin, T.-Y., Dollár, P., Girshick, R., He, K., Hariharan, B., & Belongie, S. (2017). Feature pyramid networks for object detection. In CVPR, 2117–2125.
Veit, A., Wilber, M. J., & Belongie, S. (2016). Residual networks behave like ensembles of relatively shallow networks. In NeurIPS, 550–558.
Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al. (2023) Gpt-4 technical report. arXiv preprintarXiv:2303.08774.
Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., et al. (2023) Llama 2: Open foundation and fine-tuned chat models. arXiv preprintarXiv:2307.09288.
Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al. (2021). Learning transferable visual models from natural language supervision. In ICML,56, 8748–8763, PMLR.
Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., & Chen, L.-C. (2018). Mobilenetv 2: Inverted residuals and linear bottlenecks. In CVPR, 4510–4520.
Elfwing, S., Uchibe, E., & Doya, K. (2018). Sigmoid-weighted linear units for neural network function approximation in reinforcement learning. Neural Networks, 107, 3–11.
Kingma, D. P., & Ba, J. (2014) Adam: A method for stochastic optimization. arXiv preprintarXiv:1412.6980, .
Acknowledgements
This work was supported in part by the National Key R&D Program of China under Grant 2021ZD0140407, and in part by the National Natural Science Foundation of China under Grants 42327901.
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by Jifeng Dai.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix: Architecture of Auxiliary Networks
In this section, we introduce the network architectures of \(\varvec{w}\), \(\varvec{\psi }\) and \(\varvec{\phi }\) we use in our experiments. Note that, \(\varvec{w}\) is a decoder that aims to reconstruct the input images from deep features, while \(\varvec{\psi }\) and \(\varvec{\phi }\) share the same architecture except for the last layer. The architectures used on CIFAR, SVHN and STL-10 are shown in Table 18 and Table 19. Architectures on ImageNet for ResNet/ResNeXt and EfficientNet are shown in Tables 20, 21 and Tables 22, 23, respectively. We apply the same mobile inverted bottleneck convolutional layers (MobileConv) (Sandler et al., 2018; Tan and Quoc, 2019) and Swish activations (Elfwing et al., 2018; Tan and Quoc, 2019) as EfficientNet to its corresponding auxiliary networks. For DeiT-S, we simply use an MLP as \(\varvec{w}\) to reconstruct the model inputs, while \(\varvec{\psi }\) consists of a 3x3 convolutional layer with stride=2, followed by 3 standard Transformer layers. The architectures of \(\varvec{w}\) and \(\varvec{\psi }\) for semantic segmentation are shown in Table 24 and Table 25, where the same networks are used with varying input sizes. Herein, “s”, “p” and “oc” refer to stride, padding and output channels, respectively. “conv.” refers to a single convolutional layer.
In particular, in all of our experiments, DGL (Belilovsky et al., 2020) uses exactly the same architecture of auxiliary networks as us.
Appendix: Details of Mutual Information Estimation
In this section, we describe the details on obtaining the estimates of \(I(\varvec{h}, \varvec{x})\) and \(I(\varvec{h}, y)\) we present in Figs. 2, 3 and 5.
As we have discussed in Sect. 4.3, the expected reconstruction error \(\mathcal {R}(\varvec{x}|\varvec{h})\) follows \(I(\varvec{h}, \varvec{x})\!=\!H(\varvec{x})\!-\!H(\varvec{x}|\varvec{h})\!\ge \!H(\varvec{x})\!-\!\mathcal {R}(\varvec{x}|\varvec{h})\) (Vincent et al., 2008; Rifai et al., 2012; Kingma and Welling, 2013; Makhzani et al., 2015; Hjelm et al., 2019). Therefore, similar to Sect. 4.3, we estimate \(I(\varvec{h}, \varvec{x})\) by training a decoder parameterized by \(\varvec{w}\) to obtain the minimal reconstruction loss, namely \(I(\varvec{h}, \varvec{x}) \approx \mathop {\max }_{\varvec{w}}[H(\varvec{x}) - \mathcal {R}_{\varvec{w}}(\varvec{x}|\varvec{h})]\). Note that, ideally, this bound can be arbitrarily tight provided that \(\varvec{w}\) has sufficient capacity. In specific, we use the same network architecture as Table 18, and train it for 10 epochs to minimize the averaged binary cross-entropy reconstruction loss of each pixel. An adam (Kingma et al., 2014) optimizer with default hyper-parameters (lr=0.001, betas=(0.9, 0.999), eps=1e-08, weight_decay=0) is adopted. Naive as this procedure might seem, for one thing, we find that it is sufficient to reconstruct the input images well given enough information, and meanwhile distinguish different values of \(I(\varvec{h}, \varvec{x})\) via the quality of reconstructed images. For another, we are primarily concerned with the comparisons of \(I(\varvec{h}, \varvec{x})\) between end-to-end training and various cases of greedy supervised learning rather than obtaining the exact values of \(I(\varvec{h}, \varvec{x})\). The same training process is applied to all the intermediate features \(\varvec{h}\), and hence the comparisons are fair. Finally, since \(H(\varvec{x})\) is a constant, for the ease of understanding, we simply present \(1-{AverageBinaryCrossEntropyLoss}(\varvec{x}|\varvec{h})\) as the estimates of \(I(\varvec{h}, \varvec{x})\), equivalent to adding the constant \(1 - H(\varvec{x})\) to the real estimates of \(I(\varvec{h}, \varvec{x})\).
For \(I(\varvec{h}, y)\), as we have discussed in Sect. 4.3 as well, since \(I(\varvec{h}, y) = H(y) - H(y|\varvec{h}) = H(y) - \mathbb {E}_{(\varvec{h}, y)}[-\text {log}\ p(y|\varvec{h})]\), we train an auxiliary classifier \(q_{\varvec{\psi }}(y|\varvec{h})\) with parameters \(\varvec{\psi }\) to approximate \(p(y|\varvec{h})\), such that we have \(I(\varvec{h}, y)\!\approx \!\mathop {\max }_{\varvec{\psi }} \{H(y)-\frac{1}{N}[\sum _{i=1}^N\!-\text {log}\ q_{\varvec{\psi }}(y_i|\varvec{h}_i)]\}\). Here we simply adopt the test accuracy of \(q_{\varvec{\psi }}(y|\varvec{h})\) as the estimates of \(I(\varvec{h}, y)\), which is highly correlated to the value of \(-\frac{1}{N}[\sum _{i=1}^N\!-\text {log}\ q_{\varvec{\psi }}(y_i|\varvec{h}_i)]\}\) (or say, the cross-entropy loss). This can be viewed as the highest generalization performance that a classifier based on \(\varvec{h}\) is able to reach. Notably, we use a ResNet-32 as \(q_{\varvec{\psi }}\). For the inputs of \(q_{\varvec{\psi }}\), we up-sample \(\varvec{h}\) to \(32\times 32\) and map \(\varvec{h}\) to 16 channels at the first layer. All the training hyper-parameters of the ResNet-32 are the same as Sect. 5.2.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Wang, Y., Ni, Z., Pu, Y. et al. InfoPro: Locally Supervised Deep Learning by Maximizing Information Propagation. Int J Comput Vis 133, 2752–2782 (2025). https://doi.org/10.1007/s11263-024-02296-0
Received:
Accepted:
Published:
Version of record:
Issue date:
DOI: https://doi.org/10.1007/s11263-024-02296-0