DenseNets Reloaded: Paradigm Shift Beyond ResNets and ViTs

Kim, Donghyun; Heo, Byeongho; Han, Dongyoon

doi:10.1007/978-3-031-72646-0_23

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15061))

Included in the following conference series:

European Conference on Computer Vision

345 Accesses

Abstract

This paper revives Densely Connected Convolutional Networks (DenseNets) and reveals the underrated effectiveness over predominant ResNet-style architectures. We believe DenseNets’ potential was overlooked due to untouched training methods and traditional design elements not fully revealing their capabilities. Our pilot study shows dense connections through concatenation are strong, demonstrating that DenseNets can be revitalized to compete with modern architectures. We methodically refine suboptimal components - architectural adjustments, block redesign, and improved training recipes towards widening DenseNets and boosting memory efficiency while keeping concatenation shortcuts. Our models, employing simple architectural elements, ultimately surpass Swin Transformer, ConvNeXt, and DeiT-III - key architectures in the residual learning lineage. Furthermore, our models exhibit near state-of-the-art performance on ImageNet-1K, competing with the very recent models and downstream tasks, ADE20k semantic segmentation, and COCO object detection/instance segmentation. Finally, we provide empirical analyses that uncover the merits of the concatenation over additive shortcuts, steering a renewed preference towards DenseNet-style designs. Our code is available at https://github.com/naver-ai/rdnet.

D. Kim and D. Han—Equal contribution.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

DeiT III: Revenge of the ViT

A Simple Single-Scale Vision Transformer for Object Detection and Instance Segmentation

MutualNet: Adaptive ConvNet via Mutual Learning from Network Width and Resolution

Notes

1.
Increase in GR aims to address the overall low GR in the baseline at an architecture level, whereas the abovementioned GR decrease was to boost ER on a block level.
2.
https://github.com/mlfoundations/open_clip.

References

Github repository: Swin transformer for object detection. https://github.com/SwinTransformer/Swin-Transformer-Object-Detection
Ba, J., Kiros, J.R., Hinton, G.E.: Layer normalization. arXiv preprint arXiv:1607.06450 (2016)
Bello, I., et al.: Revisiting ResNets: improved training and scaling strategies. Conf. Neural Inf. Process. Syst. (NeurIPS) 34, 22614–22627 (2021)
Google Scholar
Brock, A., De, S., Smith, S.L., Simonyan, K.: High-performance large-scale image recognition without normalization. In: International Conference on Machine Learning (ICML), pp. 1059–1071 (2021)
Google Scholar
Cai, Y., et al.: Reversible column networks. In: International Conference on Learning Representations (ICLR) (2023)
Google Scholar
Changpinyo, S., Sharma, P., Ding, N., Soricut, R.: Conceptual 12M: pushing web-scale image-text pre-training to recognize long-tail visual concepts. In: IEEE Transactions on Computer Vision and Pattern Recognition (CVPR), pp. 3558–3568 (2021)
Google Scholar
Chen, Y., Li, J., Xiao, H., Jin, X., Yan, S., Feng, J.: Dual path networks. In: Conference on Neural Information Processing Systems (NIPS), vol. 30 (2017)
Google Scholar
Cherti, M., et al.: Reproducible scaling laws for contrastive language-image learning. In: IEEE Transactions on Computer Vision and Pattern Recognition (CVPR), pp. 2818–2829 (2023)
Google Scholar
Cubuk, E.D., Zoph, B., Shlens, J., Le, Q.V.: RandAugment: practical automated data augmentation with a reduced search space. In: IEEE Transactions on Computer Vision and Pattern Recognition Workshop (CVPRW), pp. 702–703 (2020)
Google Scholar
Dai, Z., Liu, H., Le, Q.V., Tan, M.: CoAtNet: marrying convolution and attention for all data sizes. In: Conference on Neural Information Processing Systems (NeurIPS), pp. 3965–3977 (2021)
Google Scholar
Desai, K., Kaul, G., Aysola, Z., Johnson, J.: RedCaps: Web-curated image-text data created by the people, for the people. arXiv preprint arXiv:2111.11431 (2021)
DeVries, T., Taylor, G.W.: Improved regularization of convolutional neural networks with cutout. arXiv preprint arXiv:1708.04552 (2017)
Ding, X., Zhang, X., Zhou, Y., Han, J., Ding, G., Sun, J.: Scaling up your kernels to 31x31: revisiting large kernel design in CNNs. In: IEEE Transactions on Computer Vision and Pattern Recognition (CVPR) (2022)
Google Scholar
Dong, X., et al.: CSWin transformer: a general vision transformer backbone with cross-shaped windows. In: IEEE Transactions on Computer Vision and Pattern Recognition (CVPR) (2022)
Google Scholar
Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. In: International Conference on Learning Representations (ICLR) (2020)
Google Scholar
Goyal, P., et al.: Accurate, large minibatch SGD: Training ImageNet in 1 hour. arXiv preprint arXiv:1706.02677 (2017)
Guo, M.H., Lu, C.Z., Liu, Z.N., Cheng, M.M., Hu, S.M.: Visual attention network. Computational Visual Media (CVMJ) (2023)
Google Scholar
Han, D., Kim, J., Kim, J.: Deep pyramidal residual networks. In: IEEE Transactions on Computer Vision and Pattern Recognition (CVPR), pp. 5927–5935 (2017)
Google Scholar
Han, D., Yoo, Y., Kim, B., Heo, B.: Learning features with parameter-free layers. arXiv preprint arXiv:2202.02777 (2022)
Han, D., Yun, S., Heo, B., Yoo, Y.: Rethinking channel dimensions for efficient model design. In: IEEE Transactions on Computer Vision and Pattern Recognition (CVPR), pp. 732–741 (2021)
Google Scholar
Hassani, A., Walton, S., Li, J., Li, S., Shi, H.: Neighborhood attention transformer. In: IEEE Transactions on Computer Vision and Pattern Recognition (CVPR), pp. 6185–6194 (2023)
Google Scholar
He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask R-CNN. In: International Conference on Computer Vision (ICCV) (2017)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: IEEE Transactions on Computer Vision and Pattern Recognition (CVPR), pp. 770–778 (2016)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Identity mappings in deep residual networks. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9908, pp. 630–645. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46493-0_38
Chapter Google Scholar
Howard, A.G., et al.: MobileNets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861 (2017)
Huang, G., Liu, Z., Pleiss, G., Maaten, L.v.d., Weinberger, K.Q.: Convolutional networks with dense connectivity. IEEE Trans. Pattern Anal. Mach. Intell. (TPAMI) 44(12), 8704–8716 (2022)
Google Scholar
Huang, G., Sun, Yu., Liu, Z., Sedra, D., Weinberger, K.Q.: Deep networks with stochastic depth. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9908, pp. 646–661. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46493-0_39
Chapter Google Scholar
Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. In: International Conference on Machine Learning (ICML), pp. 448–456. PMLR (2015)
Google Scholar
Jégou, S., Drozdzal, M., Vazquez, D., Romero, A., Bengio, Y.: The one hundred layers tiramisu: Fully convolutional DenseNets for semantic segmentation. In: IEEE Transactions on Computer Vision and Pattern Recognition Workshop (CVPRW), pp. 11–19 (2017)
Google Scholar
Kornblith, S., Norouzi, M., Lee, H., Hinton, G.: Similarity of neural network representations revisited. In: International Conference on Machine Learning (ICML), pp. 3519–3529 (2019)
Google Scholar
Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. Commun. ACM 60, 84–90 (2012)
Article Google Scholar
Le, Y., Yang, X.S.: Tiny ImageNet visual recognition challenge (2015). https://api.semanticscholar.org/CorpusID:16664790
Lee, Y., Hwang, J.w., Lee, S., Bae, Y., Park, J.: An energy and GPU-computation efficient backbone network for real-time object detection. In: IEEE Transactions on Computer Vision and Pattern Recognition Workshop (CVPRW), pp. 752–760 (2019)
Google Scholar
Lee, Y., Park, J.: CenterMask: real-time anchor-free instance segmentation. In: IEEE Transactions on Computer Vision and Pattern Recognition (CVPR) (2020)
Google Scholar
Li, S., et al.: MogaNet: multi-order gated aggregation network. In: International Conference on Learning Representations (ICLR) (2024). https://openreview.net/forum?id=XhYWgjqCrV
Li, Y., et al.: MicroNet: improving image recognition with extremely low flops. In: International Conference on Computer Vision (ICCV), pp. 468–477 (2021)
Google Scholar
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Chapter Google Scholar
Lin, W., Wu, Z., Chen, J., Huang, J., Jin, L.: Scale-aware modulation meet transformer. In: International Conference on Computer Vision (ICCV), pp. 5992–6003 (10 2023)
Google Scholar
Liu, S., et al.: More convnets in the 2020s: Scaling up kernels beyond 51x51 using sparsity. In: International Conference on Learning Representations (ICLR) (2023)
Google Scholar
Liu, Z., et al.: Swin transformer: hierarchical vision transformer using shifted windows. In: International Conference on Computer Vision (ICCV) (2021)
Google Scholar
Liu, Z., Mao, H., Wu, C.Y., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: IEEE Transactions on Computer Vision and Pattern Recognition (CVPR), pp. 11976–11986 (2022)
Google Scholar
Loshchilov, I., Hutter, F.: SGDR: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983 (2016)
Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: International Conference on Learning Representations (ICLR) (2019)
Google Scholar
Parihar, A.S., Java, A.: Densely connected convolutional transformer for single image dehazing. J. Visual Commun. Image Represent. 90, 103722 (2023)
Google Scholar
Pleiss, G., Chen, D., Huang, G., Li, T., van der Maaten, L., Weinberger, K.Q.: Memory-efficient implementation of DenseNets. arXiv preprint arXiv:1707.06990 (2017)
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning (ICML), pp. 8748–8763. PMLR (2021)
Google Scholar
Radosavovic, I., Kosaraju, R.P., Girshick, R.B., He, K., Dollár, P.: Designing network design spaces. In: IEEE Transactions on Computer Vision and Pattern Recognition (CVPR), pp. 10425–10433 (2020)
Google Scholar
Rao, Y., Zhao, W., Tang, Y., Zhou, J., Lim, S.N., Lu, J.: HorNet: efficient high-order spatial interactions with recursive gated convolutions. In: Conference on Neural Information Processing Systems (NeurIPS) (2022)
Google Scholar
Russakovsky, O., et al.: ImageNet large scale visual recognition challenge. Int. J. Comput. Vis. (IJCV) 115(3), 211–252 (2015)
Article MathSciNet Google Scholar
Sandler, M., Howard, A.G., Zhu, M., Zhmoginov, A., Chen, L.C.: MobileNetV2: inverted residuals and linear bottlenecks. In: IEEE Transactions on Computer Vision and Pattern Recognition (CVPR), pp. 4510–4520 (2018)
Google Scholar
Sharma, P., Ding, N., Goodman, S., Soricut, R.: Conceptual captions: a cleaned, hypernymed, image alt-text dataset for automatic image captioning. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 2556–2565 (2018)
Google Scholar
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? Data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021)
Szegedy, C., et al.: Going deeper with convolutions. In: IEEE Transactions on Computer Vision and Pattern Recognition (CVPR), pp. 1–9 (2015)
Google Scholar
Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the inception architecture for computer vision. IEEE Transactions on Computer Vision and Pattern Recognition (CVPR), pp. 2818–2826 (2016)
Google Scholar
Tan, M., Le, Q.: EfficientNet: rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning (ICML), pp. 6105–6114. PMLR (2019)
Google Scholar
Tan, M., Le, Q.V.: EfficientNetV2: smaller models and faster training. In: International Conference on Machine Learning (ICML) (2021)
Google Scholar
Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jegou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning (ICML), pp. 10347–10357 (2021)
Google Scholar
Touvron, H., et al.: Augmenting convolutional networks with attention-based aggregation. arXiv preprint arXiv:2112.13692 (2021)
Touvron, H., Cord, M., J’egou, H.: DeiT III: revenge of the ViT. In: European Conference on Computer Vision (ECCV) (2022)
Google Scholar
Trockman, A., Kolter, J.Z.: Patches are all you need? arXiv preprint arXiv:2201.09792 (2022)
Tu, Z., et al.: MaxViT: multi-axis vision transformer. In: European Conference on Computer Vision (ECCV) (2022)
Google Scholar
Vaswani, A., et al.: Attention is all you need. In: Conference on Neural Information Processing Systems (NIPS) (2017)
Google Scholar
Wang, C.Y., Liao, H.y., Wu, Y.H., Chen, P.Y., Hsieh, J.W., Yeh, I.H.: CSPNet: a new backbone that can enhance learning capability of CNN. In: IEEE Transactions on Computer Vision and Pattern Recognition (CVPR), pp. 1571–1580 (2020)
Google Scholar
Wang, L., Cao, M., Yuan, X.: EfficientSCI: densely connected network with space-time factorization for large-scale video snapshot compressive imaging. In: IEEE Transactions on Computer Vision and Pattern Recognition (CVPR), pp. 18477–18486 (2023)
Google Scholar
Wang, R.J., Li, X., Ling, C.X.: Pelee: a real-time object detection system on mobile devices. In: Conference on Neural Information Processing Systems (NeurIPS), pp. 1967–1976 (2018)
Google Scholar
Wang, W., et al.: Pyramid vision transformer: a versatile backbone for dense prediction without convolutions. In: International Conference on Computer Vision (ICCV), pp. 548–558 (2021)
Google Scholar
Wang, W., et al.: PVT v2: improved baselines with pyramid vision transformer. Comput. Vis. Media (CVMJ) 8(3), 1–10 (2022)
Google Scholar
Wang, Z., Xie, K., Zhang, X.Y., Chen, H.Q., Wen, C., He, J.: Small-object detection based on yolo and dense block via image super-resolution. IEEE Access 9, 56416–56429 (2021)
Article Google Scholar
Wightman, R., Touvron, H., Jégou, H.: ResNet strikes back: An improved training procedure in timm. https://github.com/huggingface/pytorch-image-models (2021)
Wu, K., et al.: TinyViT: fast pretraining distillation for small vision transformers. In: European Conference on Computer Vision (ECCV) (2022)
Google Scholar
Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11209, pp. 432–448. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01228-1_26
Chapter Google Scholar
Xie, S., Girshick, R., Dollár, P., Tu, Z., He, K.: Aggregated residual transformations for deep neural networks. In: IEEE Transactions on Computer Vision and Pattern Recognition (CVPR), pp. 1492–1500 (2017)
Google Scholar
Yang, J., Li, C., Dai, X., Gao, J.: Focal modulation networks. In: Conference on Neural Information Processing Systems (NeurIPS) (2022)
Google Scholar
Yang, J., et al.: Focal self-attention for local-global interactions in vision transformers. In: Conference on Neural Information Processing Systems (NeurIPS) (2021)
Google Scholar
Yu, W., et al.: MetaFormer is actually what you need for vision. In: IEEE Transactions on Computer Vision and Pattern Recognition (CVPR), pp. 10819–10829 (2022)
Google Scholar
Yu, W., Zhou, P., Yan, S., Wang, X.: InceptionNext: When inception meets convnext. arXiv preprint arXiv:2303.16900 (2023)
Yun, S., et al.: CutMix: regularization strategy to train strong classifiers with localizable features. In: International Conference on Computer Vision (ICCV), pp. 6023–6032 (2019)
Google Scholar
Zhang, H., Cisse, M., Dauphin, Y.N., Lopez-Paz, D.: mixup: Beyond empirical risk minimization. In: International Conference on Learning Representations (ICLR) (2018)
Google Scholar
Zhang, J., Jin, Y., Xu, J., Xu, X., Zhang, Y.: MDU-Net: multi-scale densely connected u-net for biomedical image segmentation. Health Inf. Sci. Syst. (HISS) 11(1), 13 (2023)
Google Scholar
Zhong, Z., Zheng, L., Kang, G., Li, S., Yang, Y.: Random erasing data augmentation. In: AAAI Conference on Artificial Intelligence (AAAI), pp. 13001–13008 (2020)
Google Scholar
Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Semantic understanding of scenes through the ADE20K dataset. Int. J. Comput. Vis. (IJCV) 127, 302–321 (2018)
Article Google Scholar
Zhu, L., Wang, X., Ke, Z., Zhang, W., Lau, R.: BiFormer: vision transformer with bi-level routing attention. In: IEEE Transactions on Computer Vision and Pattern Recognition (CVPR) (2023)
Google Scholar

Download references

Author information

Authors and Affiliations

NAVER Cloud AI, Seongnam-si, South Korea
Donghyun Kim
NAVER AI Lab, Seongnam-si, South Korea
Byeongho Heo & Dongyoon Han

Authors

Donghyun Kim
View author publications
You can also search for this author in PubMed Google Scholar
Byeongho Heo
View author publications
You can also search for this author in PubMed Google Scholar
Dongyoon Han
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

University of Birmingham, Birmingham, UK
Aleš Leonardis
University of Trento, Trento, Italy
Elisa Ricci
Technical University of Darmstadt, Darmstadt, Germany
Stefan Roth
Princeton University, Princeton, NJ, USA
Olga Russakovsky
Czech Technical University in Prague, Prague, Czech Republic
Torsten Sattler
École des Ponts ParisTech, Marne-la-Vallée, France
Gül Varol

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 3225 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kim, D., Heo, B., Han, D. (2025). DenseNets Reloaded: Paradigm Shift Beyond ResNets and ViTs. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15061. Springer, Cham. https://doi.org/10.1007/978-3-031-72646-0_23

Download citation

DOI: https://doi.org/10.1007/978-3-031-72646-0_23
Published: 28 October 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72645-3
Online ISBN: 978-3-031-72646-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

DenseNets Reloaded: Paradigm Shift Beyond ResNets and ViTs