Investigating Self-Supervised Methods for Label-Efficient Learning

Nandam, Srinivasa Rao; Atito, Sara; Feng, Zhenhua; Kittler, Josef; Awais, Muhammad

doi:10.1007/s11263-025-02397-4

Investigating Self-Supervised Methods for Label-Efficient Learning

Open access
Published: 10 March 2025

Volume 133, pages 4522–4537, (2025)
Cite this article

You have full access to this open access article

Download PDF

International Journal of Computer Vision Aims and scope Submit manuscript

Investigating Self-Supervised Methods for Label-Efficient Learning

Download PDF

Srinivasa Rao Nandam ORCID: orcid.org/0000-0003-2094-6023¹,
Sara Atito^1,2,
Zhenhua Feng³,
Josef Kittler² &
…
Muhammad Awais^1,2

1505 Accesses
Explore all metrics

A Correction to this article was published on 23 April 2025

This article has been updated

Abstract

Vision transformers combined with self-supervised learning have enabled the development of models which scale across large datasets for several downstream tasks, including classification, segmentation, and detection. However, the potential of these models for low-shot learning across several downstream tasks remains largely under explored. In this work, we conduct a systematic examination of different self-supervised pretext tasks, namely contrastive learning, clustering, and masked image modelling, to assess their low-shot capabilities by comparing different pretrained models. In addition, we explore the impact of various collapse avoidance techniques, such as centring, ME-MAX, and sinkhorn, on these downstream tasks. Based on our detailed analysis, we introduce a framework that combines mask image modelling and clustering as pretext tasks. This framework demonstrates superior performance across all examined low-shot downstream tasks, including multi-class classification, multi-label classification and semantic segmentation. Furthermore, when testing the model on large-scale datasets, we show performance gains in various tasks.

Efficient Self-Supervised Contrastive Learning with Representation Mixing

Clean-to-clean: pretraining vision transformers without additional data

Article 26 April 2025

Unsupervised Image Classification for Deep Representation Learning

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Self-supervised learning (SSL) has gained popularity as a learning technique for acquiring meaningful representations in an unsupervised manner. By training on large unlabeled datasets using self-supervised pretext and auxiliary tasks, SSL produces features that can efficiently be applied to downstream tasks with fewer labels, as demonstrated in Ahmed et al. (2021), Caron et al. (2021), Assran et al. (2022), He et al. (2021), Xie et al. (2021), Atito et al. (2021). Furthermore, SSL has enabled Vision Transformers (ViTs) (Dosovitskiy et al., 2020) to outperform Convolutional Neural Networks (CNNs) in various image-related tasks, including classification, detection, and segmentation (Ahmed et al., 2021; Xie et al., 2021; He et al., 2021; Zhou et al., 2021).

Self-supervised methods for modelling global discriminative features often employ either contrastive pretext tasks (Chen et al., 2020a, b, c, 2021; He et al., 2019) or clustering pretext tasks (Caron et al., 2021, 2020; Assran et al., 2022). On the other hand, an alternative avenue of SSL, i.e. Masked Image Modeling (MIM) methods, has emerged with a distinct focus on capturing contextual information by reconstruction either at the pixel level (Atito et al., 2022; Xie et al., 2021; He et al., 2021) or the token level (Wei et al., 2021; Bao et al., 2021), thereby lacking the incorporation of discriminative details that are crucial for generating globally informative features. Jiang et al. (2023) explored a combination of contrastive pretext tasks and MIM in the pixel space, which has shown to yield improved representations.

iBoT (Zhou et al., 2021) follows a similar approach, but replace the contrastive task with a clustering task at both global and patch levels and utilise MIM for token-level masked region prediction. Further, MSN (Assran et al., 2022) proposes ME-MAX loss instead of the centring trick proposed in DINO (Caron et al., 2021) for better low-shot linear evaluation.

All previous methods have demonstrated strong performance in the large dataset size regime. However, their performance in low-shot scenarios has been largely overlooked, with the notable exception of the MSN approach. However, MSN do not analyse the effect of different SSL components, like the choice of the pretext task and the collapse avoidance mechanisms for low-shot learning. Moreover, MSN confines its evaluation to low-shot linear classification on ImageNet-1K, which presents several limitations. For one, pre-training with a contrastive or clustering loss function typically results in high linear evaluation performance, especially when the model is assessed using the same dataset on which it was pre-trained. This outcome can falsely imply effectiveness in low-shot scenarios. Additionally, this narrow method of assessment does not accurately predict the model’s transferability to different tasks and datasets. Hence, there’s a notable gap in the literature in comprehensive, system-level analyses of the impact of SSL and its components on low-shot applications.

In this paper, we perform a detailed study of the impact of pretext tasks i.e. clustering, contrastive learning, and MIM, and the choice of a collapse avoidance method, i.e ME-MAX, sinkhorn and centring on the performance of low-shot downstream tasks. In addition, we also study the effect of extending the instance discrimination pretext task to the patch level. We provide an overview of different pretext tasks and collapse avoidance mechanisms used by previous frameworks in Table 1.

Based on the above analysis, we investigate a simple model with a combination of two different pretext tasks namely clustering and MIM for low-shot learning. Clustering is done at both, the class token level to capture global semantics and the patch level to capture local semantics. We perform MIM at pixel level, in addition to clustering, to capture finegrained details. When evaluated on several low-shot downstream tasks namely multi-label classification, multi-class classification and semantic segmentation, the proposed simple model works better due its ability to capture details at various levels. We also present the performance of state-of-the-art self-supervised models on these downstream tasks. Figure 1 shows the performance comparison of various SSL methods in low-shot classification on ImageNet-1K. To analyse the scaling behaviour on full datasets we finetune the model on standard finetuning evaluation settings following the previous SSL approaches (Caron et al., 2021; Atito et al., 2021; Zhou et al., 2021). We find that our model performs favourably in these settings as well.

Table 1 A review of different self supervised methods and their pretext tasks and collapse avoidance mechanisms

Investigating Self-Supervised Methods for Label-Efficient Learning

Abstract

Similar content being viewed by others

Efficient Self-Supervised Contrastive Learning with Representation Mixing

Clean-to-clean: pretraining vision transformers without additional data

Unsupervised Image Classification for Deep Representation Learning

Explore related subjects

1 Introduction

2 Related Works

3 Details of SSL Components and Analysis

3.1 Introduction of Different SSL Pretext Tasks

3.2 Collapse Avoidance

3.3 Analysis

3.4 Simple Pretext Combination for Low-Shot

3.5 A Simple Trick to Improve Pretraining Time

4 Experimental Results

4.1 Pre-training Setup

4.2 Main Results

4.3 Ablation Study

4.4 Visualisation

5 Conclusion

Data Availability

Change history

19 April 2025

23 April 2025

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendix A Architecture Details

Appendix A Architecture Details

1.1 A.1 Input, Masking and Patch Embedding

1.2 A.2 Reconstruction of the Masked Regions

1.3 A.3 Clustering as a Pretext Task

1.4 A.4 Training Losses

1.5 A.5 Pre-training & Finetuning

1.6 A.6 Visualisation

Rights and permissions

About this article

Cite this article

Share this article

Keywords