Abstract
Deep neural networks may be susceptible to learning spurious correlations that hold on average but not in atypical test samples. As with the recent emergence of vision transformer (ViT) models, it remains unexplored how spurious correlations are manifested in such architectures. In this paper, we systematically investigate the robustness of different transformer architectures to spurious correlations on three challenging benchmark datasets. Our study reveals that for transformers, larger models and more pre-training data significantly improve robustness to spurious correlations. Key to their success is the ability to generalize better from the examples where spurious correlations do not hold. Further, we perform extensive ablations and experiments to understand the role of the self-attention mechanism in providing robustness under spuriously correlated environments. We hope that our work will inspire future research on further understanding the robustness of ViT models to spurious correlations.
Similar content being viewed by others
Data availability
The datasets used during the current study are available in https://github.com/deeplearning-wisc/vit-spurious-robustness.
Code Availability
The code to reproduce the results in this study is available in https://github.com/deeplearning-wisc/vit-spurious-robustness.
Notes
Refer Table 7 (Appendix) for hyper-parameters used in different training schemes.
References
Abnar, S., & Zuidema, W. (2020). Quantifying attention flow in transformers. In Proceedings of the 58th annual meeting of the association for computational linguistics (pp. 4190–4197). Association for Computational Linguistics.
Bai, Y., Mei, J., Yuille, A., & Xie C. (2021). Are transformers more robust than cnns? In Thirty-fifth conference on neural information processing systems.
Beery, S., Horn, G. Van, & Perona, P. (2018). Recognition in terra incognita. In Proceedings of the European conference on computer vision (ECCV) (pp. 456–473).
Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., & Veit, A. (2021). Understanding robustness of transformers for image classification. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 10231–10241).
Cubuk, E. D., Zoph, B., Mane, D., Vasudevan, V., & Le, Q. V. (2019). Autoaugment: Learning augmentation strategies from data. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 113–123).
Cubuk, E. D., Zoph, B., Shlens, J., & Le, Q. V. (2020). Randaugment: Practical automated data augmentation with a reduced search space. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops (pp. 702–703).
Dai, Z., Yang, Z., Yang, Y., Carbonell, J., Le, Q., & Salakhutdinov, R. (2019). Transformer-XL: Attentive language models beyond a fixed-length context. In Proceedings of the 57th annual meeting of the association for computational linguistics (pp. 2978–2988).
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., & Fei-Fei, L. (2009). Imagenet: A large-scale hierarchical image database. In Proceedings of the IEEE/CVF international conference on computer vision and pattern recognition (pp. 248–255).
Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2019). Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of NAACL-HLT (pp. 4171–4186).
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., & Houlsby, N. (2021). An image is worth 16x16 words: Transformers for image recognition at scale. In International conference on learning representations.
Geirhos, R., Rubisch, P., Michaelis, C., Bethge, M., Wichmann, F. A., & Brendel, W. (2019). Imagenet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness.
Goel, K., Gu, A., Li, Y., & Ré C. (2021). Model patching: Closing the subgroup performance gap with data augmentation. In International conference on learning representations.
He, H., Zha, S., & Wang, H. (2019). Unlearn dataset bias in natural language inference by fitting the residual. In Proceedings of the 2nd workshop on deep learning approaches for low-resource NLP DeepLo (pp. 132–142).
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition.
Hendrycks, D., & Dietterich, T. (2019). Benchmarking neural network robustness to common corruptions and perturbations.
Hendrycks, D., Lee, K., & Mazeika, M. (2019). Using pre-training can improve model robustness and uncertainty. In International conference on machine learning (pp. 2712–2721). PMLR.
Hendrycks, D., Zhao, K., Basart, S., Steinhardt, J., & Song, D. (2021). Natural adversarial examples. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 15262–15271.
Kolesnikov, A., Beyer, L., Zhai, X., Puigcerver, J., Yung, J., Gelly, S., & Houlsby, N. (2020). Big transfer (BiT): General visual representation learning. In Lecture Notes in computer science (including subseries lecture notes in artificial intelligence and lecture notes in bioinformatics).
Liu, H., HaoChen, J. Z., Gaidon, A., & Ma T. (2021). Self-supervised learning is more robust to dataset imbalance. arXiv preprint arXiv:2110.05025 .
Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., & Guo, B. (2021). Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 10012–10022).
Liu, Z., Luo, P., Wang, X., & Tang, X. (2015). Deep learning face attributes in the wild. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 3730–3738).
Liu, Z., Mao, H., Wu, C. Y., Feichtenhofer, C., Darrell, T., & Xie, S. (2022). A convnet for the 2020s. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 11976–11986).
Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., & Stoyanov, V. (2019). Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 .
Liu, W., Wang, X., Owens, J., & Li, Y. (2020). Energy-based out-of-distribution detection.
Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., & Xue, H. (2022). Towards robust vision transformer. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 12042–12051).
McCoy, R. T., Pavlick, E., & Linzen, T. (2019). Right for the wrong reasons: Diagnosing syntactic heuristics in natural language inference. arXiv preprint arXiv:1902.01007.
Ming, Y., Yin, H., & Li, Y. (2022). On the impact of spurious correlation for out-of-distribution detection. In Proceedings of the AAAI conference on artificial intelligence.
Naseer, M., Ranasinghe, K., Khan, S., Hayat, M., Khan, F., & Yang, M. H. (2021). Intriguing properties of vision transformers. In Thirty-fifth conference on neural information processing systems.
Park, N., & Kim, S. (2022). How do vision transformers work?
Paul, S. & Chen, P. Y. (2022). Vision transformers are robust learners. In Proceedings of the AAAI conference on artificial intelligence (pp. 2071–2081).
Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S. Sastry, G., Askell, A., Mishkin, P., Clark, J., & Krueger, G. (2021). Learning transferable visual models from natural language supervision. In International conference on machine learning (pp. 8748–8763). PMLR.
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2019). Language models are unsupervised multitask learners. OpenAI blog, 9.
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., & Khosla, A. (2015). Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115, 211–252.
Sagawa, S., Koh, P. W., Hashimoto, T. B., & Liang P. (2020). Distributionally robust neural networks for group shifts: On the importance of regularization for worst-case generalization. In International conference on learning representations (ICLR).
Sagawa, S., Raghunathan, A., Koh, P. W., & Liang, P. (2020). An investigation of why overparameterization exacerbates spurious correlations. In International conference on machine learning (pp. 8346–8356). PMLR.
Selvaraju, R. R., Das, A., Vedantam, R., Cogswell, M., Parikh, D., & Batra, D. (2016). Grad-cam: Why did you say that? arXiv e-prints: arXiv–1611 .
Shi, Y., Daunhawer, I., Vogt, J. E., Torr, P. H., & Sanyal, A. (2022). How robust are pre-trained models to distribution shift? arXiv preprint arXiv:2206.08871.
Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., & Beyer, L. (2021). How to train your vit? Data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 .
Taori, R., Dave, A., Shankar, V., Carlini, N., Recht, B., & Schmidt, L. (2019). When robustness doesn’t promote robustness: synthetic versus natural distribution shifts on imagenet.
Tian, R., Wu, Z., Dai, Q., Hu, H., & Jiang, Y. G. (2022). Deeper insights into vits robustness towards common corruptions. arXiv preprint arXiv:2204.12143.
Touvron, H., Cord, M., & Jégou, H. (2022). Deit iii: Revenge of vit. In European conference on computer vision. Springer.
Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., & Jégou, H. (2021). Training data-efficient image transformers & distillation through attention. In International conference on machine learning (pp. 10347–10357). PMLR.
Touvron, H., Cord, M., Sablayrolles, A., Synnaeve, G., & Jégou, H. (2021). Going deeper with image transformers. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 32–42).
Tu, L., Lalwani, G., Gella, S., & He, H. (2020). An empirical study on robustness to spurious correlations using pre-trained language models. Transactions of the Association for Computational Linguistics, 8, 621–633.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., & Polosukhin, I. (2017). Attention is all you need. In Proceedings of the 31st international conference on neural information processing systems (pp. 6000–6010).
Wah, C., Branson, S., Welinder, P., Perona, P., & Belongie, S. (2011). The Caltech-UCSD Birds-200-2011 Dataset. Technical Report CNS-TR-2011-001, California Institute of Technology.
Wang, W., Xie, E., Li, X., Fan, D. P., Song, K., Liang, D., Lu, T., Luo, P., & Shao, L. (2021). Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 568–578).
Xue, F., Shi, Z., Wei, F., Lou, Y., Liu, Y., & You, Y. (2021). Go wider instead of deeper. arXiv preprint arXiv:2107.11817.
Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov, R. R., & Le, Q. V. (2019). Xlnet: Generalized autoregressive pretraining for language understanding. Advances in Neural Information Processing Systems.
Yuan, L., Chen, Y., Wang, T., Yu, W., Shi, Y., Jiang, Z. H., Tay, F. E., Feng, J., & Yan, S. (2021). Tokens-to-token vit: Training vision transformers from scratch on imagenet. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 558–567).
Yun, S., Han, D., Oh, S. J., Chun, S., Choe, J., & Yoo, Y. (2019). Cutmix: Regularization strategy to train strong classifiers with localizable features. In Proceedings of the IEEE/CVF international conference on computer vision.
Zeiler, M. D. & Fergus, R. (2014). Visualizing and understanding convolutional networks. In european conference on computer vision (pp. 818–833).
Zhai, X., Kolesnikov, A., Houlsby, N., & Beyer, L. (2021). Scaling vision transformers.
Zhang, H., Cisse, M., Dauphin, Y. N., & Lopez-Paz, D. (2018). Mixup: Beyond empirical risk minimization. In International conference on learning representations.
Zhang, T., Wu, F., Katiyar, A., Weinberger, K. Q., & Artzi, Y. (2020). Revisiting few-sample bert fine-tuning. arXiv preprint arXiv:2006.05987.
Zhang, C., Zhang, M., Zhang, S., Jin, D., Zhou, Q., Cai, Z., Zhao, H., Liu, X., & Liu, Z. (2022). Delving deep into the generalization of vision transformers under distribution shifts. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 7277–7286).
Zhou, D., Yu, Z., Xie, E., Xiao, C., Anandkumar, A., Feng, J., & Alvarez J. M. (2022). Understanding the robustness in vision transformers. In International conference on machine learning (pp. 27378–27394). PMLR.
Zhou, B., Lapedriza, A., Khosla, A., Oliva, A., & Torralba, A. (2017). Places: A 10 million image database for scene recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40, 1452–1464.
Funding
Research is supported in part by the UL Research Institutes through the Center for Advancing Safety of Machine Intelligence; AFOSR Young Investigator Program under award number FA9550-23-1-0184; National Science Foundation (NSF) Award No. IIS-2237037 & IIS-2331669; and Office of Naval Research under grant number N00014-23-1-2643.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors have no competing interests to declare that are relevant to the content of this article.
Ethics approval
Not applicable.
Consent to participate
Not applicable.
Additional information
Communicated by Kaiyang Zhou.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix A Implementation Details
-
1.
Transformers. For ViT models, we obtain the pre-trained checkpoints from the timm library.Footnote 1 For downstream fine-tuning on the Waterbirds and CelebA dataset, we scale up the resolution to 384 \(\times \) 384 by adopting 2D interpolation of the pre-trained position embeddings proposed in Dosovitskiy et al. (2021). Note, for CMNIST we keep the resolution as \(224 \times 224\) during fine-tuning. We fine-tune models using SGD with a momentum of 0.9 with an initial learning rate of 3e-2. As described in (Steiner et al., 2021), we use a fixed batch size of 512, gradient clipping at global norm 1, and a cosine decay learning rate schedule with a linear warmup.
-
2.
BiT. We obtain the pre-trained checkpoints from the official repository.Footnote 2 For downstream fine-tuning, we use SGD with an initial learning rate of 0.003, momentum 0.9, and batch size 512. We fine-tune models with various capacities for 500 steps, including BiT-M-R50x1, BiT-M-R50x3, and BiT-M-R101x1.
-
3.
Data Augmentation Schemes. For applying different data augmentations and regularization approaches in Sect. 1, we use the helper functions provided with timm library. We report different hyper-parameters used in Table 7.
Appendix B Extension: How Does the Size of Pre-training Dataset Affect Robustness to Spurious Correlations?
In this section, to further validate our findings on the importance of large-scale pre-training dataset, we show results on CelebA (Liu et al., 2015) dataset. We report our findings in Table 8. We also observe a similar trend for this setup that larger model capacity and more pre-training data yield a significant improvement in worst-group accuracy. Further, when pre-trained on a relatively smaller dataset such as ImageNet-1k, the performance of both transformer and CNN models are poor as compared to the ImageNet-21k counterpart.
Also, compared to BiT models, the robustness of transformer models benefits more with a large pre-training dataset. For example, compared to ImageNet-1k, fine-tuning DeiT-III-Base pre-trained on ImageNet-21k improves the worst-group accuracy by 6.5%. On the other hand, for BiT models, fine-tuning with a larger pre-trained dataset yields marginal improvement. Specifically, BiT-M-R50x3 only improves the worst-group accuracy by 1.5% with ImageNet-21k.
Appendix C Spurious Out-of-Distribution Detection
In this section, we study the performance of ViT models in out-of-distribution setting. Introduced in Ming et al. (2022), spurious out-of-distribution (OOD) data is defined as samples that do not contain the invariant features \(\textbf{z}^\textrm{inv}\) essential for accurate classification, but contain the spurious features \(\textbf{z}^{e}\). Hence, these samples are denoted as \(\textbf{x}_\textrm{ood} = \rho (\textbf{z}^{\bar{y}},\textbf{z}^e)\) where \(\bar{y}\) is an out-of-class label, such that \(\bar{y} \not \in \mathcal {Y}\). In the problem of waterbird vs landbird classification, an image of a person standing in forest would be an example of spurious OOD, since it contains different semantic class person \(\not \in \{\texttt {waterbird}, \texttt {landbird}\}\), yet has the environmental features of land background. A non-robust model relying on the background feature may classify such OOD data as an in-distribution class with high confidence. Hence, we aim to understand if self-attention based ViT models can mitigate this problem and if so, to what extent.
To investigate the performance of different models against spurious OOD examples, we use the setup introduced in Ming et al. (2022). Specifically, for Waterbirds (Sagawa et al., 2020) we test on subset of images of land and water sampled from the Places dataset (Zhou et al., 2017). Considering, CelebA (Liu et al., 2015) as in-distribution, our test suite consists of images of bald male as spurious OOD, since they contain environmental features (gender) without invariant features (hair). For CMNIST, the in-distribution data contains digits \(\mathcal {Y}\) = \(\{0,1\}\) and the background colors, \(\mathcal {E}\) = {red, green, purple, pink}. We use digits {5, 6, 7, 8, 9} with background color red and green as test OOD samples. We report our findings in Table 9. Clearly, ViT models achieve better OOD evaluation metrics as compared to BiTs. Specifically, ViT-B/16 achieves \(+{32}\%\) higher AUROC than BiT-M-R50x3, considering Waterbirds (Sagawa et al., 2020) as in-distribution.
Appendix D Extension: Color Spurious Correlation
To further validate our findings beyond natural background and gender as spurious (i.e. environmental) features, we provide additional experimental results with the ColorMNIST dataset, where the digits are superimposed on colored backgrounds. Specifically, it contains a spurious correlation between the target label and the background color. Similar to the setup in Ming et al. (2022), we fix the classes \(\mathcal {Y}\) = \(\{0,1\}\) and the background colors, \(\mathcal {E}\) = {red, green, purple, pink}. For this study, label \(y=0\) is spuriously correlated with background color \(\{\texttt {red}, \texttt {purple}\}\), and similarly, label \(y=1\) has spurious associations with background color \(\{\texttt {green}, \texttt {pink}\}\). Formally, we have \({\mathbb {P}}(e = \texttt {red} \mid y = 0) = {\mathbb {P}}(e = \texttt {purple} \mid y = 0) = {\mathbb {P}}(e = \texttt {green} \mid y = 1) = {\mathbb {P}}(e = \texttt {pink} \mid y = 1) = 0.45\) and \({\mathbb {P}}(e = \texttt {green} \mid y = 0) = {\mathbb {P}}(e = \texttt {pink} \mid y = 0) = {\mathbb {P}}(e = \texttt {red} \mid y = 1) = {\mathbb {P}}(e = \texttt {purple} \mid y = 1) = 0.05\). Note that, while fine-tuning the models, we fix the foreground color of digits as \(\texttt {white}\).
Results and insights on robustness performance. We compare model predictions on samples with the same class label but different background & foreground colors. Given a data point (\(\textbf{x}_i, y_i\)), we modify the background and foreground color of \(\textbf{x}_i\) randomly to generate a new test image \(\bar{\textbf{x}}_i\) with the constraint of having the same semantic label. In evaluation, the background color is chosen uniform-randomly from the set of colors: {#ecf02b, #f06007, #0ff5f1, #573115, #857d0f, #015c24, #ab0067, #fbb7fa, #d1ed95, #0026ff} and the foreground color is selected randomly from the set \(\{\texttt {black},\texttt {white}\}\). For evaluation, we form a dataset consisting of 2100 samples. The results reported are averaged over 50 random runs. Figure 6 depicts the distribution of training samples in the CMNIST dataset (left) and some representative examples after transformation (right).
We report our findings in Fig. 7. Our operating hypothesis is that a robust model should predict the same class label \(\hat{f}(\textbf{x}_i)\) and \({\hat{f}}(\bar{\textbf{x}}_i)\) for a given pair \((\textbf{x}_i, \bar{\textbf{x}}_i)\), as they share exactly the same target label (i.e., the invariant feature is approximately the same). We can observe from Fig. 7 that the best model ViT-B/16 obtains consistent predictions for 100% of image pairs. After extensive experimentation over all combinations, we find that setting the foreground color as \(\texttt {black}\) and the background as \(\texttt {white}\) caused the models to be most vulnerable. We see a significant decline in model consistency when the foreground color is set as \(\texttt {black}\) and the background as \(\texttt {white}\) (indicated as BW) as compared to random setup.
Consistency Measure. Evaluation results quantifying consistency for models of different architectures and varying capacities. We indicate the setup when the foreground color is set as \(\texttt {black}\) and the background as \(\texttt {white}\) using BW (right). Random represents setting both the foreground and background color randomly (left)
Appendix E Visualization
1.1 E.1 Attention Map
In Fig. 8, we visualize attention maps obtained from ViT-B/16 model for some samples in Waterbirds (Sagawa et al., 2020) and CMNIST dataset. We use Attention Rollout (Abnar & Zuidema, 2020) to obtain the attention matrix. We can observe that the model successfully attends spatial locations representing invariant features while making predictions.
1.2 E.2 The Attention Matrix of CMNIST
In the main text, we provide visualizations in which each image patch, irrespective of its spatial location, provides maximum attention to the patches representing essential cues for accurately identifying the foreground object. In Fig. 9, we show visualizations for ViT-B/16 fine-tuned on the CMNIST dataset to further validate our findings.
Appendix F Extension: Pattern in Attention Matrix
In this section, we provide visualizations for top N patches receiving the highest attention values for ViT-Ti/16 (Fig. 10) fine-tuned on Waterbirds dataset (Sagawa et al., 2020) on various test images.
Takeaways. Figure 10 shows the top N patches receiving the highest attention values for ViT-Ti/16 model fine-tuned on the Waterbirds dataset. We observe: (1) The model correctly attends to patches responsible for accurate classification in images belonging to the majority groups, i.e, waterbird on water background and landbird on land background. (2) For images belonging to minority groups (3rd and 4th row in Fig. 10) such as waterbird on land background, the model provides maximum attention to the environmental features, exhibiting lack of robustness to spurious correlations.
Visualization of the top N patches receiving the highest attention (marked in red) for ViT-Ti/16 model finetuned on Waterbirds (Sagawa et al., 2020) dataset (Color figure online)
Appendix G Extension: Results Highlighting the Importance of Pre-training Dataset on Additional Architectures
In this section, to further validate our findings on the importance of large-scale pre-training dataset we show results for additional transformer architectures: RVT (Mao et al., 2022) and PVT (Wang et al., 2021).
Takeaways: We report the results of finetuning different model architectures on Waterbirds (Sagawa et al., 2020) dataset in Table 10. Note, RVT architecture is specifically modified to provide additional robustness against adversarial attacks, common corruptions, and out-of-distribution inputs. Despite these modifications, RVT models when pre-trained on ImageNet-1k perform poorly when fine-tuned on Waterbirds indicating a lack of robustness to spurious correlations. However, we do notice that RVT displays stronger robustness to spurious correlation than ViT when pre-trained on ImageNet-1k.
Appendix H Experiments on training models from scratch
In Table 11, we train ViT and BiT models from scratch on Waterbirds(Sagawa et al., 2020) dataset. We observe that without any pre-training, both ViT and BiT models severely overfit the training dataset to attain 100% training accuracy and significantly fail on test samples. This observation indicates that without pre-training, both transformers and CNNs have a high propensity of memorizing training samples (along with their inherent bias).
Appendix I How Does Different Training Configuration Affect Robustness to Spurious Correlations?
In this section, we aim to disentangle the effect of different training configurations on robustness to spurious correlations for DeiT-III (Touvron et al., 2022) models. Unlike ViT (Dosovitskiy et al., 2021), DeiT-III models are pre-trained using strong data augmentations. Further, for ImageNet (Deng et al., 2009) training, Touvron et al. (2021) have shown the positive impact of different data augmentation and regularization approaches on model accuracy. Motivated by the findings, we aim to verify if applying different training schemes during fine-tuning can further improve model robustness to spurious correlations.
Table 12 reports results for fine-tuning on Waterbirds(Sagawa et al., 2020) using different configurations.Footnote 3 For this experiment, we use the DeiT-III-Small architecture pre-trained on ImageNet-21k. The top row in Table 12 corresponds to the standard fine-tuning setting we use throughout the paper for consistency and fair evaluation between different architectures. Considering data augmentation schemes, we observe that Mixup (Zhang et al., 2018), CutMix (Yun et al., 2019), and Rand Augment (Cubuk et al., 2020) provide the most significant improvement in model robustness. Specifically, using Rand Augment (Cubuk et al., 2020) during fine-tuning leads to 4.98% improvement in worst-group accuracy. Surprisingly, we also observe that simultaneously applying too many data augmentation schemes can instead hamper the worst-group accuracy. Among regularization schemes, we observe that reducing the weight decay penalty can also provide some improvement in worst-group accuracy.
Next, we investigate the impact of different training schemes on the robustness of convolutional nets to spurious correlations. Precisely, we fine-tune the BiT-R50x1 model on the Waterbirds dataset using different data augmentation approaches. We report results in Table 13. We observe that, irrespective of the pre-training dataset, using Auto Augment (Cubuk et al., 2019) and Rand Augment (Cubuk et al., 2020) improves model robustness to spurious correlations.
Appendix J GradCAM Visualizations
In Fig. 11, we show GradCAM (Selvaraju et al., 2016) visualizations of a few representative examples from the Waterbirds (Sagawa et al., 2020) dataset for a BiT-S-R50x1 model. For each input image, we show saliency maps where warmer colors depict higher saliency. We observe that the model correctly attends to features responsible for accurate classification in images belonging to the majority groups, i.e., waterbird on water background and landbird on land background. However, for the image belonging to the minority group, i.e., waterbird on land background, the model provides maximum attention to the environmental features, exhibiting a lack of robustness to spurious correlations. This visualization directly corroborates our empirical findings in Table 3.
Appendix K Software and Hardware
We run all experiments with Python 3.7.4 and PyTorch 1.9.0 using Nvidia Quadro RTX 5000 GPU.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Ghosal, S.S., Li, Y. Are Vision Transformers Robust to Spurious Correlations?. Int J Comput Vis 132, 689–709 (2024). https://doi.org/10.1007/s11263-023-01916-5
Received:
Accepted:
Published:
Version of record:
Issue date:
DOI: https://doi.org/10.1007/s11263-023-01916-5