这是indexloc提供的服务,不要输入任何密码
Skip to main content
Log in

Task Bias in Contrastive Vision-Language Models

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

Incidental supervision from language has become a popular approach for learning generic visual representations that can be prompted to perform many recognition tasks in computer vision. We conduct an in-depth exploration of the CLIP model and show that its visual representation is often strongly biased towards solving some tasks more than others. Moreover, which task the representation will be biased towards is unpredictable, with little consistency across images. To resolve this task bias, we show how to learn a ‘task guidance token’ that can be appended to the input to prompt the representation towards features relevant to their task of interest. Our results show that this task guidance can be independent of the input image and still effectively provide a conditioning mechanism to steer visual representations towards the desired task.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+
from $39.99 /Month
  • Starting from 10 chapters or articles per month
  • Access and download chapters and articles from more than 300k books and 2,500 journals
  • Cancel anytime
View plans

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Data availibility

Data will be made available upon request.

References

  • Abnar, S., & Zuidema, W. (2020). Quantifying Attention Flow in Transformers. arXiv:2005.00928 [cs]. Accessed 2021-11-23.

  • Alayrac, J.-B., Donahue, J., Luc, P., Miech, A., Barr, I., Hasson, Y., Lenc, K., Mensch, A., Millican, K., Reynolds, M., Ring, R., Rutherford, E., Cabi, S., Han, T., Gong, Z., Samangooei, S., Monteiro, M., Menick, J., Borgeaud, S., Brock, A., Nematzadeh, A., Sharifzadeh, S., Binkowski, M., Barreira, R., Vinyals, O., Zisserman, A., & Simonyan, K. (2022). Flamingo: A Visual Language Model for Few-Shot Learning. https://doi.org/10.48550/ARXIV.2204.14198.

  • Bahng, H., Jahanian, A., Sankaranarayanan, S., & Isola, P. (2022). Exploring visual prompts for adapting large-scale models. arXiv:2203.17274 [cs]. Accessed 2022-08-16

  • Benenson, R., Popov, S., & Ferrari, V. (2019). Large-scale interactive object segmentation with human annotators. arXiv:1903.10830 [cs]. Accessed 2021-11-17.

  • Bradski, G. (2000) The OpenCV Library. Dr. Dobb’s Journal of Software Tools.

  • Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., & Amodei, D. (2020). Language models are few-shot learners. arXiv:2005.14165 [cs]. Accessed 2021-11-16.

  • Crowson, K., Biderman, S., Kornis, D., Stander, D., Hallahan, E., Castricato, L., & Raff, E. (2022). VQGAN-CLIP: Open domain image generation and editing with natural language guidance. arXiv. https://doi.org/10.48550/ARXIV.2204.08583. arXiv:2204.08583

  • Gadre, S. Y., Wortsman, M., Ilharco, G., Schmidt, L., & Song, S. (2022). CLIP on wheels: Zero-shot object navigation as object localization and exploration. https://doi.org/10.48550/ARXIV.2203.10421.

  • Gal, R., Patashnik, O., Maron, H., Chechik, G., & Cohen-Or, D. (2021). StyleGAN-NADA: CLIP-guided domain adaptation of image generators. arXiv:2108.00946 [cs]. Accessed 2021-11-17.

  • Gildenblat, J. (2021). Explainability for vision transformers (in PyTorch). original-date: 2020-12-29T11:27:52Z. https://github.com/jacobgil/vit-explain Accessed 2021-11-23

  • Goh, G., N. C., C. V., Carter, S., Petrov, M., Schubert, L., Radford, A., & Olah, C. (2021). Multimodal neurons in artificial neural networks. Distill. https://doi.org/10.23915/distill.00030. https://distill.pub/2021/multimodal-neurons

  • Goh, G., Carter, S., Petrov, M., Schubert, L., Radford, A., & Olah, C. (2021). Multimodal neurons in artificial neural networks. Distill, 6(3), 30. https://doi.org/10.23915/distill.00030

    Article  Google Scholar 

  • Ha, H., & Song, S. (2022). Semantic Abstraction: Open-World 3D Scene Understanding from 2D Vision-Language Models. https://doi.org/10.48550/ARXIV.2207.11514. arXiv:2207.11514 [cs].

  • He, K., Zhang, X., Ren, S., & Sun, J. (2015). Deep residual learning for image recognition. arXiv:1512.03385 [cs]. Accessed 2021-05-23.

  • He, Y., Zheng, H. S., Tay, Y., Gupta, J., Du, Y., Aribandi, V., Zhao, Z., Li, Y., Chen, Z., Metzler, D., Cheng, H.-T., & Chi, E. H. (2022). HyperPrompt: Prompt-based Task-Conditioning of Transformers. arXiv (2022). https://doi.org/10.48550/ARXIV.2203.00759. arXiv:2203.00759

  • Ilharco, G., Wortsman, M., Gadre, S. Y., Song, S., Hajishirzi, H., Kornblith, S., Farhadi, A., & Schmidt, L. (2022). Patching open-vocabulary models by interpolating weights. https://doi.org/10.48550/ARXIV.2208.05592.

  • Jia, M., Tang, L., Chen, B. -C., Cardie, C., Belongie, S., Hariharan, B., & Lim, S. -N. (2022). Visual prompt tuning. arXiv. https://doi.org/10.48550/ARXIV.2203.12119. arXiv:2203.12119

  • Jia, M., Tang, L., Chen, B. -C., Cardie, C., Belongie, S., Hariharan, B., & Lim, S. -N. (2022). Visual prompt tuning. arXiv:2203.12119 Accessed 2022-08-16.

  • Jia, C., Yang, Y., Xia, Y., Chen, Y.-T., Parekh, Z., Pham, H., Le, Q.V., Sung, Y., Li, Z., & Duerig, T. (2021). Scaling up visual and vision-language representation learning with noisy text supervision. arXiv:2102.05918 [cs]. Accessed 2021-11-16

  • Jia, C., Yang, Y., Xia, Y., Chen, Y., Parekh, Z., Pham, H., Le, Q. V., Sung, Y., Li, Z., & Duerig, T. (2021). Scaling up visual and vision-language representation learning with noisy text supervision. CoRR arXiv:2102.05918

  • Khandelwal, A., Weihs, L., Mottaghi, R., & Kembhavi, A. (2021). Simple but effective: CLIP embeddings for embodied AI. CoRR arXiv:2111.09888

  • Kingma, D.P., & Ba, J. (2017). Adam: A method for stochastic optimization. arXiv:1412.6980 [cs]. Accessed 2021-11-23.

  • Krylov, I., Nosov, S., & Sovrasov, V. (2021). Open images V5 text annotation and yet another mask text spotter. arXiv:2106.12326 [cs]. Accessed 2021-11-17.

  • Kuznetsova, A., Rom, H., Alldrin, N., Uijlings, J., Krasin, I., Pont-Tuset, J., Kamali, S., Popov, S., Malloci, M., Kolesnikov, A., Duerig, T., & Ferrari, V. (2020). The open images dataset V4: Unified image classification, object detection, and visual relationship detection at scale. International Journal of Computer Vision, 128(7), 1956–1981. https://doi.org/10.1007/s11263-020-01316-z

    Article  Google Scholar 

  • Lee, D., Kim, J., Choi, J., Kim, J., Byeon, M., Baek, W., & Kim, S. (2022) Karlo-v1.0.alpha on COYO-100M and CC15M. https://github.com/kakaobrain/karlo.

  • Liu, P., Yuan, W., Fu, J., Jiang, Z., Hayashi, H., & Neubig, G. (2021). Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. arXiv:2107.13586 [cs]. Accessed 2021-08-27.

  • Luo, H., Ji, L., Zhong, M., Chen, Y., Lei, W., Duan, N., & Li, T. (2021). CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval.

  • Majumdar, A., Aggarwal, G., Devnani, B., Hoffman, J., & Batra, D. (2022). ZSON: zero-shot object-goal navigation using multimodal goal embeddings. https://doi.org/10.48550/ARXIV.2206.12403. arXiv:2206.12403 [cs].

  • Materzynska, J., Torralba, A., & Bau, D. (2022). Disentangling visual and written concepts in CLIP. arXiv. https://doi.org/10.48550/ARXIV.2206.07835. arXiv:2206.07835

  • Moore, B. E., & Corso, J. J. (2020) Fiftyone. GitHub. Note: https://github.com/voxel51/fiftyone.

  • Mukhoti, J., Kulharia, V., Sanyal, A., Golodetz, S., Torr, P. H. S., & Dokania, P. K. (2020). Calibrating Deep Neural Networks using Focal Loss. arXiv:2002.09437 [cs, stat]. Accessed 2020-11-04.

  • Nado, Z., Band, N., Collier, M., Djolonga, J., Dusenberry, M. W., Farquhar, S., Filos, A., Havasi, M., Jenatton, R., Jerfel, G., Liu, J., Mariet, Z., Nixon, J., Padhy, S., Ren, J., Rudner, T. G. J., Wen, Y., Wenzel, F., Murphy, K., Sculley, D., Lakshminarayanan, B., Snoek, J., Gal, Y., & Tran, D. (2021). Uncertainty Baselines: Benchmarks for Uncertainty & Robustness in Deep Learning. arXiv:2106.04015 [cs]. Accessed 2021-11-17.

  • Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., & Sutskever, I. (2021). Learning transferable visual models from natural language supervision. arXiv:2103.00020 [cs]. Accessed 2021-11-14.

  • Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., & Chen, M. (2022). Hierarchical text-conditional image generation with CLIP latents. arXiv:2204.06125 [cs]. Accessed 2022-10-21.

  • Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., & Chen, M. (2022). Hierarchical text-conditional image generation with CLIP latents. https://doi.org/10.48550/ARXIV.2204.06125. arXiv:2204.06125

  • Rudin, C., Chen, C., Chen, Z., Huang, H., Semenova, L., & Zhong, C. (2021). Interpretable machine learning: Fundamental principles and 10 Grand Challenges. arXiv:2103.11251 [cs, stat]. Accessed 2021-11-17.

  • Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A. C., & Fei-Fei, L. (2015). ImageNet large scale visual recognition challenge. arXiv:1409.0575 [cs]. Accessed 2021-11-23.

  • Salman, H., Ilyas, A., Engstrom, L., Vemprala, S., Madry, A., & Kapoor, A. (2021). Unadversarial Examples: Designing Objects for Robust Vision. In: NeurIPS.

  • Schick, T., & Schütze, H. (2021). True few-shot learning with prompts - A real-world perspective. CoRR arXiv:2111.13440

  • Shridhar, M., Manuelli, L., & Fox, D. (2021). CLIPort: What and Where Pathways for Robotic Manipulation. arXiv:2109.12098 [cs]. Accessed 2021-11-17.

  • Singh, A., Hu, R., Goswami, V., Couairon, G., Galuba, W., Rohrbach, M., & Kiela, D. (2021). FLAVA: A foundational language and vision alignment model. CoRR arXiv:2112.04482

  • Yuan, L., Chen, D., Chen, Y., Codella, N., Dai, X., Gao, J., Hu, H., Huang, X., Li, B., Li, C., Liu, C., Liu, M., Liu, Z., Lu, Y., Shi, Y., Wang, L., Wang, J., Xiao, B., Xiao, Z., Yang, J., Zeng, M., Zhou, L., & Zhang, P. (2021). Florence: A new foundation model for computer vision. CoRR arXiv:2111.11432

  • Zamir, A., Sax, A., Shen, W., Guibas, L., Malik, J., & Savarese, S. (2018). Taskonomy: disentangling task transfer learning. arXiv:1804.08328 [cs]. Accessed 2020-11-04.

Download references

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Sachit Menon or Ishaan Preetam Chandratreya.

Ethics declarations

Conflict of interest

None

Additional information

Communicated by Kaiyang Zhou.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A: Task Bias Probing - Dataset Construction

We constructed the datasets for the task bias probing experiments by combining a subset of the OpenImages-V6 dataset, which contains information about objects, attributes and relationships with other objects, (Kuznetsova et al., 2020; Benenson et al., 2019) with independently annotated scene text labels associated with a subset of the OpenImages-V5 dataset (Krylov et al., 2021). In order to piece together this dataset, we installed the dataset using the recommended FiftyOne (Moore & Corso, 2020) library and combined images from all of the splits in their OpenImages installation. We reported results on four different comparisons: Object v. Scene Text, Objects v. Actions and Scene Text v. Actions respectively. Below, we outline the pairwise dataset creation process for each one of these.

Objects v. Scene Text: We consider the intersection of images with scene text labels and images with object detection labels. Images often contain more than one object, but the goal of image classification for object recognition is to identify the most significant one, constraining the task to one label. Similarly to the ImageNet dataset (Russakovsky et al., 2015), we want to consider the most salient object in the scene for this label. For every object with detection labels in an image, we calculate the area of the associated bounding box and choose the object label with the maximum area for a given image ID as the label for the object recognition task. (We consider detection labels instead of categorization labels for this reason; categorization labels do not give us any sense of how significant a given label is.) We replace all gendered object labels with the generic human label “Person”. This leaves us with a set of 175335 images with both object and scene text labels. Since the set of images with scene text labels is much smaller than the size of OpenImages, we first filter those images for which we have an associated scene text label from the independent scene text dataset. Then we filter this smaller set of images for those which contain object detection labels, retaining all possible object labels. At evaluation time, we want to identify the most significant object inside the scene, so for every object level we also calculate the area of the associated bounding box, keeping only the single object label with the maximum area for a given ID. We replace all gendered object labels with the generic human label ’Person’. This leaves us with a set of images with 175335 images paired with a single scene text and object label.

People v. Actions: For this comparison, we consider all images with action annotations in OpenImages. (All such images must contain people by definition of the action recognition task, so we do not need to consider an intersection.) As of V6, OpenImages does have human action annotations (Kuznetsova et al., 2020); unfortunately, there is no field for ‘actions’ in the dataset.

Instead, actions are distributed through a couple of different fields. Images that contain actions form a subset of those labeled with ‘relationships’ in OpenImages. Every relationship instance is defined as a 3-tuple of the form (first label, relationship label, second label). The relationship label is a general term for the word that connects the first and second labels and can take various forms. For example, if ‘is’ is the relationship label, the second label can be an attribute of the first label. Alternatively, the relationship label could be an unconjugated verb, in which case the second label is another object. For all the relationships available, we first filter the entire dataset to only include those where the first label refers to a person-related class (In OpenImages, these are ‘Boy’, ‘Girl’, ‘Man’, ‘Woman’ and ‘Person’) to ensure action recognition is a valid task on the datapoints. We then isolate those pairs where the relationship label directly defines an action (eg. ‘read’, ‘dance’) and those where the relationship label is generic (eg. ‘is’) followed by a verb-like attribute (eg. ‘Cry’, ‘Jump’). In the former case, we take the relationship label as the action label and in the latter case, we use the second label as the action. We conjugate all verbs in their present continuous form. The process leaves us with the first label (which is guaranteed to be a person) and an associated action label for a set of images. We then remove duplicates. We sample from each action to correct for label imbalance. Our final dataset contains 89626 distinct images.

Objects v. Actions: We obtain a set of images containing actions per the previous section. Among these, we consider images that have object detection labels to obtain images where people interact with another object through the action. We further discard any cases where the second label in the relationship 3-tuple is also a person, leaving us with 8611 images with paired action and inanimate object labels.

Scene Text v. Actions: We again consider the subset of images with valid action labels, this time taking the subset of images that also contain scene text. Our final dataset contains 56027 labelled images with both pairs.

When datasets are used for training task guidance tokens and evaluation, \(90\%\) of the data is used for training and the remaining \(10\%\) is used as a held out set.

All further dataset details, including relevant percentages of each label, can be found with the code release. Labels and label distributions vary per paired dataset considered. A superset across experiments of our objects include all OpenImages objects. We extract the text with OCR, meaning it comprises an open set vocabulary with unique text pairs. The total actions we extract from the original OpenImages data include:

’singing’, ’walking’, ’riding’, ’laying’, ’talking’, ’surfing’, ’holding hands’, ’reading’, ’dancing’, ’running’, ’hugging’, ’skiing’, ’playing’, ’sitting’, ’shaking hands’, ’standing’, ’skateboarding’, ’kissing’, ’crying’, ’jumping’, ’eating’, ’drinking’, ’hitting’, ’kicking’, ’cutting’, ’high-fiving’, ’catch’, ’throwing’, ’snowboarding’, ’talking on phone’

Appendix B: Task Bias - Task-Directed Text Prompt Modification Details

This section provides details for the experiment from Section 3.3. The goal of the experiment is to determine whether a text prompt, added as a prefix to the text choices, can guide the zero-shot classification procedure to solve a desired task. In Table 2, we list the prompts used and the associated intended task. We arrived at these prompts from the original prompts shown to lead to an improvement in baseline performance in CLIP (Radford et al., 2021), reused by ALIGN (Jia et al., 2021) and further models. The primary consideration for designing these prompts was that they must provide sufficient information for a human to understand which task is intended. We experimented with various prompts fulfilling this criterion, choosing the best among them. (This makes it especially surprising that for some experiments, the clarifying additional text information actually results in substantially worse performance.)

Table 2 Tasks we investigated and the associated prefixes we attach to form our text prompts

Appendix C: Task Bias – Task Disambiguation Baselines

A preliminary question to whether we can resolve task bias is whether we can detect the direction of the task bias in the visual representation solely from the image embedding. In this section, we investigate the effective of using the full attention mask and the input image toward guessing the task bias in an input. We provide preliminary baselines for the difficult task of predicting which task a zero-shot model is solving for the Objects v. Scene Text and Actions v. Scene Text pairs.

Dataset: The results from our task bias probing experiments give us the per label task preference for every image in our dataset. We use these pseudo-labels as indicators of CLIP’s task bias on the particular image considered. Therefore, given an image that is part of a pairwise dataset as an input, we repurpose the index of CLIP’s preferred task for that image as a label for training the classifier, ensuring that the final test set that we report results on is near-balanced.

Table 3 Results for various classifiers on Objects v. Scene Text task bias clarification task
Table 4 Results for various classifiers on Actions v. Scene Text task bias clarification task
Table 5 Percent change in decisions made towards the intended task (resolving task bias) on the test splits of each of our datasets using CoOp tokens
Table 6 Percent change in decisions made towards the intended task (resolving task bias) on the test splits of each of our datasets with the prompted visual token added at different locations in the input visual token sequence. We try two randomly selected locations and see no conceivable difference in performance

Methods and Results: We train four different kinds of classifiers for the two paired datasets. These classifiers differ in their architecture and input space, but in each case the set of labels remains the same. Our results are summarized in 3 and 4, reporting the accuracy of the best trained classifier on the test set.

Frequent refers to the typical baseline in binary classification that always predicts the label which occurs more frequently inside the test set.

Image refers to using the 3-channel input RGB image directly as the input for the classifier. The model used is a ResNet-18 (He et al., 2015) without pre-training.

Image+Attention Overlay refers to overlaying the scaled self-attention map from CLIP’s image encoder on the image and pre-processing this as a new image. We use the method from Abnar and Zuidema (2020); Gildenblat (2021), termed ‘attention rollout’, to calculate the self attention maps, modifying it for CLIP’s architecture. We also experiment with other forms of self-attention maps, including using final layer only, and observe similar results for those. Once we have the attention map, we normalize it with its maximum value, scale it to the image shape, and colorize it using OpenCV’s JET colormap (Bradski, 2000) to turn it into an RGB image. Finally, we add this to the RGB image and use it as an input. The model used is also a ResNet-18 without pre-training.

Embedding refers to classifying directly on top of CLIP’s 512-dimensional representation of the image. Not that this is a function of the self-attention and the input image, given the structure of transformers. The model used is a shallow 4-layer MLP, with layers sizes [256, 128, 64, 2] respectively.

Embedding+Image+Attention refers to classifying from both the input images and the embedding. Note that we do not necessarily expect this to work better as it gives redundant information to the classifier but in different forms. This is because the embedding itself is a function of the self-attention and the root image. For the model, we use a ResNet18 backbone with a single linear layer to produce a 256 dimensional representation. We further use a single linear layer to constrain the CLIP’s image embedding to 256 dimensions. These are then fused and passed through the MLP used for Embedding only.

All models are trained end-to-end with the Adam optimizer (Kingma & Ba, 2017), with learning rate 0.0001 and other hyperparameters set to their default values.

Fig. 10
figure 10

Difference between attention maps with and prompting on in-the-wild images. Best viewed with zoom. We see the prompt results in more attention on people and the prompt results in more attention on the objects around them

The takeaway from these experiments provides some of the motivation for our method. We see that while predicting solely from the image is close to chance, it is relatively much easier to predict the direction of task bias from the embedding.

See Tables 3 and 4.

Appendix D: Text-Guided Prompting (CoOp)

Much like visual prompting, text-guided prompting towards the same objective in Equation 1 could also successfully perform task disambiguatation by our given metric. We choose to focus on visual prompting in the main paper, as our goal is to disambiguate the visual representations (which may also be used for tasks other than recognition). Here we show results for training towards our objective with learned text prompting via CoOp . The results are similar to those for visual prompting. It is interesting to contrast these results to those for manual text prompting; this suggests learned text prompting with the objective we define can perform task disambiguation much better than manual prompting.

See Tables 5 and 6.

Appendix E: Position of Learned Token

Here we examine whether the position of the learned visual token has any effect on the ultimate performance for task disambiguation. We find no difference between different placements of the token.

Appendix F: Further Attention Examples

Here we provide some additional examples of the differences between attention maps for prompting in the directions of different tasks, displayed in Fig. 10. In particular, here we highlight actions, which we did not have room for in the main paper. Note that these examples are all ‘in-the-wild’, not from our dataset.

Appendix G: Hyperparameters and Training Details

We follow Jia et al. (2022) and Bahng et al. (2022) for all hyperparameter settings, and find we do not need to modify any values. In particular, from Bahng et al. (2022) we employ SGD with a learning rate of 40, momentum of 0.9, without weight decay or gradient scaling. While this learning rate seems high considering standard neural network training values, we use it directly from Bahng et al. (2022), and find similarly to Jia et al. (2022) (Table 6) that a higher learning rate succeeds in this setting. We use a cosine scheduler with warmup of 1000 steps. We use the original CLIP models from OpenAI Goh et al. (2021) with the backbones reported in Table 1.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Menon, S., Chandratreya, I.P. & Vondrick, C. Task Bias in Contrastive Vision-Language Models. Int J Comput Vis 132, 2026–2040 (2024). https://doi.org/10.1007/s11263-023-01945-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • Version of record:

  • Issue date:

  • DOI: https://doi.org/10.1007/s11263-023-01945-0

Keywords