Diffusion-Enhanced Test-Time Adaptation with Text and Image Augmentation

Feng, Chun-Mei; He, Yuanyang; Zou, Jian; Khan, Salman; Xiong, Huan; Li, Zhen; Zuo, Wangmeng; Goh, Rick Siow Mong; Liu, Yong

doi:10.1007/s11263-025-02412-8

Diffusion-Enhanced Test-Time Adaptation with Text and Image Augmentation

Published: 05 April 2025

Volume 133, pages 5083–5098, (2025)
Cite this article

International Journal of Computer Vision Aims and scope Submit manuscript

695 Accesses
2 Citations
Explore all metrics

Abstract

Existing test-time prompt tuning (TPT) methods focus on single-modality data, primarily enhancing images and using confidence ratings to filter out inaccurate images. However, while image generation models can produce visually diverse images, single-modality data enhancement techniques still fail to capture the comprehensive knowledge provided by different modalities. Additionally, we note that the performance of TPT-based methods drops significantly when the number of augmented images is limited, which is not unusual given the computational expense of generative augmentation. To address these issues, we introduce $\text {IT}^{3}\text {A}$, a novel test-time adaptation method that utilizes a pre-trained generative model for multi-modal augmentation of each test sample from unknown new domains. By combining augmented data from pre-trained vision and language models, we enhance the ability of the model to adapt to unknown new test data. Additionally, to ensure that key semantics are accurately retained when generating various visual and text enhancements, we employ cosine similarity filtering between the logits of the enhanced images and text with the original test data. This process allows us to filter out some spurious augmentation and inadequate combinations. To leverage the diverse enhancements provided by the generation model across different modals, we have replaced prompt tuning with an adapter for greater flexibility in utilizing text templates. Our experiments on the test datasets with distribution shifts and domain gaps show that in a zero-shot setting, $\text {IT}^{3}\text {A}$ outperforms state-of-the-art test-time prompt tuning methods with a 5.50% increase in accuracy.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+

from $39.99 /Month

Starting from 10 chapters or articles per month
Access and download chapters and articles from more than 300k books and 2,500 journals
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Advances in diffusion models for image data augmentation: a review of methods, models, evaluation metrics and future research directions

Article Open access 30 January 2025

Few-shot adaptation of multi-modal foundation models: a survey

Article Open access 27 August 2024

MultiGen: Zero-Shot Image Generation from Multi-modal Prompts

Data availibility

The authors declare that the data supporting the experiments in this study are available within the paper. The code is available at https://github.com/chunmeifeng/DiffTPT.

Notes

https://www.photoroom.com/tech/stable-diffusion-100-percent-faster-with-memory-efficient-attention

References

Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F.L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al. (2023). Gpt-4 technical report. arXiv preprint arXiv:2303.08774
Antoniou, A., Storkey, A., & Edwards, H. (2017). Data augmentation generative adversarial networks. arXiv preprint arXiv:1711.04340
Bansal, H., & Grover, A. (2023). Leaving reality to imagination: Robust classification via generated datasets. arXiv preprint arXiv:2302.02503
BELLEGroup, (2023). Belle: Be everyone’s large language model engine. https://github.com/LianjiaTech/BELLE.
Bolya, D., & Hoffman, J. (2023). Token merging for fast stable diffusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 4598–4602.
Bossard, L., Guillaumin, M., & Van Gool, L. (2014). Food-101–mining discriminative components with random forests. In: Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part VI 13, Springer, pp 446–461.
Brock, A., Donahue, J., & Simonyan, K. (2018). Large scale gan training for high fidelity natural image synthesis. arXiv preprint arXiv:1809.11096.
Chen, D., Wang, D., Darrell, T., & Ebrahimi, S. (2022). Contrastive test-time adaptation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 295–305.
Chen, T., Kornblith, S., Norouzi, M., & Hinton, G. (2020). A simple framework for contrastive learning of visual representations. In International conference on machine learning, PMLR, pp 1597–1607.
Chiang, W.L., Li, Z., Lin, Z., Sheng, Y., Wu, Z., Zhang, H., Zheng, L., Zhuang, S., Zhuang, Y., Gonzalez, J.E., et al. (2023). Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. See https://vicuna lmsys org (accessed 14 April 2023) 2(3):6.
Cimpoi, M., Maji, S., Kokkinos, I., Mohamed, S., & Vedaldi, A. (2014). Describing textures in the wild. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3606–3613.
Dai, H., Liu, Z., Liao, W., Huang, X., Cao, Y., Wu, Z., Zhao, L., Xu, S., Liu, W., Liu, N., et al. (2023). Auggpt: Leveraging chatgpt for text data augmentation. arXiv preprint arXiv:2302.13007.
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., & Fei-Fei, L. (2009). Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, IEEE, 248–255.
Devlin, J., Chang, M.W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
Fei-Fei, L., Fergus, R., & Perona, P. (2004). Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories. In 2004 conference on computer vision and pattern recognition workshop, IEEE, pp 178–178.
Feng, C.M., Li, B., Xu, X., Liu, Y., Fu, H., & Zuo, W. (2023a). Learning federated visual prompt in null space for mri reconstruction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 8064–8073.
Feng, C.M., Yu, K., Liu, Y., Khan, S., & Zuo, W. (2023b). Diverse data augmentation with diffusions for effective test-time prompt tuning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 2704–2714.
Gao, P., Geng, S., Zhang, R., Ma, T., Fang, R., Zhang, Y., Li, H., & Qiao, Y. (2021). Clip-adapter: Better vision-language models with feature adapters. arXiv preprint arXiv:2110.04544.
Gao, P., Geng, S., Zhang, R., Ma, T., Fang, R., Zhang, Y., Li, H., & Qiao, Y. (2024). Clip-adapter: Better vision-language models with feature adapters. International Journal of Computer Vision, 132(2), 581–595.
Article Google Scholar
Gao, Y., Shi, X., Zhu, Y., Wang, H., Tang, Z., Zhou, X., Li, M., & Metaxas, D.N. (2022). Visual prompt tuning for test-time domain adaptation. arXiv preprint arXiv:2210.04831.
Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., & Bengio, Y. (2020). Generative adversarial networks. Communications of the ACM, 63(11), 139–144.
Article MathSciNet Google Scholar
Helber, P., Bischke, B., Dengel, A., & Borth, D. (2019). Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 12(7), 2217–2226.
Article Google Scholar
Hendrycks, D., Mu, N., Cubuk, E.D., Zoph, B., Gilmer, J., & Lakshminarayanan, B. (2019). Augmix: A simple data processing method to improve robustness and uncertainty. arXiv preprint arXiv:1912.02781.
Hendrycks, D., Basart, S., Mu, N., Kadavath, S., Wang, F., Dorundo, E., Desai, R., Zhu, T., Parajuli, S., & Guo, M., et al. (2021a). The many faces of robustness: A critical analysis of out-of-distribution generalization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 8340–8349.
Hendrycks, D., Zhao, K., Basart, S., Steinhardt, J., & Song, D. (2021b). Natural adversarial examples. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 15262–15271.
Ho, J., Jain, A., & Abbeel, P. (2020). Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33, 6840–6851.
Google Scholar
Ho, J., Chan, W., Saharia, C., Whang, J., Gao, R., Gritsenko, A., Kingma, D.P., Poole, B., Norouzi, M., Fleet, D.J., et al. (2022). Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303.
Huang, T., Chu, J., & Wei, F. (2022). Unsupervised prompt learning for vision-language models. arXiv preprint arXiv:2204.03649.
Jia, C., Yang, Y., Xia, Y., Chen, Y.T., Parekh, Z., Pham, H., Le, Q., Sung, Y.H., Li, Z., & Duerig, T. (2021). Scaling up visual and vision-language representation learning with noisy text supervision. In International Conference on Machine Learning, PMLR, pp 4904–4916.
Jia, M., Tang, L., Chen, B.C., Cardie, C., Belongie, S., Hariharan, B., & Lim, S.N. (2022). Visual prompt tuning. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXIII, Springer, pp 709–727.
Karmanov, A., Guan, D., Lu, S., El Saddik, A., & Xing, E. (2024). Efficient test-time adaptation of vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 14162–14171.
Kingma, D.P., & Welling, M. (2013) Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114
Krause, J., Stark, M., Deng, J., & Fei-Fei, L. (2013). 3d object representations for fine-grained categorization. In Proceedings of the IEEE international conference on computer vision workshops, pp 554–561.
Le Scao, T., Fan, A., Akiki, C., Pavlick, E., Ilić, S., Hesslow, D., Castagné, R., Luccioni, A.S., Yvon, F., Gallé, M., et al. (2023). Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100.
Li, H., Feng, C.M., Zhou, T., Xu, Y., & Chang, X. (2022a). Prompt-driven efficient open-set semi-supervised learning. arXiv preprint arXiv:2209.14205
Li, L.H., Zhang, P., Zhang, H., Yang, J., Li, C., Zhong, Y., Wang, L., Yuan, L., Zhang, L., Hwang, J.N., et al. (2022b). Grounded language-image pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 10965–10975.
Liu, Y., Kothari, P., Van Delft, B., Bellot-Gurlet, B., Mordan, T., & Alahi, A. (2021). Ttt++: When does self-supervised test-time training fail or thrive? Advances in Neural Information Processing Systems, 34, 21808–21820.
Google Scholar
Maji, S., Rahtu, E., Kannala, J., Blaschko, M., & Vedaldi, A. (2013). Fine-grained visual classification of aircraft. arXiv preprint arXiv:1306.5151.
Mandal, D., Narayan, S., Dwivedi, S.K., Gupta, V., Ahmed, S., Khan, F.S., & Shao, L. (2019). Out-of-distribution detection for generalized zero-shot action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 9985–9993.
Meng, C., Rombach, R., Gao, R., Kingma, D., Ermon, S., Ho, J., & Salimans, T. (2023). On distillation of guided diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 14297–14306.
Nichol, A., Dhariwal, P., Ramesh, A., Shyam, P., Mishkin, P., McGrew, B., Sutskever, I., & Chen, M. (2021). Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741.
Nichol, A.Q., & Dhariwal, P. (2021). Improved denoising diffusion probabilistic models. In International Conference on Machine Learning, PMLR, pp 8162–8171.
Nilsback, M. E., & Zisserman, A. (2008). Automated flower classification over a large number of classes. 2008 Sixth Indian Conference on Computer Vision (pp. 722–729). Graphics & Image Processing: IEEE.
Parkhi, O.M., Vedaldi, A., Zisserman, A., & Jawahar, C. (2012). Cats and dogs. In 2012 IEEE conference on computer vision and pattern recognition, IEEE, pp 3498–3505.
Perez, L., & Wang, J. (2017). The effectiveness of data augmentation in image classification using deep learning. arXiv preprint arXiv:1712.04621.
Piedboeuf, F., & Langlais, P. (2023). Is chatgpt the ultimate data augmentation algorithm? Findings of the Association for Computational Linguistics: EMNLP, 2023, 15606–15615.
Google Scholar
Radford, A., Narasimhan, K., Salimans, T., Sutskever, I., et al. (2018). Improving language understanding by generative pre-training.
Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al. (2021). Learning transferable visual models from natural language supervision. In International conference on machine learning, PMLR, pp 8748–8763.
Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., & Chen, M. (2022). Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125.
Recht, B., Roelofs, R., Schmidt, L., & Shankar, V. (2019). Do imagenet classifiers generalize to imagenet? In International conference on machine learning, PMLR, pp 5389–5400.
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., & Ommer, B. (2022). High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 10684–10695.
Saharia, C., Chan, W., Saxena, S., Li, L., Whang, J., Denton, E., Ghasemipour, S.K.S., Ayan, B.K., Mahdavi, S.S., Lopes, R.G., et al. (2022). Photorealistic text-to-image diffusion models with deep language understanding. arXiv preprint arXiv:2205.11487.
Schneider, S., Rusak, E., Eck, L., Bringmann, O., Brendel, W., & Bethge, M. (2020). Improving robustness against common corruptions by covariate shift adaptation. Advances in Neural Information Processing Systems, 33, 11539–11551.
Google Scholar
Shanmugam, D., Blalock, D., Balakrishnan, G., & Guttag, J. (2021). Better aggregation in test-time augmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 1214–1223.
Shorten, C., & Khoshgoftaar, T. M. (2019). A survey on image data augmentation for deep learning. Journal of big data, 6(1), 1–48.
Article Google Scholar
Shu, M., Nie, W., Huang, D.A., Yu, Z., Goldstein, T., Anandkumar, A., & Xiao, C. (2022). Test-time prompt tuning for zero-shot generalization in vision-language models. arXiv preprint arXiv:2209.07511.
Sinha, A., Song, J., Meng, C., & Ermon, S. (2021). D2c: Diffusion-decoding models for few-shot conditional generation. Advances in Neural Information Processing Systems, 34, 12533–12548.
Google Scholar
Song, Y., Dhariwal, P., Chen, M., & Sutskever, I. (2023). Consistency models. arXiv preprint arXiv:2303.01469.
Soomro, K., Zamir, A.R., & Shah, M. (2012). Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402.
Sun, T., Zhang, X., He, Z., Li, P., Cheng, Q., Liu, X., Yan, H., Shao, Y., Tang, Q., Zhang, S., Zhao, X., Chen, K., Zheng, Y., Zhou, Z., Li, R., Zhan, J., Zhou, Y., Li, L., Yang, X., Wu, L., Yin, Z., Huang, X., Jiang, Y.G., & Qiu, X. (2024). Moss: An open conversational large language model. Machine Intelligence Research https://github.com/OpenMOSS/MOSS.
Sun, Y., Wang, X., Liu, Z., Miller, J., Efros, A., & Hardt, M. (2020). Test-time training with self-supervision for generalization under distribution shifts. In International conference on machine learning, PMLR, pp 9229–9248.
Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al. (2023). Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
Ubani, S., Polat, S.O., & Nielsen, R. (2023). Zeroshotdataaug: Generating and augmenting training data with chatgpt. arXiv preprint arXiv:2304.14334.
Wang, D., Shelhamer, E., Liu, S., Olshausen, B., & Darrell, T. (2020). Tent: Fully test-time adaptation by entropy minimization. arXiv preprint arXiv:2006.10726.
Wang, H., Ge, S., Lipton, Z., & Xing, E.P. (2019). Learning robust global representations by penalizing local predictive power. Advances in Neural Information Processing Systems 32.
Xiao, J., Hays, J., Ehinger, K.A., Oliva, A., & Torralba, A. (2010). Sun database: Large-scale scene recognition from abbey to zoo. In 2010 IEEE computer society conference on computer vision and pattern recognition, IEEE, pp 3485–3492
Zhang, L., Deng, Z., Kawaguchi, K., Ghorbani, A., & Zou, J. (2020). How does mixup help with robustness and generalization? arXiv preprint arXiv:2010.04819.
Zhang, M., Levine, S., & Finn, C. (2021a). Memo: Test time robustness via adaptation and augmentation. arXiv preprint arXiv:2110.09506.
Zhang, M., Cai, Z., Pan, L., Hong, F., Guo, X., Yang, L., & Liu, Z. (2022a). Motiondiffuse: Text-driven human motion generation with diffusion model. arXiv preprint arXiv:2208.15001.
Zhang, R., Fang, R., Zhang, W., Gao, P., Li, K., Dai, J., Qiao, Y., & Li, H. (2021b). Tip-adapter: Training-free clip-adapter for better vision-language modeling. arXiv preprint arXiv:2111.03930.
Zhang, T., Wang, X., Zhou, D., Schuurmans, D., & Gonzalez, J.E. (2022b). Tempera: Test-time prompting via reinforcement learning. arXiv preprint arXiv:2211.11890.
Zhao, S., Liu, Z., Lin, J., Zhu, J. Y., & Han, S. (2020). Differentiable augmentation for data-efficient gan training. Advances in Neural Information Processing Systems, 33, 7559–7570.
Google Scholar
Zhong, Z., Zheng, L., Kang, G., Li, S., & Yang, Y. (2020). Random erasing data augmentation. Proceedings of the AAAI conference on artificial intelligence, 34, 13001–13008.
Article Google Scholar
Zhou, K., Yang, J., Loy, C.C., & Liu, Z. (2022a). Conditional prompt learning for vision-language models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 16816–16825.
Zhou, K., Yang, J., Loy, C. C., & Liu, Z. (2022). Learning to prompt for vision-language models. International Journal of Computer Vision, 130(9), 2337–2348.
Article Google Scholar

Download references

Acknowledgements

This work was supported by the National Research Foundation, Singapore under its AI Singapore Programme (AISG Award No: AISG2-TC-2021-003), Agency for Science, Technology and Research (A*STAR) through its AME Programmatic Funding Scheme Under Project A20H4b0141, A*STAR Central Research Fund “A Secure and Privacy Preserving AI Platform for Digital Health”, and Agency for Science, Technology and Research (A*STAR) through its RIE2020 Health and Biomedical Sciences (HBMS) Industry Alignment Fund Pre-Positioning (IAF-PP) (grant no. H20C6a0032).

Author information

Authors and Affiliations

Institute of High Performance Computing (IHPC), Agency for Science, Technology and Research (A*STAR), Singapore, Singapore
Chun-Mei Feng, Rick Siow Mong Goh & Yong Liu
National University of Singapore, Singapore, Singapore
Yuanyang He
Harbin Institute of Technology, Harbin, China
Jian Zou, Huan Xiong & Wangmeng Zuo
Mohamed bin Zayed University of Artificial Intelligence (MBZUAI), Abu Dhabi, UAE
Salman Khan & Huan Xiong
Australian National University, Canberra, ACT, Australia
Salman Khan
Chinese University of Hong Kong, Shenzhen, China
Zhen Li

Authors

Chun-Mei Feng
View author publications
Search author on:PubMed Google Scholar
Yuanyang He
View author publications
Search author on:PubMed Google Scholar
Jian Zou
View author publications
Search author on:PubMed Google Scholar
Salman Khan
View author publications
Search author on:PubMed Google Scholar
Huan Xiong
View author publications
Search author on:PubMed Google Scholar
Zhen Li
View author publications
Search author on:PubMed Google Scholar
Wangmeng Zuo
View author publications
Search author on:PubMed Google Scholar
Rick Siow Mong Goh
View author publications
Search author on:PubMed Google Scholar
Yong Liu
View author publications
Search author on:PubMed Google Scholar

Corresponding author

Correspondence to Zhen Li.

Additional information

Communicated by Long Yang.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Feng, CM., He, Y., Zou, J. et al. Diffusion-Enhanced Test-Time Adaptation with Text and Image Augmentation. Int J Comput Vis 133, 5083–5098 (2025). https://doi.org/10.1007/s11263-025-02412-8

Download citation

Received: 01 August 2024
Accepted: 26 September 2024
Published: 05 April 2025
Version of record: 05 April 2025
Issue date: August 2025
DOI: https://doi.org/10.1007/s11263-025-02412-8

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+

from $39.99 /Month

Starting from 10 chapters or articles per month
Access and download chapters and articles from more than 300k books and 2,500 journals
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Diffusion-Enhanced Test-Time Adaptation with Text and Image Augmentation

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Advances in diffusion models for image data augmentation: a review of methods, models, evaluation metrics and future research directions

Few-shot adaptation of multi-modal foundation models: a survey

MultiGen: Zero-Shot Image Generation from Multi-modal Prompts

Explore related subjects

Data availibility

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now