OCCO: LVM-Guided Infrared and Visible Image Fusion Framework Based on Object-Aware and Contextual Contrastive Learning

Li, Hui; Bian, Congcong; Zhang, Zeyang; Song, Xiaoning; Li, Xi; Wu, Xiao-Jun

doi:10.1007/s11263-025-02507-2

OCCO: LVM-Guided Infrared and Visible Image Fusion Framework Based on Object-Aware and Contextual Contrastive Learning

Published: 17 June 2025

Volume 133, pages 6611–6635, (2025)
Cite this article

International Journal of Computer Vision Aims and scope Submit manuscript

Hui Li ORCID: orcid.org/0000-0003-4550-7879¹,
Congcong Bian¹,
Zeyang Zhang¹,
Xiaoning Song¹,
Xi Li² &
…
Xiao-Jun Wu¹

329 Accesses
Explore all metrics

This article has been updated

Abstract

Image fusion is a crucial technique in the field of computer vision, and its goal is to generate high-quality fused images and improve the performance of downstream tasks. However, existing fusion methods struggle to balance these two factors. Achieving high quality in fused images may result in lower performance in downstream visual tasks, and vice versa. To address this drawback, a novel LVM (large vision model)-guided fusion framework with Object-aware and Contextual COntrastive learning is proposed, termed as OCCO. The pre-trained LVM is utilized to provide semantic guidance, allowing the network to focus solely on fusion tasks while emphasizing learning salient semantic features in form of contrastive learning. Additionally, a novel feature interaction fusion network is also designed to resolve information conflicts in fusion images caused by modality differences. By learning the distinction between positive samples and negative samples in the latent feature space (contextual space), the integrity of target information in fused image is improved, thereby benefiting downstream performance. Finally, compared with eight state-of-the-art methods on four datasets, the effectiveness of the proposed method is validated, and exceptional performance is also demonstrated on downstream visual task.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+

from $39.99 /Month

Starting from 10 chapters or articles per month
Access and download chapters and articles from more than 300k books and 2,500 journals
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 15

An interactive deep model combined with Retinex for low-light visible and infrared image fusion

Article 15 February 2023

A Lightweight Infrared and Visible Image Fusion Method for Object Detection

CoCoNet: Coupled Contrastive Learning Network with Multi-level Feature Ensemble for Multi-modality Image Fusion

Article 08 December 2023

Data Availability

Data will be made available on request.

Change history

27 June 2025
The acknowledgment has been corrected

References

Chen, L., Zhu, Y., Papandreou, G., Schroff, F., & Adam, H. (2018). Encoder-decoder with atrous separable convolution for semantic image segmentation. CoRR arXiv:abs/1802.02611.
Cui, G., Feng, H., Xu, Z., Li, Q., & Chen, Y. (2015). Detail preserved fusion of visible and infrared images using regional saliency extraction and multi-scale image decomposition. Optics Communications, 341, 199–209.
Article Google Scholar
Eskicioglu, A., & Fisher, P. (1995). Image quality measures and their performance. IEEE Transactions on Communications, 43(12), 2959–2965.
Article Google Scholar
Huang, S., Wu, X., Yang, Y., Wan, W., & Wang, X. (2024). A dual-encoder network based on multi-layer feature fusion for infrared and visible image fusion. International Journal of Machine Learning and Cybernetics, 15(10), 1–10.
Article Google Scholar
Hwang, S., Park, J., Kim, N., Choi, Y., & So Kweon, I. (2015). Multispectral pedestrian detection: Benchmark dataset and baseline. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Jain, D. K., Zhao, X., Gonz lez-Almagro, G., Gan, C., & Kotecha, K. (2023). Multimodal pedestrian detection using metaheuristics with deep convolutional neural network in crowded scenes. Information Fusion, 95, 401–414.
Article Google Scholar
Jocher, G. (2020). Ultralytics yolov5. https://github.com/ultralytics/yolov5.
Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A. C., Lo, W. Y., & Dollár, P. (2023). Segment anything. 2023 IEEE/CVF International Conference on Computer Vision pp 3992–4003.
Li, H., & Wu, X. J. (2019). DenseFuse: A Fusion Approach to Infrared and Visible Images. IEEE Transactions on Image Processing, 28(5), 2614–2623.
Article MathSciNet Google Scholar
Li, H., & Wu, X. J. (2024). Crossfuse: A novel cross attention mechanism based infrared and visible image fusion approach. Information Fusion, 103, Article 102147.
Article Google Scholar
Li, H., Wu, X. J., & Durrani, T. (2020). Nestfuse: An infrared and visible image fusion architecture based on nest connection and spatial/channel attention models. IEEE Transactions on Instrumentation and Measurement, 69(12), 9645–9656.
Article Google Scholar
Li, H., Wu, X. J., & Kittler, J. (2021). Rfn-nest: An end-to-end residual fusion network for infrared and visible images. Information Fusion, 73, 72–86.
Article Google Scholar
Li, H., Liu, J., Zhang, Y., & Liu, Y. (2023a). A deep learning framework for infrared and visible image fusion without strict registration. International Journal of Computer Vision, 132, 1625–1644.
Article Google Scholar
Li, H., Xu, T., Wu, X. J., Lu, J., & Kittler, J. (2023b). Lrrnet: A novel representation learning guided fusion network for infrared and visible images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(9), 11040–11052.
Article Google Scholar
Li, S., Kang, X., & Hu, J. (2013). Image fusion with guided filtering. IEEE Transactions on Image Processing, 22(7), 2864–2875.
Article Google Scholar
Liu, C., Qi, Y., & Ding, W. (2017a). Infrared and visible image fusion method based on saliency detection in sparse domain. Infrared Physics & Technology, 83, 94–102.
Article Google Scholar
Liu, J., Fan, X., Huang, Z., Wu, G., Liu, R., Zhong, W., & Luo, Z. (2022) Target-aware dual adversarial learning and a multi-scenario multi-modality benchmark to fuse infrared and visible for object detection. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition pp 5802–5811.
Liu, J., Lin, R., Wu, G., Liu, R., Luo, Z., & Fan, X. (2023a) Coconet: Coupled contrastive learning network with multi-level feature ensemble for multi-modality image fusion. International Journal of Computer Vision p 1 28.
Liu, J., Liu, Z., Wu, G., Ma, L., Liu, R., Zhong, W., ... Fan, X. (2023b) Multi-interactive feature learning and a full-time multi-modality benchmark for image fusion and segmentation. Proceedings of the IEEE/CVF international conference on computer vision pp 8115–8124
Liu, J., Wu, G., Luan, J., Jiang, Z., Liu, R., & Fan, X. (2023c). Holoco: Holistic and local contrastive learning network for multi-exposure image fusion. Information Fusion, 95, 237–249.
Article Google Scholar
Liu, J., Wu, G., Liu, Z., Wang, D., Jiang, Z., Ma, L., ... Fan, X. (2024a). Infrared and visible image fusion: From data compatibility to task adaption. IEEE Transactions on Pattern Analysis and Machine Intelligence pp 1–20.
Liu, S., & Deng, W. (2015). Very deep convolutional neural network based image classification using small training sample size. 2015 3rd IAPR Asian Conference on Pattern Recognition pp 730–734.
Liu, S., Zeng, Z., Ren, T., Li, F., Zhang, H., Yang, J., ... Zhang, L. (2023d) Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499.
Liu, X., Huo, H., Li, J., Pang, S., & Zheng, B. (2024b). A semantic-driven coupled network for infrared and visible image fusion. Information Fusion, 108, Article 102352.
Article Google Scholar
Liu, Y., Liu, S., & Wang, Z. (2015). A general framework for image fusion based on multi-scale transform and sparse representation. Information Fusion, 24, 147–164.
Article Google Scholar
Liu, Y., Chen, X., Peng, H., & Wang, Z. (2017b). Multi-focus image fusion with a deep convolutional neural network. Information Fusion, 36, 191–207.
Article Google Scholar
Liu, Y., Wang, L., Cheng, J., Li, C., & Chen, X. (2020). Multi-focus image fusion: A survey of the state of the art. Information Fusion, 64, 71–91.
Article Google Scholar
Ma, J., Ma, Y., & Li, C. (2019a). Infrared and visible image fusion methods and applications: A survey. Information Fusion, 45, 153–178.
Article Google Scholar
Ma, J., Yu, W., Liang, P., Li, C., & Jiang, J. (2019b). FusionGAN: A generative adversarial network for infrared and visible image fusion. Information Fusion, 48, 11–26.
Article Google Scholar
Ma, J., Xu, H., Jiang, J., Mei, X., & Zhang, X. P. (2020). Ddcgan: A dual-discriminator conditional generative adversarial network for multi-resolution image fusion. IEEE Transactions on Image Processing, 29, 4980–4995.
Article Google Scholar
Ma, J., Tang, L., Fan, F., Huang, J., Mei, X., & Ma, Y. (2022). Swinfusion: Cross-domain long-range learning for general image fusion via swin transformer. IEEE/CAA Journal of Automatica Sinica, 9(7), 1200–1217.
Article Google Scholar
Mechrez, R., Talmi, I., & Zelnik-Manor, L. (2018). The contextual loss for image transformation with non-aligned data. In: Proceedings of the European conference on computer vision, pp 768–783.
Pajares, G., & Manuel de la Cruz, J. (2004). A wavelet-based image fusion tutorial. Pattern Recognition, 37(9), 1855–1872.
Article Google Scholar
Roberts, J., van Aardt, J., & Ahmed, F. B. (2008). Assessment of image fusion procedures using entropy, image quality, and multispectral classification. Journal of Applied Remote Sensing, 2, Article 023522.
Article Google Scholar
Tang, L., Yuan, J., & Ma, J. (2022). Image fusion in the loop of high-level vision tasks: A semantic-aware real-time infrared and visible image fusion network. Information Fusion, 82, 28–42.
Article Google Scholar
Tang, L., Zhang, H., Xu, H., & Ma, J. (2023). Rethinking the necessity of image fusion in high-level vision tasks: A practical infrared and visible image fusion network based on progressive semantic injection and scene fidelity. Information Fusion, 99, Article 101870.
Article Google Scholar
Toet, A. (2014). TNO Image Fusion Dataset. https://figshare.com/articles/TN_Image_Fusion_Dataset/1008029.
Wang, Z., Bovik, A., Sheikh, H., & Simoncelli, E. P. (2004). Image quality assessment: from error visibility to structural similarity. IEEE Transactions on Image Processing, 13(4), 600–612.
Article Google Scholar
Wang, Z., Li, X., Zhao, L., Duan, H., Wang, S., Liu, H., & Zhang, X. (2023). When multi-focus image fusion networks meet traditional edge-preservation technology. International Journal of Computer Vision, 131, 2529–2552.
Article Google Scholar
Xie, H., Zhang, Y., Qiu, J., Zhai, X., Liu, X., Yang, Y., & Zhong, J. (2023). Semantics lead all: Towards unified image registration and fusion from a semantic perspective. Information Fusion, 98, Article 101835.
Xu, H., Ma, J., Jiang, J., Guo, X., & Ling, H. (2020). U2fusion: A unified unsupervised image fusion network. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(1), 502–518.
Zhang, H., & Ma, J. (2021). Sdnet: A versatile squeeze-and-decomposition network for real-time image fusion. International Journal of Computer Vision, 129, 2761–2785.
Article Google Scholar
Zhang, J., Liu, H., Yang, K., Hu, X., Liu, R., & Stiefelhagen, R. (2023a). Cmx: Cross-modal fusion for rgb-x semantic segmentation with transformers. IEEE Transactions on Intelligent Transportation Systems, 24(12), 14679–14694.
Article Google Scholar
Zhang, P., Wang, D., Lu, H., & Yang, X. (2021). Learning adaptive attribute-driven representation for real-time rgb-t tracking. International Journal of Computer Vision, 129(9), 2714–2729.
Article Google Scholar
Zhang, T., Guo, H., Jiao, Q., Zhang, Q., & Han, J. (2023b). Efficient rgb-t tracking via cross-modality distillation. In: 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 5404–5413.
Zhang, X., Chen, Q., Ng, R., & Koltun, V. (2019). Zoom to learn, learn to zoom. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 3757–3765.
Zhang, X., Zhai, H., Liu, J., Wang, Z., & Sun, H. (2023c). Real-time infrared and visible image fusion network using adaptive pixel weighting strategy. Information Fusion, 99, Article 101863.
Article Google Scholar
Zhao, W., Xie, S., Zhao, F., He, Y., & Lu, H. (2023a) Metafusion: Infrared and visible image fusion via meta-feature embedding from object detection. In: 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 13955–13965.
Zhao, Z., Bai, H., Zhang, J., Zhang, Y., Xu, S., Lin, Z., ... Van Gool, L. (2023b). Cddfuse: Correlation-driven dual-branch feature decomposition for multi-modality image fusion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 5906–5916.
Zheng, Z., Zhong, Y., Ma, A., & Zhang, L. (2024). Single-temporal supervised learning for universal remote sensing change detection. International Journal of Computer Vision, 132(12), 5582–5602.
Article Google Scholar
Zhou, T., & Wang, W. (2024). Cross-image pixel contrasting for semantic segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence pp 1–15.
Zhu, Z., Yang, X., Lu, R., Shen, T., Xie, X., & Zhang, T. (2022). Clf-net: Contrastive learning for infrared and visible image fusion network. IEEE Transactions on Instrumentation and Measurement, 71, 1–15.
Google Scholar

Download references

Acknowledgements

This work was supported by the National Natural Science Foundation of China (62202205), the National Key Research and Development Program of China under Grant (2023YFF1105102, 2023YFF1105105), the Fundamental Research Funds for the Central Universities (JUSRP123030), and the National Science Foundation for Distinguished Young Scholars under Grant (62225605).

Author information

Authors and Affiliations

International Joint Laboratory on Artificial Intelligence of Jiangsu Province, School of Artificial Intelligence and Computer Science, Jiangnan University, LihuRoad, Wuxi, 100190, Jiangsu, China
Hui Li, Congcong Bian, Zeyang Zhang, Xiaoning Song & Xiao-Jun Wu
College of Computer Science and Technology, Zhejiang University, Hangzhou, 310007, Zhejiang, China
Xi Li

Authors

Hui Li
View author publications
Search author on:PubMed Google Scholar
Congcong Bian
View author publications
Search author on:PubMed Google Scholar
Zeyang Zhang
View author publications
Search author on:PubMed Google Scholar
Xiaoning Song
View author publications
Search author on:PubMed Google Scholar
Xi Li
View author publications
Search author on:PubMed Google Scholar
Xiao-Jun Wu
View author publications
Search author on:PubMed Google Scholar

Corresponding author

Correspondence to Hui Li.

Ethics declarations

Conflict of interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Additional information

Communicated by Limin Wang.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Li, H., Bian, C., Zhang, Z. et al. OCCO: LVM-Guided Infrared and Visible Image Fusion Framework Based on Object-Aware and Contextual Contrastive Learning. Int J Comput Vis 133, 6611–6635 (2025). https://doi.org/10.1007/s11263-025-02507-2

Download citation

Received: 23 September 2024
Accepted: 09 June 2025
Published: 17 June 2025
Version of record: 17 June 2025
Issue date: September 2025
DOI: https://doi.org/10.1007/s11263-025-02507-2

Keywords

Profiles

Hui Li View author profile

Access this article

Log in via an institution

Subscribe and save

Springer+

from $39.99 /Month

Starting from 10 chapters or articles per month
Access and download chapters and articles from more than 300k books and 2,500 journals
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

OCCO: LVM-Guided Infrared and Visible Image Fusion Framework Based on Object-Aware and Contextual Contrastive Learning

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

An interactive deep model combined with Retinex for low-light visible and infrared image fusion

A Lightweight Infrared and Visible Image Fusion Method for Object Detection

CoCoNet: Coupled Contrastive Learning Network with Multi-level Feature Ensemble for Multi-modality Image Fusion

Explore related subjects

Data Availability

Change history

27 June 2025

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Profiles

Subscribe and save

Buy Now