Abstract
Image fusion is a crucial technique in the field of computer vision, and its goal is to generate high-quality fused images and improve the performance of downstream tasks. However, existing fusion methods struggle to balance these two factors. Achieving high quality in fused images may result in lower performance in downstream visual tasks, and vice versa. To address this drawback, a novel LVM (large vision model)-guided fusion framework with Object-aware and Contextual COntrastive learning is proposed, termed as OCCO. The pre-trained LVM is utilized to provide semantic guidance, allowing the network to focus solely on fusion tasks while emphasizing learning salient semantic features in form of contrastive learning. Additionally, a novel feature interaction fusion network is also designed to resolve information conflicts in fusion images caused by modality differences. By learning the distinction between positive samples and negative samples in the latent feature space (contextual space), the integrity of target information in fused image is improved, thereby benefiting downstream performance. Finally, compared with eight state-of-the-art methods on four datasets, the effectiveness of the proposed method is validated, and exceptional performance is also demonstrated on downstream visual task.
Similar content being viewed by others
Data Availability
Data will be made available on request.
Change history
27 June 2025
The acknowledgment has been corrected
References
Chen, L., Zhu, Y., Papandreou, G., Schroff, F., & Adam, H. (2018). Encoder-decoder with atrous separable convolution for semantic image segmentation. CoRR arXiv:abs/1802.02611.
Cui, G., Feng, H., Xu, Z., Li, Q., & Chen, Y. (2015). Detail preserved fusion of visible and infrared images using regional saliency extraction and multi-scale image decomposition. Optics Communications, 341, 199–209.
Eskicioglu, A., & Fisher, P. (1995). Image quality measures and their performance. IEEE Transactions on Communications, 43(12), 2959–2965.
Huang, S., Wu, X., Yang, Y., Wan, W., & Wang, X. (2024). A dual-encoder network based on multi-layer feature fusion for infrared and visible image fusion. International Journal of Machine Learning and Cybernetics, 15(10), 1–10.
Hwang, S., Park, J., Kim, N., Choi, Y., & So Kweon, I. (2015). Multispectral pedestrian detection: Benchmark dataset and baseline. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Jain, D. K., Zhao, X., Gonz lez-Almagro, G., Gan, C., & Kotecha, K. (2023). Multimodal pedestrian detection using metaheuristics with deep convolutional neural network in crowded scenes. Information Fusion, 95, 401–414.
Jocher, G. (2020). Ultralytics yolov5. https://github.com/ultralytics/yolov5.
Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A. C., Lo, W. Y., & Dollár, P. (2023). Segment anything. 2023 IEEE/CVF International Conference on Computer Vision pp 3992–4003.
Li, H., & Wu, X. J. (2019). DenseFuse: A Fusion Approach to Infrared and Visible Images. IEEE Transactions on Image Processing, 28(5), 2614–2623.
Li, H., & Wu, X. J. (2024). Crossfuse: A novel cross attention mechanism based infrared and visible image fusion approach. Information Fusion, 103, Article 102147.
Li, H., Wu, X. J., & Durrani, T. (2020). Nestfuse: An infrared and visible image fusion architecture based on nest connection and spatial/channel attention models. IEEE Transactions on Instrumentation and Measurement, 69(12), 9645–9656.
Li, H., Wu, X. J., & Kittler, J. (2021). Rfn-nest: An end-to-end residual fusion network for infrared and visible images. Information Fusion, 73, 72–86.
Li, H., Liu, J., Zhang, Y., & Liu, Y. (2023a). A deep learning framework for infrared and visible image fusion without strict registration. International Journal of Computer Vision, 132, 1625–1644.
Li, H., Xu, T., Wu, X. J., Lu, J., & Kittler, J. (2023b). Lrrnet: A novel representation learning guided fusion network for infrared and visible images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(9), 11040–11052.
Li, S., Kang, X., & Hu, J. (2013). Image fusion with guided filtering. IEEE Transactions on Image Processing, 22(7), 2864–2875.
Liu, C., Qi, Y., & Ding, W. (2017a). Infrared and visible image fusion method based on saliency detection in sparse domain. Infrared Physics & Technology, 83, 94–102.
Liu, J., Fan, X., Huang, Z., Wu, G., Liu, R., Zhong, W., & Luo, Z. (2022) Target-aware dual adversarial learning and a multi-scenario multi-modality benchmark to fuse infrared and visible for object detection. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition pp 5802–5811.
Liu, J., Lin, R., Wu, G., Liu, R., Luo, Z., & Fan, X. (2023a) Coconet: Coupled contrastive learning network with multi-level feature ensemble for multi-modality image fusion. International Journal of Computer Vision p 1 28.
Liu, J., Liu, Z., Wu, G., Ma, L., Liu, R., Zhong, W., ... Fan, X. (2023b) Multi-interactive feature learning and a full-time multi-modality benchmark for image fusion and segmentation. Proceedings of the IEEE/CVF international conference on computer vision pp 8115–8124
Liu, J., Wu, G., Luan, J., Jiang, Z., Liu, R., & Fan, X. (2023c). Holoco: Holistic and local contrastive learning network for multi-exposure image fusion. Information Fusion, 95, 237–249.
Liu, J., Wu, G., Liu, Z., Wang, D., Jiang, Z., Ma, L., ... Fan, X. (2024a). Infrared and visible image fusion: From data compatibility to task adaption. IEEE Transactions on Pattern Analysis and Machine Intelligence pp 1–20.
Liu, S., & Deng, W. (2015). Very deep convolutional neural network based image classification using small training sample size. 2015 3rd IAPR Asian Conference on Pattern Recognition pp 730–734.
Liu, S., Zeng, Z., Ren, T., Li, F., Zhang, H., Yang, J., ... Zhang, L. (2023d) Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499.
Liu, X., Huo, H., Li, J., Pang, S., & Zheng, B. (2024b). A semantic-driven coupled network for infrared and visible image fusion. Information Fusion, 108, Article 102352.
Liu, Y., Liu, S., & Wang, Z. (2015). A general framework for image fusion based on multi-scale transform and sparse representation. Information Fusion, 24, 147–164.
Liu, Y., Chen, X., Peng, H., & Wang, Z. (2017b). Multi-focus image fusion with a deep convolutional neural network. Information Fusion, 36, 191–207.
Liu, Y., Wang, L., Cheng, J., Li, C., & Chen, X. (2020). Multi-focus image fusion: A survey of the state of the art. Information Fusion, 64, 71–91.
Ma, J., Ma, Y., & Li, C. (2019a). Infrared and visible image fusion methods and applications: A survey. Information Fusion, 45, 153–178.
Ma, J., Yu, W., Liang, P., Li, C., & Jiang, J. (2019b). FusionGAN: A generative adversarial network for infrared and visible image fusion. Information Fusion, 48, 11–26.
Ma, J., Xu, H., Jiang, J., Mei, X., & Zhang, X. P. (2020). Ddcgan: A dual-discriminator conditional generative adversarial network for multi-resolution image fusion. IEEE Transactions on Image Processing, 29, 4980–4995.
Ma, J., Tang, L., Fan, F., Huang, J., Mei, X., & Ma, Y. (2022). Swinfusion: Cross-domain long-range learning for general image fusion via swin transformer. IEEE/CAA Journal of Automatica Sinica, 9(7), 1200–1217.
Mechrez, R., Talmi, I., & Zelnik-Manor, L. (2018). The contextual loss for image transformation with non-aligned data. In: Proceedings of the European conference on computer vision, pp 768–783.
Pajares, G., & Manuel de la Cruz, J. (2004). A wavelet-based image fusion tutorial. Pattern Recognition, 37(9), 1855–1872.
Roberts, J., van Aardt, J., & Ahmed, F. B. (2008). Assessment of image fusion procedures using entropy, image quality, and multispectral classification. Journal of Applied Remote Sensing, 2, Article 023522.
Tang, L., Yuan, J., & Ma, J. (2022). Image fusion in the loop of high-level vision tasks: A semantic-aware real-time infrared and visible image fusion network. Information Fusion, 82, 28–42.
Tang, L., Zhang, H., Xu, H., & Ma, J. (2023). Rethinking the necessity of image fusion in high-level vision tasks: A practical infrared and visible image fusion network based on progressive semantic injection and scene fidelity. Information Fusion, 99, Article 101870.
Toet, A. (2014). TNO Image Fusion Dataset. https://figshare.com/articles/TN_Image_Fusion_Dataset/1008029.
Wang, Z., Bovik, A., Sheikh, H., & Simoncelli, E. P. (2004). Image quality assessment: from error visibility to structural similarity. IEEE Transactions on Image Processing, 13(4), 600–612.
Wang, Z., Li, X., Zhao, L., Duan, H., Wang, S., Liu, H., & Zhang, X. (2023). When multi-focus image fusion networks meet traditional edge-preservation technology. International Journal of Computer Vision, 131, 2529–2552.
Xie, H., Zhang, Y., Qiu, J., Zhai, X., Liu, X., Yang, Y., & Zhong, J. (2023). Semantics lead all: Towards unified image registration and fusion from a semantic perspective. Information Fusion, 98, Article 101835.
Xu, H., Ma, J., Jiang, J., Guo, X., & Ling, H. (2020). U2fusion: A unified unsupervised image fusion network. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(1), 502–518.
Zhang, H., & Ma, J. (2021). Sdnet: A versatile squeeze-and-decomposition network for real-time image fusion. International Journal of Computer Vision, 129, 2761–2785.
Zhang, J., Liu, H., Yang, K., Hu, X., Liu, R., & Stiefelhagen, R. (2023a). Cmx: Cross-modal fusion for rgb-x semantic segmentation with transformers. IEEE Transactions on Intelligent Transportation Systems, 24(12), 14679–14694.
Zhang, P., Wang, D., Lu, H., & Yang, X. (2021). Learning adaptive attribute-driven representation for real-time rgb-t tracking. International Journal of Computer Vision, 129(9), 2714–2729.
Zhang, T., Guo, H., Jiao, Q., Zhang, Q., & Han, J. (2023b). Efficient rgb-t tracking via cross-modality distillation. In: 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 5404–5413.
Zhang, X., Chen, Q., Ng, R., & Koltun, V. (2019). Zoom to learn, learn to zoom. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 3757–3765.
Zhang, X., Zhai, H., Liu, J., Wang, Z., & Sun, H. (2023c). Real-time infrared and visible image fusion network using adaptive pixel weighting strategy. Information Fusion, 99, Article 101863.
Zhao, W., Xie, S., Zhao, F., He, Y., & Lu, H. (2023a) Metafusion: Infrared and visible image fusion via meta-feature embedding from object detection. In: 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 13955–13965.
Zhao, Z., Bai, H., Zhang, J., Zhang, Y., Xu, S., Lin, Z., ... Van Gool, L. (2023b). Cddfuse: Correlation-driven dual-branch feature decomposition for multi-modality image fusion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 5906–5916.
Zheng, Z., Zhong, Y., Ma, A., & Zhang, L. (2024). Single-temporal supervised learning for universal remote sensing change detection. International Journal of Computer Vision, 132(12), 5582–5602.
Zhou, T., & Wang, W. (2024). Cross-image pixel contrasting for semantic segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence pp 1–15.
Zhu, Z., Yang, X., Lu, R., Shen, T., Xie, X., & Zhang, T. (2022). Clf-net: Contrastive learning for infrared and visible image fusion network. IEEE Transactions on Instrumentation and Measurement, 71, 1–15.
Acknowledgements
This work was supported by the National Natural Science Foundation of China (62202205), the National Key Research and Development Program of China under Grant (2023YFF1105102, 2023YFF1105105), the Fundamental Research Funds for the Central Universities (JUSRP123030), and the National Science Foundation for Distinguished Young Scholars under Grant (62225605).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Additional information
Communicated by Limin Wang.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Li, H., Bian, C., Zhang, Z. et al. OCCO: LVM-Guided Infrared and Visible Image Fusion Framework Based on Object-Aware and Contextual Contrastive Learning. Int J Comput Vis 133, 6611–6635 (2025). https://doi.org/10.1007/s11263-025-02507-2
Received:
Accepted:
Published:
Version of record:
Issue date:
DOI: https://doi.org/10.1007/s11263-025-02507-2