这是indexloc提供的服务,不要输入任何密码
Skip to main content
Log in

OCCO: LVM-Guided Infrared and Visible Image Fusion Framework Based on Object-Aware and Contextual Contrastive Learning

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

This article has been updated

Abstract

Image fusion is a crucial technique in the field of computer vision, and its goal is to generate high-quality fused images and improve the performance of downstream tasks. However, existing fusion methods struggle to balance these two factors. Achieving high quality in fused images may result in lower performance in downstream visual tasks, and vice versa. To address this drawback, a novel LVM (large vision model)-guided fusion framework with Object-aware and Contextual COntrastive learning is proposed, termed as OCCO. The pre-trained LVM is utilized to provide semantic guidance, allowing the network to focus solely on fusion tasks while emphasizing learning salient semantic features in form of contrastive learning. Additionally, a novel feature interaction fusion network is also designed to resolve information conflicts in fusion images caused by modality differences. By learning the distinction between positive samples and negative samples in the latent feature space (contextual space), the integrity of target information in fused image is improved, thereby benefiting downstream performance. Finally, compared with eight state-of-the-art methods on four datasets, the effectiveness of the proposed method is validated, and exceptional performance is also demonstrated on downstream visual task.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+
from $39.99 /Month
  • Starting from 10 chapters or articles per month
  • Access and download chapters and articles from more than 300k books and 2,500 journals
  • Cancel anytime
View plans

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20

Similar content being viewed by others

Data Availability

Data will be made available on request.

Change history

  • 27 June 2025

    The acknowledgment has been corrected

References

  • Chen, L., Zhu, Y., Papandreou, G., Schroff, F., & Adam, H. (2018). Encoder-decoder with atrous separable convolution for semantic image segmentation. CoRR arXiv:abs/1802.02611.

  • Cui, G., Feng, H., Xu, Z., Li, Q., & Chen, Y. (2015). Detail preserved fusion of visible and infrared images using regional saliency extraction and multi-scale image decomposition. Optics Communications, 341, 199–209.

    Article  Google Scholar 

  • Eskicioglu, A., & Fisher, P. (1995). Image quality measures and their performance. IEEE Transactions on Communications, 43(12), 2959–2965.

    Article  Google Scholar 

  • Huang, S., Wu, X., Yang, Y., Wan, W., & Wang, X. (2024). A dual-encoder network based on multi-layer feature fusion for infrared and visible image fusion. International Journal of Machine Learning and Cybernetics, 15(10), 1–10.

    Article  Google Scholar 

  • Hwang, S., Park, J., Kim, N., Choi, Y., & So Kweon, I. (2015). Multispectral pedestrian detection: Benchmark dataset and baseline. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

  • Jain, D. K., Zhao, X., Gonz lez-Almagro, G., Gan, C., & Kotecha, K. (2023). Multimodal pedestrian detection using metaheuristics with deep convolutional neural network in crowded scenes. Information Fusion, 95, 401–414.

    Article  Google Scholar 

  • Jocher, G. (2020). Ultralytics yolov5. https://github.com/ultralytics/yolov5.

  • Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A. C., Lo, W. Y., & Dollár, P. (2023). Segment anything. 2023 IEEE/CVF International Conference on Computer Vision pp 3992–4003.

  • Li, H., & Wu, X. J. (2019). DenseFuse: A Fusion Approach to Infrared and Visible Images. IEEE Transactions on Image Processing, 28(5), 2614–2623.

    Article  MathSciNet  Google Scholar 

  • Li, H., & Wu, X. J. (2024). Crossfuse: A novel cross attention mechanism based infrared and visible image fusion approach. Information Fusion, 103, Article 102147.

    Article  Google Scholar 

  • Li, H., Wu, X. J., & Durrani, T. (2020). Nestfuse: An infrared and visible image fusion architecture based on nest connection and spatial/channel attention models. IEEE Transactions on Instrumentation and Measurement, 69(12), 9645–9656.

    Article  Google Scholar 

  • Li, H., Wu, X. J., & Kittler, J. (2021). Rfn-nest: An end-to-end residual fusion network for infrared and visible images. Information Fusion, 73, 72–86.

    Article  Google Scholar 

  • Li, H., Liu, J., Zhang, Y., & Liu, Y. (2023a). A deep learning framework for infrared and visible image fusion without strict registration. International Journal of Computer Vision, 132, 1625–1644.

    Article  Google Scholar 

  • Li, H., Xu, T., Wu, X. J., Lu, J., & Kittler, J. (2023b). Lrrnet: A novel representation learning guided fusion network for infrared and visible images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(9), 11040–11052.

    Article  Google Scholar 

  • Li, S., Kang, X., & Hu, J. (2013). Image fusion with guided filtering. IEEE Transactions on Image Processing, 22(7), 2864–2875.

    Article  Google Scholar 

  • Liu, C., Qi, Y., & Ding, W. (2017a). Infrared and visible image fusion method based on saliency detection in sparse domain. Infrared Physics & Technology, 83, 94–102.

    Article  Google Scholar 

  • Liu, J., Fan, X., Huang, Z., Wu, G., Liu, R., Zhong, W., & Luo, Z. (2022) Target-aware dual adversarial learning and a multi-scenario multi-modality benchmark to fuse infrared and visible for object detection. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition pp 5802–5811.

  • Liu, J., Lin, R., Wu, G., Liu, R., Luo, Z., & Fan, X. (2023a) Coconet: Coupled contrastive learning network with multi-level feature ensemble for multi-modality image fusion. International Journal of Computer Vision p 1 28.

  • Liu, J., Liu, Z., Wu, G., Ma, L., Liu, R., Zhong, W., ... Fan, X. (2023b) Multi-interactive feature learning and a full-time multi-modality benchmark for image fusion and segmentation. Proceedings of the IEEE/CVF international conference on computer vision pp 8115–8124

  • Liu, J., Wu, G., Luan, J., Jiang, Z., Liu, R., & Fan, X. (2023c). Holoco: Holistic and local contrastive learning network for multi-exposure image fusion. Information Fusion, 95, 237–249.

    Article  Google Scholar 

  • Liu, J., Wu, G., Liu, Z., Wang, D., Jiang, Z., Ma, L., ... Fan, X. (2024a). Infrared and visible image fusion: From data compatibility to task adaption. IEEE Transactions on Pattern Analysis and Machine Intelligence pp 1–20.

  • Liu, S., & Deng, W. (2015). Very deep convolutional neural network based image classification using small training sample size. 2015 3rd IAPR Asian Conference on Pattern Recognition pp 730–734.

  • Liu, S., Zeng, Z., Ren, T., Li, F., Zhang, H., Yang, J., ... Zhang, L. (2023d) Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499.

  • Liu, X., Huo, H., Li, J., Pang, S., & Zheng, B. (2024b). A semantic-driven coupled network for infrared and visible image fusion. Information Fusion, 108, Article 102352.

    Article  Google Scholar 

  • Liu, Y., Liu, S., & Wang, Z. (2015). A general framework for image fusion based on multi-scale transform and sparse representation. Information Fusion, 24, 147–164.

    Article  Google Scholar 

  • Liu, Y., Chen, X., Peng, H., & Wang, Z. (2017b). Multi-focus image fusion with a deep convolutional neural network. Information Fusion, 36, 191–207.

    Article  Google Scholar 

  • Liu, Y., Wang, L., Cheng, J., Li, C., & Chen, X. (2020). Multi-focus image fusion: A survey of the state of the art. Information Fusion, 64, 71–91.

    Article  Google Scholar 

  • Ma, J., Ma, Y., & Li, C. (2019a). Infrared and visible image fusion methods and applications: A survey. Information Fusion, 45, 153–178.

    Article  Google Scholar 

  • Ma, J., Yu, W., Liang, P., Li, C., & Jiang, J. (2019b). FusionGAN: A generative adversarial network for infrared and visible image fusion. Information Fusion, 48, 11–26.

    Article  Google Scholar 

  • Ma, J., Xu, H., Jiang, J., Mei, X., & Zhang, X. P. (2020). Ddcgan: A dual-discriminator conditional generative adversarial network for multi-resolution image fusion. IEEE Transactions on Image Processing, 29, 4980–4995.

    Article  Google Scholar 

  • Ma, J., Tang, L., Fan, F., Huang, J., Mei, X., & Ma, Y. (2022). Swinfusion: Cross-domain long-range learning for general image fusion via swin transformer. IEEE/CAA Journal of Automatica Sinica, 9(7), 1200–1217.

    Article  Google Scholar 

  • Mechrez, R., Talmi, I., & Zelnik-Manor, L. (2018). The contextual loss for image transformation with non-aligned data. In: Proceedings of the European conference on computer vision, pp 768–783.

  • Pajares, G., & Manuel de la Cruz, J. (2004). A wavelet-based image fusion tutorial. Pattern Recognition, 37(9), 1855–1872.

    Article  Google Scholar 

  • Roberts, J., van Aardt, J., & Ahmed, F. B. (2008). Assessment of image fusion procedures using entropy, image quality, and multispectral classification. Journal of Applied Remote Sensing, 2, Article 023522.

    Article  Google Scholar 

  • Tang, L., Yuan, J., & Ma, J. (2022). Image fusion in the loop of high-level vision tasks: A semantic-aware real-time infrared and visible image fusion network. Information Fusion, 82, 28–42.

    Article  Google Scholar 

  • Tang, L., Zhang, H., Xu, H., & Ma, J. (2023). Rethinking the necessity of image fusion in high-level vision tasks: A practical infrared and visible image fusion network based on progressive semantic injection and scene fidelity. Information Fusion, 99, Article 101870.

    Article  Google Scholar 

  • Toet, A. (2014). TNO Image Fusion Dataset. https://figshare.com/articles/TN_Image_Fusion_Dataset/1008029.

  • Wang, Z., Bovik, A., Sheikh, H., & Simoncelli, E. P. (2004). Image quality assessment: from error visibility to structural similarity. IEEE Transactions on Image Processing, 13(4), 600–612.

    Article  Google Scholar 

  • Wang, Z., Li, X., Zhao, L., Duan, H., Wang, S., Liu, H., & Zhang, X. (2023). When multi-focus image fusion networks meet traditional edge-preservation technology. International Journal of Computer Vision, 131, 2529–2552.

    Article  Google Scholar 

  • Xie, H., Zhang, Y., Qiu, J., Zhai, X., Liu, X., Yang, Y., & Zhong, J. (2023). Semantics lead all: Towards unified image registration and fusion from a semantic perspective. Information Fusion, 98, Article 101835.

  • Xu, H., Ma, J., Jiang, J., Guo, X., & Ling, H. (2020). U2fusion: A unified unsupervised image fusion network. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(1), 502–518.

  • Zhang, H., & Ma, J. (2021). Sdnet: A versatile squeeze-and-decomposition network for real-time image fusion. International Journal of Computer Vision, 129, 2761–2785.

    Article  Google Scholar 

  • Zhang, J., Liu, H., Yang, K., Hu, X., Liu, R., & Stiefelhagen, R. (2023a). Cmx: Cross-modal fusion for rgb-x semantic segmentation with transformers. IEEE Transactions on Intelligent Transportation Systems, 24(12), 14679–14694.

    Article  Google Scholar 

  • Zhang, P., Wang, D., Lu, H., & Yang, X. (2021). Learning adaptive attribute-driven representation for real-time rgb-t tracking. International Journal of Computer Vision, 129(9), 2714–2729.

    Article  Google Scholar 

  • Zhang, T., Guo, H., Jiao, Q., Zhang, Q., & Han, J. (2023b). Efficient rgb-t tracking via cross-modality distillation. In: 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 5404–5413.

  • Zhang, X., Chen, Q., Ng, R., & Koltun, V. (2019). Zoom to learn, learn to zoom. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 3757–3765.

  • Zhang, X., Zhai, H., Liu, J., Wang, Z., & Sun, H. (2023c). Real-time infrared and visible image fusion network using adaptive pixel weighting strategy. Information Fusion, 99, Article 101863.

    Article  Google Scholar 

  • Zhao, W., Xie, S., Zhao, F., He, Y., & Lu, H. (2023a) Metafusion: Infrared and visible image fusion via meta-feature embedding from object detection. In: 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 13955–13965.

  • Zhao, Z., Bai, H., Zhang, J., Zhang, Y., Xu, S., Lin, Z., ... Van Gool, L. (2023b). Cddfuse: Correlation-driven dual-branch feature decomposition for multi-modality image fusion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 5906–5916.

  • Zheng, Z., Zhong, Y., Ma, A., & Zhang, L. (2024). Single-temporal supervised learning for universal remote sensing change detection. International Journal of Computer Vision, 132(12), 5582–5602.

    Article  Google Scholar 

  • Zhou, T., & Wang, W. (2024). Cross-image pixel contrasting for semantic segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence pp 1–15.

  • Zhu, Z., Yang, X., Lu, R., Shen, T., Xie, X., & Zhang, T. (2022). Clf-net: Contrastive learning for infrared and visible image fusion network. IEEE Transactions on Instrumentation and Measurement, 71, 1–15.

    Google Scholar 

Download references

Acknowledgements

This work was supported by the National Natural Science Foundation of China (62202205), the National Key Research and Development Program of China under Grant (2023YFF1105102, 2023YFF1105105), the Fundamental Research Funds for the Central Universities (JUSRP123030), and the National Science Foundation for Distinguished Young Scholars under Grant (62225605).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hui Li.

Ethics declarations

Conflict of interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Additional information

Communicated by Limin Wang.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Li, H., Bian, C., Zhang, Z. et al. OCCO: LVM-Guided Infrared and Visible Image Fusion Framework Based on Object-Aware and Contextual Contrastive Learning. Int J Comput Vis 133, 6611–6635 (2025). https://doi.org/10.1007/s11263-025-02507-2

Download citation

  • Received:

  • Accepted:

  • Published:

  • Version of record:

  • Issue date:

  • DOI: https://doi.org/10.1007/s11263-025-02507-2

Keywords

Profiles

  1. Hui Li