Modality-missing RGBT Tracking: Invertible Prompt Learning and High-quality Benchmarks

Lu, Andong; Li, Chenglong; Zhao, Jiacong; Tang, Jin; Luo, Bin

doi:10.1007/s11263-024-02311-4

Modality-missing RGBT Tracking: Invertible Prompt Learning and High-quality Benchmarks

Published: 07 December 2024

Volume 133, pages 2599–2619, (2025)
Cite this article

International Journal of Computer Vision Aims and scope Submit manuscript

Andong Lu¹,
Chenglong Li ORCID: orcid.org/0000-0002-7233-2739²,
Jiacong Zhao²,
Jin Tang¹ &
…
Bin Luo¹

876 Accesses
10 Citations
Explore all metrics

Abstract

Current RGBT tracking research relies on the complete multi-modality input, but modal information might miss due to some factors such as thermal sensor self-calibration and data transmission error, called modality-missing challenge in this work. To address this challenge, we propose a novel invertible prompt learning approach, which integrates the content-preserving prompts into a well-trained tracking model to adapt to various modality-missing scenarios, for robust RGBT tracking. Given one modality-missing scenario, we propose to utilize the available modality to generate the prompt of the missing modality to adapt to RGBT tracking model. However, the cross-modality gap between available and missing modalities usually causes semantic distortion and information loss in prompt generation. To handle this issue, we design the invertible prompter by incorporating the full reconstruction of the input available modality from the generated prompt. To provide a comprehensive evaluation platform, we construct several high-quality benchmark datasets, in which various modality-missing scenarios are considered to simulate real-world challenges. Extensive experiments on three modality-missing benchmark datasets show that our method achieves significant performance improvements compared with state-of-the-art methods. We have released the code and simulation datasets at: https://github.com/mmic-lcl.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+

from $39.99 /Month

Starting from 10 chapters or articles per month
Access and download chapters and articles from more than 300k books and 2,500 journals
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 2

Missingness-aware prompting for modality-missing RGBT tracking

Article Open access 23 July 2025

Visual Prompt with Larger Model for Multi-modal Tracking

Feature Disentanglement and Adaptive Fusion for Improving Multi-modal Tracking

Data Availability

The data used in our submission consists of publicly available datasets, namely RGBT234 (Li et al., 2019), LasHeR (Li et al., 2021), and VTUAV (Pengyu et al., 2022), which are published and can be accessed through their respective official websites: RGBT234, LasHeR, and VTUAV. In addition, we have generated new data for our research, which is publicly available and can be accessed freely through our GitHub repository [https://github.com/mmic-lcl]. We ensure that all the data used in our study, including the publicly available datasets and the newly generated data, is accessible to the research community for further investigation and validation.

References

Lu, A., Li, C., Yan, Y., Tang, J., & Luo, B. (2021). Rgbt tracking via multi-adapter network with hierarchical divergence loss. IEEE Transactions on Image Processing, 30, 5613–5625.
Article Google Scholar
Cui, Z., Zhou, L., Wang, C., Xu, C., & Yang, J. (2022). Visual micro-pattern propagation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(1), 1267–1286.
Article Google Scholar
Zhang, T., Guo, H., Jiao, Q., Zhang, Q. & Han, J. (2023). Efficient rgb-t tracking via cross-modality distillation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5404–5413
Pengyu, Z., Zhao, J., Wang, D., Lu, H. & Ruan, X. (2022). Visible-thermal uav tracking: A large-scale benchmark and new baseline. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
Hui, T., Xun, Z., Peng, F., Huang, J., Wei, X., Wei, X., Dai, J., Han, J. & Liu, S. (2023). Bridging search region interaction with template for rgb-t tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13630–13639
Zhu, J., Lai, S., Chen, X., Wang, D. & Lu, H. (2023). Visual prompt multi-modal tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9516–9526
Xiao, Y., Yang, M., Li, C., Liu, L. & Tang, J. (2022). Attribute-based progressive fusion network for rgbt tracking. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 2831–2838
Lu, A., Qian, C., Li, C., Tang, J. & Wang, L. (2022). Duality-gated mutual condition network for rgbt tracking. IEEE Transactions on Neural Networks and Learning Systems
Wang, C., Xu, C., Cui, Z., Zhou, L. & Yang, J. (2020). Cross-modal pattern-propagation for rgb-t tracking. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G. & Gelly, S., et al. (2020). An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929
Li, C., Xue, W., Jia, Y., Qu, Z., Luo, B., Tang, J., & Sun, D. (2021). Lasher: A large-scale high-diversity benchmark for rgbt tracking. IEEE Transactions on Image Processing, 31, 392–404.
Article Google Scholar
Isola, P., Zhu, J.-Y., Zhou, T. & Efros, A.A. (2017). Image-to-image translation with conditional adversarial networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1125–1134
Kingma, D.P. & Dhariwal, P. (2018). Glow: Generative flow with invertible 1x1 convolutions. Advances in neural information processing systems
Zhou, L., Ye, M., Zhu, X., Xiao, S., Fan, X.-Q. & Neri, F. (2023). Homeomorphism alignment for unsupervised domain adaptation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 18699–18710
Gan, Z., Gan, C., He, X., Pu, Y., Tran, K., Gao, J., Carin, L. & Deng, L. (2017). Semantic compositional networks for visual captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5630–5639
Cao, Z., Long, M., Wang, J. & Jordan, M.I. (2018). Partial transfer learning with selective adversarial networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2724–2732
Li, C., Liang, X., Lu, Y., Zhao, N., & Tang, J. (2019). Rgb-t object tracking: benchmark and baseline. Pattern Recognition, 96, 106977.
Article Google Scholar
Li, C., Zhao, N., Lu, Y., Zhu, C. & Tang, J. (2017). Weighted sparse representation regularized graph learning for rgb-t object tracking. In: Proceedings of ACM International Conference on Multimedia
Deng, J., Dong, W., Socher, R., Li, L., Li, K. & Feifei, L. (2009). Imagenet: A large-scale hierarchical image database. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
Li, C., Lu, A., Zheng, A., Tu, Z. & Tang, J. (2019). Multi-adapter rgbt tracking. In: Proceedings of IEEE International Conference on Computer Vision Workshops
Li, C., Liu, L., Lu, A., Ji, Q. & Tang, J. (2020). Challenge-aware rgbt tracking. In: Proceedings of the IEEE European Conference on Computer Vision
Zhang, P., Wang, D., Lu, H., & Yang, X. (2021). Learning adaptive attribute-driven representation for real-time rgb-t tracking. International Journal of Computer Vision, 129, 2714–2729.
Article Google Scholar
Zhang, L., Danelljan, M., Gonzalez-Garcia, A., Weijer, J. & Shahbaz Khan, F. (2019). Multi-modal fusion for end-to-end rgb-t tracking. In: Proceedings of the IEEE International Conference on Computer Vision Workshops
Zhang, T., Liu, X., Zhang, Q., & Han, J. (2021). Siamcda: Complementarity-and distractor-aware rgb-t tracking based on siamese network. IEEE Transactions on Circuits and Systems for Video Technology, 32(3), 1403–1417.
Article Google Scholar
Yang, J., Li, Z., Zheng, F., Leonardis, A. & Song, J. (2022). Prompting for multi-modal tracking. In: Proceedings of the 30th ACM International Conference on Multimedia, pp. 3492–3500
Ma, M., Ren, J., Zhao, L., Tulyakov, S., Wu, C. & Peng, X. (2021). Smil: Multimodal learning with severely missing modality. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 2302–2310
Zhao, J., Li, R. & Jin, Q. (2021). Missing modality imagination network for emotion recognition with uncertain missing modalities. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, pp. 2608–2618
Ma, M., Ren, J., Zhao, L., Testuggine, D. & Peng, X. (2022). Are multimodal transformers robust to missing modality? In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18177–18186
Yin, Q., Wu, S., & Wang, L. (2017). Unified subspace learning for incomplete and unlabeled multi-view data. Pattern Recognition, 67, 313–327.
Article Google Scholar
Zhang, C., Cui, Y., Han, Z., Zhou, J. T., Fu, H., & Hu, Q. (2020). Deep partial multi-view learning. IEEE transactions on pattern analysis and machine intelligence, 44(5), 2402–2415.
Google Scholar
Xu, J., Li, C., Ren, Y., Peng, L., Mo, Y., Shi, X. & Zhu, X. (2022). Deep incomplete multi-view clustering via mining cluster complementarity. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 8761–8769
Lin, Y., Gou, Y., Liu, Z., Li, B., Lv, J. & Peng, X. (2021). Completer: Incomplete multi-view clustering via contrastive prediction. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11174–11183
Hu, M., Maillard, M., Zhang, Y., Ciceri, T., La Barbera, G., Bloch, I. & Gori, P. (2020). Knowledge distillation from multi-modal to mono-modal segmentation networks. In: Medical Image Computing and Computer Assisted Intervention–MICCAI, pp. 772–781
Wang, Y., Zhang, Y., Liu, Y., Lin, Z., Tian, J., Zhong, C., Shi, Z., Fan, J. & He, Z. (2021). Acn: Adversarial co-training network for brain tumor segmentation with missing modalities. In: Medical Image Computing and Computer Assisted Intervention–MICCAI, pp. 410–420
Wang, H., Chen, Y., Ma, C., Avery, J., Hull, L. & Carneiro, G. (2023). Multi-modal learning with missing modality via shared-specific feature modelling. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15878–15887
Lee, Y.-L., Tsai, Y.-H., Chiu, W.-C. & Lee, C.-Y. (2023). Multimodal prompting with missing modalities for visual recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14943–14952
Lester, B., Al-Rfou, R. & Constant, N. (2021). The power of scale for parameter-efficient prompt tuning. arXiv preprint arXiv:2104.08691
Li, X.L. & Liang, P. (2021). Prefix-tuning: Optimizing continuous prompts for generation. arXiv preprint arXiv:2101.00190
Liu, X., Ji, K., Fu, Y., Tam, W.L., Du, Z., Yang, Z. & Tang, J. (2021). P-tuning v2: Prompt tuning can be comparable to fine-tuning universally across scales and tasks. arXiv preprint arXiv:2110.07602
Jia, M., Tang, L., Chen, B.-C., Cardie, C., Belongie, S., Hariharan, B. & Lim, S.-N. (2022). Visual prompt tuning. In: European Conference on Computer Vision, pp. 709–727
Jie, S. & Deng, Z.-H. (2022). Convolutional bypasses are better vision transformer adapters. arXiv preprint arXiv:2207.07039
Cao, B., Guo, J., Zhu, P. & Hu, Q. (2024). Bi-directional adapter for multi-modal tracking. In: Proceedings of the AAAI Conference on Artificial Intelligence
Ye, B., Chang, H., Ma, B., Shan, S. & Chen, X. (2022). Joint feature learning and relation modeling for tracking: A one-stream framework. In: European Conference on Computer Vision, pp. 341–357. Springer
Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L. & Chen, W. (2021). Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685
Yin, D., Yang, Y., Wang, Z., Yu, H., Wei, K. & Sun, X. (2023). 1% vs 100%: Parameter-efficient low rank adapter for dense predictions. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 20116–20126
Van Erven, T., & Harremos, P. (2014). Rényi divergence and kullback-leibler divergence. IEEE Transactions on Information Theory, 60(7), 3797–3820.
Article MathSciNet Google Scholar
Li, C., Zhu, T., Liu, L., Si, X., Fan, Z. & Zhai, S (2022). Cross-modal object tracking: Modality-aware representations and a unified benchmark. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 1289–1296
Li, C., Cheng, H., Hu, S., Liu, X., Tang, J., & Lin, L. (2016). Learning collaborative sparse representation for grayscale-thermal tracking. IEEE Trans. Image Process., 25(12), 5743–5756.
Article MathSciNet Google Scholar
Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L. & et al. (2019). Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32
Loshchilov, I. & Hutter, F. (2017). Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101
Hou, X., Xing, J., Qian, Y., Guo, Y., Xin, S., Chen, J., Tang, K., Wang, M., Jiang, Z. & Liu, L. (2024). Sdstrack: Self-distillation symmetric adapter learning for multi-modal visual object tracking. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition

Download references

Acknowledgements

This work was supported in part by the Major Project for the National Natural Science Foundation of China under Grant 62376004; in part by the National Natural Science Foundation of China under Grant 62406002; in part by the Natural Science Foundation of Anhui Province under Grant 2208085J18; and in part by the Natural Science Foundation of Anhui Higher Education Institution under Grant 2022AH040014.

Author information

Authors and Affiliations

School of Computer Science and Technology, Anhui University, Hefei, 230601, Anhui, China
Andong Lu, Jin Tang & Bin Luo
School of Artificial Intelligence, Anhui University, Hefei, 230601, Anhui, China
Chenglong Li & Jiacong Zhao

Authors

Andong Lu
View author publications
Search author on:PubMed Google Scholar
Chenglong Li
View author publications
Search author on:PubMed Google Scholar
Jiacong Zhao
View author publications
Search author on:PubMed Google Scholar
Jin Tang
View author publications
Search author on:PubMed Google Scholar
Bin Luo
View author publications
Search author on:PubMed Google Scholar

Corresponding author

Correspondence to Chenglong Li.

Additional information

Communicated by Matej Kristan.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Lu, A., Li, C., Zhao, J. et al. Modality-missing RGBT Tracking: Invertible Prompt Learning and High-quality Benchmarks. Int J Comput Vis 133, 2599–2619 (2025). https://doi.org/10.1007/s11263-024-02311-4

Download citation

Received: 20 March 2024
Accepted: 12 November 2024
Published: 07 December 2024
Version of record: 07 December 2024
Issue date: May 2025
DOI: https://doi.org/10.1007/s11263-024-02311-4

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+

from $39.99 /Month

Starting from 10 chapters or articles per month
Access and download chapters and articles from more than 300k books and 2,500 journals
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Modality-missing RGBT Tracking: Invertible Prompt Learning and High-quality Benchmarks

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Missingness-aware prompting for modality-missing RGBT tracking

Visual Prompt with Larger Model for Multi-modal Tracking

Feature Disentanglement and Adaptive Fusion for Improving Multi-modal Tracking

Explore related subjects

Data Availability

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now