这是indexloc提供的服务,不要输入任何密码
Skip to main content
Log in

Modality-missing RGBT Tracking: Invertible Prompt Learning and High-quality Benchmarks

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

Current RGBT tracking research relies on the complete multi-modality input, but modal information might miss due to some factors such as thermal sensor self-calibration and data transmission error, called modality-missing challenge in this work. To address this challenge, we propose a novel invertible prompt learning approach, which integrates the content-preserving prompts into a well-trained tracking model to adapt to various modality-missing scenarios, for robust RGBT tracking. Given one modality-missing scenario, we propose to utilize the available modality to generate the prompt of the missing modality to adapt to RGBT tracking model. However, the cross-modality gap between available and missing modalities usually causes semantic distortion and information loss in prompt generation. To handle this issue, we design the invertible prompter by incorporating the full reconstruction of the input available modality from the generated prompt. To provide a comprehensive evaluation platform, we construct several high-quality benchmark datasets, in which various modality-missing scenarios are considered to simulate real-world challenges. Extensive experiments on three modality-missing benchmark datasets show that our method achieves significant performance improvements compared with state-of-the-art methods. We have released the code and simulation datasets at: https://github.com/mmic-lcl.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+
from $39.99 /Month
  • Starting from 10 chapters or articles per month
  • Access and download chapters and articles from more than 300k books and 2,500 journals
  • Cancel anytime
View plans

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13

Similar content being viewed by others

Data Availability

The data used in our submission consists of publicly available datasets, namely RGBT234 (Li et al., 2019), LasHeR (Li et al., 2021), and VTUAV (Pengyu et al., 2022), which are published and can be accessed through their respective official websites: RGBT234, LasHeR, and VTUAV. In addition, we have generated new data for our research, which is publicly available and can be accessed freely through our GitHub repository [https://github.com/mmic-lcl]. We ensure that all the data used in our study, including the publicly available datasets and the newly generated data, is accessible to the research community for further investigation and validation.

References

  • Lu, A., Li, C., Yan, Y., Tang, J., & Luo, B. (2021). Rgbt tracking via multi-adapter network with hierarchical divergence loss. IEEE Transactions on Image Processing, 30, 5613–5625.

    Article  Google Scholar 

  • Cui, Z., Zhou, L., Wang, C., Xu, C., & Yang, J. (2022). Visual micro-pattern propagation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(1), 1267–1286.

    Article  Google Scholar 

  • Zhang, T., Guo, H., Jiao, Q., Zhang, Q. & Han, J. (2023). Efficient rgb-t tracking via cross-modality distillation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5404–5413

  • Pengyu, Z., Zhao, J., Wang, D., Lu, H. & Ruan, X. (2022). Visible-thermal uav tracking: A large-scale benchmark and new baseline. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

  • Hui, T., Xun, Z., Peng, F., Huang, J., Wei, X., Wei, X., Dai, J., Han, J. & Liu, S. (2023). Bridging search region interaction with template for rgb-t tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13630–13639

  • Zhu, J., Lai, S., Chen, X., Wang, D. & Lu, H. (2023). Visual prompt multi-modal tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9516–9526

  • Xiao, Y., Yang, M., Li, C., Liu, L. & Tang, J. (2022). Attribute-based progressive fusion network for rgbt tracking. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 2831–2838

  • Lu, A., Qian, C., Li, C., Tang, J. & Wang, L. (2022). Duality-gated mutual condition network for rgbt tracking. IEEE Transactions on Neural Networks and Learning Systems

  • Wang, C., Xu, C., Cui, Z., Zhou, L. & Yang, J. (2020). Cross-modal pattern-propagation for rgb-t tracking. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

  • Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G. & Gelly, S., et al. (2020). An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929

  • Li, C., Xue, W., Jia, Y., Qu, Z., Luo, B., Tang, J., & Sun, D. (2021). Lasher: A large-scale high-diversity benchmark for rgbt tracking. IEEE Transactions on Image Processing, 31, 392–404.

    Article  Google Scholar 

  • Isola, P., Zhu, J.-Y., Zhou, T. & Efros, A.A. (2017). Image-to-image translation with conditional adversarial networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1125–1134

  • Kingma, D.P. & Dhariwal, P. (2018). Glow: Generative flow with invertible 1x1 convolutions. Advances in neural information processing systems

  • Zhou, L., Ye, M., Zhu, X., Xiao, S., Fan, X.-Q. & Neri, F. (2023). Homeomorphism alignment for unsupervised domain adaptation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 18699–18710

  • Gan, Z., Gan, C., He, X., Pu, Y., Tran, K., Gao, J., Carin, L. & Deng, L. (2017). Semantic compositional networks for visual captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5630–5639

  • Cao, Z., Long, M., Wang, J. & Jordan, M.I. (2018). Partial transfer learning with selective adversarial networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2724–2732

  • Li, C., Liang, X., Lu, Y., Zhao, N., & Tang, J. (2019). Rgb-t object tracking: benchmark and baseline. Pattern Recognition, 96, 106977.

    Article  Google Scholar 

  • Li, C., Zhao, N., Lu, Y., Zhu, C. & Tang, J. (2017). Weighted sparse representation regularized graph learning for rgb-t object tracking. In: Proceedings of ACM International Conference on Multimedia

  • Deng, J., Dong, W., Socher, R., Li, L., Li, K. & Feifei, L. (2009). Imagenet: A large-scale hierarchical image database. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

  • Li, C., Lu, A., Zheng, A., Tu, Z. & Tang, J. (2019). Multi-adapter rgbt tracking. In: Proceedings of IEEE International Conference on Computer Vision Workshops

  • Li, C., Liu, L., Lu, A., Ji, Q. & Tang, J. (2020). Challenge-aware rgbt tracking. In: Proceedings of the IEEE European Conference on Computer Vision

  • Zhang, P., Wang, D., Lu, H., & Yang, X. (2021). Learning adaptive attribute-driven representation for real-time rgb-t tracking. International Journal of Computer Vision, 129, 2714–2729.

    Article  Google Scholar 

  • Zhang, L., Danelljan, M., Gonzalez-Garcia, A., Weijer, J. & Shahbaz Khan, F. (2019). Multi-modal fusion for end-to-end rgb-t tracking. In: Proceedings of the IEEE International Conference on Computer Vision Workshops

  • Zhang, T., Liu, X., Zhang, Q., & Han, J. (2021). Siamcda: Complementarity-and distractor-aware rgb-t tracking based on siamese network. IEEE Transactions on Circuits and Systems for Video Technology, 32(3), 1403–1417.

    Article  Google Scholar 

  • Yang, J., Li, Z., Zheng, F., Leonardis, A. & Song, J. (2022). Prompting for multi-modal tracking. In: Proceedings of the 30th ACM International Conference on Multimedia, pp. 3492–3500

  • Ma, M., Ren, J., Zhao, L., Tulyakov, S., Wu, C. & Peng, X. (2021). Smil: Multimodal learning with severely missing modality. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 2302–2310

  • Zhao, J., Li, R. & Jin, Q. (2021). Missing modality imagination network for emotion recognition with uncertain missing modalities. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, pp. 2608–2618

  • Ma, M., Ren, J., Zhao, L., Testuggine, D. & Peng, X. (2022). Are multimodal transformers robust to missing modality? In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18177–18186

  • Yin, Q., Wu, S., & Wang, L. (2017). Unified subspace learning for incomplete and unlabeled multi-view data. Pattern Recognition, 67, 313–327.

    Article  Google Scholar 

  • Zhang, C., Cui, Y., Han, Z., Zhou, J. T., Fu, H., & Hu, Q. (2020). Deep partial multi-view learning. IEEE transactions on pattern analysis and machine intelligence, 44(5), 2402–2415.

    Google Scholar 

  • Xu, J., Li, C., Ren, Y., Peng, L., Mo, Y., Shi, X. & Zhu, X. (2022). Deep incomplete multi-view clustering via mining cluster complementarity. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 8761–8769

  • Lin, Y., Gou, Y., Liu, Z., Li, B., Lv, J. & Peng, X. (2021). Completer: Incomplete multi-view clustering via contrastive prediction. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11174–11183

  • Hu, M., Maillard, M., Zhang, Y., Ciceri, T., La Barbera, G., Bloch, I. & Gori, P. (2020). Knowledge distillation from multi-modal to mono-modal segmentation networks. In: Medical Image Computing and Computer Assisted Intervention–MICCAI, pp. 772–781

  • Wang, Y., Zhang, Y., Liu, Y., Lin, Z., Tian, J., Zhong, C., Shi, Z., Fan, J. & He, Z. (2021). Acn: Adversarial co-training network for brain tumor segmentation with missing modalities. In: Medical Image Computing and Computer Assisted Intervention–MICCAI, pp. 410–420

  • Wang, H., Chen, Y., Ma, C., Avery, J., Hull, L. & Carneiro, G. (2023). Multi-modal learning with missing modality via shared-specific feature modelling. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15878–15887

  • Lee, Y.-L., Tsai, Y.-H., Chiu, W.-C. & Lee, C.-Y. (2023). Multimodal prompting with missing modalities for visual recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14943–14952

  • Lester, B., Al-Rfou, R. & Constant, N. (2021). The power of scale for parameter-efficient prompt tuning. arXiv preprint arXiv:2104.08691

  • Li, X.L. & Liang, P. (2021). Prefix-tuning: Optimizing continuous prompts for generation. arXiv preprint arXiv:2101.00190

  • Liu, X., Ji, K., Fu, Y., Tam, W.L., Du, Z., Yang, Z. & Tang, J. (2021). P-tuning v2: Prompt tuning can be comparable to fine-tuning universally across scales and tasks. arXiv preprint arXiv:2110.07602

  • Jia, M., Tang, L., Chen, B.-C., Cardie, C., Belongie, S., Hariharan, B. & Lim, S.-N. (2022). Visual prompt tuning. In: European Conference on Computer Vision, pp. 709–727

  • Jie, S. & Deng, Z.-H. (2022). Convolutional bypasses are better vision transformer adapters. arXiv preprint arXiv:2207.07039

  • Cao, B., Guo, J., Zhu, P. & Hu, Q. (2024). Bi-directional adapter for multi-modal tracking. In: Proceedings of the AAAI Conference on Artificial Intelligence

  • Ye, B., Chang, H., Ma, B., Shan, S. & Chen, X. (2022). Joint feature learning and relation modeling for tracking: A one-stream framework. In: European Conference on Computer Vision, pp. 341–357. Springer

  • Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L. & Chen, W. (2021). Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685

  • Yin, D., Yang, Y., Wang, Z., Yu, H., Wei, K. & Sun, X. (2023). 1% vs 100%: Parameter-efficient low rank adapter for dense predictions. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 20116–20126

  • Van Erven, T., & Harremos, P. (2014). Rényi divergence and kullback-leibler divergence. IEEE Transactions on Information Theory, 60(7), 3797–3820.

    Article  MathSciNet  Google Scholar 

  • Li, C., Zhu, T., Liu, L., Si, X., Fan, Z. & Zhai, S (2022). Cross-modal object tracking: Modality-aware representations and a unified benchmark. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 1289–1296

  • Li, C., Cheng, H., Hu, S., Liu, X., Tang, J., & Lin, L. (2016). Learning collaborative sparse representation for grayscale-thermal tracking. IEEE Trans. Image Process., 25(12), 5743–5756.

    Article  MathSciNet  Google Scholar 

  • Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L. & et al. (2019). Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32

  • Loshchilov, I. & Hutter, F. (2017). Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101

  • Hou, X., Xing, J., Qian, Y., Guo, Y., Xin, S., Chen, J., Tang, K., Wang, M., Jiang, Z. & Liu, L. (2024). Sdstrack: Self-distillation symmetric adapter learning for multi-modal visual object tracking. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition

Download references

Acknowledgements

This work was supported in part by the Major Project for the National Natural Science Foundation of China under Grant 62376004; in part by the National Natural Science Foundation of China under Grant 62406002; in part by the Natural Science Foundation of Anhui Province under Grant 2208085J18; and in part by the Natural Science Foundation of Anhui Higher Education Institution under Grant 2022AH040014.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Chenglong Li.

Additional information

Communicated by Matej Kristan.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Lu, A., Li, C., Zhao, J. et al. Modality-missing RGBT Tracking: Invertible Prompt Learning and High-quality Benchmarks. Int J Comput Vis 133, 2599–2619 (2025). https://doi.org/10.1007/s11263-024-02311-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • Version of record:

  • Issue date:

  • DOI: https://doi.org/10.1007/s11263-024-02311-4

Keywords