Abstract
Current RGBT tracking research relies on the complete multi-modality input, but modal information might miss due to some factors such as thermal sensor self-calibration and data transmission error, called modality-missing challenge in this work. To address this challenge, we propose a novel invertible prompt learning approach, which integrates the content-preserving prompts into a well-trained tracking model to adapt to various modality-missing scenarios, for robust RGBT tracking. Given one modality-missing scenario, we propose to utilize the available modality to generate the prompt of the missing modality to adapt to RGBT tracking model. However, the cross-modality gap between available and missing modalities usually causes semantic distortion and information loss in prompt generation. To handle this issue, we design the invertible prompter by incorporating the full reconstruction of the input available modality from the generated prompt. To provide a comprehensive evaluation platform, we construct several high-quality benchmark datasets, in which various modality-missing scenarios are considered to simulate real-world challenges. Extensive experiments on three modality-missing benchmark datasets show that our method achieves significant performance improvements compared with state-of-the-art methods. We have released the code and simulation datasets at: https://github.com/mmic-lcl.
Similar content being viewed by others
Data Availability
The data used in our submission consists of publicly available datasets, namely RGBT234 (Li et al., 2019), LasHeR (Li et al., 2021), and VTUAV (Pengyu et al., 2022), which are published and can be accessed through their respective official websites: RGBT234, LasHeR, and VTUAV. In addition, we have generated new data for our research, which is publicly available and can be accessed freely through our GitHub repository [https://github.com/mmic-lcl]. We ensure that all the data used in our study, including the publicly available datasets and the newly generated data, is accessible to the research community for further investigation and validation.
References
Lu, A., Li, C., Yan, Y., Tang, J., & Luo, B. (2021). Rgbt tracking via multi-adapter network with hierarchical divergence loss. IEEE Transactions on Image Processing, 30, 5613–5625.
Cui, Z., Zhou, L., Wang, C., Xu, C., & Yang, J. (2022). Visual micro-pattern propagation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(1), 1267–1286.
Zhang, T., Guo, H., Jiao, Q., Zhang, Q. & Han, J. (2023). Efficient rgb-t tracking via cross-modality distillation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5404–5413
Pengyu, Z., Zhao, J., Wang, D., Lu, H. & Ruan, X. (2022). Visible-thermal uav tracking: A large-scale benchmark and new baseline. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
Hui, T., Xun, Z., Peng, F., Huang, J., Wei, X., Wei, X., Dai, J., Han, J. & Liu, S. (2023). Bridging search region interaction with template for rgb-t tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13630–13639
Zhu, J., Lai, S., Chen, X., Wang, D. & Lu, H. (2023). Visual prompt multi-modal tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9516–9526
Xiao, Y., Yang, M., Li, C., Liu, L. & Tang, J. (2022). Attribute-based progressive fusion network for rgbt tracking. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 2831–2838
Lu, A., Qian, C., Li, C., Tang, J. & Wang, L. (2022). Duality-gated mutual condition network for rgbt tracking. IEEE Transactions on Neural Networks and Learning Systems
Wang, C., Xu, C., Cui, Z., Zhou, L. & Yang, J. (2020). Cross-modal pattern-propagation for rgb-t tracking. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G. & Gelly, S., et al. (2020). An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929
Li, C., Xue, W., Jia, Y., Qu, Z., Luo, B., Tang, J., & Sun, D. (2021). Lasher: A large-scale high-diversity benchmark for rgbt tracking. IEEE Transactions on Image Processing, 31, 392–404.
Isola, P., Zhu, J.-Y., Zhou, T. & Efros, A.A. (2017). Image-to-image translation with conditional adversarial networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1125–1134
Kingma, D.P. & Dhariwal, P. (2018). Glow: Generative flow with invertible 1x1 convolutions. Advances in neural information processing systems
Zhou, L., Ye, M., Zhu, X., Xiao, S., Fan, X.-Q. & Neri, F. (2023). Homeomorphism alignment for unsupervised domain adaptation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 18699–18710
Gan, Z., Gan, C., He, X., Pu, Y., Tran, K., Gao, J., Carin, L. & Deng, L. (2017). Semantic compositional networks for visual captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5630–5639
Cao, Z., Long, M., Wang, J. & Jordan, M.I. (2018). Partial transfer learning with selective adversarial networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2724–2732
Li, C., Liang, X., Lu, Y., Zhao, N., & Tang, J. (2019). Rgb-t object tracking: benchmark and baseline. Pattern Recognition, 96, 106977.
Li, C., Zhao, N., Lu, Y., Zhu, C. & Tang, J. (2017). Weighted sparse representation regularized graph learning for rgb-t object tracking. In: Proceedings of ACM International Conference on Multimedia
Deng, J., Dong, W., Socher, R., Li, L., Li, K. & Feifei, L. (2009). Imagenet: A large-scale hierarchical image database. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
Li, C., Lu, A., Zheng, A., Tu, Z. & Tang, J. (2019). Multi-adapter rgbt tracking. In: Proceedings of IEEE International Conference on Computer Vision Workshops
Li, C., Liu, L., Lu, A., Ji, Q. & Tang, J. (2020). Challenge-aware rgbt tracking. In: Proceedings of the IEEE European Conference on Computer Vision
Zhang, P., Wang, D., Lu, H., & Yang, X. (2021). Learning adaptive attribute-driven representation for real-time rgb-t tracking. International Journal of Computer Vision, 129, 2714–2729.
Zhang, L., Danelljan, M., Gonzalez-Garcia, A., Weijer, J. & Shahbaz Khan, F. (2019). Multi-modal fusion for end-to-end rgb-t tracking. In: Proceedings of the IEEE International Conference on Computer Vision Workshops
Zhang, T., Liu, X., Zhang, Q., & Han, J. (2021). Siamcda: Complementarity-and distractor-aware rgb-t tracking based on siamese network. IEEE Transactions on Circuits and Systems for Video Technology, 32(3), 1403–1417.
Yang, J., Li, Z., Zheng, F., Leonardis, A. & Song, J. (2022). Prompting for multi-modal tracking. In: Proceedings of the 30th ACM International Conference on Multimedia, pp. 3492–3500
Ma, M., Ren, J., Zhao, L., Tulyakov, S., Wu, C. & Peng, X. (2021). Smil: Multimodal learning with severely missing modality. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 2302–2310
Zhao, J., Li, R. & Jin, Q. (2021). Missing modality imagination network for emotion recognition with uncertain missing modalities. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, pp. 2608–2618
Ma, M., Ren, J., Zhao, L., Testuggine, D. & Peng, X. (2022). Are multimodal transformers robust to missing modality? In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18177–18186
Yin, Q., Wu, S., & Wang, L. (2017). Unified subspace learning for incomplete and unlabeled multi-view data. Pattern Recognition, 67, 313–327.
Zhang, C., Cui, Y., Han, Z., Zhou, J. T., Fu, H., & Hu, Q. (2020). Deep partial multi-view learning. IEEE transactions on pattern analysis and machine intelligence, 44(5), 2402–2415.
Xu, J., Li, C., Ren, Y., Peng, L., Mo, Y., Shi, X. & Zhu, X. (2022). Deep incomplete multi-view clustering via mining cluster complementarity. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 8761–8769
Lin, Y., Gou, Y., Liu, Z., Li, B., Lv, J. & Peng, X. (2021). Completer: Incomplete multi-view clustering via contrastive prediction. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11174–11183
Hu, M., Maillard, M., Zhang, Y., Ciceri, T., La Barbera, G., Bloch, I. & Gori, P. (2020). Knowledge distillation from multi-modal to mono-modal segmentation networks. In: Medical Image Computing and Computer Assisted Intervention–MICCAI, pp. 772–781
Wang, Y., Zhang, Y., Liu, Y., Lin, Z., Tian, J., Zhong, C., Shi, Z., Fan, J. & He, Z. (2021). Acn: Adversarial co-training network for brain tumor segmentation with missing modalities. In: Medical Image Computing and Computer Assisted Intervention–MICCAI, pp. 410–420
Wang, H., Chen, Y., Ma, C., Avery, J., Hull, L. & Carneiro, G. (2023). Multi-modal learning with missing modality via shared-specific feature modelling. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15878–15887
Lee, Y.-L., Tsai, Y.-H., Chiu, W.-C. & Lee, C.-Y. (2023). Multimodal prompting with missing modalities for visual recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14943–14952
Lester, B., Al-Rfou, R. & Constant, N. (2021). The power of scale for parameter-efficient prompt tuning. arXiv preprint arXiv:2104.08691
Li, X.L. & Liang, P. (2021). Prefix-tuning: Optimizing continuous prompts for generation. arXiv preprint arXiv:2101.00190
Liu, X., Ji, K., Fu, Y., Tam, W.L., Du, Z., Yang, Z. & Tang, J. (2021). P-tuning v2: Prompt tuning can be comparable to fine-tuning universally across scales and tasks. arXiv preprint arXiv:2110.07602
Jia, M., Tang, L., Chen, B.-C., Cardie, C., Belongie, S., Hariharan, B. & Lim, S.-N. (2022). Visual prompt tuning. In: European Conference on Computer Vision, pp. 709–727
Jie, S. & Deng, Z.-H. (2022). Convolutional bypasses are better vision transformer adapters. arXiv preprint arXiv:2207.07039
Cao, B., Guo, J., Zhu, P. & Hu, Q. (2024). Bi-directional adapter for multi-modal tracking. In: Proceedings of the AAAI Conference on Artificial Intelligence
Ye, B., Chang, H., Ma, B., Shan, S. & Chen, X. (2022). Joint feature learning and relation modeling for tracking: A one-stream framework. In: European Conference on Computer Vision, pp. 341–357. Springer
Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L. & Chen, W. (2021). Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685
Yin, D., Yang, Y., Wang, Z., Yu, H., Wei, K. & Sun, X. (2023). 1% vs 100%: Parameter-efficient low rank adapter for dense predictions. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 20116–20126
Van Erven, T., & Harremos, P. (2014). Rényi divergence and kullback-leibler divergence. IEEE Transactions on Information Theory, 60(7), 3797–3820.
Li, C., Zhu, T., Liu, L., Si, X., Fan, Z. & Zhai, S (2022). Cross-modal object tracking: Modality-aware representations and a unified benchmark. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 1289–1296
Li, C., Cheng, H., Hu, S., Liu, X., Tang, J., & Lin, L. (2016). Learning collaborative sparse representation for grayscale-thermal tracking. IEEE Trans. Image Process., 25(12), 5743–5756.
Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L. & et al. (2019). Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32
Loshchilov, I. & Hutter, F. (2017). Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101
Hou, X., Xing, J., Qian, Y., Guo, Y., Xin, S., Chen, J., Tang, K., Wang, M., Jiang, Z. & Liu, L. (2024). Sdstrack: Self-distillation symmetric adapter learning for multi-modal visual object tracking. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition
Acknowledgements
This work was supported in part by the Major Project for the National Natural Science Foundation of China under Grant 62376004; in part by the National Natural Science Foundation of China under Grant 62406002; in part by the Natural Science Foundation of Anhui Province under Grant 2208085J18; and in part by the Natural Science Foundation of Anhui Higher Education Institution under Grant 2022AH040014.
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by Matej Kristan.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Lu, A., Li, C., Zhao, J. et al. Modality-missing RGBT Tracking: Invertible Prompt Learning and High-quality Benchmarks. Int J Comput Vis 133, 2599–2619 (2025). https://doi.org/10.1007/s11263-024-02311-4
Received:
Accepted:
Published:
Version of record:
Issue date:
DOI: https://doi.org/10.1007/s11263-024-02311-4