Learning Discriminative Features for Visual Tracking via Scenario Decoupling

Ma, Yinchao; Yu, Qianjin; Yang, Wenfei; Zhang, Tianzhu; Zhang, Jinpeng

doi:10.1007/s11263-024-02307-0

Learning Discriminative Features for Visual Tracking via Scenario Decoupling

Published: 19 December 2024

Volume 133, pages 2950–2966, (2025)
Cite this article

International Journal of Computer Vision Aims and scope Submit manuscript

Yinchao Ma¹,
Qianjin Yu¹,
Wenfei Yang¹,
Tianzhu Zhang ORCID: orcid.org/0000-0003-0764-6106¹ &
…
Jinpeng Zhang²

504 Accesses
Explore all metrics

Abstract

Visual tracking aims to estimate object state automatically in a video sequence, which is challenging especially in complex scenarios. Recent Transformer-based trackers enable the interaction between the target template and search region in the feature extraction phase for target-aware feature learning, which have achieved superior performance. However, visual tracking is essentially a task to discriminate the specified target from the backgrounds. These trackers commonly ignore the role of background in feature learning, which may cause backgrounds to be mistakenly enhanced in complex scenarios, affecting temporal robustness and spatial discriminability. To address the above limitations, we propose a scenario-aware tracker (SATrack) based on a specifically designed scenario-aware Vision Transformer, which integrates a scenario knowledge extractor and a scenario knowledge modulator. The proposed SATrack enjoys several merits. Firstly, we design a novel scenario-aware Vision Transformer for visual tracking, which can decouple historic scenarios into explicit target and background knowledge to guide discriminative feature learning. Secondly, a scenario knowledge extractor is designed to dynamically acquire decoupled and compact scenario knowledge from video contexts, and a scenario knowledge modulator is designed to embed scenario knowledge into attention mechanisms for scenario-aware feature learning. Extensive experimental results on nine tracking benchmarks demonstrate that SATrack achieves new state-of-the-art performance with high FPS.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+

from $39.99 /Month

Starting from 10 chapters or articles per month
Access and download chapters and articles from more than 300k books and 2,500 journals
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Learning Feature Restoration Transformer for Robust Dehazing Visual Object Tracking

Article 12 July 2024

Autoregressive Temporal Modeling for Advanced Tracking-by-Diffusion

Article 09 May 2025

Visual tracking via confidence template updating spatial-temporal regularized correlation filters

Article 19 September 2023

Data Availibility

GOT-10k dataset can be downloaded at http://got-10k.aitestunion.com/downloads. TNL2K dataset can be downloaded at https://github.com/wangxiao5791509/TNL2K_evaluation_toolkit. LaSOT and LaSOText datasets can be downloaded at https://github.com/HengLan/LaSOT_Evaluation_Toolkit. TrackingNet dataset can be downloaded at https://github.com/SilvioGiancola/TrackingNet-devkit NFS dataset can be downloaded at https://ci2cv.net/nfs/index.html UAV123 dataset can be downloaded at https://cemse.kaust.edu.sa/ivul/uav123 AVisT dataset can be downloaded at https://sites.google.com/view/avist-benchmark.

References

Arnab, A., Dehghani, M., Heigold, G., Sun, C., Lučić, M., & Schmid, C. (2021). Vivit: A video vision transformer. In Proceedings of the IEEE international conference on computer vision (pp. 6836–6846).
Bertinetto, L., Valmadre, J., Henriques, J. F., Vedaldi, A., & Torr, P. H. S. (2016). Fully-convolutional siamese networks for object tracking. In Proceedings of the European conference on computer vision workshops.
Bhat, G., Danelljan, M., Gool, L. V., & Timofte, R. (2019). Learning discriminative model prediction for tracking. In Proceedings of the IEEE international conference on computer vision.
Bhat, G., Danelljan, M., Gool, L.V., & Timofte, R. (2020). Know your surroundings: Exploiting scene information for object tracking. Proceedings of the European conference on computer vision (pp. 205–221).
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., & Zagoruyko, S. (2020). End-to-end object detection with transformers. In Proceedings of the European conference on computer vision.
Chen, B., Li, P., Bai, L., Qiao, L., Shen, Q., Li, B., Gan, W., Wu, W. & Ouyang, W. (2022). Backbone is all your need: A simplified architecture for visual object tracking. In Proceedings of the European conference on computer vision.
Chen, J., & Ho, C. M. (2022). Mm-vit: Multi-modal video transformer for compressed video action recognition. In Proceedings of the IEEE/CVF winter conference on applications of computer vision (pp. 1910–1921).
Chen, X., Yan, B., Zhu, J., Lu, H., Ruan, X., & Wang, D. (2023). High-performance transformer tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(07), 8507–8523. https://doi.org/10.1109/TPAMI.2022.3232535
Article Google Scholar
Chen, X., Yan, B., Zhu, J., Wang, D., Yang, X., & Lu, H. (2021). Transformer tracking. In Proceedings of IEEE conference on computer vision and pattern recognition.
Chen, Z., Zhong, B., Li, G., Zhang, S., & Ji, R. (2020). Siamese box adaptive network for visual tracking. In Proceedings of IEEE conference on computer vision and pattern recognition.
Cheng, B., Misra, I., Schwing, A. G., Kirillov, A., & Girdhar, R. (2022). Masked-attention mask transformer for universal image segmentation. In Proceedings of IEEE conference on computer vision and pattern recognition (pp. 1290–1299).
Cui, Y., Guo, D., Shao, Y., Wang, Z., Shen, C., Zhang, L., & Chen, S. (2022a). Joint classification and regression for visual tracking with fully convolutional siamese networks (pp. 1–17). International Journal of Computer Vision.
Cui, Y., et al. (2022b). Fully convolutional online tracking. Computer Vision and Image Understanding, 224, 103547.
Cui, Y., Jiang, C., Wang, L., Wu, G. (2024a). Mixformer: End-to-end tracking with iterative mixed attention. IEEE Transactions on Pattern Analysis and Machine Intelligence. https://doi.org/10.1109/TPAMI.2024.3349519
Cui, Y., Jiang, C., Wang, L., Wu, G. (2024b). Mixformerv2: Efficient fully transformer tracking. Advances of Neural Information Processing Systems, 36.
Danelljan, M., Bhat, G., Khan, F. S., & Felsberg, M. (2017). ECO: Efficient convolution operators for tracking. In Proceedings of IEEE conference on computer vision and pattern recognition.
Danelljan, M., Bhat, G., Khan, F. S., & Felsberg, M. (2019). ATOM: Accurate tracking by overlap maximization. In Proceedings of IEEE conference on computer vision and pattern recognition.
Danelljan, M., et al. (2020). Probabilistic regression for visual tracking. In Proceedings of IEEE conference on computer vision and pattern recognition.
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J. & Houlsby, N. (2021). An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on learning representations.
Du, F., et al. (2020). Correlation-guided attention for corner detection based visual tracking. In Proceedings of IEEE conference on computer vision and pattern recognition.
Fan, H., Bai, H., Lin, L., Yang, F., Chu, P., Deng, G., Yu, S., Huang, M., Liu, J., Xu, Y., et al. (2021). Lasot: A high-quality large-scale single object tracking benchmark. International Journal of Computer Vision, 129(2), 439–461.
Article Google Scholar
Fan, H., Lin, L., Yang, F., Chu, P., Deng, G., Yu, S., Bai, H., Xu, Y., Liao, C., & Ling, H. (2019). LaSOT: A high-quality benchmark for large-scale single object tracking. In Proceedings of IEEE conference on computer vision and pattern recognition.
Gao, S., et al. (2022). Aiatrack: Attention in attention for transformer visual tracking. In Proceedings of the European conference on computer vision (pp. 146–164).
Glorot, X., & Bengio, Y. (2010). Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the international conference on artificial intelligence and statistics.
Guo, D., Shao, Y., Cui, Y., Wang, Z., Zhang, L., & Shen, C. (2021). Graph attention tracking. In Proceedings of IEEE conference on computer vision and pattern recognition (pp. 9543–9552).
Han, R., Feng, W., & Wang, S. (2020). Fast learning of spatially regularized and content aware correlation filter for visual tracking. IEEE Transactions on Image Processing, 29, 7128–7140.
Article MathSciNet Google Scholar
He, K., Chen, X., Xie, S., Li, Y., Dollár, P., & Girshick, R. (2022). Masked autoencoders are scalable vision learners. In Proceedings of IEEE conference on computer vision and pattern recognition (pp. 16000–16009).
Huang, L., Zhao, X., & Huang, K. (2018). Got-10k: A large high-diversity benchmark for generic object tracking in the wild. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43, 1562–1577.
Jiang, B., et al. (2018). Acquisition of localization confidence for accurate object detection. In Proceedings of the European conference on computer vision.
Kiani Galoogahi, H., et al. (2017). Need for speed: A benchmark for higher frame rate object tracking. In Proceedings of the IEEE international conference on computer vision.
Kristan, M., Leonardis, A., Matas, J., Felsberg, M., Pflugfelder, R., Kämäräinen, J.-K., Chang, H. J., Danelljan, M., Zajc, L. Č., Lukežič, A., et al. (2022). The tenth visual object tracking vot2022 challenge results. In Proceedings of the European conference on computer vision (pp. 431–460).
Law, H., & Deng, J. (2018). Cornernet: Detecting objects as paired keypoints. In Proceedings of the European conference on computer vision (pp. 734–750).
Li, B., Wu, W., Wang, Q., Zhang, F., Xing, J., & Yan, J. (2019). SiamRPN++: Evolution of siamese visual tracking with very deep networks. In Proceedings of IEEE conference on computer vision and pattern recognition.
Li, B., Yan, J., Wu, W., Zhu, Z., & Hu, X. (2018). High performance visual tracking with siamese region proposal network. In Proceedings of IEEE conference on computer vision and pattern recognition.
Lin, L., Fan, H., Zhang, Z., Xu, Y., & Ling, H. (2022). Swintrack: A simple and strong baseline for transformer tracking. Advances in Neural Information Processing Systems, 35, 16743–16754.
Google Scholar
Lin, T.-Y., Maire, M., Belongie, S. J., Bourdev, L. D., Girshick, R. B., Hays, J., Perona, P., Ramanan, D., Dollár, P., & Zitnick, C. L. (2014). Microsoft COCO: Common objects in context. In Proceedings of the European conference on computer vision.
Liu, S., Liang, X., Liu, L., Lu, K., Lin, L., Cao, X., & Yan, S. (2015). Fashion parsing with video context. IEEE Transactions on Multimedia, 17(8), 1347–1358.
Article Google Scholar
Liu, X., Wang, Q., Hu, Y., Tang, X., Zhang, S., Bai, S., & Bai, X. (2022). End-to-end temporal action detection with transformer. IEEE Transactions on Image Processing, 31, 5427–5441.
Article Google Scholar
Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B., & Guo, B. (2021). Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE international conference on computer vision (pp. 10012–10022).
Loshchilov, I., & Hutter, F. (2018). Decoupled weight decay regularization. In International conference on learning representations.
Lukezic, A., et al. (2020). D3S: A discriminative single shot segmentation tracker. In Proceedings of IEEE conference on computer vision and pattern recognition.
Ma, Y., et al. (2023). Adaptive part mining for robust visual tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(10), 11443–11457.
Article Google Scholar
Mayer, C., Danelljan, M., Paudel, D. P., & Van Gool, L. (2021). Learning target candidate association to keep track of what not to track. In Proceedings of the IEEE international conference on computer vision (pp. 13444–13454).
Mueller, M., et al. (2016). A benchmark and simulator for UAV tracking. In Proceedings of the European conference on computer vision.
Muller, M., Bibi, A., Giancola, S., Alsubaihi, S., & Ghanem, B. (2018). TrackingNet: A large-scale dataset and benchmark for object tracking in the wild. In Proceedings of the European conference on computer vision.
Nam, H., & Han, B. (2016). Learning multi-domain convolutional neural networks for visual tracking. In Proceedings of IEEE conference on computer vision and pattern recognition.
Noman, M., Ghallabi, W. A., Najiha, D., Mayer, C., Dudhane, A., Danelljan, M., Cholakkal, H., Khan, S., Van Gool, L. & Khan, F.S. (2022). Avist: A benchmark for visual object tracking in adverse visibility. In British machine vision conference.
Rezatofighi, H., Tsoi, N., Gwak, J., Sadeghian, A., Reid, I. D., & Savarese, S. (2019). Generalized intersection over union: A metric and a loss for bounding box regression. In Proceedings of IEEE conference on computer vision and pattern recognition.
Strudel, R., Garcia, R., Laptev, I., & Schmid, C. (2021). Segmenter: Transformer for semantic segmentation. In Proceedings of the IEEE international conference on computer vision (pp. 7262–7272).
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L. & Polosukhin, I. (2017). Attention is all you need. In Advances of neural information processing systems.
Voigtlaender, P., Luiten, J., Torr, P. H. S., & Leibe, B. (2020). Siam R-CNN: Visual tracking by re-detection. In Proceedings of IEEE conference on computer vision and pattern recognition.
Wang, G., et al. (2020). Tracking by Instance Detection: A meta-learning approach. In Proceedings of IEEE conference on computer vision and pattern recognition.
Wang, N., et al. (2021a). Transformer meets tracker: Exploiting temporal context for robust visual tracking. In Proceedings of IEEE conference on computer vision and pattern recognition (pp. 1571–1580).
Wang, W., Xie, E., Li, X., Fan, D.-P., Song, K., Liang, D., Lu, T., Luo, P., & Shao, L. (2021b). Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In Proceedings of the IEEE international conference on computer vision (pp. 568–578).
Wang, X., Shu, X., Zhang, Z., Jiang, B., Wang, Y., Tian, Y., & Wu, F. (2021c). Towards more flexible and accurate object tracking with natural language: Algorithms and benchmark. In Proceedings of IEEE conference on computer vision and pattern recognition (pp. 13763–13773).
Wu, Y., Lim, J., & Yang, M.-H. (2015). Object tracking benchmark. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(9), 1834–1848. https://doi.org/10.1109/TPAMI.2014.2388226.
Xin, C., et al. (2023). Seqtrack: Sequence to sequence learning for visual object tracking. In Proceedings of IEEE conference on computer vision and pattern recognition (pp. 14572–14581).
Xu, T., Feng, Z., Wu, X.-J., & Kittler, J. (2021). Adaptive channel selection for robust visual object tracking with discriminative correlation filters. International Journal of Computer Vision, 129, 1359–1375.
Article Google Scholar
Xu, Y., Wang, Z., Li, Z., Yuan, Y., & Yu, G. (2020). SiamFC++: Towards robust and accurate visual tracking with target estimation guidelines. In Proceedings of the AAAI conference on artificial intelligence.
Yan, B., Peng, H., Fu, J., Wang, D., & Lu, H. (2021). Learning spatio-temporal transformer for visual tracking. In Proceedings of the IEEE international conference on computer vision (pp. 10448–10457).
Ye, B., Chang, H., Ma, B., & Shan, S. (2022). Joint feature learning and relation modeling for tracking: A one-stream framework. In Proceedings of the European conference on computer vision.
Yilmaz, A., Javed, O., & Shah, M. (2006). Object tracking: A survey. ACM Computing Surveys, 38(4), 13.
Article Google Scholar
Yutao, C., Cheng, J., Wang, L., & Wu, G. (2022). Mixformer: End-to-end tracking with iterative mixed attention. In Proceedings of IEEE conference on computer vision and pattern recognition.
Zhang, C., Wu, J., & Li, Y. (2022). Actionformer: Localizing moments of actions with transformers. In Proceedings of the European conference on computer vision.
Zhang, T., et al. (2013). Robust visual tracking via structured multi-task sparse learning. International Journal of Computer Vision, 101(2), 367–383.
Article MathSciNet Google Scholar
Zhang, T., et al. (2015). Robust visual tracking via consistent low-rank sparse learning. International Journal of Computer Vision, 111, 171–190.
Article Google Scholar
Zhang, Z., et al. (2020). Ocean: Object-aware anchor-free tracking. In Proceedings of the European conference on computer vision.

Download references

Author information

Authors and Affiliations

National Key Laboratory of Deep Space Exploration, School of Information Science and Technology, University of Science and Technology of China, Huizhou Avenue, Hefei, 230000, Anhui, China
Yinchao Ma, Qianjin Yu, Wenfei Yang & Tianzhu Zhang
Intelligent Science Technology Academy, China Aerospace Science and Industry Corporation, Fucheng Road, Beijing, 100037, Beijing, China
Jinpeng Zhang

Authors

Yinchao Ma
View author publications
Search author on:PubMed Google Scholar
Qianjin Yu
View author publications
Search author on:PubMed Google Scholar
Wenfei Yang
View author publications
Search author on:PubMed Google Scholar
Tianzhu Zhang
View author publications
Search author on:PubMed Google Scholar
Jinpeng Zhang
View author publications
Search author on:PubMed Google Scholar

Contributions

Data collection and analysis were performed by YM and QY. The SATrack model was originally proposed by YM and QY. WY further improved the SATrack model. TZ and JZ are the leaders of this project, who discussed the feasibility in-depth and polished the manuscript. TZ also participated in partial experimental design and paper revision. The first draft of the manuscript was written by YM and all authors commented on previous versions of the manuscript. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Tianzhu Zhang.

Ethics declarations

Confict of interest

The authors declare that they have no conflict of interest.

Additional information

Communicated by Matej Kristan.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Ma, Y., Yu, Q., Yang, W. et al. Learning Discriminative Features for Visual Tracking via Scenario Decoupling. Int J Comput Vis 133, 2950–2966 (2025). https://doi.org/10.1007/s11263-024-02307-0

Download citation

Received: 07 January 2024
Accepted: 17 November 2024
Published: 19 December 2024
Version of record: 19 December 2024
Issue date: May 2025
DOI: https://doi.org/10.1007/s11263-024-02307-0

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+

from $39.99 /Month

Starting from 10 chapters or articles per month
Access and download chapters and articles from more than 300k books and 2,500 journals
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Learning Discriminative Features for Visual Tracking via Scenario Decoupling

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Learning Feature Restoration Transformer for Robust Dehazing Visual Object Tracking

Autoregressive Temporal Modeling for Advanced Tracking-by-Diffusion

Visual tracking via confidence template updating spatial-temporal regularized correlation filters

Explore related subjects

Data Availibility

References

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Confict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now