Abstract
Visual tracking aims to estimate object state automatically in a video sequence, which is challenging especially in complex scenarios. Recent Transformer-based trackers enable the interaction between the target template and search region in the feature extraction phase for target-aware feature learning, which have achieved superior performance. However, visual tracking is essentially a task to discriminate the specified target from the backgrounds. These trackers commonly ignore the role of background in feature learning, which may cause backgrounds to be mistakenly enhanced in complex scenarios, affecting temporal robustness and spatial discriminability. To address the above limitations, we propose a scenario-aware tracker (SATrack) based on a specifically designed scenario-aware Vision Transformer, which integrates a scenario knowledge extractor and a scenario knowledge modulator. The proposed SATrack enjoys several merits. Firstly, we design a novel scenario-aware Vision Transformer for visual tracking, which can decouple historic scenarios into explicit target and background knowledge to guide discriminative feature learning. Secondly, a scenario knowledge extractor is designed to dynamically acquire decoupled and compact scenario knowledge from video contexts, and a scenario knowledge modulator is designed to embed scenario knowledge into attention mechanisms for scenario-aware feature learning. Extensive experimental results on nine tracking benchmarks demonstrate that SATrack achieves new state-of-the-art performance with high FPS.
Similar content being viewed by others
Data Availibility
GOT-10k dataset can be downloaded at http://got-10k.aitestunion.com/downloads. TNL2K dataset can be downloaded at https://github.com/wangxiao5791509/TNL2K_evaluation_toolkit. LaSOT and LaSOText datasets can be downloaded at https://github.com/HengLan/LaSOT_Evaluation_Toolkit. TrackingNet dataset can be downloaded at https://github.com/SilvioGiancola/TrackingNet-devkit NFS dataset can be downloaded at https://ci2cv.net/nfs/index.html UAV123 dataset can be downloaded at https://cemse.kaust.edu.sa/ivul/uav123 AVisT dataset can be downloaded at https://sites.google.com/view/avist-benchmark.
References
Arnab, A., Dehghani, M., Heigold, G., Sun, C., Lučić, M., & Schmid, C. (2021). Vivit: A video vision transformer. In Proceedings of the IEEE international conference on computer vision (pp. 6836–6846).
Bertinetto, L., Valmadre, J., Henriques, J. F., Vedaldi, A., & Torr, P. H. S. (2016). Fully-convolutional siamese networks for object tracking. In Proceedings of the European conference on computer vision workshops.
Bhat, G., Danelljan, M., Gool, L. V., & Timofte, R. (2019). Learning discriminative model prediction for tracking. In Proceedings of the IEEE international conference on computer vision.
Bhat, G., Danelljan, M., Gool, L.V., & Timofte, R. (2020). Know your surroundings: Exploiting scene information for object tracking. Proceedings of the European conference on computer vision (pp. 205–221).
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., & Zagoruyko, S. (2020). End-to-end object detection with transformers. In Proceedings of the European conference on computer vision.
Chen, B., Li, P., Bai, L., Qiao, L., Shen, Q., Li, B., Gan, W., Wu, W. & Ouyang, W. (2022). Backbone is all your need: A simplified architecture for visual object tracking. In Proceedings of the European conference on computer vision.
Chen, J., & Ho, C. M. (2022). Mm-vit: Multi-modal video transformer for compressed video action recognition. In Proceedings of the IEEE/CVF winter conference on applications of computer vision (pp. 1910–1921).
Chen, X., Yan, B., Zhu, J., Lu, H., Ruan, X., & Wang, D. (2023). High-performance transformer tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(07), 8507–8523. https://doi.org/10.1109/TPAMI.2022.3232535
Chen, X., Yan, B., Zhu, J., Wang, D., Yang, X., & Lu, H. (2021). Transformer tracking. In Proceedings of IEEE conference on computer vision and pattern recognition.
Chen, Z., Zhong, B., Li, G., Zhang, S., & Ji, R. (2020). Siamese box adaptive network for visual tracking. In Proceedings of IEEE conference on computer vision and pattern recognition.
Cheng, B., Misra, I., Schwing, A. G., Kirillov, A., & Girdhar, R. (2022). Masked-attention mask transformer for universal image segmentation. In Proceedings of IEEE conference on computer vision and pattern recognition (pp. 1290–1299).
Cui, Y., Guo, D., Shao, Y., Wang, Z., Shen, C., Zhang, L., & Chen, S. (2022a). Joint classification and regression for visual tracking with fully convolutional siamese networks (pp. 1–17). International Journal of Computer Vision.
Cui, Y., et al. (2022b). Fully convolutional online tracking. Computer Vision and Image Understanding, 224, 103547.
Cui, Y., Jiang, C., Wang, L., Wu, G. (2024a). Mixformer: End-to-end tracking with iterative mixed attention. IEEE Transactions on Pattern Analysis and Machine Intelligence. https://doi.org/10.1109/TPAMI.2024.3349519
Cui, Y., Jiang, C., Wang, L., Wu, G. (2024b). Mixformerv2: Efficient fully transformer tracking. Advances of Neural Information Processing Systems, 36.
Danelljan, M., Bhat, G., Khan, F. S., & Felsberg, M. (2017). ECO: Efficient convolution operators for tracking. In Proceedings of IEEE conference on computer vision and pattern recognition.
Danelljan, M., Bhat, G., Khan, F. S., & Felsberg, M. (2019). ATOM: Accurate tracking by overlap maximization. In Proceedings of IEEE conference on computer vision and pattern recognition.
Danelljan, M., et al. (2020). Probabilistic regression for visual tracking. In Proceedings of IEEE conference on computer vision and pattern recognition.
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J. & Houlsby, N. (2021). An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on learning representations.
Du, F., et al. (2020). Correlation-guided attention for corner detection based visual tracking. In Proceedings of IEEE conference on computer vision and pattern recognition.
Fan, H., Bai, H., Lin, L., Yang, F., Chu, P., Deng, G., Yu, S., Huang, M., Liu, J., Xu, Y., et al. (2021). Lasot: A high-quality large-scale single object tracking benchmark. International Journal of Computer Vision, 129(2), 439–461.
Fan, H., Lin, L., Yang, F., Chu, P., Deng, G., Yu, S., Bai, H., Xu, Y., Liao, C., & Ling, H. (2019). LaSOT: A high-quality benchmark for large-scale single object tracking. In Proceedings of IEEE conference on computer vision and pattern recognition.
Gao, S., et al. (2022). Aiatrack: Attention in attention for transformer visual tracking. In Proceedings of the European conference on computer vision (pp. 146–164).
Glorot, X., & Bengio, Y. (2010). Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the international conference on artificial intelligence and statistics.
Guo, D., Shao, Y., Cui, Y., Wang, Z., Zhang, L., & Shen, C. (2021). Graph attention tracking. In Proceedings of IEEE conference on computer vision and pattern recognition (pp. 9543–9552).
Han, R., Feng, W., & Wang, S. (2020). Fast learning of spatially regularized and content aware correlation filter for visual tracking. IEEE Transactions on Image Processing, 29, 7128–7140.
He, K., Chen, X., Xie, S., Li, Y., Dollár, P., & Girshick, R. (2022). Masked autoencoders are scalable vision learners. In Proceedings of IEEE conference on computer vision and pattern recognition (pp. 16000–16009).
Huang, L., Zhao, X., & Huang, K. (2018). Got-10k: A large high-diversity benchmark for generic object tracking in the wild. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43, 1562–1577.
Jiang, B., et al. (2018). Acquisition of localization confidence for accurate object detection. In Proceedings of the European conference on computer vision.
Kiani Galoogahi, H., et al. (2017). Need for speed: A benchmark for higher frame rate object tracking. In Proceedings of the IEEE international conference on computer vision.
Kristan, M., Leonardis, A., Matas, J., Felsberg, M., Pflugfelder, R., Kämäräinen, J.-K., Chang, H. J., Danelljan, M., Zajc, L. Č., Lukežič, A., et al. (2022). The tenth visual object tracking vot2022 challenge results. In Proceedings of the European conference on computer vision (pp. 431–460).
Law, H., & Deng, J. (2018). Cornernet: Detecting objects as paired keypoints. In Proceedings of the European conference on computer vision (pp. 734–750).
Li, B., Wu, W., Wang, Q., Zhang, F., Xing, J., & Yan, J. (2019). SiamRPN++: Evolution of siamese visual tracking with very deep networks. In Proceedings of IEEE conference on computer vision and pattern recognition.
Li, B., Yan, J., Wu, W., Zhu, Z., & Hu, X. (2018). High performance visual tracking with siamese region proposal network. In Proceedings of IEEE conference on computer vision and pattern recognition.
Lin, L., Fan, H., Zhang, Z., Xu, Y., & Ling, H. (2022). Swintrack: A simple and strong baseline for transformer tracking. Advances in Neural Information Processing Systems, 35, 16743–16754.
Lin, T.-Y., Maire, M., Belongie, S. J., Bourdev, L. D., Girshick, R. B., Hays, J., Perona, P., Ramanan, D., Dollár, P., & Zitnick, C. L. (2014). Microsoft COCO: Common objects in context. In Proceedings of the European conference on computer vision.
Liu, S., Liang, X., Liu, L., Lu, K., Lin, L., Cao, X., & Yan, S. (2015). Fashion parsing with video context. IEEE Transactions on Multimedia, 17(8), 1347–1358.
Liu, X., Wang, Q., Hu, Y., Tang, X., Zhang, S., Bai, S., & Bai, X. (2022). End-to-end temporal action detection with transformer. IEEE Transactions on Image Processing, 31, 5427–5441.
Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B., & Guo, B. (2021). Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE international conference on computer vision (pp. 10012–10022).
Loshchilov, I., & Hutter, F. (2018). Decoupled weight decay regularization. In International conference on learning representations.
Lukezic, A., et al. (2020). D3S: A discriminative single shot segmentation tracker. In Proceedings of IEEE conference on computer vision and pattern recognition.
Ma, Y., et al. (2023). Adaptive part mining for robust visual tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(10), 11443–11457.
Mayer, C., Danelljan, M., Paudel, D. P., & Van Gool, L. (2021). Learning target candidate association to keep track of what not to track. In Proceedings of the IEEE international conference on computer vision (pp. 13444–13454).
Mueller, M., et al. (2016). A benchmark and simulator for UAV tracking. In Proceedings of the European conference on computer vision.
Muller, M., Bibi, A., Giancola, S., Alsubaihi, S., & Ghanem, B. (2018). TrackingNet: A large-scale dataset and benchmark for object tracking in the wild. In Proceedings of the European conference on computer vision.
Nam, H., & Han, B. (2016). Learning multi-domain convolutional neural networks for visual tracking. In Proceedings of IEEE conference on computer vision and pattern recognition.
Noman, M., Ghallabi, W. A., Najiha, D., Mayer, C., Dudhane, A., Danelljan, M., Cholakkal, H., Khan, S., Van Gool, L. & Khan, F.S. (2022). Avist: A benchmark for visual object tracking in adverse visibility. In British machine vision conference.
Rezatofighi, H., Tsoi, N., Gwak, J., Sadeghian, A., Reid, I. D., & Savarese, S. (2019). Generalized intersection over union: A metric and a loss for bounding box regression. In Proceedings of IEEE conference on computer vision and pattern recognition.
Strudel, R., Garcia, R., Laptev, I., & Schmid, C. (2021). Segmenter: Transformer for semantic segmentation. In Proceedings of the IEEE international conference on computer vision (pp. 7262–7272).
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L. & Polosukhin, I. (2017). Attention is all you need. In Advances of neural information processing systems.
Voigtlaender, P., Luiten, J., Torr, P. H. S., & Leibe, B. (2020). Siam R-CNN: Visual tracking by re-detection. In Proceedings of IEEE conference on computer vision and pattern recognition.
Wang, G., et al. (2020). Tracking by Instance Detection: A meta-learning approach. In Proceedings of IEEE conference on computer vision and pattern recognition.
Wang, N., et al. (2021a). Transformer meets tracker: Exploiting temporal context for robust visual tracking. In Proceedings of IEEE conference on computer vision and pattern recognition (pp. 1571–1580).
Wang, W., Xie, E., Li, X., Fan, D.-P., Song, K., Liang, D., Lu, T., Luo, P., & Shao, L. (2021b). Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In Proceedings of the IEEE international conference on computer vision (pp. 568–578).
Wang, X., Shu, X., Zhang, Z., Jiang, B., Wang, Y., Tian, Y., & Wu, F. (2021c). Towards more flexible and accurate object tracking with natural language: Algorithms and benchmark. In Proceedings of IEEE conference on computer vision and pattern recognition (pp. 13763–13773).
Wu, Y., Lim, J., & Yang, M.-H. (2015). Object tracking benchmark. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(9), 1834–1848. https://doi.org/10.1109/TPAMI.2014.2388226.
Xin, C., et al. (2023). Seqtrack: Sequence to sequence learning for visual object tracking. In Proceedings of IEEE conference on computer vision and pattern recognition (pp. 14572–14581).
Xu, T., Feng, Z., Wu, X.-J., & Kittler, J. (2021). Adaptive channel selection for robust visual object tracking with discriminative correlation filters. International Journal of Computer Vision, 129, 1359–1375.
Xu, Y., Wang, Z., Li, Z., Yuan, Y., & Yu, G. (2020). SiamFC++: Towards robust and accurate visual tracking with target estimation guidelines. In Proceedings of the AAAI conference on artificial intelligence.
Yan, B., Peng, H., Fu, J., Wang, D., & Lu, H. (2021). Learning spatio-temporal transformer for visual tracking. In Proceedings of the IEEE international conference on computer vision (pp. 10448–10457).
Ye, B., Chang, H., Ma, B., & Shan, S. (2022). Joint feature learning and relation modeling for tracking: A one-stream framework. In Proceedings of the European conference on computer vision.
Yilmaz, A., Javed, O., & Shah, M. (2006). Object tracking: A survey. ACM Computing Surveys, 38(4), 13.
Yutao, C., Cheng, J., Wang, L., & Wu, G. (2022). Mixformer: End-to-end tracking with iterative mixed attention. In Proceedings of IEEE conference on computer vision and pattern recognition.
Zhang, C., Wu, J., & Li, Y. (2022). Actionformer: Localizing moments of actions with transformers. In Proceedings of the European conference on computer vision.
Zhang, T., et al. (2013). Robust visual tracking via structured multi-task sparse learning. International Journal of Computer Vision, 101(2), 367–383.
Zhang, T., et al. (2015). Robust visual tracking via consistent low-rank sparse learning. International Journal of Computer Vision, 111, 171–190.
Zhang, Z., et al. (2020). Ocean: Object-aware anchor-free tracking. In Proceedings of the European conference on computer vision.
Author information
Authors and Affiliations
Contributions
Data collection and analysis were performed by YM and QY. The SATrack model was originally proposed by YM and QY. WY further improved the SATrack model. TZ and JZ are the leaders of this project, who discussed the feasibility in-depth and polished the manuscript. TZ also participated in partial experimental design and paper revision. The first draft of the manuscript was written by YM and all authors commented on previous versions of the manuscript. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Confict of interest
The authors declare that they have no conflict of interest.
Additional information
Communicated by Matej Kristan.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Ma, Y., Yu, Q., Yang, W. et al. Learning Discriminative Features for Visual Tracking via Scenario Decoupling. Int J Comput Vis 133, 2950–2966 (2025). https://doi.org/10.1007/s11263-024-02307-0
Received:
Accepted:
Published:
Version of record:
Issue date:
DOI: https://doi.org/10.1007/s11263-024-02307-0