这是indexloc提供的服务,不要输入任何密码
Skip to main content
Log in

Learning Discriminative Features for Visual Tracking via Scenario Decoupling

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

Visual tracking aims to estimate object state automatically in a video sequence, which is challenging especially in complex scenarios. Recent Transformer-based trackers enable the interaction between the target template and search region in the feature extraction phase for target-aware feature learning, which have achieved superior performance. However, visual tracking is essentially a task to discriminate the specified target from the backgrounds. These trackers commonly ignore the role of background in feature learning, which may cause backgrounds to be mistakenly enhanced in complex scenarios, affecting temporal robustness and spatial discriminability. To address the above limitations, we propose a scenario-aware tracker (SATrack) based on a specifically designed scenario-aware Vision Transformer, which integrates a scenario knowledge extractor and a scenario knowledge modulator. The proposed SATrack enjoys several merits. Firstly, we design a novel scenario-aware Vision Transformer for visual tracking, which can decouple historic scenarios into explicit target and background knowledge to guide discriminative feature learning. Secondly, a scenario knowledge extractor is designed to dynamically acquire decoupled and compact scenario knowledge from video contexts, and a scenario knowledge modulator is designed to embed scenario knowledge into attention mechanisms for scenario-aware feature learning. Extensive experimental results on nine tracking benchmarks demonstrate that SATrack achieves new state-of-the-art performance with high FPS.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+
from $39.99 /Month
  • Starting from 10 chapters or articles per month
  • Access and download chapters and articles from more than 300k books and 2,500 journals
  • Cancel anytime
View plans

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Data Availibility

GOT-10k dataset can be downloaded at http://got-10k.aitestunion.com/downloads. TNL2K dataset can be downloaded at https://github.com/wangxiao5791509/TNL2K_evaluation_toolkit. LaSOT and LaSOText datasets can be downloaded at https://github.com/HengLan/LaSOT_Evaluation_Toolkit. TrackingNet dataset can be downloaded at https://github.com/SilvioGiancola/TrackingNet-devkit NFS dataset can be downloaded at https://ci2cv.net/nfs/index.html UAV123 dataset can be downloaded at https://cemse.kaust.edu.sa/ivul/uav123 AVisT dataset can be downloaded at https://sites.google.com/view/avist-benchmark.

References

  • Arnab, A., Dehghani, M., Heigold, G., Sun, C., Lučić, M., & Schmid, C. (2021). Vivit: A video vision transformer. In Proceedings of the IEEE international conference on computer vision (pp. 6836–6846).

  • Bertinetto, L., Valmadre, J., Henriques, J. F., Vedaldi, A., & Torr, P. H. S. (2016). Fully-convolutional siamese networks for object tracking. In Proceedings of the European conference on computer vision workshops.

  • Bhat, G., Danelljan, M., Gool, L. V., & Timofte, R. (2019). Learning discriminative model prediction for tracking. In Proceedings of the IEEE international conference on computer vision.

  • Bhat, G., Danelljan, M., Gool, L.V., & Timofte, R. (2020). Know your surroundings: Exploiting scene information for object tracking. Proceedings of the European conference on computer vision (pp. 205–221).

  • Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., & Zagoruyko, S. (2020). End-to-end object detection with transformers. In Proceedings of the European conference on computer vision.

  • Chen, B., Li, P., Bai, L., Qiao, L., Shen, Q., Li, B., Gan, W., Wu, W. & Ouyang, W. (2022). Backbone is all your need: A simplified architecture for visual object tracking. In Proceedings of the European conference on computer vision.

  • Chen, J., & Ho, C. M. (2022). Mm-vit: Multi-modal video transformer for compressed video action recognition. In Proceedings of the IEEE/CVF winter conference on applications of computer vision (pp. 1910–1921).

  • Chen, X., Yan, B., Zhu, J., Lu, H., Ruan, X., & Wang, D. (2023). High-performance transformer tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(07), 8507–8523. https://doi.org/10.1109/TPAMI.2022.3232535

    Article  Google Scholar 

  • Chen, X., Yan, B., Zhu, J., Wang, D., Yang, X., & Lu, H. (2021). Transformer tracking. In Proceedings of IEEE conference on computer vision and pattern recognition.

  • Chen, Z., Zhong, B., Li, G., Zhang, S., & Ji, R. (2020). Siamese box adaptive network for visual tracking. In Proceedings of IEEE conference on computer vision and pattern recognition.

  • Cheng, B., Misra, I., Schwing, A. G., Kirillov, A., & Girdhar, R. (2022). Masked-attention mask transformer for universal image segmentation. In Proceedings of IEEE conference on computer vision and pattern recognition (pp. 1290–1299).

  • Cui, Y., Guo, D., Shao, Y., Wang, Z., Shen, C., Zhang, L., & Chen, S. (2022a). Joint classification and regression for visual tracking with fully convolutional siamese networks (pp. 1–17). International Journal of Computer Vision.

  • Cui, Y., et al. (2022b). Fully convolutional online tracking. Computer Vision and Image Understanding, 224, 103547.

  • Cui, Y., Jiang, C., Wang, L., Wu, G. (2024a). Mixformer: End-to-end tracking with iterative mixed attention. IEEE Transactions on Pattern Analysis and Machine Intelligence. https://doi.org/10.1109/TPAMI.2024.3349519

  • Cui, Y., Jiang, C., Wang, L., Wu, G. (2024b). Mixformerv2: Efficient fully transformer tracking. Advances of Neural Information Processing Systems, 36.

  • Danelljan, M., Bhat, G., Khan, F. S., & Felsberg, M. (2017). ECO: Efficient convolution operators for tracking. In Proceedings of IEEE conference on computer vision and pattern recognition.

  • Danelljan, M., Bhat, G., Khan, F. S., & Felsberg, M. (2019). ATOM: Accurate tracking by overlap maximization. In Proceedings of IEEE conference on computer vision and pattern recognition.

  • Danelljan, M., et al. (2020). Probabilistic regression for visual tracking. In Proceedings of IEEE conference on computer vision and pattern recognition.

  • Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J. & Houlsby, N. (2021). An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on learning representations.

  • Du, F., et al. (2020). Correlation-guided attention for corner detection based visual tracking. In Proceedings of IEEE conference on computer vision and pattern recognition.

  • Fan, H., Bai, H., Lin, L., Yang, F., Chu, P., Deng, G., Yu, S., Huang, M., Liu, J., Xu, Y., et al. (2021). Lasot: A high-quality large-scale single object tracking benchmark. International Journal of Computer Vision, 129(2), 439–461.

    Article  Google Scholar 

  • Fan, H., Lin, L., Yang, F., Chu, P., Deng, G., Yu, S., Bai, H., Xu, Y., Liao, C., & Ling, H. (2019). LaSOT: A high-quality benchmark for large-scale single object tracking. In Proceedings of IEEE conference on computer vision and pattern recognition.

  • Gao, S., et al. (2022). Aiatrack: Attention in attention for transformer visual tracking. In Proceedings of the European conference on computer vision (pp. 146–164).

  • Glorot, X., & Bengio, Y. (2010). Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the international conference on artificial intelligence and statistics.

  • Guo, D., Shao, Y., Cui, Y., Wang, Z., Zhang, L., & Shen, C. (2021). Graph attention tracking. In Proceedings of IEEE conference on computer vision and pattern recognition (pp. 9543–9552).

  • Han, R., Feng, W., & Wang, S. (2020). Fast learning of spatially regularized and content aware correlation filter for visual tracking. IEEE Transactions on Image Processing, 29, 7128–7140.

    Article  MathSciNet  Google Scholar 

  • He, K., Chen, X., Xie, S., Li, Y., Dollár, P., & Girshick, R. (2022). Masked autoencoders are scalable vision learners. In Proceedings of IEEE conference on computer vision and pattern recognition (pp. 16000–16009).

  • Huang, L., Zhao, X., & Huang, K. (2018). Got-10k: A large high-diversity benchmark for generic object tracking in the wild. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43, 1562–1577.

  • Jiang, B., et al. (2018). Acquisition of localization confidence for accurate object detection. In Proceedings of the European conference on computer vision.

  • Kiani Galoogahi, H., et al. (2017). Need for speed: A benchmark for higher frame rate object tracking. In Proceedings of the IEEE international conference on computer vision.

  • Kristan, M., Leonardis, A., Matas, J., Felsberg, M., Pflugfelder, R., Kämäräinen, J.-K., Chang, H. J., Danelljan, M., Zajc, L. Č., Lukežič, A., et al. (2022). The tenth visual object tracking vot2022 challenge results. In Proceedings of the European conference on computer vision (pp. 431–460).

  • Law, H., & Deng, J. (2018). Cornernet: Detecting objects as paired keypoints. In Proceedings of the European conference on computer vision (pp. 734–750).

  • Li, B., Wu, W., Wang, Q., Zhang, F., Xing, J., & Yan, J. (2019). SiamRPN++: Evolution of siamese visual tracking with very deep networks. In Proceedings of IEEE conference on computer vision and pattern recognition.

  • Li, B., Yan, J., Wu, W., Zhu, Z., & Hu, X. (2018). High performance visual tracking with siamese region proposal network. In Proceedings of IEEE conference on computer vision and pattern recognition.

  • Lin, L., Fan, H., Zhang, Z., Xu, Y., & Ling, H. (2022). Swintrack: A simple and strong baseline for transformer tracking. Advances in Neural Information Processing Systems, 35, 16743–16754.

    Google Scholar 

  • Lin, T.-Y., Maire, M., Belongie, S. J., Bourdev, L. D., Girshick, R. B., Hays, J., Perona, P., Ramanan, D., Dollár, P., & Zitnick, C. L. (2014). Microsoft COCO: Common objects in context. In Proceedings of the European conference on computer vision.

  • Liu, S., Liang, X., Liu, L., Lu, K., Lin, L., Cao, X., & Yan, S. (2015). Fashion parsing with video context. IEEE Transactions on Multimedia, 17(8), 1347–1358.

    Article  Google Scholar 

  • Liu, X., Wang, Q., Hu, Y., Tang, X., Zhang, S., Bai, S., & Bai, X. (2022). End-to-end temporal action detection with transformer. IEEE Transactions on Image Processing, 31, 5427–5441.

    Article  Google Scholar 

  • Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B., & Guo, B. (2021). Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE international conference on computer vision (pp. 10012–10022).

  • Loshchilov, I., & Hutter, F. (2018). Decoupled weight decay regularization. In International conference on learning representations.

  • Lukezic, A., et al. (2020). D3S: A discriminative single shot segmentation tracker. In Proceedings of IEEE conference on computer vision and pattern recognition.

  • Ma, Y., et al. (2023). Adaptive part mining for robust visual tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(10), 11443–11457.

    Article  Google Scholar 

  • Mayer, C., Danelljan, M., Paudel, D. P., & Van Gool, L. (2021). Learning target candidate association to keep track of what not to track. In Proceedings of the IEEE international conference on computer vision (pp. 13444–13454).

  • Mueller, M., et al. (2016). A benchmark and simulator for UAV tracking. In Proceedings of the European conference on computer vision.

  • Muller, M., Bibi, A., Giancola, S., Alsubaihi, S., & Ghanem, B. (2018). TrackingNet: A large-scale dataset and benchmark for object tracking in the wild. In Proceedings of the European conference on computer vision.

  • Nam, H., & Han, B. (2016). Learning multi-domain convolutional neural networks for visual tracking. In Proceedings of IEEE conference on computer vision and pattern recognition.

  • Noman, M., Ghallabi, W. A., Najiha, D., Mayer, C., Dudhane, A., Danelljan, M., Cholakkal, H., Khan, S., Van Gool, L. & Khan, F.S. (2022). Avist: A benchmark for visual object tracking in adverse visibility. In British machine vision conference.

  • Rezatofighi, H., Tsoi, N., Gwak, J., Sadeghian, A., Reid, I. D., & Savarese, S. (2019). Generalized intersection over union: A metric and a loss for bounding box regression. In Proceedings of IEEE conference on computer vision and pattern recognition.

  • Strudel, R., Garcia, R., Laptev, I., & Schmid, C. (2021). Segmenter: Transformer for semantic segmentation. In Proceedings of the IEEE international conference on computer vision (pp. 7262–7272).

  • Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L. & Polosukhin, I. (2017). Attention is all you need. In Advances of neural information processing systems.

  • Voigtlaender, P., Luiten, J., Torr, P. H. S., & Leibe, B. (2020). Siam R-CNN: Visual tracking by re-detection. In Proceedings of IEEE conference on computer vision and pattern recognition.

  • Wang, G., et al. (2020). Tracking by Instance Detection: A meta-learning approach. In Proceedings of IEEE conference on computer vision and pattern recognition.

  • Wang, N., et al. (2021a). Transformer meets tracker: Exploiting temporal context for robust visual tracking. In Proceedings of IEEE conference on computer vision and pattern recognition (pp. 1571–1580).

  • Wang, W., Xie, E., Li, X., Fan, D.-P., Song, K., Liang, D., Lu, T., Luo, P., & Shao, L. (2021b). Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In Proceedings of the IEEE international conference on computer vision (pp. 568–578).

  • Wang, X., Shu, X., Zhang, Z., Jiang, B., Wang, Y., Tian, Y., & Wu, F. (2021c). Towards more flexible and accurate object tracking with natural language: Algorithms and benchmark. In Proceedings of IEEE conference on computer vision and pattern recognition (pp. 13763–13773).

  • Wu, Y., Lim, J., & Yang, M.-H. (2015). Object tracking benchmark. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(9), 1834–1848. https://doi.org/10.1109/TPAMI.2014.2388226.

  • Xin, C., et al. (2023). Seqtrack: Sequence to sequence learning for visual object tracking. In Proceedings of IEEE conference on computer vision and pattern recognition (pp. 14572–14581).

  • Xu, T., Feng, Z., Wu, X.-J., & Kittler, J. (2021). Adaptive channel selection for robust visual object tracking with discriminative correlation filters. International Journal of Computer Vision, 129, 1359–1375.

    Article  Google Scholar 

  • Xu, Y., Wang, Z., Li, Z., Yuan, Y., & Yu, G. (2020). SiamFC++: Towards robust and accurate visual tracking with target estimation guidelines. In Proceedings of the AAAI conference on artificial intelligence.

  • Yan, B., Peng, H., Fu, J., Wang, D., & Lu, H. (2021). Learning spatio-temporal transformer for visual tracking. In Proceedings of the IEEE international conference on computer vision (pp. 10448–10457).

  • Ye, B., Chang, H., Ma, B., & Shan, S. (2022). Joint feature learning and relation modeling for tracking: A one-stream framework. In Proceedings of the European conference on computer vision.

  • Yilmaz, A., Javed, O., & Shah, M. (2006). Object tracking: A survey. ACM Computing Surveys, 38(4), 13.

    Article  Google Scholar 

  • Yutao, C., Cheng, J., Wang, L., & Wu, G. (2022). Mixformer: End-to-end tracking with iterative mixed attention. In Proceedings of IEEE conference on computer vision and pattern recognition.

  • Zhang, C., Wu, J., & Li, Y. (2022). Actionformer: Localizing moments of actions with transformers. In Proceedings of the European conference on computer vision.

  • Zhang, T., et al. (2013). Robust visual tracking via structured multi-task sparse learning. International Journal of Computer Vision, 101(2), 367–383.

    Article  MathSciNet  Google Scholar 

  • Zhang, T., et al. (2015). Robust visual tracking via consistent low-rank sparse learning. International Journal of Computer Vision, 111, 171–190.

    Article  Google Scholar 

  • Zhang, Z., et al. (2020). Ocean: Object-aware anchor-free tracking. In Proceedings of the European conference on computer vision.

Download references

Author information

Authors and Affiliations

Authors

Contributions

Data collection and analysis were performed by YM and QY. The SATrack model was originally proposed by YM and QY. WY further improved the SATrack model. TZ and JZ are the leaders of this project, who discussed the feasibility in-depth and polished the manuscript. TZ also participated in partial experimental design and paper revision. The first draft of the manuscript was written by YM and all authors commented on previous versions of the manuscript. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Tianzhu Zhang.

Ethics declarations

Confict of interest

The authors declare that they have no conflict of interest.

Additional information

Communicated by Matej Kristan.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ma, Y., Yu, Q., Yang, W. et al. Learning Discriminative Features for Visual Tracking via Scenario Decoupling. Int J Comput Vis 133, 2950–2966 (2025). https://doi.org/10.1007/s11263-024-02307-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • Version of record:

  • Issue date:

  • DOI: https://doi.org/10.1007/s11263-024-02307-0

Keywords