这是indexloc提供的服务,不要输入任何密码
Skip to main content
Log in

A Spatio-Temporal Robust Tracker with Spatial-Channel Transformer and Jitter Suppression

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

The robustness of visual object tracking is reflected not only in the accuracy of the target localisation in every single frame, but also in the smoothness of the predicted motion of the tracked object across consecutive frames. From the perspective of appearance modelling, the success of the state-of-the-art Transformer-based trackers derives from their ability to adaptively associate the representations of related spatial regions. However, the absence of attention in the channel dimension hinders the realisation of their potential tracking capacity. To cope with the commonly occurring misalignment of the spatial scale between the template and a search patch, we propose a novel cross channel correlation mechanism. Accordingly, the relevance of multi-channel features in the channel Transformer is modelled using two different sources of information. The result is a novel spatial-channel Transformer, which integrates information conveyed by features along both, the spatial and channel directions. For temporal modelling, to quantify the temporal smoothness, we propose a jitter metric that measures the cross-frame variation of the predicted bounding boxes as a function of the parameters such as centre displacement, area, and aspect ratio. As the changes of an object between consecutive frames are limited, the proposed jitter loss can be used to monitor the temporal behaviour of the tracking results and penalise erroneus predictions during the training stage, thus enhancing the temporal stability of an appearance-based tracker. Extensive experiments on several well-known benchmarking datasets demonstrate the robustness of the proposed tracker.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+
from $39.99 /Month
  • Starting from 10 chapters or articles per month
  • Access and download chapters and articles from more than 300k books and 2,500 journals
  • Cancel anytime
View plans

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Availability of Data and Materials

The datasets supporting the conclusions of this article are included within the article.

References

  • Bai, S., He, Z., Dong, Y., & Bai, H. (2020). Multi-hierarchical independent correlation filters for visual tracking. In 2020 IEEE international conference on multimedia and expo (ICME) (pp. 1–6).

  • Bertinetto, L., Valmadre, J., Henriques, J. F., Vedaldi, A., & Torr, P. H. (2016). Fully-convolutional Siamese networks for object tracking. In European conference on computer vision. Springer (pp. 850–865).

  • Bhat, G., Johnander, J., Danelljan, M., Khan, F. S. & Felsberg, M. (2018). Unveiling the power of deep tracking. In European conference on computer vision (pp. 493–509).

  • Bhat, G., Danelljan, M., Van Gool, L. & Timofte, R. (2020). Know your surroundings: Exploiting scene information for object tracking. In European conference on computer vision (pp. 205–221).

  • Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A. & Zagoruyko, S. (2020). End-to-end object detection with transformers. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.M. (Eds) European conference on computer vision (pp. 213–229).

  • Chen, X., Yan, B., Zhu, J., Wang, D., Yang, X. & Lu, H. (2021). Transformer tracking. In 2021 IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 8122–8131).

  • Chen, Z., Zhong, B., Li, G., Zhang, S., & Ji, R. (2020). Siamese box adaptive network for visual tracking. In 2020 IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 6667–6676).

  • Cheng, S., Zhong, B., Li, G., Liu, X., Tang, Z., Li, X. & Wang, J. (2021). Learning to filter: Siamese relation network for robust tracking. In 2021 IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 4419–4429).

  • Choi, J., Kwon, J., Lee, K. M. (2019). Deep meta learning for real-time target-aware visual tracking. In 2019 IEEE/CVF international conference on computer vision (ICCV) (pp. 911–920).

  • Cui, Y., Guo, D., Shao, Y., Wang, Z., Sheng, C., Zhang, L., & Chen, S. (2022). Joint classification and regression for visual tracking with fully convolutional Siamese networks. International Journal of Computer Vision, 130, 550–566.

    Article  Google Scholar 

  • Cui, Y., Jiang, C., Wang, L. & Wu, G. (2022b). Mixformer: End-to-end tracking with iterative mixed attention. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 13608–13618).

  • Danelljan, M., Van Gool, L., & Timofte, R. (2020). Probabilistic regression for visual tracking. In 2020 IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 7181–7190).

  • Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J. & Houlsby, N. (2021). An image is worth 16x16 words: Transformers for image recognition at scale. In International conference on learning representation

  • Du, F., Liu, P., Zhao, W. & Tang, X. (2020). Correlation-guided attention for corner detection based visual tracking. In 2020 IEEE/CVF Conference on computer vision and pattern recognition (CVPR) (pp. 6835–6844).

  • Fan, H., Lin, L., Yang, F., Chu, P., Deng, G., Yu, S., Bai, H., Xu, Y., Liao, C. & Ling, H. (2019). Lasot: A high-quality benchmark for large-scale single object tracking. In 2019 IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 5369–5378).

  • Fan, H., Miththanthaya, H. A., Harshit, H., Rajan, S. R., Liu, X., Zou, Z., Lin, Y. & Ling, H. (2021). Transparent object tracking benchmark. In 2021 IEEE/CVF international conference on computer vision (ICCV) (pp. 10714–10723).

  • Fu, Z., Liu, Q., Fu, Z. & Wang, Y. (2021). Stmtrack: Template-free visual tracking with space-time memory networks. In 2021 IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 13769–13778).

  • Galoogahi, H. K., Fagg, A., Huang, C., Ramanan, D. & Lucey, S. (2017). Need for speed: A benchmark for higher frame rate object tracking. In 2017 IEEE international conference on computer vision (ICCV) (pp. 1134–1143).

  • Gao, J., Zhang, T. & Xu, C. (2019). Graph convolutional tracking. In 2019 IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp 4644–4654).

  • Guo, D., Shao, Y., Cui, Y., Wang, Z., Zhang, L., & Shen, C. (2021). Graph attention tracking. In 2021 IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 9538–9547).

  • Han, K., Wang, Y., Chen, H., Chen, X., Guo, J., Liu, Z., Tang, Y., Xiao, A., Xu, C., Xu, Y., Yang, Z., Zhang, Y., & Tao, D. (2022). A survey on vision transformer. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45, 1–1.

    Article  Google Scholar 

  • Henriques, J. F., Rui, C., Martins, P., & Batista, J. (2015). High-speed tracking with kernelized correlation filters. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(3), 583–596.

    Article  Google Scholar 

  • Hong, J., & Kwon, J. (2022). Optimal visual tracking using Wasserstein transport proposals. Expert Systems with Applications, 209, 118251.

    Article  Google Scholar 

  • Huang, L., Zhao, X., & Huang, K. (2021). Got-10k: A large high-diversity benchmark for generic object tracking in the wild. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(5), 1562–1577.

    Article  Google Scholar 

  • Li, B., Wu, W., Wang, Q., Zhang, F., Xing, J. & Yan, J. (2019a). Siamrpn++: Evolution of siamese visual tracking with very deep networks. In 2019 IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 4277–4286).

  • Li, F., Tian, C., Zuo, W., Zhang, L. & Yang, M. H. (2018). Learning spatial-temporal regularized correlation filters for visual tracking. In 2018 IEEE/CVF conference on computer vision and pattern recognition (pp. 4904–4913).

  • Li, P., Chen, B., Ouyang, W., Wang, D., Yang, X. & Lu, H. (2019b). Gradnet: Gradient-guided network for visual object tracking. In 2019 IEEE/CVF international conference on computer vision (ICCV) (pp. 6161–6170).

  • Liao, B., Wang, C., Wang, Y., Wang, Y. & Yin, J. (2020). Pg-net: Pixel to global matching network for visual tracking. In Vedaldi, A., Bischof, H., Brox, T., & Frahm, J.M. (Eds.) European Conference on Computer Vision (pp. 429–444).

  • Lin, L., Fan, H., Xu, Y. & Ling, H. (2021). Swintrack: A simple and strong baseline for transformer tracking. https://arxiv.org/abs/2112.00995.

  • Lin, T. Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P. & Zitnick, C. L. (2014). Microsoft coco: Common objects in context. In European conference on computer vision (pp. 740–755).

  • Marvasti-Zadeh, S. M., Cheng, L., Ghanei-Yakhdan, H., & Kasaei, S. (2022). Deep learning for visual tracking: A comprehensive survey. IEEE Transactions on Intelligent Transportation Systems, 23(5), 3943–3968.

    Article  Google Scholar 

  • Mayer, C., Danelljan, M., Pani Paudel, D. & Van Gool, L. (2021). Learning target candidate association to keep track of what not to track. In 2021 IEEE/CVF international conference on computer vision (ICCV) (pp. 13424–13434).

  • Mayer, C., Danelljan, M., Bhat, G., Paul, M., Paudel, D. P., Yu, F. & Van Gool, L. (2022). Transforming model prediction for tracking. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 8731–8740).

  • Mueller, M., Smith, N. & Ghanem, B. (2016). A benchmark and simulator for UAV tracking. In European conference on computer vision. Springer (pp. 445–461).

  • Müller, M., Bibi, A., Giancola, S., Alsubaihi, S., & Ghanem, B. (2018). Trackingnet: A large-scale dataset and benchmark for object tracking in the wild. Computer Vision - ECCV, 2018, 310–327.

    Google Scholar 

  • Nguyen, T., Zhao, Q., & Yan, S. (2018). Attentive systems: A survey. International Journal of Computer Vision, 126, 86–110.

    Article  Google Scholar 

  • Park, J., & Kwon, J. (2022). Wasserstein approximate Bayesian computation for visual tracking. Pattern Recognition, 131, 108905.

    Article  Google Scholar 

  • Song, Z., Yu, J., Chen, Y. P. P. & Yang, W. (2022). Transformer tracking with cyclic shifting window attention. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 8791–8800).

  • Tang, F. & Ling, Q. (2022). Ranking-based siamese visual tracking. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 8741–8750).

  • Touvron, H., Cord, M., Sablayrolles, A., Synnaeve, G. & Jégou, H. (2021). Going deeper with image transformers. https://arxiv.org/abs/2103.17239.

  • Voigtlaender, P., Luiten, J., Torr, P. H. & Leibe, B. (2020). Siam R-CNN: Visual tracking by re-detection. In 2020 IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 6577–6587).

  • Wang, N., Zhou, W., Wang, J. & Li, H. (2021). Transformer meets tracker: Exploiting temporal context for robust visual tracking. In 2021 IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 1571–1580).

  • Wang, Q., Teng, Z., Xing, J., Gao, J., Hu, W. & Maybank, S. (2018). Learning attentions: Residual attentional siamese network for high performance online visual tracking. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 4854–4863).

  • Wu, Y., Lim, J., & Yang, M. H. (2015). Object tracking benchmark. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(9), 1834–1848.

    Article  Google Scholar 

  • Xie, F., Wang, C., Wang, G., Cao, Y., Yang, W., & Zeng, W. (2022). Correlation-aware deep tracking. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 8751–8760).

  • Xu, L., Wei, Y., Dong, C., Xu, C., & Diao, Z. (2021). Wasserstein distance-based auto-encoder tracking. Neural Processing Letters, 53, 2305–2329. https://doi.org/10.1007/s11063-021-10507-9

  • Xu, T., Feng, Z. H., Wu, X. J. & Kittler, J. (2019a). Joint group feature selection and discriminative filter learning for robust visual object tracking. In 2019 IEEE/CVF international conference on computer vision (ICCV) (pp. 7949–7959).

  • Xu, T., Feng, Z. H., Wu, X. J., & Kittler, J. (2019). Learning adaptive discriminative correlation filters via temporal consistency preserving spatial feature selection for robust visual object tracking. IEEE Transactions on Image Processing, 28(11), 5596–5609.

    Article  MathSciNet  Google Scholar 

  • Xu, T., Feng, Z., Wu, X. J., & Kittle, J. (2021). Adaptive channel selection for robust visual object tracking with discriminative correlation filters. International Journal of Computer Vision, 129, 1359–1375.

    Article  Google Scholar 

  • Yan, B., Peng, H., Fu, J., Wang, D. & Lu, H. (2021). Learning spatio-temporal transformer for visual tracking. In 2021 IEEE/CVF international conference on computer vision (ICCV) (pp. 10428–10437).

  • Yu, Y., Xiong, Y., Huang, W. & Scott, M. R. (2020). Deformable Siamese attention networks for visual object tracking. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 6727–6736).

  • Zhang, L., Gonzalez-Garcia, A., Weijer, J. V. D., Danelljan, M., & Khan, F. S. (2019). Learning the model update for Siamese trackers. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV) (pp. 4009–4018).

  • Zhang, Z., Wang, C., Qiu, W., Qin, W., & Zeng, W. (2021). Adafuse: Adaptive multiview fusion for accurate human pose estimation in the wild. International Journal of Computer Vision, 129, 703–718.

    Article  Google Scholar 

  • Zhao, S., Xu, T., Wu, X. J., & Zhu, X. F. (2021). Adaptive feature fusion for visual object tracking. Pattern Recognition, 111, 107679.

    Article  Google Scholar 

  • Zhao, S., Xu, T., Wu, X. J. & Kittler, J. (2022). Distillation, ensemble and selection for building a better and faster Siamese based tracker. IEEE Transactions on Circuits and Systems for Video Technology (pp. 1–1).

  • Zhou, W., Wen, L., Zhang, L., Du, D., Luo, T. & Wu, Y. (2019). Siamman: Siamese motion-aware network for visual tracking. https://arxiv.org/pdf/1912.05515.pdf.

  • Zhou, Z., Pei, W., Li, X., Wang, H., Zheng, F., & He, Z. (2021). Saliency-associated object tracking. In 2021 IEEE/CVF international conference on computer vision (ICCV) (pp. 9846–9855).

  • Zhu, X.F., Xu, T., Tang, Z., Wu, Z., Liu, H., Yang, X., Wu, X.J. & Kittler, J. (2022). Rgbd1k: A large-scale dataset and benchmark for RGB-d object tracking. arXiv preprint arXiv:2208.09787

Download references

Acknowledgements

This work was supported in part by the National Natural Science Foundation of China (U1836218, 62020106012, 62106089, 62202205), and in part by the UK EPSRC programme grant (FACER2VM) EP/N007743/1, in part by the EPSRC/dstl/MURI project EP/R018456/1.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xiao-Jun Wu.

Ethics declarations

Competing Interests

The authors declare that they have no conflict of interest.

Additional information

Communicated by Oliver Zendel.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhao, S., Xu, T., Wu, XJ. et al. A Spatio-Temporal Robust Tracker with Spatial-Channel Transformer and Jitter Suppression. Int J Comput Vis 132, 1645–1658 (2024). https://doi.org/10.1007/s11263-023-01902-x

Download citation

  • Received:

  • Accepted:

  • Published:

  • Version of record:

  • Issue date:

  • DOI: https://doi.org/10.1007/s11263-023-01902-x

Keywords