A Spatio-Temporal Robust Tracker with Spatial-Channel Transformer and Jitter Suppression

Zhao, Shaochuan; Xu, Tianyang; Wu, Xiao-Jun; Kittler, Josef

doi:10.1007/s11263-023-01902-x

A Spatio-Temporal Robust Tracker with Spatial-Channel Transformer and Jitter Suppression

Published: 30 November 2023

Volume 132, pages 1645–1658, (2024)
Cite this article

International Journal of Computer Vision Aims and scope Submit manuscript

Shaochuan Zhao ORCID: orcid.org/0000-0003-0109-462X¹,
Tianyang Xu²,
Xiao-Jun Wu² &
…
Josef Kittler³

940 Accesses
5 Citations
1 Altmetric
Explore all metrics

Abstract

The robustness of visual object tracking is reflected not only in the accuracy of the target localisation in every single frame, but also in the smoothness of the predicted motion of the tracked object across consecutive frames. From the perspective of appearance modelling, the success of the state-of-the-art Transformer-based trackers derives from their ability to adaptively associate the representations of related spatial regions. However, the absence of attention in the channel dimension hinders the realisation of their potential tracking capacity. To cope with the commonly occurring misalignment of the spatial scale between the template and a search patch, we propose a novel cross channel correlation mechanism. Accordingly, the relevance of multi-channel features in the channel Transformer is modelled using two different sources of information. The result is a novel spatial-channel Transformer, which integrates information conveyed by features along both, the spatial and channel directions. For temporal modelling, to quantify the temporal smoothness, we propose a jitter metric that measures the cross-frame variation of the predicted bounding boxes as a function of the parameters such as centre displacement, area, and aspect ratio. As the changes of an object between consecutive frames are limited, the proposed jitter loss can be used to monitor the temporal behaviour of the tracking results and penalise erroneus predictions during the training stage, thus enhancing the temporal stability of an appearance-based tracker. Extensive experiments on several well-known benchmarking datasets demonstrate the robustness of the proposed tracker.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+

from $39.99 /Month

Starting from 10 chapters or articles per month
Access and download chapters and articles from more than 300k books and 2,500 journals
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

SiamRCSC: Robust siamese network with channel and spatial constraints for visual object tracking

Article 23 October 2024

Evota: an enhanced visual object tracking network with attention mechanism

Article 17 August 2023

AMST²: aggregated multi-level spatial and temporal context-based transformer for robust aerial tracking

Article Open access 04 June 2023

Availability of Data and Materials

The datasets supporting the conclusions of this article are included within the article.

References

Bai, S., He, Z., Dong, Y., & Bai, H. (2020). Multi-hierarchical independent correlation filters for visual tracking. In 2020 IEEE international conference on multimedia and expo (ICME) (pp. 1–6).
Bertinetto, L., Valmadre, J., Henriques, J. F., Vedaldi, A., & Torr, P. H. (2016). Fully-convolutional Siamese networks for object tracking. In European conference on computer vision. Springer (pp. 850–865).
Bhat, G., Johnander, J., Danelljan, M., Khan, F. S. & Felsberg, M. (2018). Unveiling the power of deep tracking. In European conference on computer vision (pp. 493–509).
Bhat, G., Danelljan, M., Van Gool, L. & Timofte, R. (2020). Know your surroundings: Exploiting scene information for object tracking. In European conference on computer vision (pp. 205–221).
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A. & Zagoruyko, S. (2020). End-to-end object detection with transformers. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.M. (Eds) European conference on computer vision (pp. 213–229).
Chen, X., Yan, B., Zhu, J., Wang, D., Yang, X. & Lu, H. (2021). Transformer tracking. In 2021 IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 8122–8131).
Chen, Z., Zhong, B., Li, G., Zhang, S., & Ji, R. (2020). Siamese box adaptive network for visual tracking. In 2020 IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 6667–6676).
Cheng, S., Zhong, B., Li, G., Liu, X., Tang, Z., Li, X. & Wang, J. (2021). Learning to filter: Siamese relation network for robust tracking. In 2021 IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 4419–4429).
Choi, J., Kwon, J., Lee, K. M. (2019). Deep meta learning for real-time target-aware visual tracking. In 2019 IEEE/CVF international conference on computer vision (ICCV) (pp. 911–920).
Cui, Y., Guo, D., Shao, Y., Wang, Z., Sheng, C., Zhang, L., & Chen, S. (2022). Joint classification and regression for visual tracking with fully convolutional Siamese networks. International Journal of Computer Vision, 130, 550–566.
Article Google Scholar
Cui, Y., Jiang, C., Wang, L. & Wu, G. (2022b). Mixformer: End-to-end tracking with iterative mixed attention. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 13608–13618).
Danelljan, M., Van Gool, L., & Timofte, R. (2020). Probabilistic regression for visual tracking. In 2020 IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 7181–7190).
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J. & Houlsby, N. (2021). An image is worth 16x16 words: Transformers for image recognition at scale. In International conference on learning representation
Du, F., Liu, P., Zhao, W. & Tang, X. (2020). Correlation-guided attention for corner detection based visual tracking. In 2020 IEEE/CVF Conference on computer vision and pattern recognition (CVPR) (pp. 6835–6844).
Fan, H., Lin, L., Yang, F., Chu, P., Deng, G., Yu, S., Bai, H., Xu, Y., Liao, C. & Ling, H. (2019). Lasot: A high-quality benchmark for large-scale single object tracking. In 2019 IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 5369–5378).
Fan, H., Miththanthaya, H. A., Harshit, H., Rajan, S. R., Liu, X., Zou, Z., Lin, Y. & Ling, H. (2021). Transparent object tracking benchmark. In 2021 IEEE/CVF international conference on computer vision (ICCV) (pp. 10714–10723).
Fu, Z., Liu, Q., Fu, Z. & Wang, Y. (2021). Stmtrack: Template-free visual tracking with space-time memory networks. In 2021 IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 13769–13778).
Galoogahi, H. K., Fagg, A., Huang, C., Ramanan, D. & Lucey, S. (2017). Need for speed: A benchmark for higher frame rate object tracking. In 2017 IEEE international conference on computer vision (ICCV) (pp. 1134–1143).
Gao, J., Zhang, T. & Xu, C. (2019). Graph convolutional tracking. In 2019 IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp 4644–4654).
Guo, D., Shao, Y., Cui, Y., Wang, Z., Zhang, L., & Shen, C. (2021). Graph attention tracking. In 2021 IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 9538–9547).
Han, K., Wang, Y., Chen, H., Chen, X., Guo, J., Liu, Z., Tang, Y., Xiao, A., Xu, C., Xu, Y., Yang, Z., Zhang, Y., & Tao, D. (2022). A survey on vision transformer. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45, 1–1.
Article Google Scholar
Henriques, J. F., Rui, C., Martins, P., & Batista, J. (2015). High-speed tracking with kernelized correlation filters. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(3), 583–596.
Article Google Scholar
Hong, J., & Kwon, J. (2022). Optimal visual tracking using Wasserstein transport proposals. Expert Systems with Applications, 209, 118251.
Article Google Scholar
Huang, L., Zhao, X., & Huang, K. (2021). Got-10k: A large high-diversity benchmark for generic object tracking in the wild. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(5), 1562–1577.
Article Google Scholar
Li, B., Wu, W., Wang, Q., Zhang, F., Xing, J. & Yan, J. (2019a). Siamrpn++: Evolution of siamese visual tracking with very deep networks. In 2019 IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 4277–4286).
Li, F., Tian, C., Zuo, W., Zhang, L. & Yang, M. H. (2018). Learning spatial-temporal regularized correlation filters for visual tracking. In 2018 IEEE/CVF conference on computer vision and pattern recognition (pp. 4904–4913).
Li, P., Chen, B., Ouyang, W., Wang, D., Yang, X. & Lu, H. (2019b). Gradnet: Gradient-guided network for visual object tracking. In 2019 IEEE/CVF international conference on computer vision (ICCV) (pp. 6161–6170).
Liao, B., Wang, C., Wang, Y., Wang, Y. & Yin, J. (2020). Pg-net: Pixel to global matching network for visual tracking. In Vedaldi, A., Bischof, H., Brox, T., & Frahm, J.M. (Eds.) European Conference on Computer Vision (pp. 429–444).
Lin, L., Fan, H., Xu, Y. & Ling, H. (2021). Swintrack: A simple and strong baseline for transformer tracking. https://arxiv.org/abs/2112.00995.
Lin, T. Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P. & Zitnick, C. L. (2014). Microsoft coco: Common objects in context. In European conference on computer vision (pp. 740–755).
Marvasti-Zadeh, S. M., Cheng, L., Ghanei-Yakhdan, H., & Kasaei, S. (2022). Deep learning for visual tracking: A comprehensive survey. IEEE Transactions on Intelligent Transportation Systems, 23(5), 3943–3968.
Article Google Scholar
Mayer, C., Danelljan, M., Pani Paudel, D. & Van Gool, L. (2021). Learning target candidate association to keep track of what not to track. In 2021 IEEE/CVF international conference on computer vision (ICCV) (pp. 13424–13434).
Mayer, C., Danelljan, M., Bhat, G., Paul, M., Paudel, D. P., Yu, F. & Van Gool, L. (2022). Transforming model prediction for tracking. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 8731–8740).
Mueller, M., Smith, N. & Ghanem, B. (2016). A benchmark and simulator for UAV tracking. In European conference on computer vision. Springer (pp. 445–461).
Müller, M., Bibi, A., Giancola, S., Alsubaihi, S., & Ghanem, B. (2018). Trackingnet: A large-scale dataset and benchmark for object tracking in the wild. Computer Vision - ECCV, 2018, 310–327.
Google Scholar
Nguyen, T., Zhao, Q., & Yan, S. (2018). Attentive systems: A survey. International Journal of Computer Vision, 126, 86–110.
Article Google Scholar
Park, J., & Kwon, J. (2022). Wasserstein approximate Bayesian computation for visual tracking. Pattern Recognition, 131, 108905.
Article Google Scholar
Song, Z., Yu, J., Chen, Y. P. P. & Yang, W. (2022). Transformer tracking with cyclic shifting window attention. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 8791–8800).
Tang, F. & Ling, Q. (2022). Ranking-based siamese visual tracking. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 8741–8750).
Touvron, H., Cord, M., Sablayrolles, A., Synnaeve, G. & Jégou, H. (2021). Going deeper with image transformers. https://arxiv.org/abs/2103.17239.
Voigtlaender, P., Luiten, J., Torr, P. H. & Leibe, B. (2020). Siam R-CNN: Visual tracking by re-detection. In 2020 IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 6577–6587).
Wang, N., Zhou, W., Wang, J. & Li, H. (2021). Transformer meets tracker: Exploiting temporal context for robust visual tracking. In 2021 IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 1571–1580).
Wang, Q., Teng, Z., Xing, J., Gao, J., Hu, W. & Maybank, S. (2018). Learning attentions: Residual attentional siamese network for high performance online visual tracking. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 4854–4863).
Wu, Y., Lim, J., & Yang, M. H. (2015). Object tracking benchmark. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(9), 1834–1848.
Article Google Scholar
Xie, F., Wang, C., Wang, G., Cao, Y., Yang, W., & Zeng, W. (2022). Correlation-aware deep tracking. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 8751–8760).
Xu, L., Wei, Y., Dong, C., Xu, C., & Diao, Z. (2021). Wasserstein distance-based auto-encoder tracking. Neural Processing Letters, 53, 2305–2329. https://doi.org/10.1007/s11063-021-10507-9
Xu, T., Feng, Z. H., Wu, X. J. & Kittler, J. (2019a). Joint group feature selection and discriminative filter learning for robust visual object tracking. In 2019 IEEE/CVF international conference on computer vision (ICCV) (pp. 7949–7959).
Xu, T., Feng, Z. H., Wu, X. J., & Kittler, J. (2019). Learning adaptive discriminative correlation filters via temporal consistency preserving spatial feature selection for robust visual object tracking. IEEE Transactions on Image Processing, 28(11), 5596–5609.
Article MathSciNet Google Scholar
Xu, T., Feng, Z., Wu, X. J., & Kittle, J. (2021). Adaptive channel selection for robust visual object tracking with discriminative correlation filters. International Journal of Computer Vision, 129, 1359–1375.
Article Google Scholar
Yan, B., Peng, H., Fu, J., Wang, D. & Lu, H. (2021). Learning spatio-temporal transformer for visual tracking. In 2021 IEEE/CVF international conference on computer vision (ICCV) (pp. 10428–10437).
Yu, Y., Xiong, Y., Huang, W. & Scott, M. R. (2020). Deformable Siamese attention networks for visual object tracking. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 6727–6736).
Zhang, L., Gonzalez-Garcia, A., Weijer, J. V. D., Danelljan, M., & Khan, F. S. (2019). Learning the model update for Siamese trackers. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV) (pp. 4009–4018).
Zhang, Z., Wang, C., Qiu, W., Qin, W., & Zeng, W. (2021). Adafuse: Adaptive multiview fusion for accurate human pose estimation in the wild. International Journal of Computer Vision, 129, 703–718.
Article Google Scholar
Zhao, S., Xu, T., Wu, X. J., & Zhu, X. F. (2021). Adaptive feature fusion for visual object tracking. Pattern Recognition, 111, 107679.
Article Google Scholar
Zhao, S., Xu, T., Wu, X. J. & Kittler, J. (2022). Distillation, ensemble and selection for building a better and faster Siamese based tracker. IEEE Transactions on Circuits and Systems for Video Technology (pp. 1–1).
Zhou, W., Wen, L., Zhang, L., Du, D., Luo, T. & Wu, Y. (2019). Siamman: Siamese motion-aware network for visual tracking. https://arxiv.org/pdf/1912.05515.pdf.
Zhou, Z., Pei, W., Li, X., Wang, H., Zheng, F., & He, Z. (2021). Saliency-associated object tracking. In 2021 IEEE/CVF international conference on computer vision (ICCV) (pp. 9846–9855).
Zhu, X.F., Xu, T., Tang, Z., Wu, Z., Liu, H., Yang, X., Wu, X.J. & Kittler, J. (2022). Rgbd1k: A large-scale dataset and benchmark for RGB-d object tracking. arXiv preprint arXiv:2208.09787

Download references

Acknowledgements

This work was supported in part by the National Natural Science Foundation of China (U1836218, 62020106012, 62106089, 62202205), and in part by the UK EPSRC programme grant (FACER2VM) EP/N007743/1, in part by the EPSRC/dstl/MURI project EP/R018456/1.

Author information

Authors and Affiliations

School of Internet of Things Engineering, Jiangnan University, Wuxi, 214122, China
Shaochuan Zhao
School of Artificial Intelligence and Computer Science, Jiangnan University, Wuxi, 214122, China
Tianyang Xu & Xiao-Jun Wu
Centre for Vision, Speech and Signal Processing, University of Surrey, Guildford, GU2 7XH, UK
Josef Kittler

Authors

Shaochuan Zhao
View author publications
Search author on:PubMed Google Scholar
Tianyang Xu
View author publications
Search author on:PubMed Google Scholar
Xiao-Jun Wu
View author publications
Search author on:PubMed Google Scholar
Josef Kittler
View author publications
Search author on:PubMed Google Scholar

Corresponding author

Correspondence to Xiao-Jun Wu.

Ethics declarations

Competing Interests

The authors declare that they have no conflict of interest.

Additional information

Communicated by Oliver Zendel.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Zhao, S., Xu, T., Wu, XJ. et al. A Spatio-Temporal Robust Tracker with Spatial-Channel Transformer and Jitter Suppression. Int J Comput Vis 132, 1645–1658 (2024). https://doi.org/10.1007/s11263-023-01902-x

Download citation

Received: 29 November 2022
Accepted: 02 September 2023
Published: 30 November 2023
Version of record: 30 November 2023
Issue date: May 2024
DOI: https://doi.org/10.1007/s11263-023-01902-x

Keywords

Part of a collection:

Special Issue on Robust Vision

Access this article

Log in via an institution

Subscribe and save

Springer+

from $39.99 /Month

Starting from 10 chapters or articles per month
Access and download chapters and articles from more than 300k books and 2,500 journals
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Spatio-Temporal Robust Tracker with Spatial-Channel Transformer and Jitter Suppression

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

SiamRCSC: Robust siamese network with channel and spatial constraints for visual object tracking

Evota: an enhanced visual object tracking network with attention mechanism

AMST2: aggregated multi-level spatial and temporal context-based transformer for robust aerial tracking

Explore related subjects

Availability of Data and Materials

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Competing Interests

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

AMST²: aggregated multi-level spatial and temporal context-based transformer for robust aerial tracking