Abstract
The robustness of visual object tracking is reflected not only in the accuracy of the target localisation in every single frame, but also in the smoothness of the predicted motion of the tracked object across consecutive frames. From the perspective of appearance modelling, the success of the state-of-the-art Transformer-based trackers derives from their ability to adaptively associate the representations of related spatial regions. However, the absence of attention in the channel dimension hinders the realisation of their potential tracking capacity. To cope with the commonly occurring misalignment of the spatial scale between the template and a search patch, we propose a novel cross channel correlation mechanism. Accordingly, the relevance of multi-channel features in the channel Transformer is modelled using two different sources of information. The result is a novel spatial-channel Transformer, which integrates information conveyed by features along both, the spatial and channel directions. For temporal modelling, to quantify the temporal smoothness, we propose a jitter metric that measures the cross-frame variation of the predicted bounding boxes as a function of the parameters such as centre displacement, area, and aspect ratio. As the changes of an object between consecutive frames are limited, the proposed jitter loss can be used to monitor the temporal behaviour of the tracking results and penalise erroneus predictions during the training stage, thus enhancing the temporal stability of an appearance-based tracker. Extensive experiments on several well-known benchmarking datasets demonstrate the robustness of the proposed tracker.
Similar content being viewed by others
Availability of Data and Materials
The datasets supporting the conclusions of this article are included within the article.
References
Bai, S., He, Z., Dong, Y., & Bai, H. (2020). Multi-hierarchical independent correlation filters for visual tracking. In 2020 IEEE international conference on multimedia and expo (ICME) (pp. 1–6).
Bertinetto, L., Valmadre, J., Henriques, J. F., Vedaldi, A., & Torr, P. H. (2016). Fully-convolutional Siamese networks for object tracking. In European conference on computer vision. Springer (pp. 850–865).
Bhat, G., Johnander, J., Danelljan, M., Khan, F. S. & Felsberg, M. (2018). Unveiling the power of deep tracking. In European conference on computer vision (pp. 493–509).
Bhat, G., Danelljan, M., Van Gool, L. & Timofte, R. (2020). Know your surroundings: Exploiting scene information for object tracking. In European conference on computer vision (pp. 205–221).
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A. & Zagoruyko, S. (2020). End-to-end object detection with transformers. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.M. (Eds) European conference on computer vision (pp. 213–229).
Chen, X., Yan, B., Zhu, J., Wang, D., Yang, X. & Lu, H. (2021). Transformer tracking. In 2021 IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 8122–8131).
Chen, Z., Zhong, B., Li, G., Zhang, S., & Ji, R. (2020). Siamese box adaptive network for visual tracking. In 2020 IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 6667–6676).
Cheng, S., Zhong, B., Li, G., Liu, X., Tang, Z., Li, X. & Wang, J. (2021). Learning to filter: Siamese relation network for robust tracking. In 2021 IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 4419–4429).
Choi, J., Kwon, J., Lee, K. M. (2019). Deep meta learning for real-time target-aware visual tracking. In 2019 IEEE/CVF international conference on computer vision (ICCV) (pp. 911–920).
Cui, Y., Guo, D., Shao, Y., Wang, Z., Sheng, C., Zhang, L., & Chen, S. (2022). Joint classification and regression for visual tracking with fully convolutional Siamese networks. International Journal of Computer Vision, 130, 550–566.
Cui, Y., Jiang, C., Wang, L. & Wu, G. (2022b). Mixformer: End-to-end tracking with iterative mixed attention. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 13608–13618).
Danelljan, M., Van Gool, L., & Timofte, R. (2020). Probabilistic regression for visual tracking. In 2020 IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 7181–7190).
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J. & Houlsby, N. (2021). An image is worth 16x16 words: Transformers for image recognition at scale. In International conference on learning representation
Du, F., Liu, P., Zhao, W. & Tang, X. (2020). Correlation-guided attention for corner detection based visual tracking. In 2020 IEEE/CVF Conference on computer vision and pattern recognition (CVPR) (pp. 6835–6844).
Fan, H., Lin, L., Yang, F., Chu, P., Deng, G., Yu, S., Bai, H., Xu, Y., Liao, C. & Ling, H. (2019). Lasot: A high-quality benchmark for large-scale single object tracking. In 2019 IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 5369–5378).
Fan, H., Miththanthaya, H. A., Harshit, H., Rajan, S. R., Liu, X., Zou, Z., Lin, Y. & Ling, H. (2021). Transparent object tracking benchmark. In 2021 IEEE/CVF international conference on computer vision (ICCV) (pp. 10714–10723).
Fu, Z., Liu, Q., Fu, Z. & Wang, Y. (2021). Stmtrack: Template-free visual tracking with space-time memory networks. In 2021 IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 13769–13778).
Galoogahi, H. K., Fagg, A., Huang, C., Ramanan, D. & Lucey, S. (2017). Need for speed: A benchmark for higher frame rate object tracking. In 2017 IEEE international conference on computer vision (ICCV) (pp. 1134–1143).
Gao, J., Zhang, T. & Xu, C. (2019). Graph convolutional tracking. In 2019 IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp 4644–4654).
Guo, D., Shao, Y., Cui, Y., Wang, Z., Zhang, L., & Shen, C. (2021). Graph attention tracking. In 2021 IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 9538–9547).
Han, K., Wang, Y., Chen, H., Chen, X., Guo, J., Liu, Z., Tang, Y., Xiao, A., Xu, C., Xu, Y., Yang, Z., Zhang, Y., & Tao, D. (2022). A survey on vision transformer. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45, 1–1.
Henriques, J. F., Rui, C., Martins, P., & Batista, J. (2015). High-speed tracking with kernelized correlation filters. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(3), 583–596.
Hong, J., & Kwon, J. (2022). Optimal visual tracking using Wasserstein transport proposals. Expert Systems with Applications, 209, 118251.
Huang, L., Zhao, X., & Huang, K. (2021). Got-10k: A large high-diversity benchmark for generic object tracking in the wild. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(5), 1562–1577.
Li, B., Wu, W., Wang, Q., Zhang, F., Xing, J. & Yan, J. (2019a). Siamrpn++: Evolution of siamese visual tracking with very deep networks. In 2019 IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 4277–4286).
Li, F., Tian, C., Zuo, W., Zhang, L. & Yang, M. H. (2018). Learning spatial-temporal regularized correlation filters for visual tracking. In 2018 IEEE/CVF conference on computer vision and pattern recognition (pp. 4904–4913).
Li, P., Chen, B., Ouyang, W., Wang, D., Yang, X. & Lu, H. (2019b). Gradnet: Gradient-guided network for visual object tracking. In 2019 IEEE/CVF international conference on computer vision (ICCV) (pp. 6161–6170).
Liao, B., Wang, C., Wang, Y., Wang, Y. & Yin, J. (2020). Pg-net: Pixel to global matching network for visual tracking. In Vedaldi, A., Bischof, H., Brox, T., & Frahm, J.M. (Eds.) European Conference on Computer Vision (pp. 429–444).
Lin, L., Fan, H., Xu, Y. & Ling, H. (2021). Swintrack: A simple and strong baseline for transformer tracking. https://arxiv.org/abs/2112.00995.
Lin, T. Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P. & Zitnick, C. L. (2014). Microsoft coco: Common objects in context. In European conference on computer vision (pp. 740–755).
Marvasti-Zadeh, S. M., Cheng, L., Ghanei-Yakhdan, H., & Kasaei, S. (2022). Deep learning for visual tracking: A comprehensive survey. IEEE Transactions on Intelligent Transportation Systems, 23(5), 3943–3968.
Mayer, C., Danelljan, M., Pani Paudel, D. & Van Gool, L. (2021). Learning target candidate association to keep track of what not to track. In 2021 IEEE/CVF international conference on computer vision (ICCV) (pp. 13424–13434).
Mayer, C., Danelljan, M., Bhat, G., Paul, M., Paudel, D. P., Yu, F. & Van Gool, L. (2022). Transforming model prediction for tracking. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 8731–8740).
Mueller, M., Smith, N. & Ghanem, B. (2016). A benchmark and simulator for UAV tracking. In European conference on computer vision. Springer (pp. 445–461).
Müller, M., Bibi, A., Giancola, S., Alsubaihi, S., & Ghanem, B. (2018). Trackingnet: A large-scale dataset and benchmark for object tracking in the wild. Computer Vision - ECCV, 2018, 310–327.
Nguyen, T., Zhao, Q., & Yan, S. (2018). Attentive systems: A survey. International Journal of Computer Vision, 126, 86–110.
Park, J., & Kwon, J. (2022). Wasserstein approximate Bayesian computation for visual tracking. Pattern Recognition, 131, 108905.
Song, Z., Yu, J., Chen, Y. P. P. & Yang, W. (2022). Transformer tracking with cyclic shifting window attention. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 8791–8800).
Tang, F. & Ling, Q. (2022). Ranking-based siamese visual tracking. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 8741–8750).
Touvron, H., Cord, M., Sablayrolles, A., Synnaeve, G. & Jégou, H. (2021). Going deeper with image transformers. https://arxiv.org/abs/2103.17239.
Voigtlaender, P., Luiten, J., Torr, P. H. & Leibe, B. (2020). Siam R-CNN: Visual tracking by re-detection. In 2020 IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 6577–6587).
Wang, N., Zhou, W., Wang, J. & Li, H. (2021). Transformer meets tracker: Exploiting temporal context for robust visual tracking. In 2021 IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 1571–1580).
Wang, Q., Teng, Z., Xing, J., Gao, J., Hu, W. & Maybank, S. (2018). Learning attentions: Residual attentional siamese network for high performance online visual tracking. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 4854–4863).
Wu, Y., Lim, J., & Yang, M. H. (2015). Object tracking benchmark. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(9), 1834–1848.
Xie, F., Wang, C., Wang, G., Cao, Y., Yang, W., & Zeng, W. (2022). Correlation-aware deep tracking. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 8751–8760).
Xu, L., Wei, Y., Dong, C., Xu, C., & Diao, Z. (2021). Wasserstein distance-based auto-encoder tracking. Neural Processing Letters, 53, 2305–2329. https://doi.org/10.1007/s11063-021-10507-9
Xu, T., Feng, Z. H., Wu, X. J. & Kittler, J. (2019a). Joint group feature selection and discriminative filter learning for robust visual object tracking. In 2019 IEEE/CVF international conference on computer vision (ICCV) (pp. 7949–7959).
Xu, T., Feng, Z. H., Wu, X. J., & Kittler, J. (2019). Learning adaptive discriminative correlation filters via temporal consistency preserving spatial feature selection for robust visual object tracking. IEEE Transactions on Image Processing, 28(11), 5596–5609.
Xu, T., Feng, Z., Wu, X. J., & Kittle, J. (2021). Adaptive channel selection for robust visual object tracking with discriminative correlation filters. International Journal of Computer Vision, 129, 1359–1375.
Yan, B., Peng, H., Fu, J., Wang, D. & Lu, H. (2021). Learning spatio-temporal transformer for visual tracking. In 2021 IEEE/CVF international conference on computer vision (ICCV) (pp. 10428–10437).
Yu, Y., Xiong, Y., Huang, W. & Scott, M. R. (2020). Deformable Siamese attention networks for visual object tracking. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 6727–6736).
Zhang, L., Gonzalez-Garcia, A., Weijer, J. V. D., Danelljan, M., & Khan, F. S. (2019). Learning the model update for Siamese trackers. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV) (pp. 4009–4018).
Zhang, Z., Wang, C., Qiu, W., Qin, W., & Zeng, W. (2021). Adafuse: Adaptive multiview fusion for accurate human pose estimation in the wild. International Journal of Computer Vision, 129, 703–718.
Zhao, S., Xu, T., Wu, X. J., & Zhu, X. F. (2021). Adaptive feature fusion for visual object tracking. Pattern Recognition, 111, 107679.
Zhao, S., Xu, T., Wu, X. J. & Kittler, J. (2022). Distillation, ensemble and selection for building a better and faster Siamese based tracker. IEEE Transactions on Circuits and Systems for Video Technology (pp. 1–1).
Zhou, W., Wen, L., Zhang, L., Du, D., Luo, T. & Wu, Y. (2019). Siamman: Siamese motion-aware network for visual tracking. https://arxiv.org/pdf/1912.05515.pdf.
Zhou, Z., Pei, W., Li, X., Wang, H., Zheng, F., & He, Z. (2021). Saliency-associated object tracking. In 2021 IEEE/CVF international conference on computer vision (ICCV) (pp. 9846–9855).
Zhu, X.F., Xu, T., Tang, Z., Wu, Z., Liu, H., Yang, X., Wu, X.J. & Kittler, J. (2022). Rgbd1k: A large-scale dataset and benchmark for RGB-d object tracking. arXiv preprint arXiv:2208.09787
Acknowledgements
This work was supported in part by the National Natural Science Foundation of China (U1836218, 62020106012, 62106089, 62202205), and in part by the UK EPSRC programme grant (FACER2VM) EP/N007743/1, in part by the EPSRC/dstl/MURI project EP/R018456/1.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Competing Interests
The authors declare that they have no conflict of interest.
Additional information
Communicated by Oliver Zendel.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Zhao, S., Xu, T., Wu, XJ. et al. A Spatio-Temporal Robust Tracker with Spatial-Channel Transformer and Jitter Suppression. Int J Comput Vis 132, 1645–1658 (2024). https://doi.org/10.1007/s11263-023-01902-x
Received:
Accepted:
Published:
Version of record:
Issue date:
DOI: https://doi.org/10.1007/s11263-023-01902-x