Spatial-Temporal Transformer for Single RGB-D Camera Synchronous Tracking and Reconstruction of Non-rigid Dynamic Objects

Liu, Xiaofei; Yi, Zhengkun; Wu, Xinyu; Shang, Wanfeng

doi:10.1007/s11263-025-02469-5

Spatial-Temporal Transformer for Single RGB-D Camera Synchronous Tracking and Reconstruction of Non-rigid Dynamic Objects

Published: 21 May 2025

Volume 133, pages 6015–6024, (2025)
Cite this article

International Journal of Computer Vision Aims and scope Submit manuscript

247 Accesses
Explore all metrics

Abstract

We propose a simple and effective method that views the problem of single RGB-D camera synchronous tracking and reconstruction of non-rigid dynamic objects as an aligned sequential point cloud prediction problem. Our method does not require additional data transformations (truncated signed distance function or deformation graphs, etc.), alignment constraints (handcrafted features or optical flow, etc.), and prior regularities (as-rigid-as-possible or embedded deformation, etc.). We propose an end-to-end model architecture that is TRansformer for synchronous Tracking and Reconstruction of non-rigid dynamic target based on RGB-D images from a monocular camera, called TR4TR. We use a spatial-temporal combined 2D image encoder that directly encodes features from RGB-D sequence images, and a 3D point decoder to generate aligned sequential point cloud containing tracking and reconstruction results. The TR4TR model outperforms the baselines on the DeepDeform non-rigid dataset, and outperforms the state-of-the-art method by 8.82% on the deformation error evaluation metric. In addition, TR4TR is more robust when the target undergoes large inter-frame deformation. The code is available at https://github.com/xfliu1998/tr4tr-main.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+

from $39.99 /Month

Starting from 10 chapters or articles per month
Access and download chapters and articles from more than 300k books and 2,500 journals
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Weakly-Supervised Single-view Dense 3D Point Cloud Reconstruction via Differentiable Renderer

Article Open access 30 September 2021

Accurate and robust tracking of rigid objects in real time

Article 22 May 2020

Non-rigid Tracking Using RGB-D Data

References

Alcantarilla, P. F, Bartoli, A., & Davison, A. J. (2012). Kaze features. In: Computer Vision–ECCV 2012: 12th European Conference on Computer Vision, Florence, Italy, October 7-13, 2012, Proceedings, Part VI 12, Springer (pp. 214–227).
Amberg, B., Romdhani, S., & Vetter, T. (2007). Optimal step nonrigid icp algorithms for surface registration. In: 2007 IEEE conference on computer vision and pattern recognition, IEEE (pp. 1–8).
Bertasius, G., Wang, H., & Torresani, L. (2021). Is space-time attention all you need for video understanding? In: ICML (pp. 4).
Besl, P. J., & McKay, N. D. (1992). Method for registration of 3-d shapes. In: Sensor fusion IV: control paradigms and data structures. Spie (pp. 586–606).
Bozic, A., Palafox, P., Zollhöfer, M., et al. (2020a). Neural non-rigid tracking. Advances in Neural Information Processing Systems, 33, 18727–18737.
Google Scholar
Bozic, A., Zollhofer, M., & Theobalt, C., et al. (2020b). Deepdeform: Learning non-rigid rgb-d reconstruction with semi-supervised data. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 7002–7012).
Bozic, A., Palafox, P., Thies, J., et al. (2021). Transformerfusion: Monocular rgb scene reconstruction using transformers. Advances in Neural Information Processing Systems, 34, 1403–1414.
Google Scholar
Brown, T., Mann, B., Ryder, N., et al. (2020). Language models are few-shot learners. Advances in neural information processing systems, 33, 1877–1901.
Google Scholar
Cai, H., Feng, W., & Feng, X., et al. (2022). Neural surface reconstruction of dynamic scenes with monocular rgb-d camera. arXiv preprint arXiv:2206.15258.
Carion, N., Massa, F., & Synnaeve, G., et al. (2020). End-to-end object detection with transformers. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part I 16, Springer (pp. 213–229).
Curless, B., & Levoy, M. (1996). A volumetric method for building complex models from range images. In: Proceedings of the 23rd annual conference on Computer graphics and interactive techniques (pp .303–312).
Devlin, J., Chang, M. W., & Lee, K. et al. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
Dosovitskiy, A., Fischer, P., & Ilg, E., et al. (2015). Flownet: Learning optical flow with convolutional networks. In: Proceedings of the IEEE international conference on computer vision (pp. 2758–2766).
Dosovitskiy, A., Beyer, L., & Kolesnikov, A., et al. (2020). An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929.
Gu, X., Wang, Y., & Wu, C., et al. (2019). Hplflownet: Hierarchical permutohedral lattice flownet for scene flow estimation on large-scale point clouds. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 3254–3263).
Hao, Y., Song, H., & Dong, L., et al. (2022). Language models are general-purpose interfaces. arXiv preprint arXiv:2206.06336.
He, K., Chen, X., & Xie, S., et al. (2022). Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 16000–16009).
Innmann, M., Zollhöfer, M., & Nießner, M., et al. (2016). Volumedeform: Real-time volumetric non-rigid reconstruction. In: Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part VIII 14, Springer (pp. 362–379).
Kazhdan, M., & Hoppe, H. (2013). Screened poisson surface reconstruction. ACM Trans Graph, 32(3). https://doi.org/10.1145/2487228.2487237.
Li, J., Li, D., & Xiong, C., et al. (2022). Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, PMLR (pp. 12888–12900).
Li, Y., Bozic, A., & Zhang, T., et al. (2020). Learning to optimize non-rigid tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 4910–4918).
Li, Y., Takehara, H., & Taketomi, T., et al. (2021). 4dcomplete: Non-rigid motion estimation beyond the observable surface. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 12706–12716).
Lin, W., Zheng, C., & Yong, J. H., et al. (2022). Occlusionfusion: Occlusion-aware motion estimation for real-time dynamic 3d reconstruction. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 1736–1745).
Liu, X., Qi, C. R., & Guibas, L. J. (2019). Flownet3d: Learning scene flow in 3d point clouds. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 529–537).
Liu, Z., Lin, Y., & Cao, Y., et al. (2021). Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF international conference on computer vision (pp. 10012–10022).
Loshchilov, I., & Hutter, F. (2017). Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101.
Lowe, D. G. (2004). Distinctive image features from scale-invariant keypoints. International journal of computer vision, 60, 91–110.
Article Google Scholar
Lu, J., Clark, C., & Zellers, R., et al. (2022). Unified-io: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916.
Ma, W. C., Wang, S., & Hu, R., et al. (2019). Deep rigid instance scene flow. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 3614–3622).
Mildenhall, B., Srinivasan, P. P., Tancik, M., et al. (2021). Nerf: Representing scenes as neural radiance fields for view synthesis. Communications of the ACM, 65(1), 99–106.
Article Google Scholar
Newcombe, R. A., Fox, D., & Seitz, S. M. (2015). Dynamicfusion: Reconstruction and tracking of non-rigid scenes in real-time. In: Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 343–352).
Ouyang, L., Wu, J., & Jiang, X., et al. (2022). Training language models to follow instructions with human feedback. arXiv preprint arXiv:2203.02155.
Radford, A., Kim, J. W., & Hallacy, C., et al. (2021). Learning transferable visual models from natural language supervision. In: International conference on machine learning, PMLR (pp. 8748–8763).
Ronneberger, O., Fischer, P., & Brox, T. (2015). U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18, Springer (pp. 234–241).
Slavcheva, M., Baust, M., & Cremers, D., et al. (2017). Killingfusion: Non-rigid 3d reconstruction without correspondences. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 1386–1395).
Slavcheva, M., Baust, M., & Ilic, S. (2018). Sobolevfusion: 3d reconstruction of scenes undergoing free non-rigid motion. In: Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2646–2655).
Sorkine, O., & Alexa, M. (2007). As-rigid-as-possible surface modeling. In: Symposium on Geometry processing (pp. 109–116).
Sumner, R. W., Schmid, J., & Pauly, M. (2007). Embedded deformation for shape manipulation. ACM SIGGRAPH 2007 papers.
Tombari, F., Salti, S., & Di Stefano, L. (2010). Unique signatures of histograms for local surface description. In: Computer Vision–ECCV 2010: 11th European Conference on Computer Vision, Heraklion, Crete, Greece, September 5-11, 2010, Proceedings, Part III 11, Springer (pp. 356–369).
Vaswani, A., Shazeer, N., & Parmar, N., et al. (2017). Attention is all you need. Advances in neural information processing systems 30.
Vo, K., Truong, S., Yamazaki, K., et al. (2023). Aoe-net: Entities interactions modeling with adaptive attention mechanism for temporal action proposals generation. International Journal of Computer Vision, 131(1), 302–323.
Article Google Scholar
Wang, D., Cui, X., & Chen, X., et al. (2021). Multi-view 3d reconstruction with transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 5722–5731).
Wang, W., Bao, H., & Dong, L., et al. (2022). Image as a foreign language: Beit pretraining for all vision and vision-language tasks. arXiv preprint arXiv:2208.10442.
Yahui, L., Sangineto, E., Wei, B., et al. (2021). Efficient training of visual transformers with small-size datasets. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS, 29, 23818–23830.
Google Scholar
Zhao, H., Jiang, L., & Jia, J., et al. (2021). Point transformer. In: Proceedings of the IEEE/CVF international conference on computer vision (pp. 16259–16268).

Download references

Author information

Authors and Affiliations

School of Artificial Intelligence, Shenzhen University, Shenzhen, 518060, Guangdong, China
Wanfeng Shang
Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen, 518055, Guangdong, China
Xiaofei Liu, Zhengkun Yi & Xinyu Wu
University of Chinese Academy of Sciences, Beijing, 101408, China
Xiaofei Liu

Authors

Xiaofei Liu
View author publications
Search author on:PubMed Google Scholar
Zhengkun Yi
View author publications
Search author on:PubMed Google Scholar
Xinyu Wu
View author publications
Search author on:PubMed Google Scholar
Wanfeng Shang
View author publications
Search author on:PubMed Google Scholar

Corresponding author

Correspondence to Wanfeng Shang.

Ethics declarations

Conflict of Interest Statement

This work was supported in part by the National Natural Science Foundation of China (NSFC)/Research Grants Council (RGC) of Hong Kong Joint Research Scheme under Grant 62361166630; in part by NSFC under Grant 62273323; and in part by the Shenzhen Key Basic Research Project (Grant No. JCYJ20241202124427037). The authors have no competing interests to declare that are relevant to the content of this article.

Additional information

Communicated by Yasuyuki Matsushita.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary file 1 (mp4 62884 KB)

Supplementary file 2 (pdf 4583 KB)

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Liu, X., Yi, Z., Wu, X. et al. Spatial-Temporal Transformer for Single RGB-D Camera Synchronous Tracking and Reconstruction of Non-rigid Dynamic Objects. Int J Comput Vis 133, 6015–6024 (2025). https://doi.org/10.1007/s11263-025-02469-5

Download citation

Received: 21 June 2023
Accepted: 02 May 2025
Published: 21 May 2025
Version of record: 21 May 2025
Issue date: September 2025
DOI: https://doi.org/10.1007/s11263-025-02469-5

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+

from $39.99 /Month

Starting from 10 chapters or articles per month
Access and download chapters and articles from more than 300k books and 2,500 journals
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Spatial-Temporal Transformer for Single RGB-D Camera Synchronous Tracking and Reconstruction of Non-rigid Dynamic Objects

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Weakly-Supervised Single-view Dense 3D Point Cloud Reconstruction via Differentiable Renderer

Accurate and robust tracking of rigid objects in real time

Non-rigid Tracking Using RGB-D Data

Explore related subjects

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of Interest Statement

Additional information

Publisher's Note

Supplementary Information

Supplementary file 2 (pdf 4583 KB)

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now