这是indexloc提供的服务,不要输入任何密码
Skip to main content
Log in

Spatial-Temporal Transformer for Single RGB-D Camera Synchronous Tracking and Reconstruction of Non-rigid Dynamic Objects

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

We propose a simple and effective method that views the problem of single RGB-D camera synchronous tracking and reconstruction of non-rigid dynamic objects as an aligned sequential point cloud prediction problem. Our method does not require additional data transformations (truncated signed distance function or deformation graphs, etc.), alignment constraints (handcrafted features or optical flow, etc.), and prior regularities (as-rigid-as-possible or embedded deformation, etc.). We propose an end-to-end model architecture that is TRansformer for synchronous Tracking and Reconstruction of non-rigid dynamic target based on RGB-D images from a monocular camera, called TR4TR. We use a spatial-temporal combined 2D image encoder that directly encodes features from RGB-D sequence images, and a 3D point decoder to generate aligned sequential point cloud containing tracking and reconstruction results. The TR4TR model outperforms the baselines on the DeepDeform non-rigid dataset, and outperforms the state-of-the-art method by 8.82% on the deformation error evaluation metric. In addition, TR4TR is more robust when the target undergoes large inter-frame deformation. The code is available at https://github.com/xfliu1998/tr4tr-main.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+
from $39.99 /Month
  • Starting from 10 chapters or articles per month
  • Access and download chapters and articles from more than 300k books and 2,500 journals
  • Cancel anytime
View plans

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

References

  • Alcantarilla, P. F, Bartoli, A., & Davison, A. J. (2012). Kaze features. In: Computer Vision–ECCV 2012: 12th European Conference on Computer Vision, Florence, Italy, October 7-13, 2012, Proceedings, Part VI 12, Springer (pp. 214–227).

  • Amberg, B., Romdhani, S., & Vetter, T. (2007). Optimal step nonrigid icp algorithms for surface registration. In: 2007 IEEE conference on computer vision and pattern recognition, IEEE (pp. 1–8).

  • Bertasius, G., Wang, H., & Torresani, L. (2021). Is space-time attention all you need for video understanding? In: ICML (pp. 4).

  • Besl, P. J., & McKay, N. D. (1992). Method for registration of 3-d shapes. In: Sensor fusion IV: control paradigms and data structures. Spie (pp. 586–606).

  • Bozic, A., Palafox, P., Zollhöfer, M., et al. (2020a). Neural non-rigid tracking. Advances in Neural Information Processing Systems, 33, 18727–18737.

    Google Scholar 

  • Bozic, A., Zollhofer, M., & Theobalt, C., et al. (2020b). Deepdeform: Learning non-rigid rgb-d reconstruction with semi-supervised data. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 7002–7012).

  • Bozic, A., Palafox, P., Thies, J., et al. (2021). Transformerfusion: Monocular rgb scene reconstruction using transformers. Advances in Neural Information Processing Systems, 34, 1403–1414.

    Google Scholar 

  • Brown, T., Mann, B., Ryder, N., et al. (2020). Language models are few-shot learners. Advances in neural information processing systems, 33, 1877–1901.

    Google Scholar 

  • Cai, H., Feng, W., & Feng, X., et al. (2022). Neural surface reconstruction of dynamic scenes with monocular rgb-d camera. arXiv preprint arXiv:2206.15258.

  • Carion, N., Massa, F., & Synnaeve, G., et al. (2020). End-to-end object detection with transformers. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part I 16, Springer (pp. 213–229).

  • Curless, B., & Levoy, M. (1996). A volumetric method for building complex models from range images. In: Proceedings of the 23rd annual conference on Computer graphics and interactive techniques (pp .303–312).

  • Devlin, J., Chang, M. W., & Lee, K. et al. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.

  • Dosovitskiy, A., Fischer, P., & Ilg, E., et al. (2015). Flownet: Learning optical flow with convolutional networks. In: Proceedings of the IEEE international conference on computer vision (pp. 2758–2766).

  • Dosovitskiy, A., Beyer, L., & Kolesnikov, A., et al. (2020). An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929.

  • Gu, X., Wang, Y., & Wu, C., et al. (2019). Hplflownet: Hierarchical permutohedral lattice flownet for scene flow estimation on large-scale point clouds. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 3254–3263).

  • Hao, Y., Song, H., & Dong, L., et al. (2022). Language models are general-purpose interfaces. arXiv preprint arXiv:2206.06336.

  • He, K., Chen, X., & Xie, S., et al. (2022). Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 16000–16009).

  • Innmann, M., Zollhöfer, M., & Nießner, M., et al. (2016). Volumedeform: Real-time volumetric non-rigid reconstruction. In: Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part VIII 14, Springer (pp. 362–379).

  • Kazhdan, M., & Hoppe, H. (2013). Screened poisson surface reconstruction. ACM Trans Graph, 32(3). https://doi.org/10.1145/2487228.2487237.

  • Li, J., Li, D., & Xiong, C., et al. (2022). Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, PMLR (pp. 12888–12900).

  • Li, Y., Bozic, A., & Zhang, T., et al. (2020). Learning to optimize non-rigid tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 4910–4918).

  • Li, Y., Takehara, H., & Taketomi, T., et al. (2021). 4dcomplete: Non-rigid motion estimation beyond the observable surface. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 12706–12716).

  • Lin, W., Zheng, C., & Yong, J. H., et al. (2022). Occlusionfusion: Occlusion-aware motion estimation for real-time dynamic 3d reconstruction. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 1736–1745).

  • Liu, X., Qi, C. R., & Guibas, L. J. (2019). Flownet3d: Learning scene flow in 3d point clouds. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 529–537).

  • Liu, Z., Lin, Y., & Cao, Y., et al. (2021). Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF international conference on computer vision (pp. 10012–10022).

  • Loshchilov, I., & Hutter, F. (2017). Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101.

  • Lowe, D. G. (2004). Distinctive image features from scale-invariant keypoints. International journal of computer vision, 60, 91–110.

    Article  Google Scholar 

  • Lu, J., Clark, C., & Zellers, R., et al. (2022). Unified-io: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916.

  • Ma, W. C., Wang, S., & Hu, R., et al. (2019). Deep rigid instance scene flow. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 3614–3622).

  • Mildenhall, B., Srinivasan, P. P., Tancik, M., et al. (2021). Nerf: Representing scenes as neural radiance fields for view synthesis. Communications of the ACM, 65(1), 99–106.

    Article  Google Scholar 

  • Newcombe, R. A., Fox, D., & Seitz, S. M. (2015). Dynamicfusion: Reconstruction and tracking of non-rigid scenes in real-time. In: Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 343–352).

  • Ouyang, L., Wu, J., & Jiang, X., et al. (2022). Training language models to follow instructions with human feedback. arXiv preprint arXiv:2203.02155.

  • Radford, A., Kim, J. W., & Hallacy, C., et al. (2021). Learning transferable visual models from natural language supervision. In: International conference on machine learning, PMLR (pp. 8748–8763).

  • Ronneberger, O., Fischer, P., & Brox, T. (2015). U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18, Springer (pp. 234–241).

  • Slavcheva, M., Baust, M., & Cremers, D., et al. (2017). Killingfusion: Non-rigid 3d reconstruction without correspondences. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 1386–1395).

  • Slavcheva, M., Baust, M., & Ilic, S. (2018). Sobolevfusion: 3d reconstruction of scenes undergoing free non-rigid motion. In: Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2646–2655).

  • Sorkine, O., & Alexa, M. (2007). As-rigid-as-possible surface modeling. In: Symposium on Geometry processing (pp. 109–116).

  • Sumner, R. W., Schmid, J., & Pauly, M. (2007). Embedded deformation for shape manipulation. ACM SIGGRAPH 2007 papers.

  • Tombari, F., Salti, S., & Di Stefano, L. (2010). Unique signatures of histograms for local surface description. In: Computer Vision–ECCV 2010: 11th European Conference on Computer Vision, Heraklion, Crete, Greece, September 5-11, 2010, Proceedings, Part III 11, Springer (pp. 356–369).

  • Vaswani, A., Shazeer, N., & Parmar, N., et al. (2017). Attention is all you need. Advances in neural information processing systems 30.

  • Vo, K., Truong, S., Yamazaki, K., et al. (2023). Aoe-net: Entities interactions modeling with adaptive attention mechanism for temporal action proposals generation. International Journal of Computer Vision, 131(1), 302–323.

    Article  Google Scholar 

  • Wang, D., Cui, X., & Chen, X., et al. (2021). Multi-view 3d reconstruction with transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 5722–5731).

  • Wang, W., Bao, H., & Dong, L., et al. (2022). Image as a foreign language: Beit pretraining for all vision and vision-language tasks. arXiv preprint arXiv:2208.10442.

  • Yahui, L., Sangineto, E., Wei, B., et al. (2021). Efficient training of visual transformers with small-size datasets. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS, 29, 23818–23830.

    Google Scholar 

  • Zhao, H., Jiang, L., & Jia, J., et al. (2021). Point transformer. In: Proceedings of the IEEE/CVF international conference on computer vision (pp. 16259–16268).

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Wanfeng Shang.

Ethics declarations

Conflict of Interest Statement

This work was supported in part by the National Natural Science Foundation of China (NSFC)/Research Grants Council (RGC) of Hong Kong Joint Research Scheme under Grant 62361166630; in part by NSFC under Grant 62273323; and in part by the Shenzhen Key Basic Research Project (Grant No. JCYJ20241202124427037). The authors have no competing interests to declare that are relevant to the content of this article.

Additional information

Communicated by Yasuyuki Matsushita.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary file 1 (mp4 62884 KB)

Supplementary file 2 (pdf 4583 KB)

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Liu, X., Yi, Z., Wu, X. et al. Spatial-Temporal Transformer for Single RGB-D Camera Synchronous Tracking and Reconstruction of Non-rigid Dynamic Objects. Int J Comput Vis 133, 6015–6024 (2025). https://doi.org/10.1007/s11263-025-02469-5

Download citation

  • Received:

  • Accepted:

  • Published:

  • Version of record:

  • Issue date:

  • DOI: https://doi.org/10.1007/s11263-025-02469-5

Keywords