Abstract
We propose a simple and effective method that views the problem of single RGB-D camera synchronous tracking and reconstruction of non-rigid dynamic objects as an aligned sequential point cloud prediction problem. Our method does not require additional data transformations (truncated signed distance function or deformation graphs, etc.), alignment constraints (handcrafted features or optical flow, etc.), and prior regularities (as-rigid-as-possible or embedded deformation, etc.). We propose an end-to-end model architecture that is TRansformer for synchronous Tracking and Reconstruction of non-rigid dynamic target based on RGB-D images from a monocular camera, called TR4TR. We use a spatial-temporal combined 2D image encoder that directly encodes features from RGB-D sequence images, and a 3D point decoder to generate aligned sequential point cloud containing tracking and reconstruction results. The TR4TR model outperforms the baselines on the DeepDeform non-rigid dataset, and outperforms the state-of-the-art method by 8.82% on the deformation error evaluation metric. In addition, TR4TR is more robust when the target undergoes large inter-frame deformation. The code is available at https://github.com/xfliu1998/tr4tr-main.
Similar content being viewed by others
References
Alcantarilla, P. F, Bartoli, A., & Davison, A. J. (2012). Kaze features. In: Computer Vision–ECCV 2012: 12th European Conference on Computer Vision, Florence, Italy, October 7-13, 2012, Proceedings, Part VI 12, Springer (pp. 214–227).
Amberg, B., Romdhani, S., & Vetter, T. (2007). Optimal step nonrigid icp algorithms for surface registration. In: 2007 IEEE conference on computer vision and pattern recognition, IEEE (pp. 1–8).
Bertasius, G., Wang, H., & Torresani, L. (2021). Is space-time attention all you need for video understanding? In: ICML (pp. 4).
Besl, P. J., & McKay, N. D. (1992). Method for registration of 3-d shapes. In: Sensor fusion IV: control paradigms and data structures. Spie (pp. 586–606).
Bozic, A., Palafox, P., Zollhöfer, M., et al. (2020a). Neural non-rigid tracking. Advances in Neural Information Processing Systems, 33, 18727–18737.
Bozic, A., Zollhofer, M., & Theobalt, C., et al. (2020b). Deepdeform: Learning non-rigid rgb-d reconstruction with semi-supervised data. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 7002–7012).
Bozic, A., Palafox, P., Thies, J., et al. (2021). Transformerfusion: Monocular rgb scene reconstruction using transformers. Advances in Neural Information Processing Systems, 34, 1403–1414.
Brown, T., Mann, B., Ryder, N., et al. (2020). Language models are few-shot learners. Advances in neural information processing systems, 33, 1877–1901.
Cai, H., Feng, W., & Feng, X., et al. (2022). Neural surface reconstruction of dynamic scenes with monocular rgb-d camera. arXiv preprint arXiv:2206.15258.
Carion, N., Massa, F., & Synnaeve, G., et al. (2020). End-to-end object detection with transformers. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part I 16, Springer (pp. 213–229).
Curless, B., & Levoy, M. (1996). A volumetric method for building complex models from range images. In: Proceedings of the 23rd annual conference on Computer graphics and interactive techniques (pp .303–312).
Devlin, J., Chang, M. W., & Lee, K. et al. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
Dosovitskiy, A., Fischer, P., & Ilg, E., et al. (2015). Flownet: Learning optical flow with convolutional networks. In: Proceedings of the IEEE international conference on computer vision (pp. 2758–2766).
Dosovitskiy, A., Beyer, L., & Kolesnikov, A., et al. (2020). An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929.
Gu, X., Wang, Y., & Wu, C., et al. (2019). Hplflownet: Hierarchical permutohedral lattice flownet for scene flow estimation on large-scale point clouds. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 3254–3263).
Hao, Y., Song, H., & Dong, L., et al. (2022). Language models are general-purpose interfaces. arXiv preprint arXiv:2206.06336.
He, K., Chen, X., & Xie, S., et al. (2022). Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 16000–16009).
Innmann, M., Zollhöfer, M., & Nießner, M., et al. (2016). Volumedeform: Real-time volumetric non-rigid reconstruction. In: Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part VIII 14, Springer (pp. 362–379).
Kazhdan, M., & Hoppe, H. (2013). Screened poisson surface reconstruction. ACM Trans Graph, 32(3). https://doi.org/10.1145/2487228.2487237.
Li, J., Li, D., & Xiong, C., et al. (2022). Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, PMLR (pp. 12888–12900).
Li, Y., Bozic, A., & Zhang, T., et al. (2020). Learning to optimize non-rigid tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 4910–4918).
Li, Y., Takehara, H., & Taketomi, T., et al. (2021). 4dcomplete: Non-rigid motion estimation beyond the observable surface. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 12706–12716).
Lin, W., Zheng, C., & Yong, J. H., et al. (2022). Occlusionfusion: Occlusion-aware motion estimation for real-time dynamic 3d reconstruction. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 1736–1745).
Liu, X., Qi, C. R., & Guibas, L. J. (2019). Flownet3d: Learning scene flow in 3d point clouds. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 529–537).
Liu, Z., Lin, Y., & Cao, Y., et al. (2021). Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF international conference on computer vision (pp. 10012–10022).
Loshchilov, I., & Hutter, F. (2017). Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101.
Lowe, D. G. (2004). Distinctive image features from scale-invariant keypoints. International journal of computer vision, 60, 91–110.
Lu, J., Clark, C., & Zellers, R., et al. (2022). Unified-io: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916.
Ma, W. C., Wang, S., & Hu, R., et al. (2019). Deep rigid instance scene flow. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 3614–3622).
Mildenhall, B., Srinivasan, P. P., Tancik, M., et al. (2021). Nerf: Representing scenes as neural radiance fields for view synthesis. Communications of the ACM, 65(1), 99–106.
Newcombe, R. A., Fox, D., & Seitz, S. M. (2015). Dynamicfusion: Reconstruction and tracking of non-rigid scenes in real-time. In: Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 343–352).
Ouyang, L., Wu, J., & Jiang, X., et al. (2022). Training language models to follow instructions with human feedback. arXiv preprint arXiv:2203.02155.
Radford, A., Kim, J. W., & Hallacy, C., et al. (2021). Learning transferable visual models from natural language supervision. In: International conference on machine learning, PMLR (pp. 8748–8763).
Ronneberger, O., Fischer, P., & Brox, T. (2015). U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18, Springer (pp. 234–241).
Slavcheva, M., Baust, M., & Cremers, D., et al. (2017). Killingfusion: Non-rigid 3d reconstruction without correspondences. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 1386–1395).
Slavcheva, M., Baust, M., & Ilic, S. (2018). Sobolevfusion: 3d reconstruction of scenes undergoing free non-rigid motion. In: Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2646–2655).
Sorkine, O., & Alexa, M. (2007). As-rigid-as-possible surface modeling. In: Symposium on Geometry processing (pp. 109–116).
Sumner, R. W., Schmid, J., & Pauly, M. (2007). Embedded deformation for shape manipulation. ACM SIGGRAPH 2007 papers.
Tombari, F., Salti, S., & Di Stefano, L. (2010). Unique signatures of histograms for local surface description. In: Computer Vision–ECCV 2010: 11th European Conference on Computer Vision, Heraklion, Crete, Greece, September 5-11, 2010, Proceedings, Part III 11, Springer (pp. 356–369).
Vaswani, A., Shazeer, N., & Parmar, N., et al. (2017). Attention is all you need. Advances in neural information processing systems 30.
Vo, K., Truong, S., Yamazaki, K., et al. (2023). Aoe-net: Entities interactions modeling with adaptive attention mechanism for temporal action proposals generation. International Journal of Computer Vision, 131(1), 302–323.
Wang, D., Cui, X., & Chen, X., et al. (2021). Multi-view 3d reconstruction with transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 5722–5731).
Wang, W., Bao, H., & Dong, L., et al. (2022). Image as a foreign language: Beit pretraining for all vision and vision-language tasks. arXiv preprint arXiv:2208.10442.
Yahui, L., Sangineto, E., Wei, B., et al. (2021). Efficient training of visual transformers with small-size datasets. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS, 29, 23818–23830.
Zhao, H., Jiang, L., & Jia, J., et al. (2021). Point transformer. In: Proceedings of the IEEE/CVF international conference on computer vision (pp. 16259–16268).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of Interest Statement
This work was supported in part by the National Natural Science Foundation of China (NSFC)/Research Grants Council (RGC) of Hong Kong Joint Research Scheme under Grant 62361166630; in part by NSFC under Grant 62273323; and in part by the Shenzhen Key Basic Research Project (Grant No. JCYJ20241202124427037). The authors have no competing interests to declare that are relevant to the content of this article.
Additional information
Communicated by Yasuyuki Matsushita.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Below is the link to the electronic supplementary material.
Supplementary file 1 (mp4 62884 KB)
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Liu, X., Yi, Z., Wu, X. et al. Spatial-Temporal Transformer for Single RGB-D Camera Synchronous Tracking and Reconstruction of Non-rigid Dynamic Objects. Int J Comput Vis 133, 6015–6024 (2025). https://doi.org/10.1007/s11263-025-02469-5
Received:
Accepted:
Published:
Version of record:
Issue date:
DOI: https://doi.org/10.1007/s11263-025-02469-5