MoDA: Modeling Deformable 3D Objects from Casual Videos

Song, Chaoyue; Wei, Jiacheng; Chen, Tianyi; Chen, Yiwen; Foo, Chuan-Sheng; Liu, Fayao; Lin, Guosheng

doi:10.1007/s11263-024-02310-5

MoDA: Modeling Deformable 3D Objects from Casual Videos

Published: 12 December 2024

Volume 133, pages 2825–2844, (2025)
Cite this article

International Journal of Computer Vision Aims and scope Submit manuscript

Chaoyue Song^1,3,
Jiacheng Wei¹,
Tianyi Chen^1,2,
Yiwen Chen¹,
Chuan-Sheng Foo^3,4,
Fayao Liu³ &
…
Guosheng Lin ORCID: orcid.org/0000-0002-0329-7458¹

350 Accesses
4 Citations
Explore all metrics

Abstract

In this paper, we focus on the challenges of modeling deformable 3D objects from casual videos. With the popularity of NeRF, many works extend it to dynamic scenes with a canonical NeRF and a deformation model that achieves 3D point transformation between the observation space and the canonical space. Recent works rely on linear blend skinning (LBS) to achieve the canonical-observation transformation. However, the linearly weighted combination of rigid transformation matrices is not guaranteed to be rigid. As a matter of fact, unexpected scale and shear factors often appear. In practice, using LBS as the deformation model can always lead to skin-collapsing artifacts for bending or twisting motions. To solve this problem, we propose neural dual quaternion blend skinning (NeuDBS) to achieve 3D point deformation, which can perform rigid transformation without skin-collapsing artifacts. To register 2D pixels across different frames, we establish a correspondence between canonical feature embeddings that encodes 3D points within the canonical space, and 2D image features by solving an optimal transport problem. Besides, we introduce a texture filtering approach for texture rendering that effectively minimizes the impact of noisy colors outside target deformable objects.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+

from $39.99 /Month

Starting from 10 chapters or articles per month
Access and download chapters and articles from more than 300k books and 2,500 journals
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Recovering Pose and 3D Deformable Shape from Multi-instance Image Ensembles

CanonicalFusion: Generating Drivable 3D Human Avatars from Multiple Images

Hierarchically Structured Neural Bones for Reconstructing Animatable Objects from Casual Videos

Availability of Data and Materials

We use 6 datasets in this work, where casual-cat, casual-human, eagle and hands are collected by BANMo (Yang et al., 2022). We have obtained permission to use these datasets. AMA dataset is the published dataset collected by Vlasic et al. (2008). The usage of casual-adult has also obtained consent.

Code Availability

Will be available upon acceptance.

Notes

https://banmo-www.github.io.

References

Badger, M., Wang, Y., Modh, A., Perkes, A., Kolotouros, N., Pfrommer, B. G., Schmidt, M. F., & Daniilidis, K. (2020). 3D bird reconstruction: A dataset, model, and shape recovery from a single view. In European Conference on Computer Vision (pp. 1–17)
Biggs, B., Boyne, O., Charles, J., Fitzgibbon, A., & Cipolla, R. (2020). Who left the dogs out? 3D animal reconstruction with expectation maximization in the loop. In European Conference on Computer Vision (pp. 195–211).
Chen, A., Xu, Z., Zhao, F., Zhang, X., Xiang, F., Yu, J., & Su, H. (2021). MVSNeRF: Fast generalizable radiance field reconstruction from multi-view stereo. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 14124–14133).
Chen, C., Yang, X., Yang, F., Feng, C., Fu, Z., Foo, C.S., Lin, G., & Liu, F. (2024). Sculpt3D: Multi-view consistent text-to-3D generation with sparse 3D prior. arXiv Preprint arXiv:2403.09140.
Fan, H., Su, H., & Guibas, L. J. (2017). A point set generation network for 3D object reconstruction from a single image. iN Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 605–613).
Goel, S., Kanazawa, A., & Malik, J. (2020). Shape and viewpoint without keypoints. In European Conference on Computer Vision (pp. 88–104)
Gropp, A., Yariv, L., Haim, N., Atzmon, M., Lipman, Y. (2020). Implicit geometric regularization for learning shapes. arXiv Preprint arXiv:2002.10099.
Hejl, J. (2004). Hardware skinning with quaternions. In A. Kirmse (Ed.), Game programming Gems 4 (pp. 487–495). Charles River Media. 487–495.
Henzler, P., Reizenstein, J., Labatut, P., Shapovalov, R., Ritschel, T., Vedaldi, A., & Novotny, D. (2021). Unsupervised learning of 3D object categories from videos in the wild. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 4700–4709).
Jacobson, A., Deng, Z., Kavan, L., & Lewis, J. P. (2014). Skinning: Real-time shape deformation (full text not available). ACM SIGGRAPH Courses (p. 1).
Jiang, W., Yi, K. M., Samei, G., Tuzel, O., & Ranjan, A. (2022). NeuMan: Neural human radiance field from a single video. arxiv Preprint arXiv:2203.12575.
Kanazawa, A., Tulsiani, S., Efros, A. A., & Malik, J. (2018). Learning category-specific mesh reconstruction from image collections. In Proceedings of the European Conference on Computer Vision (ECCV) (pp. 371–386).
Kavan, L., Collins, S., Žára, J., & O’Sullivan, C. (2007). Skinning with dual quaternions. In Proceedings of the 2007 Symposium on Interactive 3D Graphics and Games (pp. 39–46).
Kavan, L., Collins, S., Žára, J., & O’Sullivan, C. (2008). Geometric skinning with approximate dual quaternion blending. ACM Transactions on Graphics (TOG), 27(4), 1–23.
Article Google Scholar
Kirillov, A., Wu, Y., He, K., Girshick, R. (2020). PointRend: Image segmentation as rendering. In Proceedings of the IEEE/CVF Conference on computer Vision and Pattern Recognition (pp. 9799–9808).
Kocabas, M., Athanasiou, N., & Black, M. J. (2020). Vibe: Video inference for human body pose and shape estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 5253–5263).
Kulkarni, N., Gupta, A., Fouhey, D. F., & Tulsiani, S. (2020). Articulation-aware canonical surface mapping. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 452–461).
Lewis, J. P., Cordner, M., & Fong, N. (2000). Pose space deformation: A unified approach to shape interpolation and skeleton-driven deformation. In Proceedings of the 27th Annual Conference on Computer Graphics and Interactive Techniques (pp. 165–172).
Li, R., Lin, G., & Xie, L. (2021). Self-point-flow: Self-supervised scene flow estimation from point clouds with optimal transport and random walk. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 15577–15586).
Li, X., Liu, S., Kim, K., Mello, S. D., Jampani, V., Yang, M. H., & Kautz, J. (2020). Self-supervised single-view 3d reconstruction via semantic consistency. In European Conference on Computer Vision (pp. 677–693).
Li, Z., Niklaus, S., Snavely, N., & Wang, O. (2021). Neural scene flow fields for space-time view synthesis of dynamic scenes. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 6498–6508).
Li, R., Tanke, J., Vo, M., Zollhöfer, M., Gall, J., Kanazawa, A., & Lassner, C. (2022). TAVA: Template-free animatable volumetric actors. In European Conference on Computer Vision (pp. 419–436).
Li, X., Liu, S., De Mello, S., Kim, K., Wang, X., Yang, M. H., & Kautz, J. (2020). Online adaptation for consistent mesh reconstruction in the wild. Advances in Neural Information Processing Systems, 33, 15009–15019.
Google Scholar
Li, L., Shen, Z., Shen, L., Tan, P., et al. (2022). Streaming radiance fields for 3d video synthesis. Advances in Neural Information Processing Systems, 35, 13485–13498.
Google Scholar
Liu, L., Habermann, M., Rudnev, V., Sarkar, K., Gu, J., & Theobalt, C. (2021). Neural actor: Neural free-view synthesis of human actors with pose control. ACM Transactions on Graphics (TOG), 40(6), 1–16.
Google Scholar
Liu, W., Zhang, C., Ding, H., Hung, T. Y., & Lin, G. (2022). Few-shot segmentation with optimal transport matching and message flow. IEEE Transactions on Multimedia, 25, 5130–5141.
Article Google Scholar
Loper, M., Mahmood, N., Romero, J., Pons-Moll, G., & Black, M. J. (2015). SMPL: A skinned multi-person linear model. ACM Transactions on Graphics (TOG), 34(6), 1–16.
Article Google Scholar
Mahmood, N., Ghorbani, N., Troje, N. F., Pons-Moll, G., & Black, M. J. (2019). AMASS: Archive of motion capture as surface shapes. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 5442–5451).
Mildenhall, B., Srinivasan, P. P., Tancik, M., Barron, J. T., Ramamoorthi, R., & Ng, R. (2021). Nerf: Representing scenes as neural radiance fields for view synthesis. Communications of the ACM, 65(1), 99–106.
Article Google Scholar
Neverova, N., Sanakoyeu, A., Labatut, P., Novotny, D., & Vedaldi, A. (2021). Discovering relationships between object categories via universal canonical maps. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 404–413).
Neverova, N., Novotny, D., Szafraniec, M., Khalidov, V., Labatut, P., & Vedaldi, A. (2020). Continuous surface embeddings. Advances in Neural Information Processing Systems, 33, 17258–17270.
Google Scholar
Noguchi, A., Sun, X., Lin, S., & Harada, T. (2021). Neural articulated radiance field. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 5762–5772).
Novotny, D., Larlus, D., & Vedaldi, A. (2017). Learning 3D object categories by looking around them. In Proceedings of the IEEE International Conference on Computer Vision (pp. 5218–5227).
Oechsle, M., Peng, S., & Geiger, A. (2021). Unisurf: Unifying neural implicit surfaces and radiance fields for multi-view reconstruction. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 5589–5599).
Park, K., Sinha, U., Barron, J. T., Bouaziz, S., Goldman, D. B., Seitz, S. M., & Martin-Brualla, R. (2021a). Nerfies: Deformable neural radiance fields. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 5865–5874).
Park, K., Sinha, U., Hedman, P., Barron, J. T., Bouaziz, S., Goldman, D. B., Martin-Brualla, R., & Seitz, S. M. (2021b). HyperNeRF: A higher-dimensional representation for topologically varying neural radiance fields. ACM Transactions on Graphics (TOG), 40(6), 1–12.
Article Google Scholar
Pavlakos, G., Choutas, V., Ghorbani, N., Bolkart, T., Osman, A. A., Tzionas, D., & Black, M. J. (2019). Expressive body capture: 3D hands, face, and body from a single image. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 10975–10985).
Peng, S., Dong, J., Wang, Q., Zhang, S., Shuai, Q., Zhou, X., & Bao, H. (2021). Animatable neural radiance fields for modeling dynamic human bodies. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 14314–14323)
Peng, S., Zhang, Y., Xu, Y., Wang, Q., Shuai, Q., Bao, H., & Zhou, X. (2021). Neural body: Implicit neural representations with structured latent codes for novel view synthesis of dynamic humans. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 9054–9063).
Pumarola, A., Corona, E., Pons-Moll, G., & Moreno-Noguer, F. (2021). D-NeRF: Neural radiance fields for dynamic scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 10318–10327)
Puy, G., Boulch, A., Marlet, R. (2020). FLOT: Scene flow on point clouds guided by optimal transport. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXVIII (pp. 527–544).
Saito, S., Huang, Z., Natsume, R., Morishima, S., Kanazawa, A., & Li, H. (2019). Pifu: Pixel-aligned implicit function for high-resolution clothed human digitization. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 2304–2314).
Saito, S., Simon, T., Saragih, J., & Joo, H. (2020). Pifuhd: Multi-level pixel-aligned implicit function for high-resolution 3d human digitization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 84–93).
Shi, Y., Rong, D., Ni, B., Chen, C., & Zhang, W. (2022). GARF: Geometry-aware generalized neural radiance field. arXiv Preprint arXiv:2212.02280.
Shi, H., Wei, J., Li, R., Liu, F., & Lin, G. (2022). Weakly supervised segmentation on outdoor 4d point clouds with temporal matching and spatial graph propagation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 11840–11849).
Shi, Y., Xiong, Y., Ni, B., & Zhang, W. (2023). USR: Unsupervised separated 3d garment and human reconstruction via geometry and semantic consistency. arXiv Preprint arXiv:2302.10518.
Sinkhorn, R. (1967). Diagonal equivalence to matrices with prescribed row and column sums. The American Mathematical Monthly, 74(4), 402–405.
Article MathSciNet Google Scholar
Song, L., Chen, A., Li, Z., Chen, Z., Chen, L., Yuan, J., Xu, Y., & Geiger, A. (2022). NeRFPlayer: A streamable dynamic scene representation with decomposed neural radiance fields. arXiv Preprint arXiv:2210.15947.
Song, C., Wei, J., Li, R., Liu, F., & Lin, G. (2021). 3D pose transfer with correspondence learning and mesh refinement. Advances in Neural Information Processing Systems, 34, 3108–3120.
Google Scholar
Song, C., Wei, J., Li, R., Liu, F., & Lin, G. (2023). Unsupervised 3D pose transfer with cross consistency and dual reconstruction. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(8), 1–13. https://doi.org/10.1109/TPAMI.2023.3259059
Article Google Scholar
Su, S. Y., Yu, F., Zollhöfer, M., & Rhodin, H. (2021). A-nerf: Articulated neural radiance fields for learning human shape, appearance, and pose. Advances in Neural Information Processing Systems, 34, 12278–12291.
Google Scholar
Tretschk, E., Tewari, A., Golyanik, V., Zollhöfer, M., Lassner, C., & Theobalt, C. (2021). Non-rigid neural radiance fields: Reconstruction and novel view synthesis of a dynamic scene from monocular video. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 12959–12970).
Vlasic, D., Baran, I., Matusik, W., & Popović, J. (2008). Articulated mesh animation from multi-view silhouettes. ACM SIGGRAPH papers (pp. 1–9).
Vo, M. P., Sheikh, Y. A., & Narasimhan, S. G. (2020). Spatiotemporal bundle adjustment for dynamic 3D human reconstruction in the wild. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(2), 1066–1080.
Article Google Scholar
Wang, P., Liu, L., Liu, Y., Theobalt, C., Komura, T., & Wang, W. (2021). NeuS: Learning neural implicit surfaces by volume rendering for multi-view reconstruction. arXiv Preprint arXiv:2106.10689.
Wei, J., Wang, H., Feng, J., Lin, G., & Yap, K. H. (2023). TAPS3D: Text-guided 3d textured shape generation from pseudo supervision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 16805–16815).
Weng, C. Y., Curless, B., Srinivasan, P. P., Barron, J. T., & Kemelmacher-Shlizerman, I. (2022). HumanNerf: Free-viewpoint rendering of moving people from monocular video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 16210–16220).
Wu, S., Jakab, T., Rupprecht, C., Vedaldi, A. (2021). DOVE: Learning deformable 3D objects by watching videos. arXiv Preprint arXiv:2107.10844.
Wu, Q., Liu, X., Chen, Y., Li, K., Zheng, C., Cai, J., Zheng, J. (2022). Object-compositional neural implicit surfaces. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXVII (pp. 197–213).
Xiang, D., Joo, H., & Sheikh, Y. (2019). Monocular total capture: Posing face, body, and hands in the wild. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 10965–10974).
Yang, F., & Lin, G. (2021). CT-Net: Complementary transfering network for garment transfer with arbitrary geometric changes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 9899–9908).
Yang, G., & Ramanan, D. (2019). Volumetric correspondence networks for optical flow. Advances in Neural Information Processing Systems32
Yang, F., Chen, T., He, X., Cai, Z., Yang, L., Wu, S., & Lin, G. (2023). Attrihuman-3D: Editable 3D human avatar generation with attribute decomposition and indexing. arXiv Preprint arXiv:2312.02209.
Yang, G., Sun, D., Jampani, V., Vlasic, D., Cole, F., Chang, H., Ramanan, D., Freeman, & Liu, C. (2021). LASR: Learning articulated shape reconstruction from a monocular video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 15980–15989).
Yang, G., Vo, M., Neverova, N., Ramanan, D., Vedaldi, A., & Joo, H. (2022). BANMo: Building animatable 3d neural models from many casual videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 2863–2873).
Yang, G., Wang, C., Reddy, N. D., & Ramanan, D. (2023). Reconstructing animatable categories from videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 16995–17005).
Yang, G., Sun, D., Jampani, V., Vlasic, D., Cole, F., Liu, C., & Ramanan, D. (2021). Viser: Video-specific surface embeddings for articulated 3d shape reconstruction. Advances in Neural Information Processing Systems, 34, 19326–19338.
Google Scholar
Yariv, L., Gu, J., Kasten, Y., & Lipman, Y. (2021). Volume rendering of neural implicit surfaces. Advances in Neural Information Processing Systems, 34, 4805–4815.
Google Scholar
Yariv, L., Kasten, Y., Moran, D., Galun, M., Atzmon, M., Ronen, B., & Lipman, Y. (2020). Multiview neural surface reconstruction by disentangling geometry and appearance. Advances in Neural Information Processing Systems, 33, 2492–2502.
Google Scholar
Ye, Y., Tulsiani, S., Gupta, A. (2021). Shelf-supervised mesh prediction in the wild. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 8843–8852).
Yu, A., Ye, V., Tancik, M., & Kanazawa, A. (2021). pixelNeRF: Neural radiance fields from one or few images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 4578–4587).
Zhang, J., Yang, G., Tulsiani, S., & Ramanan, D. (2021). NERS: Neural reflectance surfaces for sparse-view 3d reconstruction in the wild. Advances in Neural Information Processing Systems, 34, 29835–29847.
Google Scholar
Zhi, S., Laidlow, T., Leutenegger, S., & Davison, A. J. (2021). In-place scene labelling and understanding with implicit scene representation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 15838–15847).
Zuffi, S., Kanazawa, A., & Black, M. J. (2018). Lions and tigers and bears: Capturing non-rigid, 3d, articulated shape from images. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (pp. 3955–3963).
Zuffi, S., Kanazawa, A., Berger-Wolf, T., & Black, M. J. (2019). Three-d safari: Learning to estimate zebra pose, shape, and texture from images“ in the wild”. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 5359–5368).
Zuffi, S., Kanazawa, A., Jacobs, D. W., & Black, M. J. (2017). 3D menagerie: Modeling the 3D shape and pose of animals. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 6365–6373).

Download references

Funding

This research is supported by the Agency for Science, Technology and Research (A*STAR) under its MTC Programmatic Funds (Grant No. M23L7b0021).

Author information

Authors and Affiliations

College of Computing and Data Science, Nanyang Technological University, Singapore, Singapore
Chaoyue Song, Jiacheng Wei, Tianyi Chen, Yiwen Chen & Guosheng Lin
School of Computer Science and Engineering, South China University of Technology, Guangzhou, China
Tianyi Chen
Institute for Infocomm Research, A*STAR, Singapore, Singapore
Chaoyue Song, Chuan-Sheng Foo & Fayao Liu
Centre for Frontier AI Research, A*STAR, Singapore, Singapore
Chuan-Sheng Foo

Authors

Chaoyue Song
View author publications
Search author on:PubMed Google Scholar
Jiacheng Wei
View author publications
Search author on:PubMed Google Scholar
Tianyi Chen
View author publications
Search author on:PubMed Google Scholar
Yiwen Chen
View author publications
Search author on:PubMed Google Scholar
Chuan-Sheng Foo
View author publications
Search author on:PubMed Google Scholar
Fayao Liu
View author publications
Search author on:PubMed Google Scholar
Guosheng Lin
View author publications
Search author on:PubMed Google Scholar

Contributions

Conceptualization: [Chaoyue Song]; Methodology: [Chaoyue Song]; Formal analysis and investigation: [Chaoyue Song]; Experiments: [Chaoyue Song], [Tianyi Chen], [Yiwen Chen]; Writing—original draft preparation: [Chaoyue Song]; Discussion: [Jiacheng Wei], [Chuan-Sheng Foo], [Fayao Liu], [Guosheng Lin]; Supervision: [Fayao Liu], [Guosheng Lin].

Corresponding author

Correspondence to Guosheng Lin.

Ethics declarations

Conflict of interest

Author Chaoyue Song has received research support from A*STAR. Author Fayao Liu and Chuan-Sheng Foo receive salaries from A*STAR. Author Guosheng Lin receives a salary from Nanyang Technological University.

Ethical Approval

Not applicable.

Consent to Participate

Yes.

Consent for Publication

Yes.

Additional information

Communicated by Xiaowei Zhou.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary file 1 (mp4 23599 KB)

Supplementary file 2 (mp4 6341 KB)

Supplementary file 3 (mp4 15116 KB)

Supplementary file 4 (pdf 38 KB)

Supplementary file 5 (pdf 213 KB)

Supplementary file 6 (pdf 926 KB)

Supplementary file 7 (pdf 414 KB)

Supplementary file 8 (pdf 185 KB)

Supplementary file 9 (pdf 129 KB)

Appendices

Appendix A: More details of MoDA

1.1 A.1 Skinning Weights for NeuDBS

We define the skinning weights for NeuDBS as $\textbf{W} = \{W_{1},..., W_{J}\} \in \mathbb {R}^{J}$, where J is the number of joint. Learning the skinning weights only from neural networks is difficult to optimize. To obtain the skinning weights for the proposed NeuDBS, we first calculate the Gaussian skinning weights and then learn the residual skinning weights with an MLP network following Yang et al. (2022).

Firstly, we compute the Gaussian skinning weights based on the Mahalanobis distance between 3D points and the Gaussian ellipsoids,

$$\begin{aligned} \textbf{W}_{G} = (\textbf{X}-\textbf{O})^{T}\textbf{V}^{T}\varvec{\Lambda }^{0}\textbf{V}(\textbf{X}-\textbf{O}), \end{aligned}$$

(21)

where $\textbf{O} \in \mathbb {R}^{J \times 3}$ are the joint center locations, $\textbf{V} \in \mathbb {R}^{J \times 3 \times 3}$ are joint orientations and $\varvec{\Lambda }^{0} \in \mathbb {R}^{J \times 3 \times 3}$ are diagonal scale matrices. The joints represented by explicit 3D Gaussian ellipsoids are composed of these 3 elements: center, orientation, and scale. To learn better skinning weights for 3D deformation, we predict the residual skinning weights from an MLP network,

$$\begin{aligned} \textbf{W}_{r} = F_{skin}(\textbf{X}, \varvec{\psi }_{b}), \end{aligned}$$

(22)

then we have the final skinning weights,

$$\begin{aligned} \textbf{W} = \sigma _{softmax}(\textbf{W}_{G} + \textbf{W}_{r}). \end{aligned}$$

(23)

To be specific, the skinning weights $\textbf{W}^{t}_{o\xrightarrow c}$ are learned from 3D points in the observation space and the body pose code $\varvec{\psi }_{b}^{t}$ at time t, and $\textbf{W}^{t}_{c\xrightarrow o}$ are learned from 3D points in the canonical space and the rest pose code $\varvec{\psi }_{b}^{*}$.

1.2 A.2 Loss Functions

Optical flow loss. We render 2D flow to compute the optical flow loss. Specifically, we deform the canonical points to another time $t^{\prime }$ and get its 2D re-projection,

$$\begin{aligned} \textbf{x}^{t^{\prime }} = \sum _{k=1}^{N}\tau _{k}\textbf{P}^{t^{\prime }}(\mathcal {D}^{t^{\prime }}_{c\xrightarrow o}(\textbf{X}^{*}_{k})), \end{aligned}$$

(24)

where $\textbf{P}^{t^{\prime }}$ is the projection matrix of a pinhole camera. Then we can compute the 2D flow,

$$\begin{aligned} \textbf{f}(\textbf{x}^{t}, t\xrightarrow t^{\prime }) = \textbf{x}^{t^{\prime }} - \textbf{x}^{t}, \end{aligned}$$

(25)

and the optical flow loss $\mathcal {L}_{of}$ is defined as

$$\begin{aligned} \mathcal {L}_{of} = \sum _{\textbf{x}^{t}, (t, t^{\prime })}{\left\| \textbf{f}(\textbf{x}^{t}, t\xrightarrow t^{\prime }) - \widetilde{\textbf{f}}(\textbf{x}^{t}, t\xrightarrow t^{\prime })\right\| ^{2}}, \end{aligned}$$

(26)

where $\widetilde{\textbf{f}}$ is the observed optical flow that are extracted from off-the-shelf method (Yang & Ramanan, 2019).

3D cycle consistency loss. Similar to Li et al. (2021b); Yang et al. (2022), we introduce a 3D cycle consistency loss to learn better deformations. We deform the sampled points in the observation space to the canonical space and then deform them back to their original coordinates,

$$\begin{aligned} \mathcal {L}_{cyc} = \sum _{k}\tau _{k}{\left\| \mathcal {D}^{t}_{c\xrightarrow o}\left( \mathcal {D}^{t}_{o\xrightarrow c}(\textbf{X}^{t}_{k})) - \textbf{X}^{t}_{k}\right) \right\| ^{2}_{2}}, \end{aligned}$$

(27)

where $\tau _{k}$ weighs the sampled points to guarantee the points closer to the surface have stronger regularization.

Eikonal loss. Following Yang et al. (2022, 2023), we also adopt the implicit geometric regularization term (Gropp et al., 2020) as:

$$\begin{aligned} \mathcal {L}_{eikonal} = \sum _{\textbf{X} \in \textbf{V}^{*}}\left( \left\| \nabla F_{\textrm{SDF}}(\textbf{X})\right\| _{2} - 1\right) ^{2}. \end{aligned}$$

(28)

We observe that the eikonal loss often leads to instability and training failures. Following RAC (Yang et al., 2023), we solve this problem in the first stage by forcing the norm of the first-order derivative of signed distances d to be close to the mean norm rather than 1, which helps stabilize the training process. In subsequent stages, the target is changed from the mean norm to 1, ultimately contributing to a smoother geometry.

1.3 A.3 Implementation Details

Training strategy. The optimization strategies of MoDA include three stages. Firstly, we optimize all losses and parameters. In this stage, MoDA already reconstructs good shape and deformation. Then we improve the articulated motions, where we only update the parameters related to the deformation model while keeping the shape parameters fixed. Finally, we improve the details of the reconstructions through importance sampling while freezing the camera poses. The design of MLP networks in MoDA is similar to BANMo (Yang et al., 2022). The hyperparameters $\gamma $ and $\lambda $ in the texture filtering function are set to 1.5 and 10 respectively (See Fig. 10 for the function). Our code will be available on GitHub once the paper is accepted.

Sampling details for 2D–3D matching. In optimal transport, the sampling of 3D points is similar to BANMo (Yang et al., 2022). We establish a canonical 3D grid $\textbf{V}^{*} \in \mathbb {R}^{20\times 20\times 20}$ ($N_{point}= 8000$) to build correspondence between pixels and canonical points. This grid is centered at the origin and axis-aligned with bounds $[x_{min}, x_{max}], [y_{min}, y_{max}]$, and $[z_{min}, z_{max}]$, undergoes iterative refinement during optimization. Every 200 iterations, we update the bounds of the grid by approximating the object’s bounds. This approximation is obtained by applying marching cubes on a $64^{3}$ grid to extract a surface mesh.

1.4 A.4 Dataset

We use 6 datasets in this work, where casual-cat, casual-human, eagle and hands are collected by BANMo (Yang et al., 2022). We have obtained permission to use these datasets. AMA dataset is the published dataset collected by Vlasic et al. (2008). The usage of casual-adult has also obtained consent.

Appendix B: More Experimental results

In this section, we show more experimental results.

Correspondence. In Fig. 11, we illustrate the correspondence between different videos in both the casual-human and casual-cat datasets. Distinct colors represent the correspondence. Besides, to compare the matching results between optimal transport and soft argmax method used in BANMo (Yang et al., 2022), we visualize the 2D–2D matching results in Fig. 12. As shown, optimal transport tends to produce more accurate matching results, while mismatches with soft argmax are highlighted with red boxes.

Texture rendering results. Here, we compare texture rendering results of BANMo (Yang et al., 2022) and our method on casual-cat. As shown in 13, we also provide a screenshot of the example on BANMo’s website.^{Footnote 1} The result of BANMo is aligned with the example on their website, appears unrealistic and potentially influenced by noise. In contrast, our method produces results more closely resembling the reference image.

Novel view synthesis. As the results reported in Fig. 9 and Fig. 13 are all rendered from meshes with texture, we also test the influence of texture filtering on volume rendering under novel views. For these novel view volume rendering results, we find that texture filtering is also beneficial. As shown in Fig. 14, the results with texture filtering exhibit less noise compared to those without it.

More results on multiple-video setups. We also show more results comparing MoDA with BANMo (Yang et al., 2022) and ViSER (Yang et al., 2021) on casual-human and casual-adult in Fig. 15.

Table 4 Training time comparison

Full size table

Appendix C: Training Time

We compare the training times of our method and BANMo Yang et al. (2022) on different datasets. We train the models on two RTX 3090 GPUs. As shown in Table 4, MoDA and BANMo both have fast training on different datasets. BANMo takes around 8–10 h. MoDA takes about one hour more than BANMo. MoDA requires more computational time compared to BANMo due to two primary factors. Firstly, DBS employed in MoDA is inherently more time-consuming than Linear Blend Skinning (LBS), as reported in Figure 18 of Kavan et al. (2008). Secondly, the optimization process for solving the optimal transport problem adds additional computational overhead. However, the increased training time is acceptable considering MoDA’s superior performance.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Song, C., Wei, J., Chen, T. et al. MoDA: Modeling Deformable 3D Objects from Casual Videos. Int J Comput Vis 133, 2825–2844 (2025). https://doi.org/10.1007/s11263-024-02310-5

Download citation

Received: 12 September 2023
Accepted: 18 November 2024
Published: 12 December 2024
Version of record: 12 December 2024
Issue date: May 2025
DOI: https://doi.org/10.1007/s11263-024-02310-5

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+

from $39.99 /Month

Starting from 10 chapters or articles per month
Access and download chapters and articles from more than 300k books and 2,500 journals
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

MoDA: Modeling Deformable 3D Objects from Casual Videos

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Explore related subjects

Availability of Data and Materials

Code Availability

Notes

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Ethical Approval

Consent to Participate

Consent for Publication

Additional information

Publisher's Note

Supplementary Information

Appendices

Appendix A: More details of MoDA

1.1 A.1 Skinning Weights for NeuDBS

1.2 A.2 Loss Functions

1.3 A.3 Implementation Details

1.4 A.4 Dataset

Appendix B: More Experimental results

Appendix C: Training Time

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now