Abstract
In this paper, we focus on the challenges of modeling deformable 3D objects from casual videos. With the popularity of NeRF, many works extend it to dynamic scenes with a canonical NeRF and a deformation model that achieves 3D point transformation between the observation space and the canonical space. Recent works rely on linear blend skinning (LBS) to achieve the canonical-observation transformation. However, the linearly weighted combination of rigid transformation matrices is not guaranteed to be rigid. As a matter of fact, unexpected scale and shear factors often appear. In practice, using LBS as the deformation model can always lead to skin-collapsing artifacts for bending or twisting motions. To solve this problem, we propose neural dual quaternion blend skinning (NeuDBS) to achieve 3D point deformation, which can perform rigid transformation without skin-collapsing artifacts. To register 2D pixels across different frames, we establish a correspondence between canonical feature embeddings that encodes 3D points within the canonical space, and 2D image features by solving an optimal transport problem. Besides, we introduce a texture filtering approach for texture rendering that effectively minimizes the impact of noisy colors outside target deformable objects.
Similar content being viewed by others
Availability of Data and Materials
Code Availability
Will be available upon acceptance.
References
Badger, M., Wang, Y., Modh, A., Perkes, A., Kolotouros, N., Pfrommer, B. G., Schmidt, M. F., & Daniilidis, K. (2020). 3D bird reconstruction: A dataset, model, and shape recovery from a single view. In European Conference on Computer Vision (pp. 1–17)
Biggs, B., Boyne, O., Charles, J., Fitzgibbon, A., & Cipolla, R. (2020). Who left the dogs out? 3D animal reconstruction with expectation maximization in the loop. In European Conference on Computer Vision (pp. 195–211).
Chen, A., Xu, Z., Zhao, F., Zhang, X., Xiang, F., Yu, J., & Su, H. (2021). MVSNeRF: Fast generalizable radiance field reconstruction from multi-view stereo. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 14124–14133).
Chen, C., Yang, X., Yang, F., Feng, C., Fu, Z., Foo, C.S., Lin, G., & Liu, F. (2024). Sculpt3D: Multi-view consistent text-to-3D generation with sparse 3D prior. arXiv Preprint arXiv:2403.09140.
Fan, H., Su, H., & Guibas, L. J. (2017). A point set generation network for 3D object reconstruction from a single image. iN Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 605–613).
Goel, S., Kanazawa, A., & Malik, J. (2020). Shape and viewpoint without keypoints. In European Conference on Computer Vision (pp. 88–104)
Gropp, A., Yariv, L., Haim, N., Atzmon, M., Lipman, Y. (2020). Implicit geometric regularization for learning shapes. arXiv Preprint arXiv:2002.10099.
Hejl, J. (2004). Hardware skinning with quaternions. In A. Kirmse (Ed.), Game programming Gems 4 (pp. 487–495). Charles River Media. 487–495.
Henzler, P., Reizenstein, J., Labatut, P., Shapovalov, R., Ritschel, T., Vedaldi, A., & Novotny, D. (2021). Unsupervised learning of 3D object categories from videos in the wild. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 4700–4709).
Jacobson, A., Deng, Z., Kavan, L., & Lewis, J. P. (2014). Skinning: Real-time shape deformation (full text not available). ACM SIGGRAPH Courses (p. 1).
Jiang, W., Yi, K. M., Samei, G., Tuzel, O., & Ranjan, A. (2022). NeuMan: Neural human radiance field from a single video. arxiv Preprint arXiv:2203.12575.
Kanazawa, A., Tulsiani, S., Efros, A. A., & Malik, J. (2018). Learning category-specific mesh reconstruction from image collections. In Proceedings of the European Conference on Computer Vision (ECCV) (pp. 371–386).
Kavan, L., Collins, S., Žára, J., & O’Sullivan, C. (2007). Skinning with dual quaternions. In Proceedings of the 2007 Symposium on Interactive 3D Graphics and Games (pp. 39–46).
Kavan, L., Collins, S., Žára, J., & O’Sullivan, C. (2008). Geometric skinning with approximate dual quaternion blending. ACM Transactions on Graphics (TOG), 27(4), 1–23.
Kirillov, A., Wu, Y., He, K., Girshick, R. (2020). PointRend: Image segmentation as rendering. In Proceedings of the IEEE/CVF Conference on computer Vision and Pattern Recognition (pp. 9799–9808).
Kocabas, M., Athanasiou, N., & Black, M. J. (2020). Vibe: Video inference for human body pose and shape estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 5253–5263).
Kulkarni, N., Gupta, A., Fouhey, D. F., & Tulsiani, S. (2020). Articulation-aware canonical surface mapping. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 452–461).
Lewis, J. P., Cordner, M., & Fong, N. (2000). Pose space deformation: A unified approach to shape interpolation and skeleton-driven deformation. In Proceedings of the 27th Annual Conference on Computer Graphics and Interactive Techniques (pp. 165–172).
Li, R., Lin, G., & Xie, L. (2021). Self-point-flow: Self-supervised scene flow estimation from point clouds with optimal transport and random walk. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 15577–15586).
Li, X., Liu, S., Kim, K., Mello, S. D., Jampani, V., Yang, M. H., & Kautz, J. (2020). Self-supervised single-view 3d reconstruction via semantic consistency. In European Conference on Computer Vision (pp. 677–693).
Li, Z., Niklaus, S., Snavely, N., & Wang, O. (2021). Neural scene flow fields for space-time view synthesis of dynamic scenes. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 6498–6508).
Li, R., Tanke, J., Vo, M., Zollhöfer, M., Gall, J., Kanazawa, A., & Lassner, C. (2022). TAVA: Template-free animatable volumetric actors. In European Conference on Computer Vision (pp. 419–436).
Li, X., Liu, S., De Mello, S., Kim, K., Wang, X., Yang, M. H., & Kautz, J. (2020). Online adaptation for consistent mesh reconstruction in the wild. Advances in Neural Information Processing Systems, 33, 15009–15019.
Li, L., Shen, Z., Shen, L., Tan, P., et al. (2022). Streaming radiance fields for 3d video synthesis. Advances in Neural Information Processing Systems, 35, 13485–13498.
Liu, L., Habermann, M., Rudnev, V., Sarkar, K., Gu, J., & Theobalt, C. (2021). Neural actor: Neural free-view synthesis of human actors with pose control. ACM Transactions on Graphics (TOG), 40(6), 1–16.
Liu, W., Zhang, C., Ding, H., Hung, T. Y., & Lin, G. (2022). Few-shot segmentation with optimal transport matching and message flow. IEEE Transactions on Multimedia, 25, 5130–5141.
Loper, M., Mahmood, N., Romero, J., Pons-Moll, G., & Black, M. J. (2015). SMPL: A skinned multi-person linear model. ACM Transactions on Graphics (TOG), 34(6), 1–16.
Mahmood, N., Ghorbani, N., Troje, N. F., Pons-Moll, G., & Black, M. J. (2019). AMASS: Archive of motion capture as surface shapes. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 5442–5451).
Mildenhall, B., Srinivasan, P. P., Tancik, M., Barron, J. T., Ramamoorthi, R., & Ng, R. (2021). Nerf: Representing scenes as neural radiance fields for view synthesis. Communications of the ACM, 65(1), 99–106.
Neverova, N., Sanakoyeu, A., Labatut, P., Novotny, D., & Vedaldi, A. (2021). Discovering relationships between object categories via universal canonical maps. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 404–413).
Neverova, N., Novotny, D., Szafraniec, M., Khalidov, V., Labatut, P., & Vedaldi, A. (2020). Continuous surface embeddings. Advances in Neural Information Processing Systems, 33, 17258–17270.
Noguchi, A., Sun, X., Lin, S., & Harada, T. (2021). Neural articulated radiance field. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 5762–5772).
Novotny, D., Larlus, D., & Vedaldi, A. (2017). Learning 3D object categories by looking around them. In Proceedings of the IEEE International Conference on Computer Vision (pp. 5218–5227).
Oechsle, M., Peng, S., & Geiger, A. (2021). Unisurf: Unifying neural implicit surfaces and radiance fields for multi-view reconstruction. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 5589–5599).
Park, K., Sinha, U., Barron, J. T., Bouaziz, S., Goldman, D. B., Seitz, S. M., & Martin-Brualla, R. (2021a). Nerfies: Deformable neural radiance fields. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 5865–5874).
Park, K., Sinha, U., Hedman, P., Barron, J. T., Bouaziz, S., Goldman, D. B., Martin-Brualla, R., & Seitz, S. M. (2021b). HyperNeRF: A higher-dimensional representation for topologically varying neural radiance fields. ACM Transactions on Graphics (TOG), 40(6), 1–12.
Pavlakos, G., Choutas, V., Ghorbani, N., Bolkart, T., Osman, A. A., Tzionas, D., & Black, M. J. (2019). Expressive body capture: 3D hands, face, and body from a single image. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 10975–10985).
Peng, S., Dong, J., Wang, Q., Zhang, S., Shuai, Q., Zhou, X., & Bao, H. (2021). Animatable neural radiance fields for modeling dynamic human bodies. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 14314–14323)
Peng, S., Zhang, Y., Xu, Y., Wang, Q., Shuai, Q., Bao, H., & Zhou, X. (2021). Neural body: Implicit neural representations with structured latent codes for novel view synthesis of dynamic humans. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 9054–9063).
Pumarola, A., Corona, E., Pons-Moll, G., & Moreno-Noguer, F. (2021). D-NeRF: Neural radiance fields for dynamic scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 10318–10327)
Puy, G., Boulch, A., Marlet, R. (2020). FLOT: Scene flow on point clouds guided by optimal transport. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXVIII (pp. 527–544).
Saito, S., Huang, Z., Natsume, R., Morishima, S., Kanazawa, A., & Li, H. (2019). Pifu: Pixel-aligned implicit function for high-resolution clothed human digitization. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 2304–2314).
Saito, S., Simon, T., Saragih, J., & Joo, H. (2020). Pifuhd: Multi-level pixel-aligned implicit function for high-resolution 3d human digitization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 84–93).
Shi, Y., Rong, D., Ni, B., Chen, C., & Zhang, W. (2022). GARF: Geometry-aware generalized neural radiance field. arXiv Preprint arXiv:2212.02280.
Shi, H., Wei, J., Li, R., Liu, F., & Lin, G. (2022). Weakly supervised segmentation on outdoor 4d point clouds with temporal matching and spatial graph propagation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 11840–11849).
Shi, Y., Xiong, Y., Ni, B., & Zhang, W. (2023). USR: Unsupervised separated 3d garment and human reconstruction via geometry and semantic consistency. arXiv Preprint arXiv:2302.10518.
Sinkhorn, R. (1967). Diagonal equivalence to matrices with prescribed row and column sums. The American Mathematical Monthly, 74(4), 402–405.
Song, L., Chen, A., Li, Z., Chen, Z., Chen, L., Yuan, J., Xu, Y., & Geiger, A. (2022). NeRFPlayer: A streamable dynamic scene representation with decomposed neural radiance fields. arXiv Preprint arXiv:2210.15947.
Song, C., Wei, J., Li, R., Liu, F., & Lin, G. (2021). 3D pose transfer with correspondence learning and mesh refinement. Advances in Neural Information Processing Systems, 34, 3108–3120.
Song, C., Wei, J., Li, R., Liu, F., & Lin, G. (2023). Unsupervised 3D pose transfer with cross consistency and dual reconstruction. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(8), 1–13. https://doi.org/10.1109/TPAMI.2023.3259059
Su, S. Y., Yu, F., Zollhöfer, M., & Rhodin, H. (2021). A-nerf: Articulated neural radiance fields for learning human shape, appearance, and pose. Advances in Neural Information Processing Systems, 34, 12278–12291.
Tretschk, E., Tewari, A., Golyanik, V., Zollhöfer, M., Lassner, C., & Theobalt, C. (2021). Non-rigid neural radiance fields: Reconstruction and novel view synthesis of a dynamic scene from monocular video. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 12959–12970).
Vlasic, D., Baran, I., Matusik, W., & Popović, J. (2008). Articulated mesh animation from multi-view silhouettes. ACM SIGGRAPH papers (pp. 1–9).
Vo, M. P., Sheikh, Y. A., & Narasimhan, S. G. (2020). Spatiotemporal bundle adjustment for dynamic 3D human reconstruction in the wild. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(2), 1066–1080.
Wang, P., Liu, L., Liu, Y., Theobalt, C., Komura, T., & Wang, W. (2021). NeuS: Learning neural implicit surfaces by volume rendering for multi-view reconstruction. arXiv Preprint arXiv:2106.10689.
Wei, J., Wang, H., Feng, J., Lin, G., & Yap, K. H. (2023). TAPS3D: Text-guided 3d textured shape generation from pseudo supervision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 16805–16815).
Weng, C. Y., Curless, B., Srinivasan, P. P., Barron, J. T., & Kemelmacher-Shlizerman, I. (2022). HumanNerf: Free-viewpoint rendering of moving people from monocular video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 16210–16220).
Wu, S., Jakab, T., Rupprecht, C., Vedaldi, A. (2021). DOVE: Learning deformable 3D objects by watching videos. arXiv Preprint arXiv:2107.10844.
Wu, Q., Liu, X., Chen, Y., Li, K., Zheng, C., Cai, J., Zheng, J. (2022). Object-compositional neural implicit surfaces. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXVII (pp. 197–213).
Xiang, D., Joo, H., & Sheikh, Y. (2019). Monocular total capture: Posing face, body, and hands in the wild. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 10965–10974).
Yang, F., & Lin, G. (2021). CT-Net: Complementary transfering network for garment transfer with arbitrary geometric changes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 9899–9908).
Yang, G., & Ramanan, D. (2019). Volumetric correspondence networks for optical flow. Advances in Neural Information Processing Systems32
Yang, F., Chen, T., He, X., Cai, Z., Yang, L., Wu, S., & Lin, G. (2023). Attrihuman-3D: Editable 3D human avatar generation with attribute decomposition and indexing. arXiv Preprint arXiv:2312.02209.
Yang, G., Sun, D., Jampani, V., Vlasic, D., Cole, F., Chang, H., Ramanan, D., Freeman, & Liu, C. (2021). LASR: Learning articulated shape reconstruction from a monocular video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 15980–15989).
Yang, G., Vo, M., Neverova, N., Ramanan, D., Vedaldi, A., & Joo, H. (2022). BANMo: Building animatable 3d neural models from many casual videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 2863–2873).
Yang, G., Wang, C., Reddy, N. D., & Ramanan, D. (2023). Reconstructing animatable categories from videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 16995–17005).
Yang, G., Sun, D., Jampani, V., Vlasic, D., Cole, F., Liu, C., & Ramanan, D. (2021). Viser: Video-specific surface embeddings for articulated 3d shape reconstruction. Advances in Neural Information Processing Systems, 34, 19326–19338.
Yariv, L., Gu, J., Kasten, Y., & Lipman, Y. (2021). Volume rendering of neural implicit surfaces. Advances in Neural Information Processing Systems, 34, 4805–4815.
Yariv, L., Kasten, Y., Moran, D., Galun, M., Atzmon, M., Ronen, B., & Lipman, Y. (2020). Multiview neural surface reconstruction by disentangling geometry and appearance. Advances in Neural Information Processing Systems, 33, 2492–2502.
Ye, Y., Tulsiani, S., Gupta, A. (2021). Shelf-supervised mesh prediction in the wild. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 8843–8852).
Yu, A., Ye, V., Tancik, M., & Kanazawa, A. (2021). pixelNeRF: Neural radiance fields from one or few images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 4578–4587).
Zhang, J., Yang, G., Tulsiani, S., & Ramanan, D. (2021). NERS: Neural reflectance surfaces for sparse-view 3d reconstruction in the wild. Advances in Neural Information Processing Systems, 34, 29835–29847.
Zhi, S., Laidlow, T., Leutenegger, S., & Davison, A. J. (2021). In-place scene labelling and understanding with implicit scene representation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 15838–15847).
Zuffi, S., Kanazawa, A., & Black, M. J. (2018). Lions and tigers and bears: Capturing non-rigid, 3d, articulated shape from images. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (pp. 3955–3963).
Zuffi, S., Kanazawa, A., Berger-Wolf, T., & Black, M. J. (2019). Three-d safari: Learning to estimate zebra pose, shape, and texture from images“ in the wild”. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 5359–5368).
Zuffi, S., Kanazawa, A., Jacobs, D. W., & Black, M. J. (2017). 3D menagerie: Modeling the 3D shape and pose of animals. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 6365–6373).
Funding
This research is supported by the Agency for Science, Technology and Research (A*STAR) under its MTC Programmatic Funds (Grant No. M23L7b0021).
Author information
Authors and Affiliations
Contributions
Conceptualization: [Chaoyue Song]; Methodology: [Chaoyue Song]; Formal analysis and investigation: [Chaoyue Song]; Experiments: [Chaoyue Song], [Tianyi Chen], [Yiwen Chen]; Writing—original draft preparation: [Chaoyue Song]; Discussion: [Jiacheng Wei], [Chuan-Sheng Foo], [Fayao Liu], [Guosheng Lin]; Supervision: [Fayao Liu], [Guosheng Lin].
Corresponding author
Ethics declarations
Conflict of interest
Author Chaoyue Song has received research support from A*STAR. Author Fayao Liu and Chuan-Sheng Foo receive salaries from A*STAR. Author Guosheng Lin receives a salary from Nanyang Technological University.
Ethical Approval
Not applicable.
Consent to Participate
Yes.
Consent for Publication
Yes.
Additional information
Communicated by Xiaowei Zhou.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Below is the link to the electronic supplementary material.
Supplementary file 1 (mp4 23599 KB)
Supplementary file 2 (mp4 6341 KB)
Supplementary file 3 (mp4 15116 KB)
Appendices
Appendix A: More details of MoDA
1.1 A.1 Skinning Weights for NeuDBS
We define the skinning weights for NeuDBS as \(\textbf{W} = \{W_{1},..., W_{J}\} \in \mathbb {R}^{J}\), where J is the number of joint. Learning the skinning weights only from neural networks is difficult to optimize. To obtain the skinning weights for the proposed NeuDBS, we first calculate the Gaussian skinning weights and then learn the residual skinning weights with an MLP network following Yang et al. (2022).
Firstly, we compute the Gaussian skinning weights based on the Mahalanobis distance between 3D points and the Gaussian ellipsoids,
where \(\textbf{O} \in \mathbb {R}^{J \times 3}\) are the joint center locations, \(\textbf{V} \in \mathbb {R}^{J \times 3 \times 3}\) are joint orientations and \(\varvec{\Lambda }^{0} \in \mathbb {R}^{J \times 3 \times 3}\) are diagonal scale matrices. The joints represented by explicit 3D Gaussian ellipsoids are composed of these 3 elements: center, orientation, and scale. To learn better skinning weights for 3D deformation, we predict the residual skinning weights from an MLP network,
then we have the final skinning weights,
To be specific, the skinning weights \(\textbf{W}^{t}_{o\xrightarrow c}\) are learned from 3D points in the observation space and the body pose code \(\varvec{\psi }_{b}^{t}\) at time t, and \(\textbf{W}^{t}_{c\xrightarrow o}\) are learned from 3D points in the canonical space and the rest pose code \(\varvec{\psi }_{b}^{*}\).
1.2 A.2 Loss Functions
Optical flow loss. We render 2D flow to compute the optical flow loss. Specifically, we deform the canonical points to another time \(t^{\prime }\) and get its 2D re-projection,
where \(\textbf{P}^{t^{\prime }}\) is the projection matrix of a pinhole camera. Then we can compute the 2D flow,
and the optical flow loss \(\mathcal {L}_{of}\) is defined as
where \(\widetilde{\textbf{f}}\) is the observed optical flow that are extracted from off-the-shelf method (Yang & Ramanan, 2019).
3D cycle consistency loss. Similar to Li et al. (2021b); Yang et al. (2022), we introduce a 3D cycle consistency loss to learn better deformations. We deform the sampled points in the observation space to the canonical space and then deform them back to their original coordinates,
where \(\tau _{k}\) weighs the sampled points to guarantee the points closer to the surface have stronger regularization.
Eikonal loss. Following Yang et al. (2022, 2023), we also adopt the implicit geometric regularization term (Gropp et al., 2020) as:
We observe that the eikonal loss often leads to instability and training failures. Following RAC (Yang et al., 2023), we solve this problem in the first stage by forcing the norm of the first-order derivative of signed distances d to be close to the mean norm rather than 1, which helps stabilize the training process. In subsequent stages, the target is changed from the mean norm to 1, ultimately contributing to a smoother geometry.
1.3 A.3 Implementation Details
Training strategy. The optimization strategies of MoDA include three stages. Firstly, we optimize all losses and parameters. In this stage, MoDA already reconstructs good shape and deformation. Then we improve the articulated motions, where we only update the parameters related to the deformation model while keeping the shape parameters fixed. Finally, we improve the details of the reconstructions through importance sampling while freezing the camera poses. The design of MLP networks in MoDA is similar to BANMo (Yang et al., 2022). The hyperparameters \(\gamma \) and \(\lambda \) in the texture filtering function are set to 1.5 and 10 respectively (See Fig. 10 for the function). Our code will be available on GitHub once the paper is accepted.
Sampling details for 2D–3D matching. In optimal transport, the sampling of 3D points is similar to BANMo (Yang et al., 2022). We establish a canonical 3D grid \(\textbf{V}^{*} \in \mathbb {R}^{20\times 20\times 20}\) (\(N_{point}= 8000\)) to build correspondence between pixels and canonical points. This grid is centered at the origin and axis-aligned with bounds \([x_{min}, x_{max}], [y_{min}, y_{max}]\), and \([z_{min}, z_{max}]\), undergoes iterative refinement during optimization. Every 200 iterations, we update the bounds of the grid by approximating the object’s bounds. This approximation is obtained by applying marching cubes on a \(64^{3}\) grid to extract a surface mesh.
Qualitative comparison of different methods on casual-human and casual-adult (multiple-video setups). We show 2 views of the reconstruction results based on the reference images. ViSER (Yang et al., 2021) fails to learn detailed 3D shapes and accurate poses from the videos. BANMo (Yang et al., 2022) has obvious skin-collapsing artifacts (in the red circles) for motions with large joint rotations while our method performs well. The rest poses of casual-human and casual-adult are shown in the lower right corner
1.4 A.4 Dataset
We use 6 datasets in this work, where casual-cat, casual-human, eagle and hands are collected by BANMo (Yang et al., 2022). We have obtained permission to use these datasets. AMA dataset is the published dataset collected by Vlasic et al. (2008). The usage of casual-adult has also obtained consent.
Appendix B: More Experimental results
In this section, we show more experimental results.
Correspondence. In Fig. 11, we illustrate the correspondence between different videos in both the casual-human and casual-cat datasets. Distinct colors represent the correspondence. Besides, to compare the matching results between optimal transport and soft argmax method used in BANMo (Yang et al., 2022), we visualize the 2D–2D matching results in Fig. 12. As shown, optimal transport tends to produce more accurate matching results, while mismatches with soft argmax are highlighted with red boxes.
Texture rendering results. Here, we compare texture rendering results of BANMo (Yang et al., 2022) and our method on casual-cat. As shown in 13, we also provide a screenshot of the example on BANMo’s website.Footnote 1 The result of BANMo is aligned with the example on their website, appears unrealistic and potentially influenced by noise. In contrast, our method produces results more closely resembling the reference image.
Novel view synthesis. As the results reported in Fig. 9 and Fig. 13 are all rendered from meshes with texture, we also test the influence of texture filtering on volume rendering under novel views. For these novel view volume rendering results, we find that texture filtering is also beneficial. As shown in Fig. 14, the results with texture filtering exhibit less noise compared to those without it.
More results on multiple-video setups. We also show more results comparing MoDA with BANMo (Yang et al., 2022) and ViSER (Yang et al., 2021) on casual-human and casual-adult in Fig. 15.
Appendix C: Training Time
We compare the training times of our method and BANMo Yang et al. (2022) on different datasets. We train the models on two RTX 3090 GPUs. As shown in Table 4, MoDA and BANMo both have fast training on different datasets. BANMo takes around 8–10 h. MoDA takes about one hour more than BANMo. MoDA requires more computational time compared to BANMo due to two primary factors. Firstly, DBS employed in MoDA is inherently more time-consuming than Linear Blend Skinning (LBS), as reported in Figure 18 of Kavan et al. (2008). Secondly, the optimization process for solving the optimal transport problem adds additional computational overhead. However, the increased training time is acceptable considering MoDA’s superior performance.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Song, C., Wei, J., Chen, T. et al. MoDA: Modeling Deformable 3D Objects from Casual Videos. Int J Comput Vis 133, 2825–2844 (2025). https://doi.org/10.1007/s11263-024-02310-5
Received:
Accepted:
Published:
Version of record:
Issue date:
DOI: https://doi.org/10.1007/s11263-024-02310-5