Abstract
Face reenactment and swapping share a similar pattern of identity and attribute manipulation. Our previous work UniFace has preliminarily explored establishing a unification between the two at the feature level, but it heavily relies on the accuracy of feature disentanglement, and GANs are also unstable during training. In this work, we delve into the intrinsic connections between the two from a more general training paradigm perspective, introducing a novel diffusion-based unified method UniFace++. Specifically, this work combines the advantages of each, i.e., stability of reconstruction training from reenactment, simplicity and effectiveness of the target-oriented processing from swapping, and redefining both as target-oriented reconstruction tasks. In this way, face reenactment avoids complex source feature deformation and face swapping mitigates the unstable seesaw-style optimization. The core of our approach is the rendered face obtained from reassembled 3D facial priors serving as the target pivot, which contains precise geometry and coarse identity textures. We further incorporate it with the proposed Texture-Geometry-aware Diffusion Model (TGDM) to perform texture transfer under the reconstruction supervision for high-fidelity face synthesis. Extensive quantitative and qualitative experiments demonstrate the superiority of our method for both tasks.
Similar content being viewed by others
Data Availability
References
Avrahami, O., Lischinski, D., & Fried, O. (2022). Blended diffusion for text-driven editing of natural images. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 18208–18218).
Bao, J., Chen, D., Wen, F., Li, H., & Hua, G. (2018). Towards open-set identity preserving face synthesis. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 6713–6722).
Bitouk, D., Kumar, N., Dhillon, S., Belhumeur, P., & Nayar, S. K. (2008). Face swapping: automatically replacing faces in photographs. In ACM SIGGRAPH (pp. 1–8).
Blanz, V., Scherbaum, K., Vetter, T., & Seidel, H. P. (2004). Exchanging faces in images. Computer Graphics Forum, Wiley Online Library, 23, 669–676.
Bounareli, S., Tzelepis, C., Argyriou, V., Patras, I., & Tzimiropoulos, G. (2023). Hyperreenact: One-shot reenactment via jointly learning to refine and retarget faces. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 7149–7159).
Cao, Q., Shen, L., Xie, W., Parkhi, O.M., & Zisserman, A. (2018) Vggface2: A dataset for recognising faces across pose and age. In 2018 13th IEEE international conference on automatic face & gesture recognition (FG 2018) (pp. 67–74). IEEE.
Chen, R., Chen, X., Ni, B., & Ge, Y. (2020a) Simswap: An efficient framework for high fidelity face swapping. In Proceedings of the 28th ACM international conference on multimedia (pp. 2003–2011).
Chen, Z., Wang, C., Yuan, B., & Tao, D. (2020b). Puppeteergan: Arbitrary portrait animation with semantic-aware appearance transformation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 13518–13527).
Cheng, Y.T., Tzeng, V., Liang, Y., Wang, C.C., Chen, B.Y., Chuang, Y.Y., & Ouhyoung, M. (2009) 3d-model-based face replacement in video. In SIGGRAPH’09: Posters (pp. 1–1).
Deng, J., Guo, J., Xue, N., & Zafeiriou, S. (2019a). Arcface: Additive angular margin loss for deep face recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 4690–4699).
Deng, Y., Yang, J., Xu, S., Chen, D., Jia, Y., & Tong, X. (2019b) Accurate 3d face reconstruction with weakly-supervised learning: From single image to image set. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops.
Dhariwal, P., & Nichol, A. (2021). Diffusion models beat gans on image synthesis. Advances in Neural Information Processing Systems, 34, 8780–8794.
Doukas, M. C., Zafeiriou, S., & Sharmanska, V. (2021). Headgan: One-shot neural head synthesis and editing. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 14398–14407).
Fan, W.C., Chen, Y.C., Chen, D., Cheng, Y., Yuan, L., & Wang, Y.C.F. (2022) Frido: Feature pyramid diffusion for complex scene image synthesis. arXiv preprint arXiv:2208.13753
Gao, G., Huang, H., Fu, C., Li, Z., & He, R. (2021a). Information bottleneck disentanglement for identity swapping. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 3404–3413).
Gao, G., Huang, H., Fu, C., Li, Z., & He, R. (2021b). Information bottleneck disentanglement for identity swapping. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 3404–3413).
Gao, Y., Zhou, Y., Wang, J., Li, X., Ming, X., & Lu, Y. (2023). High-fidelity and freely controllable talking head video generation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 5609–5619).
Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., & Bengio, Y. (2020). Generative adversarial networks. Communications of the ACM, 63(11), 139–144.
Ha, S., Kersner, M., Kim, B., Seo, S., & Kim, D. (2020). Marionette: Few-shot face reenactment preserving identity of unseen targets. Proceedings of the AAAI Conference on Artificial Intelligence, 34, 10893–10900.
Harvey, W., Naderiparizi, S., Masrani, V., Weilbach, C., & Wood, F. (2022) Flexible diffusion modeling of long videos. arXiv preprint arXiv:2205.11495
Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., & Hochreiter, S. (2017) Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in Neural Information Processing Systems, 30
Ho, J., & Salimans, T. (2022) Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598
Ho, J., Jain, A., & Abbeel, P. (2020). Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33, 6840–6851.
Ho, J., Jain, A., & Abbeel, P. (2020). Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33, 6840–6851.
Ho, J., Chan, W., Saharia, C., Whang, J., Gao, R., Gritsenko, A., Kingma, D.P., Poole, B., Norouzi, M., & Fleet, D.J., et al. (2022a) Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303
Ho, J., Salimans, T., Gritsenko, A., Chan, W., Norouzi, M., & Fleet, D.J. (2022b) Video diffusion models. arXiv preprint arXiv:2204.03458
Ho, J., Salimans, T., Gritsenko, A., Chan, W., Norouzi, M., & Fleet, D. J. (2022c). Video diffusion models., arXiv preprint arXiv:2204.03458
Hong, F. T., & Xu, D. (2023). Implicit identity representation conditioned memory compensation network for talking head video generation. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 23062–23072).
Hong, F. T., Zhang, L., Shen, L., & Xu, D. (2022). Depth-aware generative adversarial network for talking head video generation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 3397–3406).
Hong, F. T., Shen, L., & Xu D (2023) Dagan++: Depth-aware generative adversarial network for talking head video generation. IEEE Transactions on Pattern Analysis and Machine Intelligence.
Huang, P. H., Yang, F. E., & Wang, Y. C. F. (2020a). Learning identity-invariant motion representations for cross-id face reenactment. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 7084–7092).
Huang, R., Huang, J., Yang, D., Ren, Y., Liu, L., Li, M., Ye, Z., Liu, J., Yin, X., & Zhao, Z. (2023) Make-an-audio: Text-to-audio generation with prompt-enhanced diffusion models. arXiv preprint arXiv:2301.12661
Huang, X., & Belongie, S. (2017). Arbitrary style transfer in real-time with adaptive instance normalization. In Proceedings of the IEEE international conference on computer vision (pp. 1501–1510).
Huang, Y., Wang, Y., Tai, Y., Liu, X., Shen, P., Li, S., Li, J., & Huang, F. (2020b). Curricularface: Adaptive curriculum learning loss for deep face recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 5901–5910).
Jiang, D., Song, D., Tong, R., & Tang, M. (2023). Styleipsb Identity-preserving semantic basis of stylegan for high fidelity face swapping. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 352–361).
Karras, T., Laine, S., & Aila, T. (2019). A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 4401–4410).
Karras, T., Laine, S., Aittala, M., Hellsten, J., Lehtinen, J., & Aila, T. (2020). Analyzing and improving the image quality of stylegan. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 8110–8119).
Kim, K., Kim, Y., Cho, S., Seo, J., Nam, J., Lee, K., Kim, S., & Lee, K. (2022) Diffface: Diffusion-based face swapping with facial guidance. arXiv preprint arXiv:2212.13344
Lee, C. H., Liu, Z., Wu, L., & Luo, P. (2020). Maskgan Towards diverse and interactive facial image manipulation. In IEEE conference on computer vision and pattern recognition (CVPR).
Li, J., Li, Z., Cao, J., Song, X., & He, R. (2021). Faceinpainter: High fidelity face adaptation to heterogeneous domains. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 5089–5098).
Li, L., Bao, J., Yang, H., Chen, D., & Wen, F. (2020). Advancing high fidelity identity swapping for forgery detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 5074–5083).
Li, M., Duan, Y., Zhou, J., Lu, J. (2022) Diffusion-sdf: Text-to-shape via voxelized diffusion. arXiv preprint arXiv:2212.03293
Lin, Y., Wang, S., Lin, Q., & Tang, F. (2012) Face swapping under large pose variations: A 3d model based approach. In 2012 IEEE international conference on multimedia and expo (pp. 333–338). IEEE
Liu, H., Chen, Z., Yuan, Y., Mei, X., Liu, X., Mandic, D., Wang, W., & Plumbley, M.D. (2023a) Audioldm: Text-to-audio generation with latent diffusion models. arXiv preprint arXiv:2301.12503
Liu, Z., Li, M., Zhang, Y., Wang, C., Zhang, Q., Wang, J., & Nie, Y. (2023b). Fine-grained face swapping via regional gan inversion. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 8578–8587).
Luo, Y., Zhu, J., He, K., Chu, W., Tai, Y., Wang, C., & Yan, J. (2022). Styleface: Towards identity-disentangled face generation on megapixels. In X. V. I. Part (Ed.), Computer Vision-ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings (pp. 297–312). Springer.
Molad, E., Horwitz, E., Valevski, D., Acha, A.R., Matias, Y., Pritch, Y., Leviathan, Y., & Hoshen, Y. (2023) Dreamix: Video diffusion models are general video editors. arXiv preprint arXiv:2302.01329
Nagrani, A., Chung, J.S., & Zisserman, A. (2017) Voxceleb: a large-scale speaker identification dataset. arXiv preprint arXiv:1706.08612
Natsume, R., Yatagawa, T., & Morishima, S. (2018) Rsgan: face swapping and editing using face and hair representation in latent spaces. arXiv preprint arXiv:1804.03447
Nichol, A. Q., & Dhariwal, P. (2021). Improved denoising diffusion probabilistic models. In International conference on machine learning (pp. 8162–8171). PMLR
Park, S., Zhang, X., Bulling, A., & Hilliges, O. (2018). Learning to find eye region landmarks for remote gaze estimation in unconstrained settings. In Proceedings of the 2018 ACM symposium on eye tracking research & applications (pp. 1–10).
Perov, I., Gao, D., Chervoniy, N., Liu, K., Marangonda, S., Umé, C., Dpfks, M., Facenheim, C.S., RP, L., & Jiang, J., et al. (2020) Deepfacelab: Integrated, flexible and extensible face-swapping framework. arXiv preprint arXiv:2005.05535
Poole, B., Jain, A., Barron, J.T., & Mildenhall, B. (2022) Dreamfusion: Text-to-3d using 2d diffusion. arXiv preprint arXiv:2209.14988
Ren, Q., Lu, Z., Wu, H., Zhang, J., & Dong, Z. (2023). Hr-net: a landmark based high realistic face reenactment network. IEEE Transactions on Circuits and Systems for Video Technology, 33(11), 6347–6359.
Ren, Y., Li, G., Chen, Y., Li, T. H., & Liu, S. (2021). Pirenderer: Controllable portrait image generation via semantic neural rendering. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 13759–13768).
Richardson, E., Alaluf, Y., Patashnik, O., Nitzan, Y., Azar, Y., Shapiro, S., & Cohen-Or, D. (2021). Encoding in style: a stylegan encoder for image-to-image translation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 2287–2296).
Rochow, A., Schwarz, M., & Behnke, S. (2024). Fsrt: Facial scene representation transformer for face reenactment from factorized appearance head-pose and facial expression features. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 7716–7726).
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., & Ommer, B. (2022). High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 10684–10695).
Rosberg, F., Aksoy, E.E., Alonso-Fernandez, F., & Englund, C. (2023) Facedancer: Pose-and occlusion-aware high fidelity face swapping. In Proceedings of the IEEE/CVF winter conference on applications of computer vision (pp. 3454–3463)
Rossler, A., Cozzolino, D., Verdoliva, L., Riess, C., Thies, J., & Nießner, M. (2019) Faceforensics++: Learning to detect manipulated facial images. In Proceedings of the IEEE/CVF international conference on computer vision.
Ruiz, N., Li, Y., Jampani, V., Pritch, Y., Rubinstein, M., & Aberman, K. (2022) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. arXiv preprint arXiv:2208.12242
Saharia, C., Chan, W., Saxena, S., Li, L., Whang, J., Denton, E., Ghasemipour, S. K. S., Ayan, B. K., Mahdavi, S. S., & Lopes, R. G., et al. (2022) Photorealistic text-to-image diffusion models with deep language understanding. arXiv preprint arXiv:2205.11487
Shiohara, K., Yang, X., & Taketomi, T. (2023). Blendface: Re-designing identity encoders for face-swapping. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 7634–7644).
Siarohin, A., Lathuilière, S., Tulyakov, S., Ricci, E., & Sebe, N. (2019a). Animating arbitrary objects via deep motion transfer. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 2377–2386).
Siarohin, A., Lathuilière, S., Tulyakov, S., Ricci, E., & Sebe, N. (2019b) First order motion model for image animation. Advances in Neural Information Processing Systems, 32.
Simonyan, K., & Zisserman, A. (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556
Song, J., Meng, C., & Ermon, S. (2020) Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502
Stypulkowski, M., Vougioukas, K., He, S., Zieba, M., Petridis, S., & Pantic, M. (2023). Diffused heads: Diffusion models beat gans on talking-face generation. arXiv preprint arXiv:2301.03396
Tao, J., Wang, B., Xu, B., Ge, T., Jiang, Y., Li, W., & Duan, L. (2022). Structure-aware motion transfer with deformable anchor model. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 3637–3646).
Wang, T. C., Mallya, A., & Liu, M. Y. (2021a). One-shot free-view neural talking-head synthesis for video conferencing. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 10039–10049).
Wang, Y., Chen, X., Zhu, J., Chu, W., Tai, Y., Wang, C., Li, J., Wu, Y., Huang, F., & Ji, R. (2021b) Hififace: 3d shape and semantic prior guided high fidelity face swapping. arXiv preprint arXiv:2106.09965
Wei, H., Yang, Z., & Wang, Z. (2024) Aniportrait: Audio-driven synthesis of photorealistic portrait animation. arXiv preprint arXiv:2403.17694
Wiles, O., Koepke, A., Zisserman, A. (2018) X2face: A network for controlling face generation using images, audio, and pose codes. In Proceedings of the European conference on computer vision (ECCV) (pp. 670–686).
Wu, J.Z., Ge, Y., Wang, X., Lei, W., Gu, Y., Hsu, W., Shan, Y., Qie, X., & Shou, M.Z. (2022) Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation. arXiv preprint arXiv:2212.11565
Wu, W., Zhang, Y., Li, C., Qian, C., & Loy, C. C. (2018). Reenactgan: Learning to reenact faces via boundary transfer. In Proceedings of the European conference on computer vision (ECCV) (pp. 603–619).
Xu, C., Zhang, J., Han, Y., Tian, G., Zeng, X., Tai, Y., Wang, Y., Wang, C., & Liu, Y. (2022). Designing one unified framework for high-fidelity face reenactment and swapping. In X. V. Part (Ed.), Computer Vision-ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings (pp. 54–71). Springer.
Xu, C., Zhang, J., Hua, M., He, Q., Yi, Z., & Liu, Y. (2022b) Region-aware face swapping. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 7632–7641).
Xu, C., Zhu, J., Zhang, J., Han, Y., Chu, W., Tai, Y., Wang, C., Xie, Z., & Liu, Y. (2023) High-fidelity generalized emotional talking face generation with multi-modal emotion space learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 6609–6619).
Xu, J., Wang, X., Cheng, W., Cao, Y.P., Shan, Y., Qie, X., & Gao, S. (2022c) Dream3d: Zero-shot text-to-3d synthesis using 3d shape prior and text-to-image diffusion models. arXiv preprint arXiv:2212.14704
Xu, Y., Deng, B., Wang, J., Jing, Y., Pan, J., & He, S. (2022d). High-resolution face swapping via latent semantics disentanglement. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 7642–7651).
Xu, Z., Zhou, H., Hong, Z., Liu, Z., Liu, J., Guo, Z., Han, J., Liu, J., Ding, E., & Wang, J. (2022). Styleswap: Style-based generator empowers robust face swapping. In X. I. V. Part (Ed.), Computer Vision-ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings (pp. 661–677). Springer.
Yu, C., Wang, J., Peng, C., Gao, C., Yu, G., & Sang, N. (2018). Bisenet: Bilateral segmentation network for real-time semantic segmentation. In Proceedings of the European conference on computer vision (ECCV) (pp. 325–341).
Zakharov, E., Shysheya, A., Burkov, E., & Lempitsky, V. (2019). Few-shot adversarial learning of realistic neural talking head models. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 9459–9468).
Zeng, H., Zhang, W., Fan, C., Lv, T., Wang, S., Zhang, Z., Ma, B., Li, L., Ding, Y., & Yu, X. (2022) Flowface: Semantic flow-guided shape-aware face swapping. arXiv preprint arXiv:2212.02797
Zeng, X., Pan, Y., Wang, M., Zhang, J., & Liu, Y. (2020). Realistic face reenactment via self-supervised disentangling of identity and pose. Proceedings of the AAAI Conference on Artificial Intelligence, 34, 12757–12764.
Zhang, B., Qi, C., Zhang, P., Zhang, B., Wu, H., Chen, D., Chen, Q., Wang, Y., & Wen, F. (2023). Metaportrait: Identity-preserving talking head generation with fast personalized adaptation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 22096–22105).
Zhang, J., Zeng, X., Wang, M., Pan, Y., Liu, L., Liu, Y., Ding, Y., & Fan, C. (2020). Freenet: Multi-identity face reenactment. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 5326–5335).
Zhang, R., Isola, P., Efros, A.A., Shechtman, E., & Wang, O. (2018) The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 586–595).
Zhang, Z., Li, L., Ding, Y., & Fan, C. (2021). Flow-guided one-shot talking face generation with a high-resolution audio-visual dataset. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 3661–3670).
Zhao, J., & Zhang, H. (2022). Thin-plate spline motion model for image animation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 3657–3666).
Zhao, W., Rao, Y., Shi, W., Liu, Z., Zhou, J., & Lu, J. (2023) Diffswap: High-fidelity and controllable face swapping via 3d-aware masked diffusion. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 8568–8577).
Zhou, Y., Han, X., Shechtman, E., Echevarria, J., Kalogerakis, E., & Li, D. (2020). Makelttalk: Speaker-aware talking-head animation. ACM Transactions On Graphics (TOG), 39(6), 1–15.
Zhu, F., Zhu, J., Chu, W., Tai, Y., Xie, Z., Huang, X., & Wang, C. (2022) Hifihead: One-shot high fidelity neural head synthesis with 3d control. In IJCAI (pp. 1750–1756).
Zhu, Y., Li, Q., Wang, J., Xu, C. Z., & Sun, Z. (2021). One shot face swapping on megapixels. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 4834–4844).
Acknowledgements
This work was supported by the Key R&D Project of Zhejiang Province under Grant 2024C01172 and National Natural Science Foundation of China (62476224).
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by Svetlana Lazebnik.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Xu, C., Qian, Y., Zhu, S. et al. UniFace++: Revisiting a Unified Framework for Face Reenactment and Swapping via 3D Priors. Int J Comput Vis 133, 4538–4554 (2025). https://doi.org/10.1007/s11263-025-02395-6
Received:
Accepted:
Published:
Version of record:
Issue date:
DOI: https://doi.org/10.1007/s11263-025-02395-6