这是indexloc提供的服务,不要输入任何密码
Skip to main content
Log in

RIGID: Recurrent GAN Inversion and Editing of Real Face Videos and Beyond

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

GAN inversion is essential for harnessing the editability of GANs in real images, yet existing methods that invert video frames individually often yield temporally inconsistent results. To address this issue, we present a unified recurrent framework, Recurrent vIdeo GAN Inversion and eDiting (RIGID), designed to enforce temporally coherent GAN inversion and facial editing in real videos explicitly and simultaneously. Our approach models temporal relations between current and previous frames in three ways: (1) by maximizing inversion fidelity and consistency through learning a temporally compensated latent code and spatial features, (2) by disentangling high-frequency incoherent noises from the latent space, and (3) by introducing an in-between frame composition constraint to eliminate inconsistency after attribute manipulation, ensuring that each frame is a direct composite of its neighbors. Compared to existing video- and attribute-specific works, RIGID eliminates the need for expensive re-training of the model, resulting in approximately 60\(\times \) faster performance. Furthermore, RIGID can be easily extended to other face domains, showcasing its versatility and adaptability. Extensive experiments demonstrate that RIGID outperforms state-of-the-art methods in inversion and editing tasks both qualitatively and quantitatively.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+
from $39.99 /Month
  • Starting from 10 chapters or articles per month
  • Access and download chapters and articles from more than 300k books and 2,500 journals
  • Cancel anytime
View plans

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

Notes

  1. Here we omit the “\(+\)” in “\(w_t\)” that denotes \(w+\) latent code for simplicity.

References

  • Abdal, R., Qin, Y. & Wonka, P. (2019). Image2stylegan: How to embed images into the stylegan latent space? In: CVPR, pp. 4432–4441.

  • Abdal, R., Qin, Y. & Wonka, P. (2020). Image2stylegan++: How to edit the embedded images? In: CVPR, pp. 8296–8305.

  • Alaluf, Y., Patashnik, O., Wu, Z., Zamir, A., Shechtman, E., Lischinski, D. & Cohen-Or, D. (2023). Third time’s the charm? Image and video editing with stylegan3. In: ECCV WorkShop, pp. 204–220.

  • Alaluf, Y., Tov, O., Mokady, R., Gal, R. & Bermano, A. (2022). Hyperstyle: Stylegan inversion with hypernetworks for real image editing. In: CVPR, pp. 18511–18521.

  • Carreira, J. & Zisserman, A. (2017). Quo Vadis, action recognition? A new model and the kinetics dataset. In: CVPR, pp. 6299–6308.

  • Creswell, A., & Bharath, A. A. (2018). Inverting the generator of a generative adversarial network. IEEE TNNLS, 30(7), 1967–1974.

    Google Scholar 

  • Fox, G., Tewari, A., Elgharib, M. & Theobalt, C. (2021). Stylevideogan: A temporal generative model using a pretrained stylegan. In: BMVC.

  • Gal, R., Patashnik, O., Maron, H., Bermano, A. H., Chechik, G., & Cohen-Or, D. (2022). Stylegan-nada: Clip-guided domain adaptation of image generators. ACM TOG, 41(4), 1–13.

    Article  Google Scholar 

  • Härkönen, E., Hertzmann, A., Lehtinen, J., & Paris, S. (2020). Ganspace: Discovering interpretable gan controls. Advances in Neural Information Processing Systems, 33, 9841–9850.

    Google Scholar 

  • Ho, J., Jain, A., & Abbeel, P. (2020). Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33, 6840–6851.

    Google Scholar 

  • Hyun, S., Kim, J. & Heo, J.P. (2021). Self-supervised video gans: Learning for appearance consistency and motion coherency. In: CVPR, pp. 10826–10835.

  • Ilg, E., Mayer, N., Saikia, T., Keuper, M., Dosovitskiy, A. & Brox, T. (2017). Flownet 2.0: Evolution of optical flow estimation with deep networks. In: CVPR, pp. 2462–2470.

  • Jiang, H., Sun, D., Jampani, V., Yang, M.H., Learned-Miller, E. & Kautz, J. (2018). Super slomo: High quality estimation of multiple intermediate frames for video interpolation. In: CVPR, pp. 9000–9008.

  • Karras, T., Aittala, M., Laine, S., Härkönen, E., Hellsten, J., Lehtinen, J., & Aila, T. (2021). Alias-free generative adversarial networks. Advances in Neural Information Processing Systems, 34, 852–863.

    Google Scholar 

  • Karras, T., Laine, S. & Aila, T. (2019). A style-based generator architecture for generative adversarial networks. In: CVPR, pp. 4401–4410.

  • Karras, T., Laine, S., Aittala, M., Hellsten, J., Lehtinen, J. & Aila, T. (2020). Analyzing and improving the image quality of stylegan. In: CVPR, pp. 8110–8119.

  • Kim, G., Shim, H., Kim, H., Choi, Y., Kim, J. & Yang, E. (2023). Diffusion video autoencoders: Toward temporally consistent face video editing via disentangled video encoding. In: CVPR, pp. 6091–6100.

  • Kim, H., Choi, Y., Kim, J., Yoo, S. & Uh, Y. (2021). Exploiting spatial dimensions of latent in gan for real-time image editing. In: CVPR, pp. 852–861.

  • Kingma, D.P. & Ba, J. (2015). Adam: A method for stochastic optimization. In: ICLR.

  • Lai, W.S., Huang, J.B., Wang, O., Shechtman, E., Yumer, E. & Yang, M.H. (2018). Learning blind video temporal consistency. In: ECCV, pp. 170–185.

  • Livingstone, S. R., & Russo, F. A. (2018). The Ryerson audio-visual database of emotional speech and song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American english. PloS One, 13(5), e0196391.

    Article  Google Scholar 

  • Logacheva, E., Suvorov, R., Khomenko, O., Mashikhin, A. & Lempitsky, V. (2020). Deeplandscape: Adversarial modeling of landscape videos. In: ECCV, pp. 256–272.

  • Park, T., Liu, M.Y., Wang, T.C. & Zhu, J.Y. (2019). Semantic image synthesis with spatially-adaptive normalization. In: CVPR, pp. 2337–2346.

  • Patashnik, O., Wu, Z., Shechtman, E., Cohen-Or, D. & Lischinski, D. (2021). Styleclip: Text-driven manipulation of stylegan imagery. In: ICCV, pp. 2085–2094.

  • Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P. & Clark, J., et al. (2021). Learning transferable visual models from natural language supervision. In: ICML, pp. 8748–8763.

  • Reda, F.A., Sun, D., Dundar, A., Shoeybi, M., Liu, G., Shih, K.J., Tao, A., Kautz, J. & Catanzaro, B. (2019). Unsupervised video interpolation using cycle consistency. In: ICCV, pp. 892–900.

  • Richardson, E., Alaluf, Y., Patashnik, O., Nitzan, Y., Azar, Y., Shapiro, S. & Cohen-Or, D. (2021). Encoding in style: A stylegan encoder for image-to-image translation. In: CVPR, pp. 2287–2296.

  • Roich, D., Mokady, R., Bermano, A.H. & Cohen-Or, D. (2021). Pivotal tuning for latent-based editing of real images. arXiv preprint arXiv:2106.05744.

  • Rombach, R., Blattmann, A., Lorenz, D., Esser, P. & Ommer, B. (2022). High-resolution image synthesis with latent diffusion models. In: CVPR, pp. 10684–10695.

  • Ronneberger, O., Fischer, P. & Brox, T. (2015). U-net: Convolutional networks for biomedical image segmentation. In: MICCAI, pp. 234–241.

  • Shen, Y., Yang, C., Tang, X., & Zhou, B. (2020). Interfacegan: Interpreting the disentangled face representation learned by gans. IEEE TPAMI, 44, 2004.

    Article  Google Scholar 

  • Shen, Y. & Zhou, B. (2021). Closed-form factorization of latent semantics in gans. In: CVPR, pp. 1532–1540.

  • Shi, X., Chen, Z., Wang, H., Yeung, D.Y., Wong, W.K. & Woo, W.c. (2015). Convolutional lstm network: A machine learning approach for precipitation nowcasting. NeurIPS, vol. 28.

  • Skorokhodov, I., Tulyakov, S. & Elhoseiny, M. (2021). Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: CVPR, pp. 3626–3636.

  • Song, J., Meng, C. & Ermon, S. (2021). Denoising diffusion implicit models. In: ICLR.

  • Tian, Y., Ren, J., Chai, M., Olszewski, K., Peng, X., Metaxas, D.N. & Tulyakov, S. (2021). A good image generator is what you need for high-resolution video synthesis. In: ICLR.

  • Tian, Y., Ren, J., Chai, M., Olszewski, K., Peng, X., Metaxas, D.N. & Tulyakov, S. (2021). A good image generator is what you need for high-resolution video synthesis. In: ICLR.

  • Tov, O., Alaluf, Y., Nitzan, Y., Patashnik, O., & Cohen-Or, D. (2021). Designing an encoder for stylegan image manipulation. ACM TOG, 40(4), 1–14.

    Article  Google Scholar 

  • Tzaban, R., Mokady, R., Gal, R., Bermano, A. & Cohen-Or, D. (2022). Stitch it in time: Gan-based facial editing of real videos. In: SIGGRAPH Asia, pp. 1–9.

  • Unterthiner, T., van Steenkiste, S., Kurach, K., Marinier, R., Michalski, M. & Gelly, S. (2019). Fvd: A new metric for video generation. In: ICLR WorkShop.

  • Voynov, A. & Babenko, A. (2020). Unsupervised discovery of interpretable directions in the gan latent space. In: ICML, pp. 9786–9796.

  • Wang, T., Li, L., Lin, K., Lin, C.C., Yang, Z., Zhang, H., Liu, Z. & Wang, L. (2023). Disco: Disentangled control for referring human dance generation in real world. arXiv preprint arXiv:2307.00040

  • Wang, T., Zhang, Y., Fan, Y., Wang, J. & Chen, Q. (2022). High-fidelity gan inversion for image attribute editing. In: CVPR, pp. 11379–11388.

  • Wang, T.C., Liu, M.Y., Zhu, J.Y., Liu, G., Tao, A., Kautz, J. & Catanzaro, B. (2018). Video-to-video synthesis. In: NeurIPS, vol. 31.

  • Xia, W., Zhang, Y., Yang, Y., Xue, J. H., Zhou, B., & Yang, M. H. (2023). Gan inversion: A survey. IEEE TPAMI, 45(3), 3121–3138.

    Article  Google Scholar 

  • Xu X, Siyao L, Sun W, Yin Q, Yang MH. (2019). Quadratic video interpolation. NeurIPS, pp. 32.

  • Xu, Y., AlBahar, B. & Huang, J.B. (2022). Temporally consistent semantic video editing. In: ECCV, pp. 357–374.

  • Xu, Y., Du, Y., Xiao, W., Xu, X. & He, S. (2021). From continuity to editability: Inverting gans with consecutive images. In: ICCV, pp. 13910–13918.

  • Xu, Y., He, S., Wong, K.Y.K. & Luo, P. (2023). Rigid: Recurrent gan inversion and editing of real face videos. In: ICCV, pp. 13691–13701.

  • Yang, H., Chai, L., Wen, Q., Zhao, S., Sun, Z. & He, S. (2021). Discovering interpretable latent space directions of gans beyond binary attributes. In: CVPR, pp. 12177–12185.

  • Yang, S., Jiang, L., Liu, Z. & Loy, C.C. (2023). Styleganex: Stylegan-based manipulation beyond cropped aligned faces. In: ICCV, pp. 21000–21010.

  • Yao, X., Newson, A., Gousseau, Y. & Hellier, P. (2021). A latent transformer for disentangled face editing in images and videos. In: ICCV, pp. 13789–13798.

  • Yin, F., Zhang, Y., Cun, X., Cao, M., Fan, Y., Wang, X., Bai, Q., Wu, B., Wang, J. & Yang, Y. (2022). Styleheat: One-shot high-resolution editable talking face generation via pre-trained stylegan. In: ECCV, pp. 85–101.

  • Yu, C., Wang, J., Peng, C., Gao, C., Yu, G. & Sang, N. (2018). Bisenet: Bilateral segmentation network for real-time semantic segmentation. In: ECCV, pp. 325–341.

  • Yüksel, O.K., Simsar, E., Er, E.G. & Yanardag, P. (2021). Latentclr: A contrastive learning approach for unsupervised discovery of interpretable directions. In: ICCV, pp. 14263–14272.

  • Zhang, R., Isola, P., Efros, A.A., Shechtman, E. & Wang, O. (2018). The unreasonable effectiveness of deep features as a perceptual metric. In: CVPR, pp. 586–595.

  • Zhu, H., Wu, W., Zhu, W., Jiang, L., Tang, S., Zhang, L., Liu, Z. & Loy, C.C. (2022) CelebV-HQ: A large-scale video facial attributes dataset. In: ECCV, pp. 650–667.

Download references

Acknowledgements

This paper is partially supported by the Guangdong Natural Science Funds for Distinguished Young Scholar under Grant 2023B1515020097, National Research Foundation Singapore under the AI Singapore Programme under Grant AISG3-GV-2023-011, National Key R&D Program of China No.2022ZD0161000, and the General Research Fund of Hong Kong No.17200622 and 17209324.

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Shengfeng He or Ping Luo.

Additional information

Communicated by Ming-Hsuan Yang.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary file 1 (mp4 450385 KB)

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Xu, Y., He, S., Wong, KY.K. et al. RIGID: Recurrent GAN Inversion and Editing of Real Face Videos and Beyond. Int J Comput Vis 133, 3437–3455 (2025). https://doi.org/10.1007/s11263-024-02329-8

Download citation

  • Received:

  • Accepted:

  • Published:

  • Version of record:

  • Issue date:

  • DOI: https://doi.org/10.1007/s11263-024-02329-8

Keywords