Abstract
Synthesizing novel views from a single view image is a highly ill-posed problem. We discover an effective solution to reduce the learning ambiguity by expanding the single-view view synthesis problem to a multi-view setting. Specifically, we leverage the reliable and explicit stereo prior to generate a pseudo-stereo viewpoint, which serves as an auxiliary input to construct the 3D space. In this way, the challenging novel view synthesis process is decoupled into two simpler problems of stereo synthesis and 3D reconstruction. In order to synthesize a structurally correct and detail-preserved stereo image, we propose a self-rectified stereo synthesis to amend erroneous regions in an identify-rectify manner. Hard-to-train and incorrect warping samples are first discovered by two strategies, (1) pruning the network to reveal low-confident predictions; and (2) bidirectionally matching between stereo images to allow the discovery of improper mapping. These regions are then inpainted to form the final pseudo-stereo. With the aid of this extra input, a preferable 3D reconstruction can be easily obtained, and our method can work with arbitrary 3D representations. Extensive experiments show that our method outperforms state-of-the-art single-view view synthesis methods and stereo synthesis methods.
Similar content being viewed by others
References
Aliev, K. A., Ulyanov, D., & Lempitsky, V. S. (2020). Neural point-based graphics. In ECCV
Chaurasia, G., Duchêne, S., Sorkine-Hornung, O., & Drettakis, G. (2013). Depth synthesis and local warps for plausible image-based navigation. ACM TOG, 32, 30:1-30:12.
Chen, X., Duan, Y., Houthooft, R., Schulman, J., Sutskever, I., & Abbeel, P. (2016). Infogan: Interpretable representation learning by information maximizing generative adversarial nets. In NeurIPS
Choi, I., Gallo, O., Troccoli, A. J., Kim, M. H., & Kautz, J. (2019). Extreme view synthesis. In ICCV (pp. 7780–7789).
Cun, X., Xu, F., Pun, C. M., & Gao, H. (2019). Depth-assisted full resolution network for single image-based view synthesis. IEEE Computer Graphics and Applications, 39, 52–64.
Debevec, P. E., Taylor, C. J., & Malik, J. (1996). Modeling and rendering architecture from photographs: A hybrid geometry-and image-based approach. In Proceedings of the 23rd annual conference on Computer graphics and interactive techniques (pp. 11–20).
Debevec, P. E., Yu, Y., & Borshukov, G. (1998). Efficient view-dependent image-based rendering with projective texture-mapping. In Rendering Techniques.
Fitzgibbon, A. W., Wexler, Y., & Zisserman, A. (2005). Image-based rendering using image-based priors. International Journal of Computer Vision, 63, 141–151.
Flynn, J., Broxton, M., Debevec, P. E., DuVall, M., Fyffe, G., Overbeck, R. S., Snavely, N., & Tucker, R. (2019) Deepview: View synthesis with learned gradient descent. In CVPR (pp. 2362–2371).
Frankle, J., Dziugaite, G. K., Roy, D. M., & Carbin, M. (2019). The lottery ticket hypothesis at scale. arXiv:1903.01611
Geiger, A., Lenz, P., Stiller, C., & Urtasun, R. (2013). Vision meets robotics: The kitti dataset. The International Journal of Robotics Research, 32, 1231–1237.
Godard, C., Aodha, O. M., & Brostow, G. J. (2017). Unsupervised monocular depth estimation with left-right consistency. In CVPR (pp. 6602–6611).
GonzalezBello, J. L., & Kim, M. (2020). Forget about the lidar: Self-supervised depth estimators with med probability volumes. In NeurIPS
Gonzalez, J. L., & Kim, M. (2021). Plade-net: Towards pixel-level accuracy for self-supervised single-view depth estimation with neural positional encoding and distilled matting loss. In CVPR.
Hedman, P., Philip, J., Price, T., Frahm, J. M., Drettakis, G., & Brostow, G. J. (2018). Deep blending for free-viewpoint image-based rendering. ACM TOG, 37, 1–15.
Hooker, S., Courville, A., Clark, G., Dauphin, Y., & Frome, A. (2019). What do compressed deep neural networks forget? arXiv
Ilg, E., Mayer, N., Saikia, T., Keuper, M., Dosovitskiy, A., & Brox, T. (2017). Flownet 2.0: Evolution of optical flow estimation with deep networks. In CVPR (pp. 2462–2470).
Jampani, V., Chang, H., Sargent, K., Kar, A., Tucker, R., Krainin, M., Kaeser, D., Freeman, W. T., Salesin, D., Curless, B., et al. (2021). Slide: Single image 3d photography with soft layering and depth-aware inpainting. In ICCV (pp. 12518–12527).
Jantet, V., Morin, L., & Guillemot, C. (2009). Incremental-ldi for multi-view coding. In 3DTV (pp. 1–4).
Jiang, Z., Chen, T., Mortazavi, B. J., & Wang, Z. (2021). Self-damaging contrastive learning. arXiv:2106.02990
Karras, T., Laine, S., & Aila, T. (2019). A style-based generator architecture for generative adversarial networks. In CVPR (pp. 4396–4405).
Kingma, D. P., Ba, J. (2015). Adam: A method for stochastic optimization. In ICLR
Kopf, J., Langguth, F., Scharstein, D., Szeliski, R., & Goesele, M. (2013). Image-based rendering in the gradient domain. ACM TOG, 32, 1–9.
Kopf, J., Matzen, K., Alsisan, S., Quigley, O., Ge, F., Chong, Y., Patterson, J., Frahm, J. M., Wu, S., Yu, M., et al. (2020). One shot 3d photography. ACM TOG, 39(4), 76:1-76:13.
Kulkarni, T. D., Whitney, W. F., Kohli, P., & Tenenbaum, J. B. (2015). Deep convolutional inverse graphics network. In NeurIPS.
Li, H., Kadav, A., Durdanovic, I., Samet, H., & Graf, H. P. (2017). Pruning filters for efficient convnets. arXiv:1608.08710
Li, J., Feng, Z., She, Q., Ding, H., Wang, C., & Lee, G. H. (2021). Mine: Towards continuous depth mpi with nerf for novel view synthesis. In ICCV (pp. 12578–12588).
Liu, Z., Li, J., Shen, Z., Huang, G., Yan, S., & Zhang, C. (2017). Learning efficient convolutional networks through network slimming. In ICCV (pp. 2755–2763).
Luo, Y., Ren, J., Lin, M., Pang, J., Sun, W., Li, H., & Lin, L. (2018). Single view stereo matching. In CVPR (pp. 155–163).
Martin-Brualla, R., Pandey, R., Yang, S., Pidlypenskyi, P., Taylor, J., Valentin, J. P. C., Khamis, S., Davidson, P. L., Tkach, A., Lincoln, P., Kowdle, A., Rhemann, C., Goldman, D. B., Keskin, C., Seitz, S. M., Izadi, S., & Fanello, S. (2018). Lookingood: Enhancing performance capture with real-time neural re-rendering. ACM TOG 37, 255:1–255:14.
Mayer, N., Ilg, E., Hausser, P., Fischer, P., Cremers, D., Dosovitskiy, A., & Brox, T. (2016). A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation. In CVPR (pp. 4040–4048).
Meshry, M., Goldman, D. B., Khamis, S., Hoppe, H., Pandey, R., Snavely, N., & Martin-Brualla, R. (2019). Neural rerendering in the wild. In CVPR (pp. 6871–6880).
Mildenhall, B., Srinivasan, P. P., Tancik, M., Barron, J. T., Ramamoorthi, R., & Ng, R. (2020). Nerf: Representing scenes as neural radiance fields for view synthesis. In ECCV.
Novotný, D., Graham, B., & Reizenstein, J. (2019). Perspectivenet: A scene-consistent image generator for new view synthesis in real indoor environments. In NeurIPS.
Park, E., Yang, J., Yumer, E., Ceylan, D., & Berg, A. C. (2017). Transformation-grounded image generation network for novel 3d view synthesis. In CVPR (pp. 702–711).
Park, K., Sinha, U., Barron, J. T., Bouaziz, S., Goldman, D. B., Seitz, S. M., & Brualla, R. M. (2020). Deformable neural radiance fields. arXiv:2011.12948
Penner, E., & Zhang, L. (2017). Soft 3d reconstruction for view synthesis. ACM TOG, 36, 1–11.
Seitz, S. M., Curless, B., Diebel, J., Scharstein, D., & Szeliski, R. (2006). A comparison and evaluation of multi-view stereo reconstruction algorithms. In CVPR (pp. 519–528).
Shih, M. L., Su, S. Y., Kopf, J., & Huang, J. B. (2020). 3d photography using context-aware layered depth inpainting. In CVPR (pp. 8025–8035).
Sinha, S. N., Kopf, J., Goesele, M., Scharstein, D., & Szeliski, R. (2012). Image-based rendering for scenes with reflections. ACM TOG, 31, 1–10.
Srinivasan, P. P., Tucker, R., Barron, J. T., Ramamoorthi, R., Ng, R., & Snavely, N. (2019). Pushing the boundaries of view extrapolation with multiplane images. In CVPR (pp. 175–184).
Srinivasan, P. P., Wang, T., Sreelal, A., Ramamoorthi, R., & Ng, R. (2017). Learning to synthesize a 4d rgbd light field from a single image. In ICCV.
Sun, S. H., Huh, M., Liao, Y. H., Zhang, N., Lim, J. J. (2018). Multi-view to novel view: Synthesizing novel views with self-learned confidence. In ECCV.
Szeliski, R., & Golland, P. (2004). Stereo matching with transparency and matting. International Journal of Computer Vision, 32, 45–61.
Tatarchenko, M., Dosovitskiy, A., & Brox, T. (2016). Multi-view 3d models from single images with a convolutional network. In ECCV.
Tucker, R., & Snavely, N. (2020). Single-view view synthesis with multiplane images. In CVPR (pp. 548–557).
Tulsiani, S., Tucker, R., & Snavely, N. (2018). Layer-structured 3d scene inference via view synthesis. In ECCV.
Wang, Z., Wang, H., Chen, T., Wang, Z., & Ma, K. (2021). Troubleshooting blind image quality models in the wild. In CVPR (pp. 16251–16260).
Watson, J., Mac A. O., Turmukhambetov, D., Brostow, G. J., & Firman, M. (2020). Learning stereo from single images. In ECCV (pp. 722–740).
Xie, J., Girshick, R. B., Farhadi, A. (2016). Deep3d: Fully automatic 2d-to-3d video conversion with deep convolutional neural networks. In ECCV.
Yu, F., & Koltun, V. (2016). Multi-scale context aggregation by dilated convolutions. CoRR arXiv:1511.07122
Yu, J., Lin, Z., Yang, J., Shen, X., Lu, X., Huang, T. S. (2019). Free-form image inpainting with gated convolution. In ICCV (pp. 4471–4480).
Zhang, R., Isola, P., Efros, A. A., Shechtman, E., & Wang, O. (2018). The unreasonable effectiveness of deep features as a perceptual metric. In CVPR (pp. 586–595).
Zhou, T., Tucker, R., Flynn, J., Fyffe, G., & Snavely, N. (2018). Stereo magnification: Learning view synthesis using multiplane images. ACM TOG, 37(4), 1–12.
Zhou, T., Tulsiani, S., Sun, W., Malik, J., Efros, A. A. (2016). View synthesis by appearance flow. arXiv:1605.03557
Acknowledgements
This project is supported by the National Natural Science Foundation of China (No. 61972162); Project of Strategic Importance in The Hong Kong Polytechnic University (project no. 1-ZE2Q); Guangdong Natural Science Foundation (No. 2021A1515012625); Guangdong Natural Science Funds for Distinguished Young Scholar (No. 2023B1515020097); Singapore Ministry of Education Academic Research Fund Tier 1 (MSS23C002).
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by Boxin Shi, Ph.D.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Zhou, Y., Wu, H., Liu, W. et al. Single-View View Synthesis with Self-rectified Pseudo-Stereo. Int J Comput Vis 131, 2032–2043 (2023). https://doi.org/10.1007/s11263-023-01803-z
Received:
Accepted:
Published:
Version of record:
Issue date:
DOI: https://doi.org/10.1007/s11263-023-01803-z