Creatively Upscaling Images with Global-Regional Priors

Qian, Yurui; Cai, Qi; Pan, Yingwei; Yao, Ting; Mei, Tao

doi:10.1007/s11263-025-02424-4

Creatively Upscaling Images with Global-Regional Priors

Published: 31 March 2025

Volume 133, pages 5197–5215, (2025)
Cite this article

International Journal of Computer Vision Aims and scope Submit manuscript

Yurui Qian¹,
Qi Cai²,
Yingwei Pan ORCID: orcid.org/0000-0002-4344-8898²,
Ting Yao² &
…
Tao Mei²

316 Accesses
Explore all metrics

Abstract

Contemporary diffusion models show remarkable capability in text-to-image generation, while still being limited to restricted resolutions (e.g., $1024\times 1024$). Recent advances enable tuning-free higher-resolution image generation by recycling pre-trained diffusion models and extending them via regional denoising or dilated sampling/convolutions. However, these models struggle to simultaneously preserve global semantic structure and produce creative regional details in higher-resolution images. To address this, we present C-Upscale, a new recipe of tuning-free image upscaling that pivots on global-regional priors derived from given global prompt and estimated regional prompts via Multimodal LLM. Technically, the low-frequency component of low-resolution image is recognized as global structure prior to encourage global semantic consistency in high-resolution generation. Next, we perform regional attention control to screen cross-attention between global prompt and each region during regional denoising, leading to regional attention prior that alleviates object repetition issue. The estimated regional prompts containing rich descriptive details further act as regional semantic prior to fuel the creativity of regional detail generation. Both quantitative and qualitative evaluations demonstrate that our C-Upscale manages to generate ultra-high-resolution images (e.g., $4096\times 4096 \,{\text {and}}\, 8192\times 8192$) with higher visual fidelity and more creative regional details.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+

from $39.99 /Month

Starting from 10 chapters or articles per month
Access and download chapters and articles from more than 300k books and 2,500 journals
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 4

Improving Geo-Diversity of Generated Images with Contextualized Vendi Score Guidance

CRM: Single Image to 3D Textured Mesh with Convolutional Reconstruction Model

Skews in the Phenomenon Space Hinder Generalization in Text-to-Image Generation

Data Availability

The synthetic test low-resolution images for upscaling are generated with open-source pre-trained diffusion models, including SD1.5 (Rombach et al., 2022), SDXL (Podell et al., 2024), DreamShaper XL (Dreamshaper xl, 2024), and Pixart-$\alpha $ (Chen et al., 2024). The test prompts are randomly sampled from LAION-5B (Schuhmann et al., 2022) and MS-COCO (Lin et al., 2014) datasets. The real-world test images are ground-truth images corresponding to 1K sampled LAION-5B prompts.

References

Bar-Tal, O., Yariv, L., Lipman, Y., & Dekel, T. (2023) Multidiffusion: Fusing diffusion paths for controlled image generation. In International conference on machine learning (pp. 1737–1752).
Betker, J., Goh, G., Jing, L., Brooks, T., Wang, J., Li, L., Ouyang, L., Zhuang, J., Lee, J., Guo, Y., et al. (2023). Improving image generation with better captions. Computer Science, 2, 8.
Google Scholar
Chen, J., Pan, Y., Yao, T., & Mei, T. (2023) Controlstyle: Text-driven stylized image generation using diffusion priors. In Proceedings of the 31st ACM international conference on multimedia (pp. 7540–7548).
Chen, J., YU, J., GE, C., Yao, L., Xie, E., Wang, Z., Kwok, J., Luo, P., Lu, H., & Li, Z. (2024) Pixart-$\alpha $: Fast training of diffusion transformer for photorealistic text-to-image synthesis. In International conference on learning representations.
Chen, Y., Chen, J., Pan, Y., Li, Y., Yao, T., Chen, Z., & Mei, T. (2024) Improving text-guided object inpainting with semantic pre-inpainting. In European conference on computer vision (pp. 110–126). Springer.
Choi, J., Kim, S., Jeong, Y., Gwon, Y., & Yoon, S. (2021) Ilvr: Conditioning method for denoising diffusion probabilistic models. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 14347–14356).
Dai, X., Hou, J., Ma, C.Y., Tsai, S., Wang, J., Wang, R., Zhang, P., Vandenhende, S., Wang, X., & Dubey, A., et al. (2023) Emu: Enhancing image generation models using photogenic needles in a haystack. arXiv preprint arXiv:2309.15807
Dhariwal, P., & Nichol, A. (2021). Diffusion models beat GANs on image synthesis. Advances in Neural Information Processing Systems, 34, 8780–8794.
Google Scholar
Dong, C., Loy, C. C., He, K., & Tang, X. (2015). Image super-resolution using deep convolutional networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38, 295–307.
Article Google Scholar
Dong, J., Bai, H., Tang, J., & Pan, J. (2023) Deep unpaired blind image super-resolution using self-supervised learning and exemplar distillation. Int. J. Comput. Vis. 1–14
Dreamshaper xl (2024). https://civitai.com/models/112902?modelVersionId=351306
Du, R., Chang, D., Hospedales, T., Song, Y.Z., & Ma, Z. (2023) DemoFusion: Democratising high-resolution image generation with no. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 6159–6168).
Esser, P., Kulal, S., Blattmann, A., Entezari, R., Müller, J., Saini, H., Levi, Y., Lorenz, D., Sauer, A., & Boesel, F., et al. (2024) Scaling rectified flow transformers for high-resolution image synthesis. In International conference on machine learning
Esser, P., Rombach, R., & Ommer, B. (2021) Taming transformers for high-resolution image synthesis. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 12873–12883)
Fei, B., Lyu, Z., Pan, L., Zhang, J., Yang, W., Luo, T., Zhang, B., & Dai, B. (2023) Generative diffusion prior for unified image restoration and enhancement. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 9935–9946).
Graps, A. (1995). An introduction to wavelets. IEEE Computational Science and Engineering, 2, 50–61.
Article Google Scholar
Guo, L., He, Y., Chen, H., Xia, M., Cun, X., Wang, Y., Huang, S., Zhang, Y., Wang, X., & Chen, Q., et al. (2024) Make a cheap scaling: A self-cascade diffusion model for higher-resolution adaptation. In European conference on computer vision (pp. 39–55). Springer
Haji-Ali, M., Balakrishnan, G., & Ordonez, V. (2024) Elasticdiffusion: Training-free arbitrary size image generation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 6603–6612).
He, Y., Yang, S., Chen, H., Cun, X., Xia, M., Zhang, Y., Wang, X., He, R., Chen, Q., & Shan, Y. (2023) Scalecrafter: Tuning-free higher-resolution visual generation with diffusion models. In International conference on learning representations
Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., & Hochreiter, S. (2017) Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in neural information processing systems (vol. 30, pp. 6629–6640).
Ho, J., Jain, A., & Abbeel, P. (2020) Denoising diffusion probabilistic models. In Advances in neural information processing systems (vol. 33, pp. 6840–6851)
Huang, L., Fang, R., Zhang, A., Song, G., Liu, S., Liu, Y., & Li, H. (2024) Fouriscale: A frequency perspective on training-free high-resolution image synthesis. In European conference on computer vision (pp. 196–212).
Ke, J., Wang, Q., Wang, Y., Milanfar, P., & Yang, F. (2021) Musiq: Multi-scale image quality transformer. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 5148–5157)
Kim, Y., Hwang, G., Zhang, J., & Park, E. (2024) Diffusehigh: Training-free progressive high-resolution image synthesis through structure guidance. In Proceedings of the AAAI conference on artificial intelligence.
Kirstain, Y., Polyak, A., Singer, U., Matiana, S., Penna, J., & Levy, O. (2024) Pick-a-pic: An open dataset of user preferences for text-to-image generation. In Advances in neural information processing systems (vol. 36, pp. 36652–36663).
Kwon, W., Li, Z., Zhuang, S., Sheng, Y., Zheng, L., Yu, C.H., Gonzalez, J.E., Zhang, H., & Stoica, I. (2023) Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th symposium on operating systems principles (pp. 611–626).
Li, D., Kamko, A., Akhgari, E., Sabet, A., Xu, L., & Doshi, S. (2024) Playground v2. 5: Three insights towards enhancing aesthetic quality in text-to-image generation. arXiv preprint arXiv:2402.17245
Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., & Zitnick, C.L. (2014) Microsoft coco: Common objects in context. In: European Conference on Computer Vision, pp. 740–755
Lin, Z., Lin, M., Meng, Z., & Ji, R. (2024) Accdiffusion : An accurate method for higher-resolution image generation. In: European Conference on Computer Vision, pp. 38–53
Liu, H., Li, C., Li, Y., Li, B., Zhang, Y., Shen, S., & Lee, Y.J. (2024) Llava-next: Improved reasoning, ocr, and world knowledge. https://llava-vl.github.io/blog/2024-01-30-llava-next/
Meng, C., He, Y., Song, Y., Song, J., Wu, J., Zhu, J.Y., & Ermon, S. (2022) SDEdit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations
Podell, D., English, Z., Lacey, K., Blattmann, A., Dockhorn, T., Müller, J., Penna, J., & Rombach, R. (2024) SDXL: Improving latent diffusion models for high-resolution image synthesis. In: International Conference on Learning Representations
Qian, Y., Cai, Q., Pan, Y., Li, Y., Yao, T., Sun, Q., & Mei, T. (2024) Boosting diffusion models with moving average sampling in frequency domain. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8911–8920
Quan, W., Chen, J., Liu, Y., Yan, D. M., & Wonka, P. (2024). Deep learning-based image and video inpainting: A survey. International Journal of Computer Vision, 132, 2367–2400.
Article Google Scholar
Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., & Clark, J., et al. (2021) Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763
Ren, J., Li, W., Chen, H., Pei, R., Shao, B., Guo, Y., Peng, L., Song, F., & Zhu, L. (2024) Ultrapixel: Advancing ultra high-resolution image synthesis to new peaks. In: Advances in Neural Information Processing Systems
Ren, Y., Xia, X., Lu, Y., Zhang, J., Wu, J., Xie, P., WANG, X., & Xiao, X. (2024) Hyper-SD: Trajectory segmented consistency model for efficient image synthesis. In: Advances in Neural Information Processing Systems
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., & Ommer, B. (2022) High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10684–10695
Ronneberger, O., Fischer, P., & Brox, T. (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, pp. 234–241
Saharia, C., Chan, W., Saxena, S., Li, L., Whang, J., Denton, E.L., Ghasemipour, K., Gontijo Lopes, R., Karagol Ayan, B., & Salimans, T., et al. (2022) Photorealistic text-to-image diffusion models with deep language understanding. In: Advances in Neural Information Processing Systems, vol. 35, pp. 36479–36494
Schuhmann, C., Beaumont, R., Vencu, R., Gordon, C., Wightman, R., Cherti, M., Coombes, T., Katta, A., Mullis, C., & Wortsman, M., et al. (2022) Laion-5b: An open large-scale dataset for training next generation image-text models. In: Advances in Neural Information Processing Systems, pp. 25278–25294
Si, C., Huang, Z., Jiang, Y., & Liu, Z. (2023) Freeu: Free lunch in diffusion u-net. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4733–4743
Song, J., Meng, C., & Ermon, S. (2021) Denoising diffusion implicit models. In: International Conference on Learning Representations
Stankovic, R. S., & Falkowski, B. J. (2003). The haar wavelet transform: its status and achievements. Computers & Electrical Engineering, 29, 25–44.
Article Google Scholar
Wan, S., Li, Y., Chen, J., Pan, Y., Yao, T., Cao, Y., & Mei, T. (2024) Improving virtual try-on with garment-focused diffusion models. In: European Conference on Computer Vision, pp. 184–199. Springer
Wang, J., Chan, K.C., & Loy, C.C. (2023) Exploring clip for assessing the look and feel of images. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, pp. 2555–2563
Wang, J., Yue, Z., Zhou, S., Chan, K. C., & Loy, C. C. (2024). Exploiting diffusion prior for real-world image super-resolution. International Journal of Computer Vision, 132(12), 5929–5949.
Article Google Scholar
Wang, X., Xie, L., Dong, C., & Shan, Y. (2021) Real-esrgan: Training real-world blind super-resolution with pure synthetic data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1905–1914
Wang, Y., Yang, W., Chen, X., Wang, Y., Guo, L., Chau, L.P., Liu, Z., Qiao, Y., Kot, A.C., & Wen, B. (2024) Sinsr: diffusion-based image super-resolution in a single step. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 25796–25805
Xin, J., Wang, N., Jiang, X., Li, J., & Gao, X. (2023). Advanced binary neural network for single image super resolution. International Journal of Computer Vision, 131, 1808–1824.
Article Google Scholar
Yang, L., Yu, Z., Meng, C., Xu, M., Ermon, S., & Cui, B. (2024) Mastering text-to-image diffusion: Recaptioning, planning, and generating with multimodal llms. In: International Conference on Machine Learning, pp. 56704–56721
Yang, S., Wu, T., Shi, S., Lao, S., Gong, Y., Cao, M., Wang, J., & Yang, Y. (2022) Maniqa: Multi-dimension attention network for no-reference image quality assessment. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1191–1200
Zamir, S.W., Arora, A., Khan, S., Hayat, M., Khan, F.S., Yang, M.H., & Shao, L. (2021) Multi-stage progressive image restoration. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14821–14831
Zhang, K., Liang, J., Van Gool, L., & Timofte, R. (2021) Designing a practical degradation model for deep blind image super-resolution. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4791–4800
Zhang, K., Sun, M., Sun, J., Zhang, K., Sun, Z., & Tan, T. (2024). Open-vocabulary text-driven human image generation. International Journal of Computer Vision, 132(10), 4379–4397.
Article Google Scholar
Zhang, R., Isola, P., Efros, A.A., Shechtman, E., & Wang, O. (2018) The unreasonable effectiveness of deep features as a perceptual metric. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 586–595
Zheng, Q., Guo, Y., Deng, J., Han, J., Li, Y., Xu, S., & Xu, H. (2023) Any-size-diffusion: Toward efficient text-driven synthesis for any-size hd images. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38, pp. 7571–7578
Zhu, R., Pan, Y., Li, Y., Yao, T., Sun, Z., Mei, T., & Chen, C.W. (2024) Sd-dit: Unleashing the power of self-supervised discrimination in diffusion transformer. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8435–8445

Download references

Acknowledgements

This work was supported in part by the Beijing Municipal Science and Technology Project No. Z241100001324002 and Beijing Nova Program No. 20240484681.

Author information

Authors and Affiliations

University of Science and Technology of China, Hefei, China
Yurui Qian
HiDream.ai Inc., Beijing, China
Qi Cai, Yingwei Pan, Ting Yao & Tao Mei

Authors

Yurui Qian
View author publications
Search author on:PubMed Google Scholar
Qi Cai
View author publications
Search author on:PubMed Google Scholar
Yingwei Pan
View author publications
Search author on:PubMed Google Scholar
Ting Yao
View author publications
Search author on:PubMed Google Scholar
Tao Mei
View author publications
Search author on:PubMed Google Scholar

Corresponding author

Correspondence to Ting Yao.

Additional information

Communicated by Stephen Lin.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Qian, Y., Cai, Q., Pan, Y. et al. Creatively Upscaling Images with Global-Regional Priors. Int J Comput Vis 133, 5197–5215 (2025). https://doi.org/10.1007/s11263-025-02424-4

Download citation

Received: 26 July 2024
Accepted: 01 March 2025
Published: 31 March 2025
Version of record: 31 March 2025
Issue date: August 2025
DOI: https://doi.org/10.1007/s11263-025-02424-4

Keywords

Profiles

Qi Cai View author profile

Access this article

Log in via an institution

Subscribe and save

Springer+

from $39.99 /Month

Starting from 10 chapters or articles per month
Access and download chapters and articles from more than 300k books and 2,500 journals
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Creatively Upscaling Images with Global-Regional Priors

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Improving Geo-Diversity of Generated Images with Contextualized Vendi Score Guidance

CRM: Single Image to 3D Textured Mesh with Convolutional Reconstruction Model

Skews in the Phenomenon Space Hinder Generalization in Text-to-Image Generation

Explore related subjects

Data Availability

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Profiles

Subscribe and save

Buy Now