这是indexloc提供的服务,不要输入任何密码
Skip to main content
Log in

Creatively Upscaling Images with Global-Regional Priors

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

Contemporary diffusion models show remarkable capability in text-to-image generation, while still being limited to restricted resolutions (e.g., \(1024\times 1024\)). Recent advances enable tuning-free higher-resolution image generation by recycling pre-trained diffusion models and extending them via regional denoising or dilated sampling/convolutions. However, these models struggle to simultaneously preserve global semantic structure and produce creative regional details in higher-resolution images. To address this, we present C-Upscale, a new recipe of tuning-free image upscaling that pivots on global-regional priors derived from given global prompt and estimated regional prompts via Multimodal LLM. Technically, the low-frequency component of low-resolution image is recognized as global structure prior to encourage global semantic consistency in high-resolution generation. Next, we perform regional attention control to screen cross-attention between global prompt and each region during regional denoising, leading to regional attention prior that alleviates object repetition issue. The estimated regional prompts containing rich descriptive details further act as regional semantic prior to fuel the creativity of regional detail generation. Both quantitative and qualitative evaluations demonstrate that our C-Upscale manages to generate ultra-high-resolution images (e.g., \(4096\times 4096 \,{\text {and}}\, 8192\times 8192\)) with higher visual fidelity and more creative regional details.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+
from $39.99 /Month
  • Starting from 10 chapters or articles per month
  • Access and download chapters and articles from more than 300k books and 2,500 journals
  • Cancel anytime
View plans

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Algorithm 1
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

Data Availability

The synthetic test low-resolution images for upscaling are generated with open-source pre-trained diffusion models, including SD1.5 (Rombach et al., 2022), SDXL (Podell et al., 2024), DreamShaper XL (Dreamshaper xl, 2024), and Pixart-\(\alpha \) (Chen et al., 2024). The test prompts are randomly sampled from LAION-5B (Schuhmann et al., 2022) and MS-COCO (Lin et al., 2014) datasets. The real-world test images are ground-truth images corresponding to 1K sampled LAION-5B prompts.

References

  • Bar-Tal, O., Yariv, L., Lipman, Y., & Dekel, T. (2023) Multidiffusion: Fusing diffusion paths for controlled image generation. In International conference on machine learning (pp. 1737–1752).

  • Betker, J., Goh, G., Jing, L., Brooks, T., Wang, J., Li, L., Ouyang, L., Zhuang, J., Lee, J., Guo, Y., et al. (2023). Improving image generation with better captions. Computer Science, 2, 8.

    Google Scholar 

  • Chen, J., Pan, Y., Yao, T., & Mei, T. (2023) Controlstyle: Text-driven stylized image generation using diffusion priors. In Proceedings of the 31st ACM international conference on multimedia (pp. 7540–7548).

  • Chen, J., YU, J., GE, C., Yao, L., Xie, E., Wang, Z., Kwok, J., Luo, P., Lu, H., & Li, Z. (2024) Pixart-\(\alpha \): Fast training of diffusion transformer for photorealistic text-to-image synthesis. In International conference on learning representations.

  • Chen, Y., Chen, J., Pan, Y., Li, Y., Yao, T., Chen, Z., & Mei, T. (2024) Improving text-guided object inpainting with semantic pre-inpainting. In European conference on computer vision (pp. 110–126). Springer.

  • Choi, J., Kim, S., Jeong, Y., Gwon, Y., & Yoon, S. (2021) Ilvr: Conditioning method for denoising diffusion probabilistic models. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 14347–14356).

  • Dai, X., Hou, J., Ma, C.Y., Tsai, S., Wang, J., Wang, R., Zhang, P., Vandenhende, S., Wang, X., & Dubey, A., et al. (2023) Emu: Enhancing image generation models using photogenic needles in a haystack. arXiv preprint arXiv:2309.15807

  • Dhariwal, P., & Nichol, A. (2021). Diffusion models beat GANs on image synthesis. Advances in Neural Information Processing Systems, 34, 8780–8794.

    Google Scholar 

  • Dong, C., Loy, C. C., He, K., & Tang, X. (2015). Image super-resolution using deep convolutional networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38, 295–307.

    Article  Google Scholar 

  • Dong, J., Bai, H., Tang, J., & Pan, J. (2023) Deep unpaired blind image super-resolution using self-supervised learning and exemplar distillation. Int. J. Comput. Vis. 1–14

  • Dreamshaper xl (2024). https://civitai.com/models/112902?modelVersionId=351306

  • Du, R., Chang, D., Hospedales, T., Song, Y.Z., & Ma, Z. (2023) DemoFusion: Democratising high-resolution image generation with no. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 6159–6168).

  • Esser, P., Kulal, S., Blattmann, A., Entezari, R., Müller, J., Saini, H., Levi, Y., Lorenz, D., Sauer, A., & Boesel, F., et al. (2024) Scaling rectified flow transformers for high-resolution image synthesis. In International conference on machine learning

  • Esser, P., Rombach, R., & Ommer, B. (2021) Taming transformers for high-resolution image synthesis. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 12873–12883)

  • Fei, B., Lyu, Z., Pan, L., Zhang, J., Yang, W., Luo, T., Zhang, B., & Dai, B. (2023) Generative diffusion prior for unified image restoration and enhancement. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 9935–9946).

  • Graps, A. (1995). An introduction to wavelets. IEEE Computational Science and Engineering, 2, 50–61.

    Article  Google Scholar 

  • Guo, L., He, Y., Chen, H., Xia, M., Cun, X., Wang, Y., Huang, S., Zhang, Y., Wang, X., & Chen, Q., et al. (2024) Make a cheap scaling: A self-cascade diffusion model for higher-resolution adaptation. In European conference on computer vision (pp. 39–55). Springer

  • Haji-Ali, M., Balakrishnan, G., & Ordonez, V. (2024) Elasticdiffusion: Training-free arbitrary size image generation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 6603–6612).

  • He, Y., Yang, S., Chen, H., Cun, X., Xia, M., Zhang, Y., Wang, X., He, R., Chen, Q., & Shan, Y. (2023) Scalecrafter: Tuning-free higher-resolution visual generation with diffusion models. In International conference on learning representations

  • Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., & Hochreiter, S. (2017) Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in neural information processing systems (vol. 30, pp. 6629–6640).

  • Ho, J., Jain, A., & Abbeel, P. (2020) Denoising diffusion probabilistic models. In Advances in neural information processing systems (vol. 33, pp. 6840–6851)

  • Huang, L., Fang, R., Zhang, A., Song, G., Liu, S., Liu, Y., & Li, H. (2024) Fouriscale: A frequency perspective on training-free high-resolution image synthesis. In European conference on computer vision (pp. 196–212).

  • Ke, J., Wang, Q., Wang, Y., Milanfar, P., & Yang, F. (2021) Musiq: Multi-scale image quality transformer. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 5148–5157)

  • Kim, Y., Hwang, G., Zhang, J., & Park, E. (2024) Diffusehigh: Training-free progressive high-resolution image synthesis through structure guidance. In Proceedings of the AAAI conference on artificial intelligence.

  • Kirstain, Y., Polyak, A., Singer, U., Matiana, S., Penna, J., & Levy, O. (2024) Pick-a-pic: An open dataset of user preferences for text-to-image generation. In Advances in neural information processing systems (vol. 36, pp. 36652–36663).

  • Kwon, W., Li, Z., Zhuang, S., Sheng, Y., Zheng, L., Yu, C.H., Gonzalez, J.E., Zhang, H., & Stoica, I. (2023) Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th symposium on operating systems principles (pp. 611–626).

  • Li, D., Kamko, A., Akhgari, E., Sabet, A., Xu, L., & Doshi, S. (2024) Playground v2. 5: Three insights towards enhancing aesthetic quality in text-to-image generation. arXiv preprint arXiv:2402.17245

  • Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., & Zitnick, C.L. (2014) Microsoft coco: Common objects in context. In: European Conference on Computer Vision, pp. 740–755

  • Lin, Z., Lin, M., Meng, Z., & Ji, R. (2024) Accdiffusion : An accurate method for higher-resolution image generation. In: European Conference on Computer Vision, pp. 38–53

  • Liu, H., Li, C., Li, Y., Li, B., Zhang, Y., Shen, S., & Lee, Y.J. (2024) Llava-next: Improved reasoning, ocr, and world knowledge. https://llava-vl.github.io/blog/2024-01-30-llava-next/

  • Meng, C., He, Y., Song, Y., Song, J., Wu, J., Zhu, J.Y., & Ermon, S. (2022) SDEdit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations

  • Podell, D., English, Z., Lacey, K., Blattmann, A., Dockhorn, T., Müller, J., Penna, J., & Rombach, R. (2024) SDXL: Improving latent diffusion models for high-resolution image synthesis. In: International Conference on Learning Representations

  • Qian, Y., Cai, Q., Pan, Y., Li, Y., Yao, T., Sun, Q., & Mei, T. (2024) Boosting diffusion models with moving average sampling in frequency domain. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8911–8920

  • Quan, W., Chen, J., Liu, Y., Yan, D. M., & Wonka, P. (2024). Deep learning-based image and video inpainting: A survey. International Journal of Computer Vision, 132, 2367–2400.

    Article  Google Scholar 

  • Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., & Clark, J., et al. (2021) Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763

  • Ren, J., Li, W., Chen, H., Pei, R., Shao, B., Guo, Y., Peng, L., Song, F., & Zhu, L. (2024) Ultrapixel: Advancing ultra high-resolution image synthesis to new peaks. In: Advances in Neural Information Processing Systems

  • Ren, Y., Xia, X., Lu, Y., Zhang, J., Wu, J., Xie, P., WANG, X., & Xiao, X. (2024) Hyper-SD: Trajectory segmented consistency model for efficient image synthesis. In: Advances in Neural Information Processing Systems

  • Rombach, R., Blattmann, A., Lorenz, D., Esser, P., & Ommer, B. (2022) High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10684–10695

  • Ronneberger, O., Fischer, P., & Brox, T. (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, pp. 234–241

  • Saharia, C., Chan, W., Saxena, S., Li, L., Whang, J., Denton, E.L., Ghasemipour, K., Gontijo Lopes, R., Karagol Ayan, B., & Salimans, T., et al. (2022) Photorealistic text-to-image diffusion models with deep language understanding. In: Advances in Neural Information Processing Systems, vol. 35, pp. 36479–36494

  • Schuhmann, C., Beaumont, R., Vencu, R., Gordon, C., Wightman, R., Cherti, M., Coombes, T., Katta, A., Mullis, C., & Wortsman, M., et al. (2022) Laion-5b: An open large-scale dataset for training next generation image-text models. In: Advances in Neural Information Processing Systems, pp. 25278–25294

  • Si, C., Huang, Z., Jiang, Y., & Liu, Z. (2023) Freeu: Free lunch in diffusion u-net. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4733–4743

  • Song, J., Meng, C., & Ermon, S. (2021) Denoising diffusion implicit models. In: International Conference on Learning Representations

  • Stankovic, R. S., & Falkowski, B. J. (2003). The haar wavelet transform: its status and achievements. Computers & Electrical Engineering, 29, 25–44.

    Article  Google Scholar 

  • Wan, S., Li, Y., Chen, J., Pan, Y., Yao, T., Cao, Y., & Mei, T. (2024) Improving virtual try-on with garment-focused diffusion models. In: European Conference on Computer Vision, pp. 184–199. Springer

  • Wang, J., Chan, K.C., & Loy, C.C. (2023) Exploring clip for assessing the look and feel of images. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, pp. 2555–2563

  • Wang, J., Yue, Z., Zhou, S., Chan, K. C., & Loy, C. C. (2024). Exploiting diffusion prior for real-world image super-resolution. International Journal of Computer Vision, 132(12), 5929–5949.

    Article  Google Scholar 

  • Wang, X., Xie, L., Dong, C., & Shan, Y. (2021) Real-esrgan: Training real-world blind super-resolution with pure synthetic data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1905–1914

  • Wang, Y., Yang, W., Chen, X., Wang, Y., Guo, L., Chau, L.P., Liu, Z., Qiao, Y., Kot, A.C., & Wen, B. (2024) Sinsr: diffusion-based image super-resolution in a single step. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 25796–25805

  • Xin, J., Wang, N., Jiang, X., Li, J., & Gao, X. (2023). Advanced binary neural network for single image super resolution. International Journal of Computer Vision, 131, 1808–1824.

    Article  Google Scholar 

  • Yang, L., Yu, Z., Meng, C., Xu, M., Ermon, S., & Cui, B. (2024) Mastering text-to-image diffusion: Recaptioning, planning, and generating with multimodal llms. In: International Conference on Machine Learning, pp. 56704–56721

  • Yang, S., Wu, T., Shi, S., Lao, S., Gong, Y., Cao, M., Wang, J., & Yang, Y. (2022) Maniqa: Multi-dimension attention network for no-reference image quality assessment. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1191–1200

  • Zamir, S.W., Arora, A., Khan, S., Hayat, M., Khan, F.S., Yang, M.H., & Shao, L. (2021) Multi-stage progressive image restoration. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14821–14831

  • Zhang, K., Liang, J., Van Gool, L., & Timofte, R. (2021) Designing a practical degradation model for deep blind image super-resolution. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4791–4800

  • Zhang, K., Sun, M., Sun, J., Zhang, K., Sun, Z., & Tan, T. (2024). Open-vocabulary text-driven human image generation. International Journal of Computer Vision, 132(10), 4379–4397.

    Article  Google Scholar 

  • Zhang, R., Isola, P., Efros, A.A., Shechtman, E., & Wang, O. (2018) The unreasonable effectiveness of deep features as a perceptual metric. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 586–595

  • Zheng, Q., Guo, Y., Deng, J., Han, J., Li, Y., Xu, S., & Xu, H. (2023) Any-size-diffusion: Toward efficient text-driven synthesis for any-size hd images. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38, pp. 7571–7578

  • Zhu, R., Pan, Y., Li, Y., Yao, T., Sun, Z., Mei, T., & Chen, C.W. (2024) Sd-dit: Unleashing the power of self-supervised discrimination in diffusion transformer. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8435–8445

Download references

Acknowledgements

This work was supported in part by the Beijing Municipal Science and Technology Project No. Z241100001324002 and Beijing Nova Program No. 20240484681.

Author information

Authors and Affiliations

Corresponding author

Correspondence to Ting Yao.

Additional information

Communicated by Stephen Lin.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Qian, Y., Cai, Q., Pan, Y. et al. Creatively Upscaling Images with Global-Regional Priors. Int J Comput Vis 133, 5197–5215 (2025). https://doi.org/10.1007/s11263-025-02424-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • Version of record:

  • Issue date:

  • DOI: https://doi.org/10.1007/s11263-025-02424-4

Keywords

Profiles

  1. Qi Cai