Abstract
Contemporary diffusion models show remarkable capability in text-to-image generation, while still being limited to restricted resolutions (e.g., \(1024\times 1024\)). Recent advances enable tuning-free higher-resolution image generation by recycling pre-trained diffusion models and extending them via regional denoising or dilated sampling/convolutions. However, these models struggle to simultaneously preserve global semantic structure and produce creative regional details in higher-resolution images. To address this, we present C-Upscale, a new recipe of tuning-free image upscaling that pivots on global-regional priors derived from given global prompt and estimated regional prompts via Multimodal LLM. Technically, the low-frequency component of low-resolution image is recognized as global structure prior to encourage global semantic consistency in high-resolution generation. Next, we perform regional attention control to screen cross-attention between global prompt and each region during regional denoising, leading to regional attention prior that alleviates object repetition issue. The estimated regional prompts containing rich descriptive details further act as regional semantic prior to fuel the creativity of regional detail generation. Both quantitative and qualitative evaluations demonstrate that our C-Upscale manages to generate ultra-high-resolution images (e.g., \(4096\times 4096 \,{\text {and}}\, 8192\times 8192\)) with higher visual fidelity and more creative regional details.
Similar content being viewed by others
Data Availability
The synthetic test low-resolution images for upscaling are generated with open-source pre-trained diffusion models, including SD1.5 (Rombach et al., 2022), SDXL (Podell et al., 2024), DreamShaper XL (Dreamshaper xl, 2024), and Pixart-\(\alpha \) (Chen et al., 2024). The test prompts are randomly sampled from LAION-5B (Schuhmann et al., 2022) and MS-COCO (Lin et al., 2014) datasets. The real-world test images are ground-truth images corresponding to 1K sampled LAION-5B prompts.
References
Bar-Tal, O., Yariv, L., Lipman, Y., & Dekel, T. (2023) Multidiffusion: Fusing diffusion paths for controlled image generation. In International conference on machine learning (pp. 1737–1752).
Betker, J., Goh, G., Jing, L., Brooks, T., Wang, J., Li, L., Ouyang, L., Zhuang, J., Lee, J., Guo, Y., et al. (2023). Improving image generation with better captions. Computer Science, 2, 8.
Chen, J., Pan, Y., Yao, T., & Mei, T. (2023) Controlstyle: Text-driven stylized image generation using diffusion priors. In Proceedings of the 31st ACM international conference on multimedia (pp. 7540–7548).
Chen, J., YU, J., GE, C., Yao, L., Xie, E., Wang, Z., Kwok, J., Luo, P., Lu, H., & Li, Z. (2024) Pixart-\(\alpha \): Fast training of diffusion transformer for photorealistic text-to-image synthesis. In International conference on learning representations.
Chen, Y., Chen, J., Pan, Y., Li, Y., Yao, T., Chen, Z., & Mei, T. (2024) Improving text-guided object inpainting with semantic pre-inpainting. In European conference on computer vision (pp. 110–126). Springer.
Choi, J., Kim, S., Jeong, Y., Gwon, Y., & Yoon, S. (2021) Ilvr: Conditioning method for denoising diffusion probabilistic models. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 14347–14356).
Dai, X., Hou, J., Ma, C.Y., Tsai, S., Wang, J., Wang, R., Zhang, P., Vandenhende, S., Wang, X., & Dubey, A., et al. (2023) Emu: Enhancing image generation models using photogenic needles in a haystack. arXiv preprint arXiv:2309.15807
Dhariwal, P., & Nichol, A. (2021). Diffusion models beat GANs on image synthesis. Advances in Neural Information Processing Systems, 34, 8780–8794.
Dong, C., Loy, C. C., He, K., & Tang, X. (2015). Image super-resolution using deep convolutional networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38, 295–307.
Dong, J., Bai, H., Tang, J., & Pan, J. (2023) Deep unpaired blind image super-resolution using self-supervised learning and exemplar distillation. Int. J. Comput. Vis. 1–14
Dreamshaper xl (2024). https://civitai.com/models/112902?modelVersionId=351306
Du, R., Chang, D., Hospedales, T., Song, Y.Z., & Ma, Z. (2023) DemoFusion: Democratising high-resolution image generation with no. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 6159–6168).
Esser, P., Kulal, S., Blattmann, A., Entezari, R., Müller, J., Saini, H., Levi, Y., Lorenz, D., Sauer, A., & Boesel, F., et al. (2024) Scaling rectified flow transformers for high-resolution image synthesis. In International conference on machine learning
Esser, P., Rombach, R., & Ommer, B. (2021) Taming transformers for high-resolution image synthesis. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 12873–12883)
Fei, B., Lyu, Z., Pan, L., Zhang, J., Yang, W., Luo, T., Zhang, B., & Dai, B. (2023) Generative diffusion prior for unified image restoration and enhancement. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 9935–9946).
Graps, A. (1995). An introduction to wavelets. IEEE Computational Science and Engineering, 2, 50–61.
Guo, L., He, Y., Chen, H., Xia, M., Cun, X., Wang, Y., Huang, S., Zhang, Y., Wang, X., & Chen, Q., et al. (2024) Make a cheap scaling: A self-cascade diffusion model for higher-resolution adaptation. In European conference on computer vision (pp. 39–55). Springer
Haji-Ali, M., Balakrishnan, G., & Ordonez, V. (2024) Elasticdiffusion: Training-free arbitrary size image generation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 6603–6612).
He, Y., Yang, S., Chen, H., Cun, X., Xia, M., Zhang, Y., Wang, X., He, R., Chen, Q., & Shan, Y. (2023) Scalecrafter: Tuning-free higher-resolution visual generation with diffusion models. In International conference on learning representations
Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., & Hochreiter, S. (2017) Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in neural information processing systems (vol. 30, pp. 6629–6640).
Ho, J., Jain, A., & Abbeel, P. (2020) Denoising diffusion probabilistic models. In Advances in neural information processing systems (vol. 33, pp. 6840–6851)
Huang, L., Fang, R., Zhang, A., Song, G., Liu, S., Liu, Y., & Li, H. (2024) Fouriscale: A frequency perspective on training-free high-resolution image synthesis. In European conference on computer vision (pp. 196–212).
Ke, J., Wang, Q., Wang, Y., Milanfar, P., & Yang, F. (2021) Musiq: Multi-scale image quality transformer. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 5148–5157)
Kim, Y., Hwang, G., Zhang, J., & Park, E. (2024) Diffusehigh: Training-free progressive high-resolution image synthesis through structure guidance. In Proceedings of the AAAI conference on artificial intelligence.
Kirstain, Y., Polyak, A., Singer, U., Matiana, S., Penna, J., & Levy, O. (2024) Pick-a-pic: An open dataset of user preferences for text-to-image generation. In Advances in neural information processing systems (vol. 36, pp. 36652–36663).
Kwon, W., Li, Z., Zhuang, S., Sheng, Y., Zheng, L., Yu, C.H., Gonzalez, J.E., Zhang, H., & Stoica, I. (2023) Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th symposium on operating systems principles (pp. 611–626).
Li, D., Kamko, A., Akhgari, E., Sabet, A., Xu, L., & Doshi, S. (2024) Playground v2. 5: Three insights towards enhancing aesthetic quality in text-to-image generation. arXiv preprint arXiv:2402.17245
Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., & Zitnick, C.L. (2014) Microsoft coco: Common objects in context. In: European Conference on Computer Vision, pp. 740–755
Lin, Z., Lin, M., Meng, Z., & Ji, R. (2024) Accdiffusion : An accurate method for higher-resolution image generation. In: European Conference on Computer Vision, pp. 38–53
Liu, H., Li, C., Li, Y., Li, B., Zhang, Y., Shen, S., & Lee, Y.J. (2024) Llava-next: Improved reasoning, ocr, and world knowledge. https://llava-vl.github.io/blog/2024-01-30-llava-next/
Meng, C., He, Y., Song, Y., Song, J., Wu, J., Zhu, J.Y., & Ermon, S. (2022) SDEdit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations
Podell, D., English, Z., Lacey, K., Blattmann, A., Dockhorn, T., Müller, J., Penna, J., & Rombach, R. (2024) SDXL: Improving latent diffusion models for high-resolution image synthesis. In: International Conference on Learning Representations
Qian, Y., Cai, Q., Pan, Y., Li, Y., Yao, T., Sun, Q., & Mei, T. (2024) Boosting diffusion models with moving average sampling in frequency domain. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8911–8920
Quan, W., Chen, J., Liu, Y., Yan, D. M., & Wonka, P. (2024). Deep learning-based image and video inpainting: A survey. International Journal of Computer Vision, 132, 2367–2400.
Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., & Clark, J., et al. (2021) Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763
Ren, J., Li, W., Chen, H., Pei, R., Shao, B., Guo, Y., Peng, L., Song, F., & Zhu, L. (2024) Ultrapixel: Advancing ultra high-resolution image synthesis to new peaks. In: Advances in Neural Information Processing Systems
Ren, Y., Xia, X., Lu, Y., Zhang, J., Wu, J., Xie, P., WANG, X., & Xiao, X. (2024) Hyper-SD: Trajectory segmented consistency model for efficient image synthesis. In: Advances in Neural Information Processing Systems
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., & Ommer, B. (2022) High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10684–10695
Ronneberger, O., Fischer, P., & Brox, T. (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, pp. 234–241
Saharia, C., Chan, W., Saxena, S., Li, L., Whang, J., Denton, E.L., Ghasemipour, K., Gontijo Lopes, R., Karagol Ayan, B., & Salimans, T., et al. (2022) Photorealistic text-to-image diffusion models with deep language understanding. In: Advances in Neural Information Processing Systems, vol. 35, pp. 36479–36494
Schuhmann, C., Beaumont, R., Vencu, R., Gordon, C., Wightman, R., Cherti, M., Coombes, T., Katta, A., Mullis, C., & Wortsman, M., et al. (2022) Laion-5b: An open large-scale dataset for training next generation image-text models. In: Advances in Neural Information Processing Systems, pp. 25278–25294
Si, C., Huang, Z., Jiang, Y., & Liu, Z. (2023) Freeu: Free lunch in diffusion u-net. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4733–4743
Song, J., Meng, C., & Ermon, S. (2021) Denoising diffusion implicit models. In: International Conference on Learning Representations
Stankovic, R. S., & Falkowski, B. J. (2003). The haar wavelet transform: its status and achievements. Computers & Electrical Engineering, 29, 25–44.
Wan, S., Li, Y., Chen, J., Pan, Y., Yao, T., Cao, Y., & Mei, T. (2024) Improving virtual try-on with garment-focused diffusion models. In: European Conference on Computer Vision, pp. 184–199. Springer
Wang, J., Chan, K.C., & Loy, C.C. (2023) Exploring clip for assessing the look and feel of images. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, pp. 2555–2563
Wang, J., Yue, Z., Zhou, S., Chan, K. C., & Loy, C. C. (2024). Exploiting diffusion prior for real-world image super-resolution. International Journal of Computer Vision, 132(12), 5929–5949.
Wang, X., Xie, L., Dong, C., & Shan, Y. (2021) Real-esrgan: Training real-world blind super-resolution with pure synthetic data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1905–1914
Wang, Y., Yang, W., Chen, X., Wang, Y., Guo, L., Chau, L.P., Liu, Z., Qiao, Y., Kot, A.C., & Wen, B. (2024) Sinsr: diffusion-based image super-resolution in a single step. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 25796–25805
Xin, J., Wang, N., Jiang, X., Li, J., & Gao, X. (2023). Advanced binary neural network for single image super resolution. International Journal of Computer Vision, 131, 1808–1824.
Yang, L., Yu, Z., Meng, C., Xu, M., Ermon, S., & Cui, B. (2024) Mastering text-to-image diffusion: Recaptioning, planning, and generating with multimodal llms. In: International Conference on Machine Learning, pp. 56704–56721
Yang, S., Wu, T., Shi, S., Lao, S., Gong, Y., Cao, M., Wang, J., & Yang, Y. (2022) Maniqa: Multi-dimension attention network for no-reference image quality assessment. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1191–1200
Zamir, S.W., Arora, A., Khan, S., Hayat, M., Khan, F.S., Yang, M.H., & Shao, L. (2021) Multi-stage progressive image restoration. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14821–14831
Zhang, K., Liang, J., Van Gool, L., & Timofte, R. (2021) Designing a practical degradation model for deep blind image super-resolution. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4791–4800
Zhang, K., Sun, M., Sun, J., Zhang, K., Sun, Z., & Tan, T. (2024). Open-vocabulary text-driven human image generation. International Journal of Computer Vision, 132(10), 4379–4397.
Zhang, R., Isola, P., Efros, A.A., Shechtman, E., & Wang, O. (2018) The unreasonable effectiveness of deep features as a perceptual metric. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 586–595
Zheng, Q., Guo, Y., Deng, J., Han, J., Li, Y., Xu, S., & Xu, H. (2023) Any-size-diffusion: Toward efficient text-driven synthesis for any-size hd images. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38, pp. 7571–7578
Zhu, R., Pan, Y., Li, Y., Yao, T., Sun, Z., Mei, T., & Chen, C.W. (2024) Sd-dit: Unleashing the power of self-supervised discrimination in diffusion transformer. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8435–8445
Acknowledgements
This work was supported in part by the Beijing Municipal Science and Technology Project No. Z241100001324002 and Beijing Nova Program No. 20240484681.
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by Stephen Lin.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Qian, Y., Cai, Q., Pan, Y. et al. Creatively Upscaling Images with Global-Regional Priors. Int J Comput Vis 133, 5197–5215 (2025). https://doi.org/10.1007/s11263-025-02424-4
Received:
Accepted:
Published:
Version of record:
Issue date:
DOI: https://doi.org/10.1007/s11263-025-02424-4