这是indexloc提供的服务,不要输入任何密码
Skip to main content
Log in

Paragraph-to-Image Generation with Information-Enriched Diffusion Model

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

Text-to-image models have recently experienced rapid development, achieving astonishing performance in terms of fidelity and textual alignment capabilities. However, given a long paragraph (up to 512 words), these generation models still struggle to achieve strong alignment and are unable to generate images depicting complex scenes. In this paper, we introduce an information-enriched diffusion model for paragraph-to-image generation task, termed ParaDiffusion, which delves into the transference of the extensive semantic comprehension capabilities of large language models to the task of image generation. At its core is using a large language model (e.g., Llama V2) to encode long-form text, followed by fine-tuning with LoRA to align the text-image feature spaces in the generation task. To facilitate the training of long-text semantic alignment, we also curated a high-quality paragraph-image pair dataset, namely ParaImage. This dataset contains a small amount of high-quality, meticulously annotated data, and a large-scale synthetic dataset with long text descriptions being generated using a vision-language model. Experiments demonstrate that ParaDiffusion outperforms state-of-the-art models (SD XL, DeepFloyd IF) on ViLG-300 and ParaPrompts, achieving up to \(45\%\) human voting rate improvements for text faithfulness. Code and data can be found at: https://github.com/weijiawu/ParaDiffusion.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+
from $39.99 /Month
  • Starting from 10 chapters or articles per month
  • Access and download chapters and articles from more than 300k books and 2,500 journals
  • Cancel anytime
View plans

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14

Similar content being viewed by others

References

  • Rombach, R., Blattmann, A., Lorenz, D., Esser, P. & Ommer, B. (2022). High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 10 684–10 695.

  • Ramesh, A., Dhariwal, P., Nichol, A., Chu, C. & Chen, M. (2022). Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125.

  • Saharia, C., Chan, W., Saxena, S., Li, L., Whang, J., Denton, E. L., Ghasemipour, K., Gontijo Lopes, R., Karagol Ayan, B., Salimans, T., et al. (2022). Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems, 35(36), 479.

    Google Scholar 

  • Xue, Z., Song, G., Guo, Q., Liu, B., Zong, Z., Liu, Y. & Luo, P. (2023). Raphael: Text-to-image generation via large mixture of diffusion paths. arXiv preprint arXiv:2305.18295.

  • Dai, X., Hou, J., Ma, C.-Y., Tsai, S., Wang, J., Wang, R., Zhang, P. Vandenhende, S., Wang, X., Dubey, A. et al. (2023). Emu: Enhancing image generation models using photogenic needles in a haystack. arXiv preprint arXiv:2309.15807.

  • Schuhmann, C., Beaumont, R., Vencu, R., Gordon, C., Wightman, R., Cherti, M., Coombes, T., Katta, A., Mullis, C., Wortsman, M. et al. (2022). Laion-5b: An open large-scale dataset for training next generation image-text models. arXiv preprint arXiv:2210.08402.

  • Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J. et al. (2021). Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning. PMLR, pp 8748–8763.

  • Deepfloyd. (2023). Deepfloyd. https://www.deepfloyd.ai/.

  • Chen, J., Yu, J., Ge, C., Yao, L., Xie, E., Wu, Y., Wang, Z., Kwok, J., Luo, P., Lu, H. et al. (2023). Pixart-\(\alpha \): Fast training of diffusion transformer for photorealistic text-to-image synthesis. arXiv preprint arXiv:2310.00426.

  • Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., & Liu, P. J. (2020). Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1), 5485–5551.

    MathSciNet  Google Scholar 

  • WeihanWang, W. Y., Qingsong, Lv., Wenyi Hong, Y. W., Qi, Ji, Junhui Ji, Z. Y., Lei Zhao, X. S., Jiazheng Xu, X. B., Juanzi Li, Y. D. & Ming Dingz, J. T. (2023). Cogvlm: Visual expert for large language models. arXiv preprint arXiv:5148899.

  • Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S. et al. (2023). Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.

  • Ho, J., Jain, A., & Abbeel, P. (2020). Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33, 6840–6851.

    Google Scholar 

  • Yu, J., Xu, Y., Koh, J. Y., Luong, T., Baid, G., Wang, Z., Vasudevan, V., Ku, A., Yang, Y., Ayan, B. K. et al. (2022). Scaling autoregressive models for content-rich text-to-image generation. arXiv preprint arXiv:2206.10789.

  • Gu, Y., Wang, X., Wu, J. Z., Shi, Y., Chen, Y., Fan, Z., Xiao, W., Zhao, R., Chang, S., Wu, W. et al. (2023). Mix-of-show: Decentralized low-rank adaptation for multi-concept customization of diffusion models. arXiv preprint arXiv:2305.18292.

  • He, Y., Liu, L., Liu, J., Wu, W., Zhou, H. & Zhuang, B. (2023). Ptqd: Accurate post-training quantization for diffusion models. arXiv preprint arXiv:2305.10657.

  • Chang, H., Zhang, H., Barber, J., Maschinot, A., Lezama, J., Jiang, L., Yang, M.-H., Murphy, K., Freeman, W. T., Rubinstein, M. et al. (2023). Muse: Text-to-image generation via masked generative transformers. arXiv preprint arXiv:2301.00704.

  • Kawar, B., Zada, S., Lang, O., Tov, O., Chang, H., Dekel, T., Mosseri, I. & Irani, M. (2023). Imagic: Text-based real image editing with diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 6007–6017.

  • Zhang, L., Rao, A. & Agrawala, M. (2023). Adding conditional control to text-to-image diffusion models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 3836–3847.

  • Ruiz, N., Li, Y., Jampani, V., Pritch, Y., Rubinstein, M. & Aberman, K. (2023). Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 22 500–22 510.

  • Hertz, A., Mokady, R., Tenenbaum, J., Aberman, K., Pritch, Y. & Cohen-Or, D. (2022). Prompt-to-prompt image editing with cross attention control. arXiv preprint arXiv:2208.01626.

  • Wu, W., Zhao, Y., Shou, M. Z., Zhou, H. & Shen, C. (2023). Diffumask: Synthesizing images with pixel-level annotations for semantic segmentation using diffusion models. arXiv preprint arXiv:2303.11681.

  • Wu, W., Zhao, Y., Chen, H., Gu, Y., Zhao, R., He, Y., Zhou, H., Shou, M. Z. & Shen, C. (2024). Datasetdm: Synthesizing data with perception annotations using diffusion models. Advances in Neural Information Processing Systems, 36.

  • Saharia, C., Ho, J., Chan, W., Salimans, T., Fleet, D. J., & Norouzi, M. (2022). Image super-resolution via iterative refinement. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(4), 4713–4726.

    Google Scholar 

  • Feng, Z., Zhang, Z., Yu, X., Fang, Y., Li, L., Chen, X., Lu, Y., Liu, J., Yin, W., Feng, S. et al. (2023). Ernie-vilg 2.0: Improving text-to-image diffusion model with knowledge-enhanced mixture-of-denoising-experts. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 10 135–10 145.

  • Balaji, Y., Nah, S., Huang, X., Vahdat, A., Song, J., Kreis, K., Aittala, M., Aila, T., Laine, S., Catanzaro, B. et al. (2022). Ediffi: Text-to-image diffusion models with an ensemble of expert denoisers. arXiv preprint arXiv:2211.01324.

  • Lewis, M., Liu, Y., Goyal, N., Ghazvininejad, M., Mohamed, A., Levy, O., Stoyanov, V. & Zettlemoyer, L. (2019). Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461.

  • Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F. et al. (2023). Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.

  • Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., et al. (2022). Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35(27), 730.

    Google Scholar 

  • Wang, Y., Kordi, Y., Mishra, S., Liu, A., Smith, N. A., Khashabi, D. & Hajishirzi, H. (2022). Self-instruct: Aligning language model with self generated instructions. arXiv preprint arXiv:2212.10560.

  • Lester, B., Al-Rfou, R. & Constant, N. (2021). The power of scale for parameter-efficient prompt tuning. arXiv preprint arXiv:2104.08691.

  • Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L. & Chen, W. (2021). Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685.

  • Gani, H., Bhat, S. F., Naseer, M., Khan, S. & Wonka, P. (2023). Llm blueprint: Enabling text-to-image generation with complex and detailed prompts. arXiv preprint arXiv:2310.10640.

  • Lian, L., Li, B., Yala, A. & Darrell, T. (2023). Llm-grounded diffusion: Enhancing prompt understanding of text-to-image diffusion models with large language models. arXiv preprint arXiv:2305.13655.

  • Feng, W., Zhu, W., Fu, T.-j., Jampani, V., Akula, A., He, X., Basu, S., Wang, X. E. & Wang, W. Y. (2024). Layoutgpt: Compositional visual planning and generation with large language models. Advances in Neural Information Processing Systems, vol. 36.

  • Li, Y., Liu, H., Wu, Q., Mu, F., Yang, J., Gao, J., Li, C. & Lee, Y. J. (2023). Gligen: Open-set grounded text-to-image generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 22 511.

  • Ronneberger, O., Fischer, P. & Brox, T. (2015). U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18. Springer, pp 234–241.

  • Podell, D., English, Z., Lacey, K., Blattmann, A., Dockhorn, T., Müller, J., Penna, J. & Rombach, R. (2023). Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952.

  • Zhu, D., Chen, J., Shen, X., Li, X. & Elhoseiny, M. (2023). Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592.

  • Liu, H., Li, C., Wu, Q. & Lee, Y. J. (2023). Visual instruction tuning. arXiv preprint arXiv:2304.08485.

  • Laion. (2022). https://laion.ai/blog/laion-aesthetics/. blog.

  • Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A. C., Lo, W.-Y. et al. (2023). Segment anything. arXiv preprint arXiv:2304.02643.

  • Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P. & Zitnick, C. L. (2014). Microsoft coco: Common objects in context. In: Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13. Springer, pp 740–755.

  • Lai, X., Tian, Z., Chen, Y., Li, Y., Yuan, Y., Liu, S. & Jia, J. (2023). Lisa: Reasoning segmentation via large language model. arXiv preprint arXiv:2308.00692.

  • Nichol, A., Dhariwal, P., Ramesh, A., Shyam, P., Mishkin, P., McGrew, B., Sutskever, I. & Chen, M. (2021). Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741.

  • Kang, M., Zhu, J.-Y., Zhang, R., Park, J., Shechtman, E., Paris, S. & Park, T. (2023). Scaling up gans for text-to-image synthesis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 10 124

  • Yang, B., Gu, S., Zhang, B., Zhang, T., Chen, X., Sun, X., Chen, D. & Wen, F. (2023). Paint by example: Exemplar-based image editing with diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 18:381–391

  • Li, Y., Liu, H., Wu, Q., Mu, F., Yang, J., Gao, J., Li, C. & Lee, Y. J. (2023). Gligen: Open-set grounded text-to-image generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June, pp 22 511.

  • Liu, H., Li, C., Li, Y. & Lee, Y. J. (2024). Improved baselines with visual instruction tuning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 26 296.

  • Lu, C., Zhou, Y., Bao, F., Chen, J., Li, C., & Zhu, J. (2022). Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps. Advances in Neural Information Processing Systems, 35, 5775–5787.

    Google Scholar 

  • Song, Y., Dhariwal, P., Chen, M. & Sutskever, I. (2023). Consistency models

Download references

Acknowledgements

This research is supported by the Ministry of Education, Singapore, under its Academic Research Fund Tier 2 (Award No: MOE-T2EP20124-0012).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mike Zheng Shou.

Additional information

Communicated by Jianfei Cai.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

Appendix

1.1 More Visualizations

Figures 18 and 15 provides more visualizations of ParaDiffusion for human-centric and scenery-centric domains, respectively. The visualization reveals that our ParaDiffusion can generate intricate and realistic composite images of individuals as well as picturesque scenes. Moreover, it is noteworthy that images generated through paragraph-image generation exhibit a compelling narrative quality, enriched with profound semantics. The generated images consistently feature intricate object details and demonstrate effective multi-object control.

Fig. 15
figure 15

More Visualizations for scenery-centric from ParaDiffusion

Fig. 16
figure 16

Bad Cases for ParaDiffusion. There are still some areas that can be optimized for our ParaDiffusion

Fig. 17
figure 17

User Study on 1600 Prompts of PartiPrompts. We only selected open-source models for comparison, when the results of closed-source models (Xue et al., 2023; Radford et al., 2021) were unavailable or API calls failed

1.2 Limitations

Despite ParaDiffusion achieving excellent performance in long-text alignment and Visual Appeal, there are still some areas for improvement, such as inference speed. ParaDiffusion has not been optimized for speed, and implementing effective strategies, such as ODE solvers (Lu et al., 2022) or consistency models (Song et al., 2023), could lead to further enhancement and optimization in inference speed. In addition, while ParaDiffusion exhibits the capability to produce images of high realism, the presence of undesirable instances persists, as depicted in Fig. 16. Two predominant strategies prove effective in addressing these challenges: Firstly, at the data level, augmenting the dataset with additional high-quality images enhances diversity, contributing to further model refinement. Secondly, at the algorithmic level, the incorporation of additional constraints, such as geometric and semantic constraints, serves to imbue synthesized images with greater logical and semantic coherence.

Fig. 18
figure 18

More Visualizations for human-centric from ParaDiffusion

Fig. 19
figure 19

Visualization Comparison on ViLG-300. Our ParaDiffusion exhibits competitive performance in visual appeal

Fig. 20
figure 20

Visualization Comparison on ParaPrompts-400. Our ParaDiffusion demonstrates significant advantages in long-text alignment

Fig. 21
figure 21

Risk of Conflict between Visual Appeal and Text Faithfulness. Two insights: 1) Merely extending the token count of the current model (SD XL, DeepFloyd-IF)) does not yield satisfactory performance. 2) As the number of input tokens increases, all models experience a certain degree of decline in visual appeal

1.3 Experiments on 1,600 Prompts of PartiPrompts

We also provides the related experiment results on PartiPrompts-1600, as shown in Fig. 17. It can be observed that our model also achieved outstanding performance in Text Faithfulness, with a \(27.3\%\) human voting rate, significantly outperforming previous models such as SD XL and DeepFloyd-IF. Additionally, our model demonstrated a competitive advantage in visual appeal, surpassing SD XL and DeepFloyd-IF, and approaching the performance of contemporaneous work PIXART-\(\alpha \). A notable observation in ParaPrompts is the high proportion of ’Tie’ votes in human evaluations, especially for Text Faithfulness, with a voting rate of up to \(34\%\). This is attributed to the presence of numerous simple or abstract prompts in ParaPrompts-1600, making it challenging to provide precise voting results, such as for prompts like ‘happiness’ or ‘emotion.’

1.4 Visualization Comparison on ViLG-300 and ParaPrompts-400

To offer a more intuitive comparison, we provide visualizations comparing our model with prior works on ViLG-300 and ParaPrompts-400 datasets, as depicted in Figs. 19 and 20. From the perspective of visual appeal, the synthesized images produced by our ParaDiffusion align well with human aesthetics. They exhibit qualities reminiscent of photographic images in terms of lighting, contrast, scenes, and photographic composition. Concerning the alignment of long-form text, our ParaDiffusion demonstrates outstanding advantages, as illustrated in Fig. 20. Previous works often struggle to precisely align each object and attribute in lengthy textual descriptions, as seen in the second row of Fig. 20. Existing models frequently miss generating elements like ‘towers’ and ‘houses’, and their relative spatial relationships are flawed. In contrast, our model excels in accurately aligning the textual description with the content in the image.

1.5 More Details for ParaImage-Small

As stated in the main text, we selected 3, 000 exquisite images from a pool of 650, 000 images curated by LAION-Aesthetics (laion, 2022), adhering to common photographic principles. The detailed Aesthetic Image Selection Rule is outlined as follows:

  • The selected images will be used to annotate long-form descriptions (128-512 words, 4-10 sentences). Please assess whether the chosen images contain sufficient information (number and attributes of objects, image style) to support such lengthy textual descriptions.

  • The images should not include trademarks, any text added in post-production, and should be free of any mosaic effects.

  • Spatial Relationships between Multiple Objects: For images with multiple objects, there should be sufficient spatial hierarchy or positional relationships between these objects. For example, in a landscape photograph, the spatial distribution of mountains, lakes, and trees should create an interesting composition. There should be clear left-right relationships between multiple people.

  • Interaction between Multiple Objects: For images with multiple objects, choose scenes that showcase the interaction between the objects. This can include dialogue between characters, interactions between animals, or other interesting associations between objects.

  • Attribute of Single Object: All key details of the main subject should be clearly visible, and the subject’s attribute information should include at least three distinct aspects, including color, shape, and size. For example, in wildlife photography, the feather color, morphology, and size of an animal should be clearly visible.

  • Colors, Shapes, and Sizes of Objects: Various objects in the image should showcase diversity in colors, shapes, and sizes. This contributes to creating visually engaging scenes.

  • Clarity of the Story: The selected images should clearly convey a story or emotion. Annotators should pay attention to whether the image presents a clear and engaging narrative. For example, a couple walking on the street, a family portrait, or animals engaging in a conflict.

  • Variety of Object Categories: A diverse set of object categories enriches the content of the image. Ensure that the image encompasses various object categories to showcase diversity. For instance, on a city street, include people, cyclists, buses, and unique architectural structures simultaneously.

Following the aforementioned rules, we instructed the annotators to rate the images on a scale of 1–5, with 5 being the highest score. Subsequently, we selected images with a score of 5 as the data source for ParaImage-Small, resulting in approximately 3k images.

1.6 Risk of Conflict between Visual Appeal and Text Faithfulness

We also explored the potential of existing architectures (e.g., SD XL, DeepFloyd-IF) for long-text alignment of text-image image generation, as shown in Fig. 21. Firstly, all methods that utilize CLIP as a text encoder, such as SDXL, face limitations in supporting paragraph-image tasks due to the maximum number of input tokens supported by CLIP being 77. Secondly, we investigated the performance of methods using T5 XXL as a text encoder, e.g., DeepFloyd-IF (Deepfloyd, 2023) and PIXART-\(\alpha \) (Chen et al., 2023). We directly adjusted the tokens limitation of these methods to accommodate longer lengths, enabling support for image generation settings involving the alignment of long-form text. With an increase in the number of tokens, the visual appeal quality of DeepFloyd-IF experiences a significant decline, becoming more cartoonish around 512 tokens. Furthermore, its semantic alignment is unsatisfactory, with many generated objects missing, such as the table. Similarly, PIXART-\(\alpha \) fails to achieve satisfactory semantic alignment even with the maximum token limit increase, and its visual appeal also experiences a certain degree of decline. In contrast, our ParaDiffusion exhibits a more stable performance, achieving good semantic alignment with 256 tokens and showing minimal decline in visual appeal as the token count increases.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wu, W., Li, Z., He, Y. et al. Paragraph-to-Image Generation with Information-Enriched Diffusion Model. Int J Comput Vis 133, 5413–5434 (2025). https://doi.org/10.1007/s11263-025-02435-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • Version of record:

  • Issue date:

  • DOI: https://doi.org/10.1007/s11263-025-02435-1

Keywords