Abstract
Text-to-image models have recently experienced rapid development, achieving astonishing performance in terms of fidelity and textual alignment capabilities. However, given a long paragraph (up to 512 words), these generation models still struggle to achieve strong alignment and are unable to generate images depicting complex scenes. In this paper, we introduce an information-enriched diffusion model for paragraph-to-image generation task, termed ParaDiffusion, which delves into the transference of the extensive semantic comprehension capabilities of large language models to the task of image generation. At its core is using a large language model (e.g., Llama V2) to encode long-form text, followed by fine-tuning with LoRA to align the text-image feature spaces in the generation task. To facilitate the training of long-text semantic alignment, we also curated a high-quality paragraph-image pair dataset, namely ParaImage. This dataset contains a small amount of high-quality, meticulously annotated data, and a large-scale synthetic dataset with long text descriptions being generated using a vision-language model. Experiments demonstrate that ParaDiffusion outperforms state-of-the-art models (SD XL, DeepFloyd IF) on ViLG-300 and ParaPrompts, achieving up to \(45\%\) human voting rate improvements for text faithfulness. Code and data can be found at: https://github.com/weijiawu/ParaDiffusion.
Similar content being viewed by others
References
Rombach, R., Blattmann, A., Lorenz, D., Esser, P. & Ommer, B. (2022). High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 10 684–10 695.
Ramesh, A., Dhariwal, P., Nichol, A., Chu, C. & Chen, M. (2022). Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125.
Saharia, C., Chan, W., Saxena, S., Li, L., Whang, J., Denton, E. L., Ghasemipour, K., Gontijo Lopes, R., Karagol Ayan, B., Salimans, T., et al. (2022). Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems, 35(36), 479.
Xue, Z., Song, G., Guo, Q., Liu, B., Zong, Z., Liu, Y. & Luo, P. (2023). Raphael: Text-to-image generation via large mixture of diffusion paths. arXiv preprint arXiv:2305.18295.
Dai, X., Hou, J., Ma, C.-Y., Tsai, S., Wang, J., Wang, R., Zhang, P. Vandenhende, S., Wang, X., Dubey, A. et al. (2023). Emu: Enhancing image generation models using photogenic needles in a haystack. arXiv preprint arXiv:2309.15807.
Schuhmann, C., Beaumont, R., Vencu, R., Gordon, C., Wightman, R., Cherti, M., Coombes, T., Katta, A., Mullis, C., Wortsman, M. et al. (2022). Laion-5b: An open large-scale dataset for training next generation image-text models. arXiv preprint arXiv:2210.08402.
Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J. et al. (2021). Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning. PMLR, pp 8748–8763.
Deepfloyd. (2023). Deepfloyd. https://www.deepfloyd.ai/.
Chen, J., Yu, J., Ge, C., Yao, L., Xie, E., Wu, Y., Wang, Z., Kwok, J., Luo, P., Lu, H. et al. (2023). Pixart-\(\alpha \): Fast training of diffusion transformer for photorealistic text-to-image synthesis. arXiv preprint arXiv:2310.00426.
Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., & Liu, P. J. (2020). Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1), 5485–5551.
WeihanWang, W. Y., Qingsong, Lv., Wenyi Hong, Y. W., Qi, Ji, Junhui Ji, Z. Y., Lei Zhao, X. S., Jiazheng Xu, X. B., Juanzi Li, Y. D. & Ming Dingz, J. T. (2023). Cogvlm: Visual expert for large language models. arXiv preprint arXiv:5148899.
Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S. et al. (2023). Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
Ho, J., Jain, A., & Abbeel, P. (2020). Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33, 6840–6851.
Yu, J., Xu, Y., Koh, J. Y., Luong, T., Baid, G., Wang, Z., Vasudevan, V., Ku, A., Yang, Y., Ayan, B. K. et al. (2022). Scaling autoregressive models for content-rich text-to-image generation. arXiv preprint arXiv:2206.10789.
Gu, Y., Wang, X., Wu, J. Z., Shi, Y., Chen, Y., Fan, Z., Xiao, W., Zhao, R., Chang, S., Wu, W. et al. (2023). Mix-of-show: Decentralized low-rank adaptation for multi-concept customization of diffusion models. arXiv preprint arXiv:2305.18292.
He, Y., Liu, L., Liu, J., Wu, W., Zhou, H. & Zhuang, B. (2023). Ptqd: Accurate post-training quantization for diffusion models. arXiv preprint arXiv:2305.10657.
Chang, H., Zhang, H., Barber, J., Maschinot, A., Lezama, J., Jiang, L., Yang, M.-H., Murphy, K., Freeman, W. T., Rubinstein, M. et al. (2023). Muse: Text-to-image generation via masked generative transformers. arXiv preprint arXiv:2301.00704.
Kawar, B., Zada, S., Lang, O., Tov, O., Chang, H., Dekel, T., Mosseri, I. & Irani, M. (2023). Imagic: Text-based real image editing with diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 6007–6017.
Zhang, L., Rao, A. & Agrawala, M. (2023). Adding conditional control to text-to-image diffusion models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 3836–3847.
Ruiz, N., Li, Y., Jampani, V., Pritch, Y., Rubinstein, M. & Aberman, K. (2023). Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 22 500–22 510.
Hertz, A., Mokady, R., Tenenbaum, J., Aberman, K., Pritch, Y. & Cohen-Or, D. (2022). Prompt-to-prompt image editing with cross attention control. arXiv preprint arXiv:2208.01626.
Wu, W., Zhao, Y., Shou, M. Z., Zhou, H. & Shen, C. (2023). Diffumask: Synthesizing images with pixel-level annotations for semantic segmentation using diffusion models. arXiv preprint arXiv:2303.11681.
Wu, W., Zhao, Y., Chen, H., Gu, Y., Zhao, R., He, Y., Zhou, H., Shou, M. Z. & Shen, C. (2024). Datasetdm: Synthesizing data with perception annotations using diffusion models. Advances in Neural Information Processing Systems, 36.
Saharia, C., Ho, J., Chan, W., Salimans, T., Fleet, D. J., & Norouzi, M. (2022). Image super-resolution via iterative refinement. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(4), 4713–4726.
Feng, Z., Zhang, Z., Yu, X., Fang, Y., Li, L., Chen, X., Lu, Y., Liu, J., Yin, W., Feng, S. et al. (2023). Ernie-vilg 2.0: Improving text-to-image diffusion model with knowledge-enhanced mixture-of-denoising-experts. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 10 135–10 145.
Balaji, Y., Nah, S., Huang, X., Vahdat, A., Song, J., Kreis, K., Aittala, M., Aila, T., Laine, S., Catanzaro, B. et al. (2022). Ediffi: Text-to-image diffusion models with an ensemble of expert denoisers. arXiv preprint arXiv:2211.01324.
Lewis, M., Liu, Y., Goyal, N., Ghazvininejad, M., Mohamed, A., Levy, O., Stoyanov, V. & Zettlemoyer, L. (2019). Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461.
Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F. et al. (2023). Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., et al. (2022). Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35(27), 730.
Wang, Y., Kordi, Y., Mishra, S., Liu, A., Smith, N. A., Khashabi, D. & Hajishirzi, H. (2022). Self-instruct: Aligning language model with self generated instructions. arXiv preprint arXiv:2212.10560.
Lester, B., Al-Rfou, R. & Constant, N. (2021). The power of scale for parameter-efficient prompt tuning. arXiv preprint arXiv:2104.08691.
Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L. & Chen, W. (2021). Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685.
Gani, H., Bhat, S. F., Naseer, M., Khan, S. & Wonka, P. (2023). Llm blueprint: Enabling text-to-image generation with complex and detailed prompts. arXiv preprint arXiv:2310.10640.
Lian, L., Li, B., Yala, A. & Darrell, T. (2023). Llm-grounded diffusion: Enhancing prompt understanding of text-to-image diffusion models with large language models. arXiv preprint arXiv:2305.13655.
Feng, W., Zhu, W., Fu, T.-j., Jampani, V., Akula, A., He, X., Basu, S., Wang, X. E. & Wang, W. Y. (2024). Layoutgpt: Compositional visual planning and generation with large language models. Advances in Neural Information Processing Systems, vol. 36.
Li, Y., Liu, H., Wu, Q., Mu, F., Yang, J., Gao, J., Li, C. & Lee, Y. J. (2023). Gligen: Open-set grounded text-to-image generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 22 511.
Ronneberger, O., Fischer, P. & Brox, T. (2015). U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18. Springer, pp 234–241.
Podell, D., English, Z., Lacey, K., Blattmann, A., Dockhorn, T., Müller, J., Penna, J. & Rombach, R. (2023). Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952.
Zhu, D., Chen, J., Shen, X., Li, X. & Elhoseiny, M. (2023). Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592.
Liu, H., Li, C., Wu, Q. & Lee, Y. J. (2023). Visual instruction tuning. arXiv preprint arXiv:2304.08485.
Laion. (2022). https://laion.ai/blog/laion-aesthetics/. blog.
Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A. C., Lo, W.-Y. et al. (2023). Segment anything. arXiv preprint arXiv:2304.02643.
Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P. & Zitnick, C. L. (2014). Microsoft coco: Common objects in context. In: Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13. Springer, pp 740–755.
Lai, X., Tian, Z., Chen, Y., Li, Y., Yuan, Y., Liu, S. & Jia, J. (2023). Lisa: Reasoning segmentation via large language model. arXiv preprint arXiv:2308.00692.
Nichol, A., Dhariwal, P., Ramesh, A., Shyam, P., Mishkin, P., McGrew, B., Sutskever, I. & Chen, M. (2021). Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741.
Kang, M., Zhu, J.-Y., Zhang, R., Park, J., Shechtman, E., Paris, S. & Park, T. (2023). Scaling up gans for text-to-image synthesis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 10 124
Yang, B., Gu, S., Zhang, B., Zhang, T., Chen, X., Sun, X., Chen, D. & Wen, F. (2023). Paint by example: Exemplar-based image editing with diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 18:381–391
Li, Y., Liu, H., Wu, Q., Mu, F., Yang, J., Gao, J., Li, C. & Lee, Y. J. (2023). Gligen: Open-set grounded text-to-image generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June, pp 22 511.
Liu, H., Li, C., Li, Y. & Lee, Y. J. (2024). Improved baselines with visual instruction tuning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 26 296.
Lu, C., Zhou, Y., Bao, F., Chen, J., Li, C., & Zhu, J. (2022). Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps. Advances in Neural Information Processing Systems, 35, 5775–5787.
Song, Y., Dhariwal, P., Chen, M. & Sutskever, I. (2023). Consistency models
Acknowledgements
This research is supported by the Ministry of Education, Singapore, under its Academic Research Fund Tier 2 (Award No: MOE-T2EP20124-0012).
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by Jianfei Cai.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix
Appendix
1.1 More Visualizations
Figures 18 and 15 provides more visualizations of ParaDiffusion for human-centric and scenery-centric domains, respectively. The visualization reveals that our ParaDiffusion can generate intricate and realistic composite images of individuals as well as picturesque scenes. Moreover, it is noteworthy that images generated through paragraph-image generation exhibit a compelling narrative quality, enriched with profound semantics. The generated images consistently feature intricate object details and demonstrate effective multi-object control.
1.2 Limitations
Despite ParaDiffusion achieving excellent performance in long-text alignment and Visual Appeal, there are still some areas for improvement, such as inference speed. ParaDiffusion has not been optimized for speed, and implementing effective strategies, such as ODE solvers (Lu et al., 2022) or consistency models (Song et al., 2023), could lead to further enhancement and optimization in inference speed. In addition, while ParaDiffusion exhibits the capability to produce images of high realism, the presence of undesirable instances persists, as depicted in Fig. 16. Two predominant strategies prove effective in addressing these challenges: Firstly, at the data level, augmenting the dataset with additional high-quality images enhances diversity, contributing to further model refinement. Secondly, at the algorithmic level, the incorporation of additional constraints, such as geometric and semantic constraints, serves to imbue synthesized images with greater logical and semantic coherence.
Risk of Conflict between Visual Appeal and Text Faithfulness. Two insights: 1) Merely extending the token count of the current model (SD XL, DeepFloyd-IF)) does not yield satisfactory performance. 2) As the number of input tokens increases, all models experience a certain degree of decline in visual appeal
1.3 Experiments on 1,600 Prompts of PartiPrompts
We also provides the related experiment results on PartiPrompts-1600, as shown in Fig. 17. It can be observed that our model also achieved outstanding performance in Text Faithfulness, with a \(27.3\%\) human voting rate, significantly outperforming previous models such as SD XL and DeepFloyd-IF. Additionally, our model demonstrated a competitive advantage in visual appeal, surpassing SD XL and DeepFloyd-IF, and approaching the performance of contemporaneous work PIXART-\(\alpha \). A notable observation in ParaPrompts is the high proportion of ’Tie’ votes in human evaluations, especially for Text Faithfulness, with a voting rate of up to \(34\%\). This is attributed to the presence of numerous simple or abstract prompts in ParaPrompts-1600, making it challenging to provide precise voting results, such as for prompts like ‘happiness’ or ‘emotion.’
1.4 Visualization Comparison on ViLG-300 and ParaPrompts-400
To offer a more intuitive comparison, we provide visualizations comparing our model with prior works on ViLG-300 and ParaPrompts-400 datasets, as depicted in Figs. 19 and 20. From the perspective of visual appeal, the synthesized images produced by our ParaDiffusion align well with human aesthetics. They exhibit qualities reminiscent of photographic images in terms of lighting, contrast, scenes, and photographic composition. Concerning the alignment of long-form text, our ParaDiffusion demonstrates outstanding advantages, as illustrated in Fig. 20. Previous works often struggle to precisely align each object and attribute in lengthy textual descriptions, as seen in the second row of Fig. 20. Existing models frequently miss generating elements like ‘towers’ and ‘houses’, and their relative spatial relationships are flawed. In contrast, our model excels in accurately aligning the textual description with the content in the image.
1.5 More Details for ParaImage-Small
As stated in the main text, we selected 3, 000 exquisite images from a pool of 650, 000 images curated by LAION-Aesthetics (laion, 2022), adhering to common photographic principles. The detailed Aesthetic Image Selection Rule is outlined as follows:
-
The selected images will be used to annotate long-form descriptions (128-512 words, 4-10 sentences). Please assess whether the chosen images contain sufficient information (number and attributes of objects, image style) to support such lengthy textual descriptions.
-
The images should not include trademarks, any text added in post-production, and should be free of any mosaic effects.
-
Spatial Relationships between Multiple Objects: For images with multiple objects, there should be sufficient spatial hierarchy or positional relationships between these objects. For example, in a landscape photograph, the spatial distribution of mountains, lakes, and trees should create an interesting composition. There should be clear left-right relationships between multiple people.
-
Interaction between Multiple Objects: For images with multiple objects, choose scenes that showcase the interaction between the objects. This can include dialogue between characters, interactions between animals, or other interesting associations between objects.
-
Attribute of Single Object: All key details of the main subject should be clearly visible, and the subject’s attribute information should include at least three distinct aspects, including color, shape, and size. For example, in wildlife photography, the feather color, morphology, and size of an animal should be clearly visible.
-
Colors, Shapes, and Sizes of Objects: Various objects in the image should showcase diversity in colors, shapes, and sizes. This contributes to creating visually engaging scenes.
-
Clarity of the Story: The selected images should clearly convey a story or emotion. Annotators should pay attention to whether the image presents a clear and engaging narrative. For example, a couple walking on the street, a family portrait, or animals engaging in a conflict.
-
Variety of Object Categories: A diverse set of object categories enriches the content of the image. Ensure that the image encompasses various object categories to showcase diversity. For instance, on a city street, include people, cyclists, buses, and unique architectural structures simultaneously.
Following the aforementioned rules, we instructed the annotators to rate the images on a scale of 1–5, with 5 being the highest score. Subsequently, we selected images with a score of 5 as the data source for ParaImage-Small, resulting in approximately 3k images.
1.6 Risk of Conflict between Visual Appeal and Text Faithfulness
We also explored the potential of existing architectures (e.g., SD XL, DeepFloyd-IF) for long-text alignment of text-image image generation, as shown in Fig. 21. Firstly, all methods that utilize CLIP as a text encoder, such as SDXL, face limitations in supporting paragraph-image tasks due to the maximum number of input tokens supported by CLIP being 77. Secondly, we investigated the performance of methods using T5 XXL as a text encoder, e.g., DeepFloyd-IF (Deepfloyd, 2023) and PIXART-\(\alpha \) (Chen et al., 2023). We directly adjusted the tokens limitation of these methods to accommodate longer lengths, enabling support for image generation settings involving the alignment of long-form text. With an increase in the number of tokens, the visual appeal quality of DeepFloyd-IF experiences a significant decline, becoming more cartoonish around 512 tokens. Furthermore, its semantic alignment is unsatisfactory, with many generated objects missing, such as the table. Similarly, PIXART-\(\alpha \) fails to achieve satisfactory semantic alignment even with the maximum token limit increase, and its visual appeal also experiences a certain degree of decline. In contrast, our ParaDiffusion exhibits a more stable performance, achieving good semantic alignment with 256 tokens and showing minimal decline in visual appeal as the token count increases.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Wu, W., Li, Z., He, Y. et al. Paragraph-to-Image Generation with Information-Enriched Diffusion Model. Int J Comput Vis 133, 5413–5434 (2025). https://doi.org/10.1007/s11263-025-02435-1
Received:
Accepted:
Published:
Version of record:
Issue date:
DOI: https://doi.org/10.1007/s11263-025-02435-1