CN116168108A

CN116168108A - Method and device for generating image from text, storage medium and electronic device

Info

Publication number: CN116168108A
Application number: CN202310266304.1A
Authority: CN
Inventors: 马建
Original assignee: Guangdong Oppo Mobile Telecommunications Corp Ltd
Current assignee: Guangdong Oppo Mobile Telecommunications Corp Ltd
Priority date: 2023-03-17
Filing date: 2023-03-17
Publication date: 2023-05-26

Abstract

The present disclosure relates to the field of image data processing or generating technologies, and in particular, to a method and apparatus for generating an image by using text, a computer readable storage medium, and an electronic device, where the method includes: coding natural language text describing the image to obtain text coding data; fusing the text coding data with first Gaussian noise of a first preset time step to obtain intermediate text data; sequentially processing the intermediate text data corresponding to the time step by using the coded image generation models with different time steps to obtain image coded data; decoding the image coding data to obtain a target image; wherein the set of time steps of the plurality of encoded image generation models completely covers the first preset time step. The technical scheme of the embodiment of the disclosure improves the precision of the image generated by the text and the matching degree with the text semantic information.

Description

Method and device for generating image from text, storage medium and electronic device

技术领域technical field

本公开涉及图像数据处理或产生技术领域，具体而言，涉及一种文本生成图像的方法及装置、计算机可读存储介质及电子设备。The present disclosure relates to the technical field of image data processing or generation, and in particular, to a method and device for generating an image from text, a computer-readable storage medium, and electronic equipment.

背景技术Background technique

基于文本的图像生成技术在很多的场景上具有广泛的应用前景，包括手机主题商个性化壁纸创作，幻灯片创意图像素材获取，虚拟空间中的内容创造，多模态的对话交互系统等。Text-based image generation technology has broad application prospects in many scenarios, including the creation of personalized wallpapers for mobile theme providers, acquisition of creative image materials for slideshows, content creation in virtual spaces, and multi-modal dialogue interaction systems.

但是相关技术中的文本生成图像的方法的精度较低，生成的图像与文本语义信息的匹配程度较低。However, the method for generating an image from text in the related art has low precision, and the degree of matching between the generated image and the semantic information of the text is low.

需要说明的是，在上述背景技术部分公开的信息仅用于加强对本公开的背景的理解，因此可以包括不构成对本领域普通技术人员已知的现有技术的信息。It should be noted that the information disclosed in the above background section is only for enhancing the understanding of the background of the present disclosure, and therefore may include information that does not constitute the prior art known to those of ordinary skill in the art.

发明内容Contents of the invention

本公开的目的在于提供一种文本生成图像的方法、文本生成图像的装置、计算机可读介质和电子设备，进而至少在一定程度上提高了文本生成的图像的精度，以及与文本语义信息的匹配度。The purpose of the present disclosure is to provide a method for generating an image from text, a device for generating an image from text, a computer-readable medium, and an electronic device, thereby improving the accuracy of the image generated by text and the matching with text semantic information at least to a certain extent Spend.

根据本公开的第一方面，提供一种文本生成图像的方法，包括：对描述图像的自然语言文本进行编码得到文本编码数据；将所述文本编码数据与第一预设时间步长的第一高斯噪声进行融合得到中间文本数据；利用不同时间步长的编码图像生成模型依次对所述时间步长对应的所述中间文本数据进行处理得到图像编码数据；对所述图像编码数据进行解码得到目标图像；其中，多个所述编码图像生成模型的时间步长的集合完全覆盖所述第一预设时间步长。According to a first aspect of the present disclosure, there is provided a method for generating an image from text, including: encoding a natural language text describing an image to obtain text encoding data; combining the text encoding data with a first preset time step Gaussian noise is fused to obtain intermediate text data; the intermediate text data corresponding to the time step is sequentially processed by encoding image generation models with different time steps to obtain image encoding data; the image encoding data is decoded to obtain the target image; wherein the set of multiple time steps of the coded image generation model completely covers the first preset time step.

根据本公开的第二方面，提供一种文本生成图像的装置，包括：编码模块，用于对描述图像的自然语言文本进行编码得到文本编码数据；融合模块，用于将所述文本编码数据与第一预设时间步长的第一高斯噪声进行融合得到中间文本数据；处理模块，用于利用不同时间步长的编码图像生成模型依次对所述时间步长对应的所述中间文本数据进行处理得到图像编码数据；解码模块；用于对所述图像编码数据进行解码得到目标图像；其中，多个所述编码图像生成模型的时间步长的集合完全覆盖所述第一预设时间步长。According to the second aspect of the present disclosure, there is provided an apparatus for generating an image from text, including: an encoding module, configured to encode a natural language text describing an image to obtain encoded text data; a fusion module, configured to combine the encoded text data with the The first Gaussian noise of the first preset time step is fused to obtain the intermediate text data; the processing module is used to sequentially process the intermediate text data corresponding to the time step by using encoding image generation models of different time steps Obtaining coded image data; a decoding module; configured to decode the coded image data to obtain a target image; wherein, a set of time steps of a plurality of coded image generation models completely covers the first preset time step.

根据本公开的第三方面，提供一种计算机可读介质，其上存储有计算机程序，计算机程序被处理器执行时实现上述的方法。According to a third aspect of the present disclosure, there is provided a computer-readable medium on which a computer program is stored, and when the computer program is executed by a processor, the above method is implemented.

根据本公开的第四方面，提供一种电子设备，其特征在于，包括：一个或多个处理器；以及存储器，用于存储一个或多个程序，当一个或多个程序被一个或多个处理器执行时，使得一个或多个处理器实现上述的方法。According to a fourth aspect of the present disclosure, there is provided an electronic device, which is characterized in that it includes: one or more processors; and a memory for storing one or more programs, when one or more programs are used by one or more When the processor is executed, one or more processors are made to implement the above method.

本公开的一种实施例所提供的文本生成图像的方法，对描述图像的自然语言文本进行编码得到文本编码数据；将文本编码数据与第一预设时间步长的第一高斯噪声进行融合得到中间文本数据；利用不同时间步长的编码图像生成模型依次对时间步长对应的中间文本数据进行处理得到图像编码数据；对图像编码数据进行解码得到目标图像。相较于现有技术，利用不同时间步长的编码图像生成模型依次对时间步长对应的中间文本数据进行处理得到图像编码数据增加了模型容量，提高了图片生成的质量。An embodiment of the present disclosure provides a method for generating an image from text, which encodes the natural language text describing the image to obtain text coded data; fuses the text coded data with the first Gaussian noise of the first preset time step to obtain Intermediate text data; use coded image generation models with different time steps to sequentially process the intermediate text data corresponding to the time steps to obtain image coded data; decode the image coded data to obtain the target image. Compared with the prior art, the coded image generation model with different time steps is used to sequentially process the intermediate text data corresponding to the time steps to obtain image coded data, which increases the capacity of the model and improves the quality of image generation.

应当理解的是，以上的一般描述和后文的细节描述仅是示例性和解释性的，并不能限制本公开。It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the present disclosure.

附图说明Description of drawings

此处的附图被并入说明书中并构成本说明书的一部分，示出了符合本公开的实施例，并与说明书一起用于解释本公开的原理。显而易见地，下面描述中的附图仅仅是本公开的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他的附图。在附图中：The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the disclosure and together with the description serve to explain the principles of the disclosure. Apparently, the drawings in the following description are only some embodiments of the present disclosure, and those skilled in the art can obtain other drawings according to these drawings without creative efforts. In the attached picture:

图1示出了可以应用本公开实施例的一种示例性系统架构的示意图；FIG. 1 shows a schematic diagram of an exemplary system architecture to which embodiments of the present disclosure can be applied;

图2示意性示出本公开示例性实施例中一种文本生成图像的方法的流程图；FIG. 2 schematically shows a flowchart of a method for generating an image from text in an exemplary embodiment of the present disclosure;

图3示意性示出本公开示例性实施例中一种获取自然语言把文本的结构图；Fig. 3 schematically shows a structural diagram of acquiring natural language text in an exemplary embodiment of the present disclosure;

图4示意性示出本公开示例性实施例中一种获取图像编码数据的流程图；Fig. 4 schematically shows a flow chart of acquiring image encoding data in an exemplary embodiment of the present disclosure;

图5示意性示出本公开示例性实施例中一种获取编码图像生成模型的流程图；Fig. 5 schematically shows a flow chart of acquiring a coded image generation model in an exemplary embodiment of the present disclosure;

图6示意性示出本公开示例性实施例中一种训练编码图像生成模型的数据流向图；FIG. 6 schematically shows a data flow diagram for training a coded image generation model in an exemplary embodiment of the present disclosure;

图7示意性示出本公开示例性实施例中一种获取目标图像的流程图；Fig. 7 schematically shows a flow chart of acquiring a target image in an exemplary embodiment of the present disclosure;

图8示意性示出本公开示例性实施例中一种获取中间图像的流程图；Fig. 8 schematically shows a flowchart of acquiring an intermediate image in an exemplary embodiment of the present disclosure;

图9示意性示出本公开示例性实施例中一种训练图像解码模型的数据流向图；Fig. 9 schematically shows a data flow diagram of a training image decoding model in an exemplary embodiment of the present disclosure;

图10示意性示出本公开示例性实施例中另一种获取中间图像的流程图；FIG. 10 schematically shows another flowchart for acquiring an intermediate image in an exemplary embodiment of the present disclosure;

图11示意性示出本公开示例性实施例中另一种获取目标图像的流程图；FIG. 11 schematically shows another flowchart of acquiring a target image in an exemplary embodiment of the present disclosure;

图12示意性示出本公开示例性实施例中再一种获取目标图像的流程图；Fig. 12 schematically shows another flowchart of acquiring a target image in an exemplary embodiment of the present disclosure;

图13示意性示出本公开示例性实施例中一种文本生成图像的方法的数据流向图；FIG. 13 schematically shows a data flow diagram of a method for generating an image from text in an exemplary embodiment of the present disclosure;

图14示意性示出本公开示例性实施例中一种网页端文本生成图像的数据流向图；Fig. 14 schematically shows a data flow diagram of a text generated image on a web page in an exemplary embodiment of the present disclosure;

图15示意性示出本公开示例性实施例中一种网页端生成预览图像的效果展示图；Fig. 15 schematically shows an effect display diagram of generating a preview image on a web page in an exemplary embodiment of the present disclosure;

图16示意性示出本公开示例性实施例中文本生成图像的装置的组成示意图；FIG. 16 schematically shows a composition diagram of an apparatus for generating an image from text in an exemplary embodiment of the present disclosure;

图17示出了可以应用本公开实施例的一种电子设备的示意图。Fig. 17 shows a schematic diagram of an electronic device to which the embodiments of the present disclosure can be applied.

具体实施方式Detailed ways

现在将参考附图更全面地描述示例实施方式。然而，示例实施方式能够以多种形式实施，且不应被理解为限于在此阐述的范例；相反，提供这些实施方式使得本公开将更加全面和完整，并将示例实施方式的构思全面地传达给本领域的技术人员。所描述的特征、结构或特性可以以任何合适的方式结合在一个或更多实施方式中。Example embodiments will now be described more fully with reference to the accompanying drawings. Example embodiments may, however, be embodied in many forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete and will fully convey the concept of example embodiments to those skilled in the art. The described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments.

此外，附图仅为本公开的示意性图解，并非一定是按比例绘制。图中相同的附图标记表示相同或类似的部分，因而将省略对它们的重复描述。附图中所示的一些方框图是功能实体，不一定必须与物理或逻辑上独立的实体相对应。可以采用软件形式来实现这些功能实体，或在一个或多个硬件模块或集成电路中实现这些功能实体，或在不同网络和/或处理器装置和/或微控制器装置中实现这些功能实体。Furthermore, the drawings are merely schematic illustrations of the present disclosure and are not necessarily drawn to scale. The same reference numerals in the drawings denote the same or similar parts, and thus repeated descriptions thereof will be omitted. Some of the block diagrams shown in the drawings are functional entities and do not necessarily correspond to physically or logically separate entities. These functional entities may be implemented in software, or in one or more hardware modules or integrated circuits, or in different network and/or processor means and/or microcontroller means.

在相关技术中，基于文本的图像生成技术在很多的场景上具有广泛的应用前景，包括手机主题商个性化壁纸创作，PPT创意图像素材获取，虚拟空间中的内容创造，多模态的对话交互系统等。这种开放域的文本生成图像主要挑战在于基于文本的指导生成高质量的包含文本语义信息的图像。目前的技术方案主要基于GAN模型，自回归模型以及扩散模型。In related technologies, text-based image generation technology has broad application prospects in many scenarios, including the creation of personalized wallpapers for mobile phone theme providers, the acquisition of PPT creative image materials, content creation in virtual spaces, and multi-modal dialogue interactions. system etc. The main challenge of such open-domain text-generated images is to generate high-quality images containing textual semantic information based on text-based guidance. The current technical solutions are mainly based on GAN model, autoregressive model and diffusion model.

GAN是假设了Z服从某些常见的分布(比如正态分布或均匀分布)，然后希望训练一个模型X＝g(Z)，学习一个判别器来让真实图片和生成图片的整体判断趋于一致。不显式建模概率密度函数，而是采用生成器和判别器对抗的方式。主要缺点训练不稳定，多样性差。GAN assumes that Z obeys some common distributions (such as normal distribution or uniform distribution), and then hopes to train a model X=g(Z), and learn a discriminator to make the overall judgment of real pictures and generated pictures tend to be consistent . The probability density function is not explicitly modeled, but the generator and the discriminator are confronted. The main disadvantages are unstable training and poor diversity.

自回归模型通过链式法则的因式分解表示成一个贝叶斯网络。自回归的生成模型中，当前节点学习依赖前面的生成结果，最早应用在NLP领域文本生成，语音生成等，但图像本身是一个离散的分布，由有限个像素组成的，而每个像素的取值也是离散的、有限的，因此可以通过离散分布来描述。自回归的问题在于生成速度太慢解码空间太大，难以生成高分辨率图像。The autoregressive model is represented as a Bayesian network by factorization of the chain rule. In the autoregressive generation model, the current node learning depends on the previous generation results, which was first applied in the NLP field of text generation, speech generation, etc., but the image itself is a discrete distribution, composed of a limited number of pixels, and each pixel takes The values are also discrete and finite, and thus can be described by a discrete distribution. The problem with autoregressive is that the generation speed is too slow and the decoding space is too large to generate high-resolution images.

基于上述缺点，本公开提供一种文本生成图像的方法，图1示出了可以实现上述文本生成图像的方法的系统架构的示意图，该系统架构100可以包括终端110与服务器120。其中，终端110可以是智能手机、平板电脑、台式电脑、笔记本电脑等终端设备，服务器120泛指提供本示例性实施方式中文本生成图像相关服务的后台系统，可以是一台服务器或多台服务器形成的集群。终端110与服务器120之间可以通过有线或无线的通信链路形成连接，以进行数据交互。Based on the above shortcomings, the present disclosure provides a method for generating an image from text. FIG. 1 shows a schematic diagram of a system architecture that can implement the above method for generating an image from text. The system architecture 100 may include a terminal 110 and a server 120 . Wherein, the terminal 110 may be a terminal device such as a smart phone, a tablet computer, a desktop computer, or a notebook computer, and the server 120 generally refers to a background system that provides services related to text generation and images in this exemplary embodiment, and may be one server or multiple servers formed clusters. The terminal 110 and the server 120 may be connected through a wired or wireless communication link for data exchange.

在一种实施方式中，可以由终端110执行上述文本生成图像的方法。例如，用户使用终端110获取自然语言文本，由终端110对基于自然语言文本生成目标图像，输出目标图像。In an implementation manner, the above-mentioned method for generating an image from text may be executed by the terminal 110 . For example, the user uses the terminal 110 to obtain a natural language text, and the terminal 110 generates a target image based on the natural language text, and outputs the target image.

在一种实施方式中，可以由服务器120可以执行上述文本生成图像的方法。例如，用户使用终端110获取自然语言文本后，终端110将该自然语言文本上传至服务器120，由服务器120基于自然语言文本生成目标图像，向终端110返回目标图像。In an implementation manner, the server 120 may execute the above-mentioned method for generating an image from text. For example, after a user obtains a natural language text using the terminal 110, the terminal 110 uploads the natural language text to the server 120, and the server 120 generates a target image based on the natural language text, and returns the target image to the terminal 110.

由上可知，本示例性实施方式中的文本生成图像的方法的执行主体可以是上述终端110或服务器120，本公开对此不做限定。It can be known from the above that the execution subject of the method for generating an image from text in this exemplary embodiment may be the terminal 110 or the server 120, which is not limited in the present disclosure.

下面结合图2对本示例性实施方式中的文本生成图像的方法进行说明，图2示出了该文本生成图像的方法的示例性流程，可以包括：The method for generating an image from text in this exemplary embodiment will be described below in conjunction with FIG. 2. FIG. 2 shows an exemplary flow of the method for generating an image from text, which may include:

步骤S210，对描述图像的自然语言文本进行编码得到文本编码数据；Step S210, encoding the natural language text describing the image to obtain text encoding data;

步骤S220，将所述文本编码数据与第一预设时间步长的第一高斯噪声进行融合得到中间文本数据；Step S220, fusing the text encoding data with the first Gaussian noise of the first preset time step to obtain intermediate text data;

步骤S230，利用不同时间步长的编码图像生成模型依次对所述时间步长对应的所述中间文本数据进行处理得到图像编码数据；Step S230, using coded image generation models with different time steps to sequentially process the intermediate text data corresponding to the time steps to obtain image coded data;

步骤S240，对所述图像编码数据进行解码得到目标图像；Step S240, decoding the encoded image data to obtain a target image;

其中，多个所述编码图像生成模型的时间步长的集合完全覆盖所述第一预设时间步长。Wherein, the set of multiple time steps of the coded image generation model completely covers the first preset time step.

基于上述方法，利用不同时间步长的编码图像生成模型依次对时间步长对应的中间文本数据进行处理得到图像编码数据增加了模型容量，提高了图片生成的质量。Based on the above method, the encoding image generation model with different time steps is used to sequentially process the intermediate text data corresponding to the time steps to obtain the image encoding data, which increases the capacity of the model and improves the quality of image generation.

下面对图2中的每个步骤进行具体说明。Each step in Fig. 2 is described in detail below.

参考图2，在步骤S210中，对描述图像的自然语言文本进行编码得到文本编码数据。Referring to FIG. 2, in step S210, the natural language text describing the image is encoded to obtain text encoding data.

在本公开的一种示例实施方式中，上述文本生成图像的方法可以应用于网页端，在对描述对描述图像的自然语言文本进行编码得到文本编码数据之前，给可以首先获取描述图像的自然语言文本。In an exemplary embodiment of the present disclosure, the above-mentioned method for generating an image from a text can be applied to a web page, and before encoding the natural language text describing the image to obtain the text encoding data, the natural language describing the image can be obtained first text.

在一种示例实施方式中，描述图像的自然语言为本可以包括图片主体，细节词，修饰词。细节词可以任意组合，修饰词可以限定一种风格，也可以限定多种风格，遵循的基本原则是符合正常的中文语法逻辑即可，参照图3所示，例如“一幅非常详细的数字绘画，描绘了一个人站在传送门前，置身于许多奇幻树木的神秘森林中”，同时体验端内置了一些文本描述内容。然后功能区有风格选择，包括卡通，动漫，中国风等，还可以根据用户需求进行自定义，有个性化标签选择，包括了一些细节词和修饰词。最后可修改的设置包括图片分辨率，图片生成数量，图片细节程度以及图片生成步数等，描述图像的自然语言文本的具体细节还可以根据用户需求进行自定义，在本示例实施方式中不做具体限定。In an example implementation, the natural language base for describing an image may include a picture body, detailed words, and modifiers. Detailed words can be combined arbitrarily, and modifiers can define one style or multiple styles. The basic principle to follow is to conform to the normal Chinese grammatical logic, as shown in Figure 3, for example, "a very detailed digital painting , depicting a person standing in front of a portal, in a mysterious forest of many fantasy trees", and some text descriptions are built into the experience. Then there are style options in the function area, including cartoon, animation, Chinese style, etc., which can also be customized according to user needs. There are personalized label options, including some detailed words and modifiers. Finally, the modifiable settings include image resolution, number of generated images, degree of detail of images, and steps of image generation, etc. The specific details of the natural language text describing images can also be customized according to user needs, which is not done in this example implementation Specific limits.

在本示例实施方式中，可以通过用户输入和选择的方式获取描述图像的自然语言为本，具体的，参照图3所示，可以有用户输入文本，然后选择需要生成图像的风格、个性化标签等，其中，风格与个性化标签可以是非必选项。In this example embodiment, the natural language that describes the image can be obtained through user input and selection. Specifically, as shown in FIG. etc., among them, the style and personalization tags can be optional.

在本公开的一种示例实施方式中，在得到上述描述图像的自然语言文本之后，可以对上述自然语言文本进行编码得到文本编码数据，具体的，可以采用XLM-Roberta-Large-Vit-B-16Plus编码器对自然语言文本进行编码，也可以采用其他编码器进行编码，在本示例实时方式中不做具体限定。In an example implementation of the present disclosure, after obtaining the above natural language text describing the image, the above natural language text can be encoded to obtain text encoding data, specifically, XLM-Roberta-Large-Vit-B- The 16Plus encoder encodes the natural language text, and other encoders can also be used for encoding, which is not specifically limited in the real-time mode of this example.

在步骤S220中，将所述文本编码数据与第一预设时间步长的第一高斯噪声进行融合得到中间文本数据。In step S220, the text encoding data is fused with the first Gaussian noise of the first preset time step to obtain intermediate text data.

在本公开的一种示例实施方式中，在得到上述文本编码数据之后，可以将其与第一高斯噪声融合，融合过程可以包括预设时间步长，其中第一预设时间步长可以包括1000步，也可以是2000步，还可以根据用户需求进行自定义，在本示例实施方式中不做具体限定。融合之后得到上述中间文本数据。In an exemplary embodiment of the present disclosure, after obtaining the above-mentioned text encoding data, it can be fused with the first Gaussian noise, and the fusion process can include a preset time step, wherein the first preset time step can include 1000 steps, may also be 2000 steps, and may also be customized according to user requirements, which is not specifically limited in this example implementation manner. After fusion, the above-mentioned intermediate text data is obtained.

在步骤S230中，利用不同时间步长的编码图像生成模型依次对所述时间步长对应的所述中间文本数据进行处理得到图像编码数据。In step S230, the intermediate text data corresponding to the time steps are sequentially processed by using coded image generation models of different time steps to obtain image coded data.

在本示例实时方式中，参照图4所示，上述步骤可以包括步骤S410至步骤S430。In the real-time mode of this example, as shown in FIG. 4 , the above steps may include step S410 to step S430.

在步骤S410中，利用第N个阶段对应的编码图像生成模型对第N个阶段的中间文本数据进行处理得到第N个阶段的图像编码数据。In step S410, the intermediate text data of the Nth stage is processed by using the coded image generation model corresponding to the Nth stage to obtain the image coded data of the Nth stage.

在本示例实施方式中，参照图5所示，上述方法还可以包括获取上述编码图像生成模型，具体的，可以包括步骤S510至步骤S530。In this example embodiment, as shown in FIG. 5 , the above method may further include acquiring the coded image generation model, specifically, it may include steps S510 to S530.

在步骤S510中，获取第一初始模型和第一训练数据，其中，所述第一训练数据包括多个参考文本编码数据以及所述参考文本编码数据对应的真值图像编码数据。In step S510, a first initial model and first training data are acquired, wherein the first training data includes a plurality of reference text coded data and real image coded data corresponding to the reference text coded data.

在本公开的一种示例实施方式中，可以首先获取一个第一初始模型和训练数据，其中，训练数据可以包括多个参考文本编码数据以及所述参考文本编码数据对应的真值图像编码数据。In an example implementation of the present disclosure, a first initial model and training data may be acquired first, wherein the training data may include multiple reference text encoding data and real image encoding data corresponding to the reference text encoding data.

其中，上述参考文本编码数据是由参考自然语言文本编码得到的，其中，上述参考自然语言文本编码可以包括业界开源laion中文部分，悟空数据、翻译的laion中文数据、收集爬取的数据等，还可以根据用户需求进行自定义，在本示例实施方式中不做具体限定。Wherein, the above-mentioned reference text encoding data is obtained by referring to natural language text encoding, wherein, the above-mentioned reference natural language text encoding may include the open source laion Chinese part in the industry, Wukong data, translated laion Chinese data, collected and crawled data, etc. It can be customized according to user requirements, and is not specifically limited in this example implementation.

在对上述参考自然语言文本数据进行编码之前，还可以对上述参考自然语言文本编码数据进行数据清洗，具体的，滤掉不包含物理实体词描述，同时文本内容进行清洗，包括去除特殊标点符号，繁转简等，在本示例实施方式中不做具体限定。Before encoding the above-mentioned reference natural language text data, data cleaning can also be performed on the above-mentioned reference natural language text encoding data. Specifically, filter out descriptions that do not contain physical entity words, and at the same time clean the text content, including removing special punctuation marks, Conversion from complex to simplified, etc., are not specifically limited in this example implementation.

上述真值图像编码数据可以对真值图像进行编码之后得到的，其中，在对上述真值图像进行编码之前，可以将真值图像中的图片低分辨率，美学评分低的图像删除，以增加训练数据的质量。The above-mentioned real-valued image coding data can be obtained after coding the real-valued image, wherein, before encoding the above-mentioned real-valued image, images with low resolution and low aesthetic rating in the real-valued image can be deleted to increase The quality of the training data.

在步骤S520中，将所述参考文本编码数据输入至所述第一初始模型得到参考图像编码数据。In step S520, input the reference text coding data into the first initial model to obtain reference image coding data.

在本公开的一种示例实施方式中，在得到上述参考文本编码数据之后，可以将上述参考文本编码数据输入至上述第一初始模型，其中，上述第一初始模型可以由Transformer Encoder(编码器)部分组成，可以将随机高斯噪音以及参考文本编码数据时间步长的编码作为条件控制参考图像编码数据的生成。In an example implementation of the present disclosure, after obtaining the above-mentioned reference text encoding data, the above-mentioned reference text encoding data can be input into the above-mentioned first initial model, wherein the above-mentioned first initial model can be generated by a Transformer Encoder (encoder) Partially composed, random Gaussian noise and the encoding of the time step of the reference text encoding data can be used as conditions to control the generation of the reference image encoding data.

在步骤S530中，基于所述参考图像编码数据和所述真值图像编码数据训练所述第一初始模型得到所述编码图像生成模型。In step S530, the first initial model is trained based on the coded reference image data and the coded image real value data to obtain the coded image generation model.

在本公开的一种示例实施方式中，在得到上述参考图像编码数据以及上述真值图像编码数据之后，可以基于上述参考图像编码数据和上述真值图像编码数据训练第一初始模型得到上述编码图像生成模型。In an exemplary embodiment of the present disclosure, after obtaining the coded reference image data and the coded real image data, the first initial model may be trained based on the coded reference image data and the coded real image data to obtain the coded image Generate a model.

具体的，参照图6所示，输入参考自然语言文本进行数据编码，得到参考文本编码数据，将参考文本编码数据输入至上述第一初始模型得到参考图像编码数据，利用真值图像进行编码得到真值图像编码数据，利用上述真值图像编码数据和上述参考图像编码数据更新上述第一初始模型，以得到编码图像生成模型。Specifically, as shown in Figure 6, input the reference natural language text for data encoding to obtain the reference text encoding data, input the reference text encoding data into the above-mentioned first initial model to obtain the reference image encoding data, and use the real value image to encode to obtain the true Value image coding data, using the above-mentioned real value image coding data and the above-mentioned reference image coding data to update the first initial model, so as to obtain a coded image generation model.

在本示例实施方式中，上述第一预设时间步长可以包括N个阶段，此时上述可以包括N个串联设置的编码图像生成模型，每一个编码图像生成模型处理一个阶段的数据，能够增强模型的针对性，提升得到的图像的质量。In this example embodiment, the above-mentioned first preset time step may include N stages. At this time, the above-mentioned coded image generation models may include N serially arranged coded image generation models, and each coded image generation model processes data of one stage, which can enhance The pertinence of the model improves the quality of the obtained image.

在本示例实施方式中，将上述参考文本编码图像与上述第一预设时间步长的第一高斯噪声进行融合之后得到中间文本数据，然后利用第N个阶段对应的编码图像生成模型对第N个阶段的中间文本数据进行处理得到第N个阶段的图像编码数据。In this example embodiment, the intermediate text data is obtained after the above-mentioned reference text coded image is fused with the first Gaussian noise of the first preset time step, and then the coded image generation model corresponding to the Nth stage is used to generate the Nth The intermediate text data of the first stage is processed to obtain the image coding data of the Nth stage.

在步骤S420中，利用第N-1个阶段对应的编码图像生成模型对第N个阶段的图像编码数据进行处理得到第N-1个阶段的图像编码数据。In step S420, the coded image data of the Nth stage is processed by using the coded image generation model corresponding to the N-1th stage to obtain the image coded data of the N-1th stage.

在步骤S430中，将所述第一个阶段的图像编码数据输出。In step S430, output the encoded image data of the first stage.

在本示例实施方式中，利用第N-1个阶段对应的编码图像生成模型对第N个阶段的图像编码数据进行处理得到第N-1个阶段的图像编码数据，即第N-1阶段的编码图像生成模型的输入是第N阶段编码图像生成模型的输出。可以将上述第一个阶段的图像编码数据输出。需要说明的是，上述N是大于1的正整数。In this exemplary embodiment, the coded image data of the Nth stage is processed by using the coded image generation model corresponding to the N-1th stage to obtain the image coded data of the N-1th stage, that is, the image coded data of the N-1th stage The input to the coded image generative model is the output of the coded image generative model at stage N. The image coding data of the first stage above can be output. It should be noted that the above N is a positive integer greater than 1.

需要说明的是，上述多个所述编码图像生成模型的时间步长的集合完全覆盖所述第一预设时间步长，举例而言，若上述第一预设时间步长为1000步，则上述编码图像生成模型的时间步长大于等于1000步，上述N个阶段中每个阶段的步长可以相同，也可以不同，可以根据用户需求进行自定义，在本示例实施方式中不做具体限定。It should be noted that the set of the above-mentioned multiple time steps of the coded image generation model completely covers the first preset time step. For example, if the above-mentioned first preset time step is 1000 steps, then The time step of the above coded image generation model is greater than or equal to 1000 steps, and the step size of each stage in the above N stages can be the same or different, and can be customized according to user needs, and is not specifically limited in this example embodiment .

在步骤S240中，对所述图像编码数据进行解码得到目标图像。In step S240, the encoded image data is decoded to obtain a target image.

在本公开的一种示例实施方式中，参照图7所示，上述步骤可以包括步骤S710至步骤S720。In an example implementation of the present disclosure, as shown in FIG. 7 , the above steps may include step S710 to step S720.

在步骤S710中，利用图像解码模型对所述图像编码数据进行解码得到中间图像。In step S710, the encoded image data is decoded using an image decoding model to obtain an intermediate image.

在本示例实施方式中，参照图8所示，上述步骤可以包括步骤S820至步骤S820。In this example implementation, as shown in FIG. 8 , the above steps may include step S820 to step S820.

在步骤S810中，将所述图像编码数据与第二预设时间步长的第二高斯噪声进行融合得到待解码图像。In step S810, the coded image data is fused with the second Gaussian noise of the second preset time step to obtain an image to be decoded.

在本公开的一种示例实施方式中，在得到上述图像编码数据之后，可以将其与第二高斯噪声融合，融合过程可以包括预设时间步长，其中第二预设时间步长可以包括1000步，也可以是2000步，还可以根据用户需求进行自定义，在本示例实施方式中不做具体限定。融合得到上述待解码图像。In an example implementation of the present disclosure, after obtaining the above image coding data, it can be fused with the second Gaussian noise, and the fusion process can include a preset time step, wherein the second preset time step can include 1000 steps, may also be 2000 steps, and may also be customized according to user requirements, which is not specifically limited in this example implementation manner. Fusion to obtain the above image to be decoded.

在步骤S820中，利用不同时间步长的图像解码模型依次对所述时间步长对应的所述待解码图像进行解码得到中间图像。In step S820, the images to be decoded corresponding to the time steps are sequentially decoded using image decoding models of different time steps to obtain an intermediate image.

在本示例实施方式中，参照图9所示，就可以首先获取第二初始模型和第二训练数据，其中，第二训练数据包括多个参考待解码图像以及参考待解码图像对应的真值中间图像。然后将参考待解码图像输入至第二初始模型得到参考中间图像，最后基于参考中间图像和真值中间图像训练第二初始模型更新上述第二初始模型得到图像解码模型。In this example embodiment, as shown in FIG. 9 , the second initial model and the second training data can be obtained first, wherein the second training data includes a plurality of reference images to be decoded and the true intermediate values corresponding to the reference images to be decoded image. Then input the reference image to be decoded into the second initial model to obtain a reference intermediate image, and finally train the second initial model based on the reference intermediate image and the true intermediate image to update the second initial model to obtain an image decoding model.

具体的训练过程，可以参照对上述编码图像生成模型的训练，在本示例实施方式中不再赘述。For the specific training process, reference may be made to the training of the above coded image generation model, which will not be repeated in this example implementation.

在本示例实时方式中，上述图像解码模型的输入可以包括随机的第三高斯噪音、图像编码数据和训练时间步作为条件，进一步的，还可以通过一定比例的时间随机置零文本编码数据来使用扩散模型来做条件生成任务的策略生成上述中间图像。In the real-time mode of this example, the input of the above image decoding model can include random third Gaussian noise, image encoding data and training time steps as conditions, and further, it can also be used by randomly zeroing the text encoding data for a certain proportion of time Diffusion model to do the policy of the conditional generation task to generate the above intermediate image.

在本示例实施方式中，所述第二预设时间步长包括M个阶段，参照图10所示，利用不同时间步长的图像解码模型依次对所述时间步长对应的所述待解码图像进行解码得到中间图像可以包括步骤S1010至步骤S1030。In this example embodiment, the second preset time step includes M stages, as shown in FIG. Decoding to obtain an intermediate image may include steps S1010 to S1030.

在步骤S1010中，利用第M个阶段对应的图像解码模型对第M个阶段的待解码图像进行解码得到第M个阶段的中间图像。In step S1010, the image to be decoded at the Mth stage is decoded using the image decoding model corresponding to the Mth stage to obtain an intermediate image at the Mth stage.

在本示例实施方式中，上述第二预设时间步长可以包括M个阶段，此时上述可以包括M个串联设置的图像解码模型，每一个图像解码模型处理一个阶段的数据，能够增强模型的针对性，提升得到的图像的质量。In this exemplary embodiment, the second preset time step may include M stages, and at this time, the above may include M image decoding models set in series, and each image decoding model processes data of one stage, which can enhance the model's Pertinence, improve the quality of the obtained image.

在本示例实施方式中，利用第M个阶段对应的图像解码模型对第M个阶段的待解码图像进行处理得到第M个阶段的中间图像。In this example embodiment, the image to be decoded at the Mth stage is processed by using the image decoding model corresponding to the Mth stage to obtain an intermediate image at the Mth stage.

在步骤S1020中，利用第M-1个阶段对应的图像解码模型对第M个阶段的中间图像进行解码得到第M-1个阶段的中间图像。In step S1020, the intermediate image of the M-th stage is decoded by using the image decoding model corresponding to the M-1-th stage to obtain the intermediate image of the M-1-th stage.

在步骤S1030中，将所述第一个阶段的中间图像输出。In step S1030, the intermediate image of the first stage is output.

在本示例实施方式中，利用第M-1个阶段对应的图像解码模型对第M个阶段的中间图像进行处理得到第M-1个阶段的中间图像，即第M-1阶段的图像解码模型的输入是第M阶段图像解码模型的输出。可以将上述第一个阶段的中间图像输出。需要说明的是，上述M是大于1的正整数。In this example embodiment, the intermediate image of the Mth stage is processed by using the image decoding model corresponding to the M-1th stage to obtain the intermediate image of the M-1th stage, that is, the image decoding model of the M-1th stage The input to is the output of the M-th stage image decoding model. The intermediate image of the first stage above can be output. It should be noted that the above M is a positive integer greater than 1.

需要说明的是，上述多个所述图像解码模型的时间步长的集合完全覆盖所述第二预设时间步长，举例而言，若上述第二预设时间步长为1000步，则上述图像解码模型的时间步长大于等于1000步，上述M个阶段中每个阶段的步长可以相同，也可以不同，可以根据用户需求进行自定义，在本示例实施方式中不做具体限定。It should be noted that the set of the above-mentioned multiple time steps of the image decoding models completely covers the second preset time step. For example, if the above-mentioned second preset time step is 1000 steps, then the above-mentioned The time step of the image decoding model is greater than or equal to 1000 steps, and the step size of each of the above M stages can be the same or different, and can be customized according to user needs, and is not specifically limited in this example embodiment.

在步骤S720中，对所述中间图像进行至少一次的超分辨率处理得到所述目标图像。In step S720, at least one super-resolution process is performed on the intermediate image to obtain the target image.

在本公开的一种示例实施方式中，参照图11所示，上述步骤可以包括步骤S1110至步骤S1120。In an example implementation of the present disclosure, as shown in FIG. 11 , the above steps may include step S1110 to step S1120 .

在步骤S1110中，将所述中间图像与第三预设时间步长的第三高斯噪声进行融合得到中间超分图像。In step S1110, the intermediate image is fused with the third Gaussian noise of the third preset time step to obtain an intermediate super-resolution image.

在本公开的一种示例实施方式中，在得到上述中间图像之后，可以将其与第三高斯噪声融合，融合过程可以包括预设时间步长，其中第三预设时间步长可以包括1000步，也可以是2000步，还可以根据用户需求进行自定义，在本示例实施方式中不做具体限定。融合得到上述待解码图像。In an example implementation of the present disclosure, after obtaining the above-mentioned intermediate image, it can be fused with the third Gaussian noise, and the fusion process can include a preset time step, wherein the third preset time step can include 1000 steps , can also be 2000 steps, and can also be customized according to user requirements, which is not specifically limited in this example implementation. Fusion to obtain the above image to be decoded.

在步骤S1120中，利用不同时间步长的图像超分辨率模型依次对所述时间步长对应的所述中间超分图像进行超分辨率得到目标图像。In step S1120, the intermediate super-resolution images corresponding to the time steps are sequentially super-resolved using image super-resolution models of different time steps to obtain a target image.

在本示例实施方式中，就可以首先获取第三初始模型和第三训练数据，其中，第三训练数据包括多个参考中间图像以及参考中间图像对应的真值超分图像。然后将参考中间图像输入至第三初始模型得到参考超分图像，最后基于参考超分图像和真值超分图像训练第三初始模型更新上述第三初始模型得到图像超分辨率模型。In this example embodiment, the third initial model and the third training data may be obtained first, wherein the third training data includes a plurality of reference intermediate images and the ground-truth super-resolution images corresponding to the reference intermediate images. Then input the reference intermediate image into the third initial model to obtain a reference super-resolution image, and finally train the third initial model based on the reference super-resolution image and the true value super-resolution image to update the above-mentioned third initial model to obtain an image super-resolution model.

在本公开的一种示例实施方式中，参照图12所示，第三预设时间步长包括P个阶段；利用不同时间步长的图像超分辨率模型依次对所述时间步长对应的所述中间超分图像进行超分辨率得到目标图像可以包括步骤S1210至步骤S1230。In an example implementation of the present disclosure, as shown in FIG. 12 , the third preset time step includes P stages; image super-resolution models with different time steps are used to sequentially perform all stages corresponding to the time steps Performing super-resolution on the intermediate super-resolution image to obtain the target image may include steps S1210 to S1230.

在步骤S1210中，利用第P个阶段对应的图像超分辨率模型对第P个阶段的中间图像进行超分辨率得到第P个阶段的目标图像。In step S1210, super-resolution is performed on the intermediate image of the Pth stage by using the image super-resolution model corresponding to the Pth stage to obtain the target image of the Pth stage.

在本示例实施方式中，上述第三预设时间步长可以包括P个阶段，此时上述可以包括P个串联设置的图像超分辨率模型，每一个图像超分辨率模型处理一个阶段的数据，能够增强模型的针对性，提升得到的图像的质量。In this example embodiment, the third preset time step may include P stages, and at this time, the above may include P image super-resolution models arranged in series, and each image super-resolution model processes data of one stage, It can enhance the pertinence of the model and improve the quality of the obtained image.

在本示例实施方式中，将上述中间图像与上述第三预设时间步长的第三高斯噪声进行融合之后得到中间文本数据，然后利用第P个阶段对应的图像超分辨率模型对第P个阶段的中间图像进行处理得到第P个阶段的目标图像。In this example embodiment, the intermediate text data is obtained after the above-mentioned intermediate image is fused with the third Gaussian noise of the above-mentioned third preset time step, and then the image super-resolution model corresponding to the P-th stage is used to process the P-th The intermediate image of the stage is processed to obtain the target image of the Pth stage.

在步骤S1220中，利用第P-1个阶段对应的图像超分辨率模型对第P个阶段的中间图像进行超分辨率得到第P-1个阶段的目标图像。In step S1220, super-resolution is performed on the intermediate image of the Pth stage by using the image super-resolution model corresponding to the P-1th stage to obtain the target image of the P-1th stage.

在步骤S1230中，将所述第一个阶段的目标图像输出。In step S1230, the target image of the first stage is output.

在本示例实施方式中，利用第P-1个阶段对应的图像超分辨率模型对第P个阶段的目标图像进行处理得到第P-1个阶段的目标图像，即第P-1阶段的图像超分辨率模型的输入是第P阶段图像超分辨率模型的输出。可以将上述第一个阶段的目标图像输出。需要说明的是，上述P是大于1的正整数。In this example embodiment, the target image of the P-th stage is processed by using the image super-resolution model corresponding to the P-1th stage to obtain the target image of the P-1th stage, that is, the image of the P-1th stage The input to the super-resolution model is the output of the P-stage image super-resolution model. The target image of the first stage above can be output. It should be noted that the above P is a positive integer greater than 1.

需要说明的是，上述多个所述图像超分辨率模型的时间步长的集合完全覆盖所述第三预设时间步长，举例而言，若上述第三预设时间步长为1000步，则上述图像超分辨率模型的时间步长大于等于1000步，上述P个阶段中每个阶段的步长可以相同，也可以不同，可以根据用户需求进行自定义，在本示例实施方式中不做具体限定。It should be noted that the above-mentioned set of multiple time steps of the image super-resolution model completely covers the third preset time step. For example, if the above-mentioned third preset time step is 1000 steps, Then the time step size of the above-mentioned image super-resolution model is greater than or equal to 1000 steps, and the step size of each stage in the above-mentioned P stages can be the same or different, and can be customized according to user needs, which is not done in this example embodiment Specific limits.

在本示例实施方式中，上述图像超分辨率模型的数量可以是一个，也可以是多个，在一种示例实施方式中，在利用上述图像超分辨率模型得到目标图像之后，还可以采用SwinIR超分模型对上述目标图像进行更新。In this example embodiment, the number of the above-mentioned image super-resolution models may be one or multiple. In an example embodiment, after using the above-mentioned image super-resolution model to obtain the target image, SwinIR may also be used The super-resolution model updates the above target image.

下面参照图13对上述文本生成图像的方法进行详细介绍，首先可以将上述自然语言文本输入至上述文本编码模块得到文本编码数据，文本编码数据经过上述编码图像生成模型得到图像编码数据，然后将图像编码数据输入至图像解码模型得到中间图像，然后利用上述图像超分辨率模型对上述中间图像进行超分辨率得到目标图像，其中，还可以利用SwinIR超分模型对上述目标图像进行超分辨率以更新上述目标图像。The method for generating an image from the above-mentioned text is described in detail below with reference to FIG. 13. First, the above-mentioned natural language text can be input into the above-mentioned text encoding module to obtain text encoding data. The encoded data is input to the image decoding model to obtain an intermediate image, and then the above-mentioned image super-resolution model is used to perform super-resolution on the above-mentioned intermediate image to obtain the target image, wherein, the SwinIR super-resolution model can also be used to perform super-resolution on the above-mentioned target image to update The above target image.

在本示例实施方式中，上述中间图像的分辨率可以是64*64，经过图像超分辨率模型后的目标图像的分辨率可以是256*256，经过SwinIR超分模型对上述目标图像进行超分辨率更新后上述目标图像的分辨率可以是1024*1024。其中上述分辨率的具体数值可以根据用户需求进行自定义，在本示例实施方式中不做具体限定。In this example embodiment, the resolution of the above-mentioned intermediate image may be 64*64, the resolution of the target image after the image super-resolution model may be 256*256, and the above-mentioned target image is super-resolved through the SwinIR super-resolution model The resolution of the target image above may be 1024*1024 after the update rate. The specific numerical values of the foregoing resolutions can be customized according to user requirements, and are not specifically limited in this exemplary embodiment.

在本公开的一种示例实施方式中，上述方法可以应用于网页端，具体的，参照图14所示，用户自然语言文本发送至web服务器，即用户向web服务器发送请求，web服务器将自然语言文本发送至文本图模型中，文本图模型可以利用上述文本生成图像的方法生成目标图像，并经由web服务器反馈至用户。In an example implementation of the present disclosure, the above method can be applied to the web page. Specifically, as shown in FIG. The text is sent to the text graph model, and the text graph model can generate a target image by using the above-mentioned method for generating an image from text, and feed it back to the user via the web server.

参照图15所示，用户可以在网页端进行图像预览，在用户点击生成图像标识之后，会在图像展示界面展示生成的图像，生成图像的数量可以根据用户需求进行自定义，在本示例实施方式中不做具体限定。Referring to Figure 15, the user can preview the image on the web page. After the user clicks the generated image logo, the generated image will be displayed on the image display interface. The number of generated images can be customized according to the user's needs. In this example implementation is not specifically limited.

综上所述，本示例性实施方式中，利用不同时间步长的编码图像生成模型依次对时间步长对应的中间文本数据进行处理得到图像编码数据增加了模型容量，提高了图片生成的质量。同时利用了不同时间步长的图像解码模型一次对时间步长对应的图像编码数据及处理，也利用了不通过时间步长的图像超分辨率模型时间步长对应的中间图像进行超分辨率，进一步提升生成图像的质量。其中，图像超分辨率模型、编码图像生成模型以及图像解码模型均采用了扩散模型，文本生成图像对于复杂分布，大数据有很强的拟合能力，同时生成质量高，多样性好，可编辑能力强。To sum up, in this exemplary embodiment, the coded image generation models with different time steps are used to sequentially process the intermediate text data corresponding to the time steps to obtain image coded data, which increases the capacity of the model and improves the quality of image generation. At the same time, the image decoding model with different time steps is used to encode and process the image encoding data corresponding to the time step at one time, and the image super-resolution model that does not pass the time step is used to super-resolution the intermediate image corresponding to the time step. Further improve the quality of generated images. Among them, the image super-resolution model, the coded image generation model and the image decoding model all use the diffusion model. The text generated image has a strong fitting ability for complex distribution and big data, and at the same time, the generation quality is high, the diversity is good, and it can be edited. strong ability.

需要注意的是，上述附图仅是根据本公开示例性实施例的方法所包括的处理的示意性说明，而不是限制目的。易于理解，上述附图所示的处理并不表明或限制这些处理的时间顺序。另外，也易于理解，这些处理可以是例如在多个模块中同步或异步执行的。It should be noted that the above-mentioned figures are only schematic illustrations of processes included in the method according to the exemplary embodiments of the present disclosure, and are not intended to be limiting. It is easy to understand that the processes shown in the above figures do not imply or limit the chronological order of these processes. In addition, it is also easy to understand that these processes may be executed synchronously or asynchronously in multiple modules, for example.

进一步的，参考图16所示，本示例的实施方式中还提供一种文本生成图像的装置1600，包括编码模块1610、融合模块1620、处理模块1630和解码模块1640。其中：Further, referring to FIG. 16 , the embodiment of this example also provides an apparatus 1600 for generating images from text, including an encoding module 1610 , a fusion module 1620 , a processing module 1630 and a decoding module 1640 . in:

编码模块1610可以用于对描述图像的自然语言文本进行编码得到文本编码数据。The encoding module 1610 may be used to encode the natural language text describing the image to obtain text encoding data.

融合模块1620可以用于将所述文本编码数据与第一预设时间步长的第一高斯噪声进行融合得到中间文本数据。The fusion module 1620 may be configured to fuse the encoded text data with the first Gaussian noise of the first preset time step to obtain intermediate text data.

处理模块1630可以用于利用不同时间步长的编码图像生成模型依次对所述时间步长对应的所述中间文本数据进行处理得到图像编码数据。The processing module 1630 may be configured to sequentially process the intermediate text data corresponding to the time steps using coded image generation models of different time steps to obtain image coded data.

在一种示例实施方式中，处理模块1630可以被配置为获取第一初始模型和第一训练数据，其中，所述第一训练数据包括多个参考文本编码数据以及所述参考文本编码数据对应的真值图像编码数据；将所述参考文本编码数据输入至所述第一初始模型得到参考图像编码数据；基于所述参考图像编码数据和所述真值图像编码数据训练所述第一初始模型得到所述编码图像生成模型。In an example implementation, the processing module 1630 may be configured to acquire a first initial model and first training data, wherein the first training data includes a plurality of reference text encoding data and corresponding Real-value image coding data; input the reference text coding data into the first initial model to obtain reference image coding data; train the first initial model based on the reference image coding data and the real-value image coding data to obtain The coded image generates a model.

在一种示例实施方式中，所述第一预设时间步长包括N个阶段，处理模块1630可以被配置为利用第N个阶段对应的编码图像生成模型对第N个阶段的中间文本数据进行处理得到第N个阶段的图像编码数据；利用第N-1个阶段对应的编码图像生成模型对第N个阶段的图像编码数据进行处理得到第N-1个阶段的图像编码数据；将所述第一个阶段的图像编码数据输出；其中，N为大于1的正整数。In an example implementation, the first preset time step includes N stages, and the processing module 1630 may be configured to use the coded image generation model corresponding to the Nth stage to process the intermediate text data of the Nth stage Process the encoded image data at the Nth stage; use the encoded image generation model corresponding to the N-1st stage to process the encoded image data at the Nth stage to obtain the encoded image data at the N-1 stage; The image coding data output of the first stage; wherein, N is a positive integer greater than 1.

解码模块1640可以用于对所述图像编码数据进行解码得到目标图像。The decoding module 1640 may be configured to decode the encoded image data to obtain a target image.

在一种示例实施方式中，解码模块1640可以被配置为利用图像解码模型对所述图像编码数据进行解码得到中间图像；对所述中间图像进行至少一次的超分辨率处理得到所述目标图像。In an example implementation, the decoding module 1640 may be configured to use an image decoding model to decode the encoded image data to obtain an intermediate image; perform super-resolution processing on the intermediate image at least once to obtain the target image.

在一种示例实施方式中，解码模块1640可以被配置为将所述图像编码数据与第二预设时间步长的第二高斯噪声进行融合得到待解码图像；利用不同时间步长的图像解码模型依次对所述时间步长对应的所述待解码图像进行解码得到中间图像；其中，多个所述图像解码模型的时间步长的集合完全覆盖所述第二预设时间步长。In an example implementation, the decoding module 1640 may be configured to fuse the encoded image data with the second Gaussian noise of the second preset time step to obtain an image to be decoded; using image decoding models with different time steps Decoding the image to be decoded corresponding to the time step in sequence to obtain an intermediate image; wherein, a set of time steps of multiple image decoding models completely covers the second preset time step.

在本示例实施方式中，解码模块1640可以被配置为利用第M个阶段对应的图像解码模型对第M个阶段的待解码图像进行解码得到第M个阶段的中间图像；利用第M-1个阶段对应的图像解码模型对第M个阶段的中间图像进行解码得到第M-1个阶段的中间图像；将所述第一个阶段的中间图像输出；其中，M为大于1的正整数。In this example embodiment, the decoding module 1640 may be configured to use the image decoding model corresponding to the Mth stage to decode the image to be decoded at the Mth stage to obtain an intermediate image at the Mth stage; The image decoding model corresponding to the stage decodes the intermediate image of the Mth stage to obtain the intermediate image of the M-1th stage; outputs the intermediate image of the first stage; wherein, M is a positive integer greater than 1.

在一种示例实施方式中，解码模块1640可以被配置为将所述中间图像与第三预设时间步长的第三高斯噪声进行融合得到中间超分图像；利用不同时间步长的图像超分辨率模型依次对所述时间步长对应的所述中间超分图像进行超分辨率得到目标图像；其中，多个所述图像超分辨率模型的时间步长的集合完全覆盖所述第三预设时间步长。In an example implementation, the decoding module 1640 may be configured to fuse the intermediate image with the third Gaussian noise of the third preset time step to obtain an intermediate super-resolution image; image super-resolution using different time steps The rate model sequentially performs super-resolution on the intermediate super-resolution image corresponding to the time step to obtain the target image; wherein, the set of time steps of multiple image super-resolution models completely covers the third preset Time Step.

在本示例实施方式中，所述第三预设时间步长包括P个阶段，解码模块1640可以被配置为利用第P个阶段对应的图像超分辨率模型对第P个阶段的中间图像进行超分辨率得到第P个阶段的目标图像；利用第P-1个阶段对应的图像超分辨率模型对第P个阶段的中间图像进行超分辨率得到第P-1个阶段的目标图像；将所述第一个阶段的目标图像输出；其中，P为大于1的正整数。In this example embodiment, the third preset time step includes P stages, and the decoding module 1640 may be configured to use the image super-resolution model corresponding to the P-th stage to perform super-resolution on the intermediate image of the P-th stage resolution to obtain the target image of the Pth stage; use the image super-resolution model corresponding to the P-1th stage to perform super-resolution on the intermediate image of the P-th stage to obtain the target image of the P-1th stage; The target image output of the first stage; wherein, P is a positive integer greater than 1.

上述装置中各模块的具体细节在方法部分实施方式中已经详细说明，未披露的细节内容可以参见方法部分的实施方式内容，因而不再赘述。The specific details of each module in the above device have been described in detail in the implementation of the method, and details not disclosed can be found in the implementation of the method, so details are not repeated here.

本公开的示例性实施方式还提供一种用于执行上述文本生成图像的方法的电子设备，该电子设备可以是上述终端110或服务器120。一般的，该电子设备可以包括处理器与存储器，存储器用于存储处理器的可执行指令，处理器配置为经由执行可执行指令来执行上述文本生成图像的方法。Exemplary embodiments of the present disclosure also provide an electronic device for performing the above-mentioned method for generating an image from text, and the electronic device may be the above-mentioned terminal 110 or server 120 . Generally, the electronic device may include a processor and a memory, the memory is used to store executable instructions of the processor, and the processor is configured to execute the above-mentioned method for generating an image from text by executing the executable instructions.

下面以图17中的移动终端1700为例，对该电子设备的构造进行示例性说明。本领域技术人员应当理解，除了特别用于移动目的的部件之外，图17中的构造也能够应用于固定类型的设备。The following takes the mobile terminal 1700 in FIG. 17 as an example to illustrate the structure of the electronic device. Those skilled in the art will appreciate that, in addition to components specifically intended for mobile purposes, the configuration in Fig. 17 can also be applied to equipment of a stationary type.

如图17所示，移动终端1700具体可以包括：处理器1701、存储器1702、总线1703、移动通信模块1704、天线1、无线通信模块1705、天线2、显示屏1706、摄像模块1707、音频模块1708、电源模块1709与传感器模块1710。As shown in Figure 17, the mobile terminal 1700 may specifically include: a processor 1701, a memory 1702, a bus 1703, a mobile communication module 1704, an antenna 1, a wireless communication module 1705, an antenna 2, a display screen 1706, a camera module 1707, and an audio module 1708 , a power module 1709 and a sensor module 1710 .

处理器1701可以包括一个或多个处理单元，例如：处理器1701可以包括AP(Application Processor，应用处理器)、调制解调处理器、GPU(Graphics ProcessingUnit，图形处理器)、ISP(Image Signal Processor，图像信号处理器)、控制器、编码器、解码器、DSP(Digital Signal Processor，数字信号处理器)、基带处理器和/或NPU(Neural-Network Processing Unit，神经网络处理器)等。本示例性实施方式中的文本生成图像的方法可以由AP、GPU或DSP来执行，当方法涉及到神经网络相关的处理时，可以由NPU来执行。The processor 1701 may include one or more processing units, for example: the processor 1701 may include an AP (Application Processor, application processor), a modem processor, a GPU (Graphics Processing Unit, a graphics processor), an ISP (Image Signal Processor , image signal processor), controller, encoder, decoder, DSP (Digital Signal Processor, digital signal processor), baseband processor and/or NPU (Neural-Network Processing Unit, neural network processor), etc. The method for generating an image from text in this exemplary embodiment may be performed by an AP, GPU or DSP, and when the method involves neural network-related processing, it may be performed by an NPU.

编码器可以对图像或视频进行编码(即压缩)，例如可以将目标图像编码为特定的格式，以减小数据大小，便于存储或发送。解码器可以对图像或视频的编码数据进行解码(即解压缩)，以还原出图像或视频数据，如可以读取目标图像的编码数据，通过解码器进行解码，以还原出目标图像的数据，进而对该数据进行文本生成图像的相关处理。移动终端200可以支持一种或多种编码器和解码器。这样，移动终端200可以处理多种编码格式的图像或视频，例如：JPEG(Joint Photographic Experts Group，联合图像专家组)、PNG(Portable Network Graphics，便携式网络图形)、BMP(Bitmap，位图)等图像格式，MPEG(Moving Picture Experts Group，动态图像专家组)1、MPEG2、H.263、H.264、HEVC(HighEfficiency Video Coding，高效率视频编码)等视频格式。An encoder can encode (i.e. compress) an image or video, for example, it can encode a target image into a specific format to reduce the data size for easy storage or transmission. The decoder can decode (i.e. decompress) the encoded data of the image or video to restore the image or video data. For example, the encoded data of the target image can be read and decoded by the decoder to restore the data of the target image. Further, the relevant processing of the text to generate the image is performed on the data. The mobile terminal 200 may support one or more encoders and decoders. In this way, the mobile terminal 200 can process images or videos in multiple encoding formats, such as: JPEG (Joint Photographic Experts Group, Joint Photographic Experts Group), PNG (Portable Network Graphics, portable network graphics), BMP (Bitmap, bitmap) etc. Image format, MPEG (Moving Picture Experts Group, moving picture expert group) 1, MPEG2, H.263, H.264, HEVC (High Efficiency Video Coding, high-efficiency video coding) and other video formats.

处理器1701可以通过总线1703与存储器1702或其他部件形成连接。The processor 1701 can form a connection with the memory 1702 or other components through the bus 1703 .

存储器1702可以用于存储计算机可执行程序代码，所述可执行程序代码包括指令。处理器1701通过运行存储在存储器1702的指令，执行移动终端1700的各种功能应用以及数据处理。存储器1702还可以存储应用数据，例如存储图像，视频等文件。Memory 1702 may be used to store computer-executable program code, which includes instructions. The processor 1701 executes various functional applications and data processing of the mobile terminal 1700 by executing instructions stored in the memory 1702 . The storage 1702 can also store application data, such as storing images, videos and other files.

移动终端1700的通信功能可以通过移动通信模块1704、天线1、无线通信模块1705、天线2、调制解调处理器以及基带处理器等实现。天线1和天线2用于发射和接收电磁波信号。移动通信模块1704可以提供应用在移动终端1700上2G、3G、4G、5G等移动通信解决方案。无线通信模块1705可以提供应用在移动终端1700上的无线局域网、蓝牙、近场通信等无线通信解决方案。The communication function of the mobile terminal 1700 can be realized by the mobile communication module 1704, the antenna 1, the wireless communication module 1705, the antenna 2, the modem processor and the baseband processor, etc. Antenna 1 and Antenna 2 are used to transmit and receive electromagnetic wave signals. The mobile communication module 1704 can provide 2G, 3G, 4G, 5G and other mobile communication solutions applied on the mobile terminal 1700 . The wireless communication module 1705 can provide wireless communication solutions such as wireless local area network, bluetooth, and near field communication applied on the mobile terminal 1700 .

显示屏1706用于实现显示功能，如显示用户界面、图像、视频等。摄像模块1707用于实现拍摄功能，如拍摄图像、视频等。音频模块208用于实现音频功能，如播放音频，采集语音等。电源模块209用于实现电源管理功能，如为电池充电、为设备供电、监测电池状态等。传感器模块1710可以包括深度传感器17101、压力传感器17102、陀螺仪传感器17103、气压传感器17104等，以实现相应的感应检测功能。The display screen 1706 is used to implement display functions, such as displaying user interfaces, images, videos, and the like. The camera module 1707 is used to realize the shooting function, such as taking images and videos. The audio module 208 is used to implement audio functions, such as playing audio, collecting voice, and so on. The power module 209 is used to implement power management functions, such as charging the battery, supplying power to the device, and monitoring the state of the battery. The sensor module 1710 may include a depth sensor 17101, a pressure sensor 17102, a gyroscope sensor 17103, an air pressure sensor 17104, etc., so as to realize corresponding sensing and detection functions.

所属技术领域的技术人员能够理解，本公开的各个方面可以实现为系统、方法或程序产品。因此，本公开的各个方面可以具体实现为以下形式，即：完全的硬件实施方式、完全的软件实施方式(包括固件、微代码等)，或硬件和软件方面结合的实施方式，这里可以统称为“电路”、“模块”或“系统”。Those skilled in the art can understand that various aspects of the present disclosure can be implemented as a system, method or program product. Therefore, various aspects of the present disclosure can be embodied in the following forms, namely: a complete hardware implementation, a complete software implementation (including firmware, microcode, etc.), or a combination of hardware and software, which can be collectively referred to herein as "circuit", "module" or "system".

本公开的示例性实施方式还提供了一种计算机可读存储介质，其上存储有能够实现本说明书上述方法的程序产品。在一些可能的实施方式中，本公开的各个方面还可以实现为一种程序产品的形式，其包括程序代码，当程序产品在终端设备上运行时，程序代码用于使终端设备执行本说明书上述“示例性方法”部分中描述的根据本公开各种示例性实施方式的步骤。Exemplary embodiments of the present disclosure also provide a computer-readable storage medium on which a program product capable of implementing the above-mentioned method in this specification is stored. In some possible implementations, various aspects of the present disclosure can also be implemented in the form of a program product, which includes program code. When the program product runs on the terminal device, the program code is used to make the terminal device execute the above-mentioned Steps according to various exemplary embodiments of the present disclosure described in the "Exemplary Methods" section.

需要说明的是，本公开所示的计算机可读介质可以是计算机可读信号介质或者计算机可读存储介质或者是上述两者的任意组合。计算机可读存储介质例如可以是——但不限于——电、磁、光、电磁、红外线、或半导体的系统、装置或器件，或者任意以上的组合。计算机可读存储介质的更具体的例子可以包括但不限于：具有一个或多个导线的电连接、便携式计算机磁盘、硬盘、随机访问存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(EPROM或闪存)、光纤、便携式紧凑磁盘只读存储器(CD-ROM)、光存储器件、磁存储器件、或者上述的任意合适的组合。It should be noted that the computer-readable medium shown in the present disclosure may be a computer-readable signal medium or a computer-readable storage medium or any combination of the above two. A computer readable storage medium may be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, device, or device, or any combination thereof. More specific examples of computer-readable storage media may include, but are not limited to, electrical connections with one or more wires, portable computer diskettes, hard disks, random access memory (RAM), read-only memory (ROM), erasable Programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the above.

在本公开中，计算机可读存储介质可以是任何包含或存储程序的有形介质，该程序可以被指令执行系统、装置或者器件使用或者与其结合使用。而在本公开中，计算机可读信号介质可以包括在基带中或者作为载波一部分传播的数据信号，其中承载了计算机可读的程序代码。这种传播的数据信号可以采用多种形式，包括但不限于电磁信号、光信号或上述的任意合适的组合。计算机可读的信号介质还可以是计算机可读存储介质以外的任何计算机可读介质，该计算机可读介质可以发送、传播或者传输用于由指令执行系统、装置或者器件使用或者与其结合使用的程序。计算机可读介质上包含的程序代码可以用任何适当的介质传输，包括但不限于：无线、电线、光缆、RF等等，或者上述的任意合适的组合。In the present disclosure, a computer-readable storage medium may be any tangible medium that contains or stores a program that can be used by or in conjunction with an instruction execution system, apparatus, or device. In the present disclosure, however, a computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave carrying computer-readable program code therein. Such propagated data signals may take many forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the foregoing. A computer-readable signal medium may also be any computer-readable medium other than a computer-readable storage medium, which can send, propagate, or transmit a program for use by or in conjunction with an instruction execution system, apparatus, or device. . Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

此外，可以以一种或多种程序设计语言的任意组合来编写用于执行本公开操作的程序代码，程序设计语言包括面向对象的程序设计语言—诸如Java、C++等，还包括常规的过程式程序设计语言—诸如“C”语言或类似的程序设计语言。程序代码可以完全地在用户计算设备上执行、部分地在用户设备上执行、作为一个独立的软件包执行、部分在用户计算设备上部分在远程计算设备上执行、或者完全在远程计算设备或服务器上执行。在涉及远程计算设备的情形中，远程计算设备可以通过任意种类的网络，包括局域网(LAN)或广域网(WAN)，连接到用户计算设备，或者，可以连接到外部计算设备(例如利用因特网服务提供商来通过因特网连接)。Additionally, program code for performing the operations of the present disclosure may be written in any combination of one or more programming languages, including object-oriented programming languages—such as Java, C++, etc., as well as conventional procedural Programming language - such as "C" or a similar programming language. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server to execute. In cases involving a remote computing device, the remote computing device may be connected to the user computing device through any kind of network, including a local area network (LAN) or a wide area network (WAN), or may be connected to an external computing device (for example, using an Internet service provider). business to connect via the Internet).

本领域技术人员在考虑说明书及实践这里公开的发明后，将容易想到本公开的其他实施例。本申请旨在涵盖本公开的任何变型、用途或者适应性变化，这些变型、用途或者适应性变化遵循本公开的一般性原理并包括本公开未公开的本技术领域中的公知常识或惯用技术手段。说明书和实施例仅被视为示例性的，本公开的真正范围和精神由权利要求指出。Other embodiments of the disclosure will be readily apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any modification, use or adaptation of the present disclosure, and these modifications, uses or adaptations follow the general principles of the present disclosure and include common knowledge or conventional technical means in the technical field not disclosed in the present disclosure . The specification and examples are to be considered exemplary only, with the true scope and spirit of the disclosure indicated by the appended claims.

应当理解的是，本公开并不局限于上面已经描述并在附图中示出的精确结构，并且可以在不脱离其范围进行各种修改和改变。本公开的范围仅由所附的权利要求来限。It should be understood that the present disclosure is not limited to the precise constructions which have been described above and shown in the drawings, and various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. A method for generating images from text, comprising:

Encoding the natural language text describing the image to obtain text encoding data;

Fusing the text encoding data with the first Gaussian noise of the first preset time step to obtain intermediate text data;

Using coded image generation models with different time steps to sequentially process the intermediate text data corresponding to the time steps to obtain image coded data;

Decoding the encoded image data to obtain a target image;

Wherein, the set of multiple time steps of the coded image generation model completely covers the first preset time step.

2. The method according to claim 1, characterized in that the method further comprises:

Acquiring a first initial model and first training data, wherein the first training data includes a plurality of reference text encoding data and real-value image encoding data corresponding to the reference text encoding data;

inputting the reference text coded data into the first initial model to obtain reference image coded data;

training the first initial model based on the reference image encoding data and the true image encoding data to obtain the encoded image generation model.

3. The method according to claim 1, wherein the first preset time step size comprises N stages; the coded image generation model utilizing different time step sizes sequentially applies the corresponding time step size The intermediate text data is processed to obtain image coding data, including:

Processing the intermediate text data of the Nth stage by using the coded image generation model corresponding to the Nth stage to obtain the image coded data of the Nth stage;

Processing the coded image data at the Nth stage by using the coded image generation model corresponding to the N-1 stage to obtain the coded image data at the N-1 stage;

output the image coding data of the first stage;

Wherein, N is a positive integer greater than 1.

4. The method according to claim 1, wherein said decoding said encoded image data to obtain a target image comprises:

Using an image decoding model to decode the encoded image data to obtain an intermediate image;

performing at least one super-resolution process on the intermediate image to obtain the target image.

5. The method according to claim 4, wherein the decoding of the encoded image data using an image decoding model to obtain an intermediate image comprises:

Fusing the encoded image data with second Gaussian noise at a second preset time step to obtain an image to be decoded;

Using image decoding models with different time steps to sequentially decode the images to be decoded corresponding to the time steps to obtain an intermediate image;

Wherein, a set of multiple time steps of the image decoding models completely covers the second preset time step.

6. The method according to claim 5, wherein the second preset time step comprises M stages; the image decoding models corresponding to the time steps are sequentially performed on the waiting time steps corresponding to the time steps. The decoded image is processed to obtain an intermediate image, including:

Using the image decoding model corresponding to the Mth stage to decode the image to be decoded at the Mth stage to obtain an intermediate image at the Mth stage;

Using the image decoding model corresponding to the M-1th stage to decode the intermediate image of the M-th stage to obtain the intermediate image of the M-1th stage;

outputting the intermediate image of the first stage;

Wherein, M is a positive integer greater than 1.

7. The method according to claim 4, wherein said performing at least one super-resolution process on said intermediate image to obtain said target image comprises:

Fusing the intermediate image with the third Gaussian noise of the third preset time step to obtain an intermediate super-resolution image;

Using image super-resolution models with different time steps to sequentially perform super-resolution on the intermediate super-resolution images corresponding to the time steps to obtain a target image;

Wherein, the set of multiple time steps of the image super-resolution model completely covers the third preset time step.

8. The method according to claim 7, wherein the third preset time step comprises P stages; image super-resolution models of different time steps are used to sequentially process all the time steps corresponding to the time step Perform super-resolution on the intermediate super-resolution image to obtain the target image, including:

Using the image super-resolution model corresponding to the P-th stage to perform super-resolution on the intermediate image of the P-th stage to obtain the target image of the P-th stage;

Using the image super-resolution model corresponding to the P-1th stage to perform super-resolution on the intermediate image of the P-th stage to obtain the target image of the P-1th stage;

output the target image of the first stage;

Wherein, P is a positive integer greater than 1.

9. A device for generating images from text, comprising:

The encoding module is used to encode the natural language text describing the image to obtain text encoding data;

A fusion module, configured to fuse the encoded text data with the first Gaussian noise of the first preset time step to obtain intermediate text data;

A processing module, configured to sequentially process the intermediate text data corresponding to the time steps using coded image generation models of different time steps to obtain coded image data;

A decoding module; used to decode the encoded image data to obtain a target image;

10. A computer-readable storage medium, on which a computer program is stored, wherein when the program is executed by a processor, the method for generating an image from text according to any one of claims 1 to 8 is realized.

11. An electronic device, characterized in that it comprises:

one or more processors; and

A memory for storing one or more programs, when the one or more programs are executed by the one or more processors, the one or more processors implement any one of claims 1 to 8 Method for generating images from text as described in item.