CN117808923A

CN117808923A - Image generation method, system, electronic device and readable storage medium

Info

Publication number: CN117808923A
Application number: CN202410224976.0A
Authority: CN
Inventors: 范宝余; 李晓川; 赵雅倩; 李仁刚; 郭振华
Original assignee: Inspur Electronic Information Industry Co Ltd
Current assignee: IEIT Systems Co Ltd
Priority date: 2024-02-29
Filing date: 2024-02-29
Publication date: 2024-04-02
Anticipated expiration: 2044-02-29
Also published as: CN117808923B

Abstract

The invention discloses an image generation method, system, electronic device and readable storage medium, which relates to the field of image content generation. In order to solve the problem that the solution of generating images from pure text cannot meet the emotional needs in specific task scenarios, the image generation method includes: Obtain semantic guidance text and emotion guidance text; retrieve multiple reference image samples based on the semantic guidance text and emotion guidance text; extract features of multiple reference image samples, and combine at least two features among all features to obtain multiple image combinations Semantic features; obtain the text semantic features corresponding to the semantic guidance text, and generate associated images based on the image with the highest similarity to the text semantic features by combining the semantic features. The invention can improve image generation accuracy, make the generated associated images highly associated with guidance text and emotional text, and satisfy the semantic text requirements of the task scene while also meeting the emotional needs of the task scene.

Description

An image generation method, system, electronic device and readable storage medium

技术领域Technical field

本发明涉及图像内容生成领域，特别涉及一种图像生成方法、系统、电子设备及可读存储介质。The present invention relates to the field of image content generation, and in particular to an image generation method, system, electronic device and readable storage medium.

背景技术Background technique

图像内容生成是一种根据给定模态的指导输入（如文本、3D（Three Dimensions，三维）、点云或其他形式的信息）生成图像内容的技术。随着技术的迭代更新，AI（Artificial Intelligent，人工智能）图像内容生成逐渐成为互联网内容的重要来源。作为最主要的图像AIGC（Artificial Intelligent Generated Content，人工智能生成内容）手段，根据纯文本生成图像是最常见的任务形式，但是在某些任务场景下，如文本插图任务场景下，不仅需要生成的图像与任务场景的语义文本相同，还需要生成的图像满足该任务场景下的指定情感，但是现有的纯文本生成图像的方案并不能满足某些任务场景下的情感需求。Image content generation is a technique for generating image content based on guidance input in a given modality, such as text, 3D (Three Dimensions, three-dimensional), point clouds, or other forms of information. With the iterative update of technology, AI (Artificial Intelligent, artificial intelligence) image content generation has gradually become an important source of Internet content. As the most important image AIGC (Artificial Intelligent Generated Content) method, generating images based on plain text is the most common task form. However, in some task scenarios, such as text illustration task scenarios, not only the generated The image is the same as the semantic text of the task scene, and the generated image also needs to meet the specified emotion in the task scene. However, the existing pure text-based image generation solution cannot meet the emotional needs of some task scenes.

因此，如何提供一种解决上述技术问题的方案是本领域技术人员目前需要解决的问题。Therefore, how to provide a solution to the above technical problems is a problem that those skilled in the art currently need to solve.

发明内容Contents of the invention

本发明的目的是提供一种图像生成方法、系统、电子设备及可读存储介质，能够提高图像生成精度，使生成的关联图像与指导文本和情绪文本高度关联，在满足任务场景的语义文本要求的同时，满足该任务场景下的情感需求。The purpose of the present invention is to provide an image generation method, system, electronic device and readable storage medium that can improve the accuracy of image generation and make the generated associated images highly relevant to the guidance text and emotional text, while meeting the semantic text requirements of the task scene. while meeting the emotional needs of the task scenario.

为解决上述技术问题，本发明提供了一种图像生成方法，包括：In order to solve the above technical problems, the present invention provides an image generation method, comprising:

获取语义指导文本和情绪指导文本；Get semantic guidance text and emotion guidance text;

基于所述语义指导文本和所述情绪指导文本检索得到多个参考图像样本；Retrieve multiple reference image samples based on the semantic guidance text and the emotion guidance text;

提取多个所述参考图像样本的特征，对所有所述特征中的至少两个所述特征进行组合得到多个图像组合语义特征；Extract features of multiple reference image samples, and combine at least two of all the features to obtain multiple image combination semantic features;

获取所述语义指导文本对应的文本语义特征，基于与所述文本语义特征的相似度最高的图像组合语义特征生成关联图像。Text semantic features corresponding to the semantic guidance text are obtained, and semantic features are combined based on the image with the highest similarity to the text semantic features to generate a related image.

其中，基于所述语义指导文本和所述情绪指导文本检索得到多个参考图像样本的过程包括：Wherein, the process of retrieving multiple reference image samples based on the semantic guidance text and the emotion guidance text includes:

基于所述语义指导文本和所述情绪指导文本进行网页检索；Perform web page retrieval based on the semantic guidance text and the emotion guidance text;

根据检索到的前n条网页构建关联内容集合，所述关联内容集合包括每条所述网页对应的关联内容，所述关联内容包括所述网页的标题文本和内容文本，n为正整数；Construct an associated content set based on the first n retrieved web pages, the associated content set includes associated content corresponding to each of the web pages, the associated content includes the title text and content text of the web page, n is a positive integer;

在所述关联内容集合中选择与所述语义指导文本和所述情绪指导文本的综合关联性最强的最优关联内容；Select the optimal related content from the related content set that has the strongest comprehensive correlation with the semantic guidance text and the emotional guidance text;

基于所述最优关联内容检索得到多个参考图像样本。A plurality of reference image samples are retrieved based on the optimal associated content.

其中，基于所述语义指导文本和所述情绪指导文本进行网页检索的过程包括：Wherein, the process of web page retrieval based on the semantic guidance text and the emotion guidance text includes:

对所述语义指导文本和所述情绪指导文本进行拼接，得到检索文本；Splice the semantic guidance text and the emotion guidance text to obtain the retrieval text;

将所述检索文本输入搜索引擎接口，以便对所述检索文本进行网页检索。The retrieval text is input into the search engine interface to perform web page retrieval on the retrieval text.

其中，根据检索到的前n条网页构建关联内容集合的过程包括：Among them, the process of constructing the associated content collection based on the first n retrieved web pages includes:

提取检索到的前n条网页的标题文本和内容文本；Extract the title text and content text of the first n web pages retrieved;

在本地存储空间以字典形式存储每条所述网页对应的标题文本和内容文本，得到关联内容集合；所述字典的键为所述标题文本，所述字典的值为所述内容文本。The title text and content text corresponding to each web page are stored in the local storage space in the form of a dictionary to obtain an associated content collection; the key of the dictionary is the title text, and the value of the dictionary is the content text.

其中，在所述关联内容集合中选择与所述语义指导文本和所述情绪指导文本的综合关联性最强的最优关联内容的过程包括：The process of selecting the optimal associated content with the strongest comprehensive association with the semantic guidance text and the emotion guidance text from the associated content set includes:

针对所述关联内容集合中的每条所述内容文本，基于所述内容文本与所述语义指导文本对应的语义关联得分以及所述内容文本与所述情绪指导文本对应的情绪关联得分，得到所述内容文本的综合得分；For each of the content texts in the associated content set, based on the semantic association score corresponding to the content text and the semantic guidance text and the emotion association score corresponding to the content text and the emotion guidance text, a comprehensive score of the content text is obtained;

将包括所述综合得分最高的所述内容文本的关联内容确定为与所述语义指导文本和所述情绪指导文本的综合关联性最强的最优关联内容。The related content including the content text with the highest comprehensive score is determined as the optimal related content with the strongest comprehensive correlation with the semantic guidance text and the emotion guidance text.

其中，基于所述内容文本与所述语义指导文本对应的语义关联得分以及所述内容文本与所述情绪指导文本对应的情绪关联得分，得到所述内容文本的综合得分的过程包括：Wherein, based on the semantic association score corresponding to the content text and the semantic guidance text and the emotion association score corresponding to the content text and the emotion guidance text, the process of obtaining the comprehensive score of the content text includes:

确定所述内容文本中与所述语义指导文本匹配的语义相关文本，以及所述语义相关文本中与所述情绪指导文本匹配的情绪相关文本；Determining semantically related texts in the content text that match the semantic guidance text, and emotion-related texts in the semantically related texts that match the emotion guidance text;

基于所述语义相关文本确定所述内容文本的语义关联得分；Determining a semantic relevance score of the content text based on the semantically related text;

基于所述情绪相关文本确定所述内容文本的情绪关联得分；determining an emotion relevance score for the content text based on the emotion-related text;

利用所述语义关联得分和所述情绪关联得分确定所述内容文本的综合得分。A comprehensive score of the content text is determined using the semantic relevance score and the emotional relevance score.

其中，基于所述语义相关文本确定所述内容文本的语义关联得分的过程包括：Wherein, the process of determining the semantic relevance score of the content text based on the semantically relevant text includes:

将所述语义相关文本的字符长度占所述内容文本的字符长度的比值确定为所述内容文本的语义关联得分；Determine the ratio of the character length of the semantically related text to the character length of the content text as the semantic association score of the content text;

基于所述情绪相关文本确定所述内容文本的情绪关联得分的过程包括：The process of determining the emotion relevance score of the content text based on the emotion-related text includes:

将所述情绪相关文本的字符长度占所述语义相关文本的字符长度的比值确定为所述内容文本的情绪关联得分。The ratio of the character length of the emotion-related text to the character length of the semantic-related text is determined as the emotion correlation score of the content text.

其中，利用所述语义关联得分和所述情绪关联得分确定所述内容文本的综合得分的过程包括：Wherein, the process of using the semantic association score and the emotional association score to determine the comprehensive score of the content text includes:

将所述语义关联得分和所述情绪关联得分的乘积作为所述内容文本的综合得分。The product of the semantic association score and the emotional association score is used as the comprehensive score of the content text.

其中，基于所述最优关联内容检索得到多个参考图像样本的过程包括：Wherein, the process of retrieving multiple reference image samples based on the optimal associated content includes:

基于所述最优关联内容进行图像检索，得到多个候选图像样本；Perform image retrieval based on the optimal associated content to obtain multiple candidate image samples;

利用所述情绪指导文本和所述语义指导文本在多个所述候选图像样本中筛选出多个参考图像样本。Using the emotion guidance text and the semantic guidance text, a plurality of reference image samples are screened out from a plurality of candidate image samples.

其中，利用所述情绪指导文本和所述语义指导文本在多个所述候选图像样本中筛选出多个参考图像样本的过程包括：Wherein, the process of using the emotional guidance text and the semantic guidance text to select multiple reference image samples from multiple candidate image samples includes:

提取每一所述候选图像样本的图像摘要文本；Extract the image summary text of each candidate image sample;

对输入文本和每一所述图像摘要文本进行图像元素互斥性计算，得到每一所述图像摘要文本的视觉得分，所述输入文本包括所述语义指导文本和所述情绪指导文本；Perform image element mutual exclusivity calculation on the input text and each image summary text to obtain a visual score of each image summary text, where the input text includes the semantic guidance text and the emotion guidance text;

将所述视觉得分超过预设值的候选图像样本确定为参考图像样本。Candidate image samples whose visual scores exceed a preset value are determined as reference image samples.

其中，对输入文本和每一所述图像摘要文本进行图像元素互斥性计算，得到每一所述图像摘要文本的视觉得分的过程包括：Wherein, the process of calculating the mutual exclusivity of image elements between the input text and each image summary text, and obtaining the visual score of each image summary text includes:

提取每一所述图像摘要文本的第一实体元素和第一实体关系以及输入文本的第二实体元素和第二实体关系；Extract the first entity element and the first entity relationship of each image summary text and the second entity element and the second entity relationship of the input text;

将所述第一实体元素中不存在与所述第二实体元素不同的实体元素且所述第一实体关系中不包括与所述第二实体关系不同的实体关系的图像摘要文本确定为候选摘要文本；The image summary text in which the first entity element does not have an entity element different from the second entity element and the first entity relationship does not include an entity relationship different from the second entity relationship is determined as a candidate summary. text;

计算每一所述候选摘要文本与所述输入文本的一致性描述得分，将所述一致性描述得分作为所述候选摘要文本的视觉得分。The consistency description score of each candidate summary text and the input text is calculated, and the consistency description score is used as the visual score of the candidate summary text.

其中，提取多个所述参考图像样本的特征，对所有所述特征中的至少两个所述特征进行组合得到多个图像组合语义特征的过程包括：The process of extracting features of multiple reference image samples and combining at least two of all features to obtain multiple image combination semantic features includes:

提取多个所述参考图像样本的特征；Extract features of a plurality of reference image samples;

对所有所述特征进行聚类，得到多个一级语义特征；Clustering all the features to obtain multiple first-level semantic features;

根据所述一级语义特征的数量构造注意力掩码矩阵，Construct an attention mask matrix according to the number of first-level semantic features,

利用所述一级语义特征和所述注意力掩码矩阵得到多个图像组合语义特征。The first-level semantic features and the attention mask matrix are used to obtain multiple image combination semantic features.

其中，利用所述一级语义特征和所述注意力掩码矩阵得到多个图像组合语义特征的过程包括：Wherein, the process of obtaining multiple image combined semantic features using the first-level semantic features and the attention mask matrix includes:

利用第一关系式得到多个图像组合语义特征，所述第一关系式为The first relational expression is used to obtain the semantic features of multiple image combinations, and the first relational expression is

； ;

其中，transformer为基于注意力机制的模型，g为所述一级语义特征，softmax为概率归一化函数，W_q为查询参数权重，W_k为链参数权重，W_v为值参数权重，Mask[:,k]为所述注意力掩码矩阵的第k列的选择参数，size（g）为所述一级语义特征的维度大小，为转置符号。Among them, transformer is a model based on the attention mechanism, g is the first-level semantic feature, softmax is the probability normalization function, W _q is the query parameter weight, W _k is the chain parameter weight, W _v is the value parameter weight, Mask [:,k] is the selection parameter of the k-th column of the attention mask matrix, size (g) is the dimension size of the first-level semantic feature, is the transposition symbol.

其中，对所有所述特征进行聚类，得到多个一级语义特征的过程包括：Among them, the process of clustering all the features to obtain multiple first-level semantic features includes:

对所有所述特征进行聚类，得到多个一级语义特征及每个所述一级语义特征下的二级语义特征；Cluster all the features to obtain multiple first-level semantic features and second-level semantic features under each of the first-level semantic features;

所述图像生成方法还包括：The image generation method also includes:

构建语义特征分布森林，所述语义特征分布森林包括多个树特征，每一所述树特征的树干特征为所述一级语义特征，每一所述树干特征的树枝特征为所述一级语义特征下的二级语义特征；Construct a semantic feature distribution forest. The semantic feature distribution forest includes multiple tree features. The trunk feature of each tree feature is the first-level semantic feature. The branch feature of each tree trunk feature is the first-level semantic feature. Second-level semantic features under features;

基于与所述文本语义特征的相似度最高的图像组合语义特征生成关联图像的过程包括：The process of generating an associated image based on the image combination semantic feature having the highest similarity to the text semantic feature comprises:

基于与所述文本语义特征的相似度最高的图像组合语义特征对应的所述注意力掩码矩阵的选择参数确定最优树特征；Determine optimal tree features based on the selection parameters of the attention mask matrix corresponding to the image combination semantic feature with the highest similarity to the text semantic feature;

利用所述最优树特征得到图像筛选特征；Obtaining image screening features using the optimal tree features;

基于所述图像筛选特征和所述文本语义特征生成关联图像。A related image is generated based on the image filtering features and the text semantic features.

其中，基于所述图像筛选特征和所述文本语义特征生成关联图像的过程包括：Wherein, the process of generating associated images based on the image filtering features and the text semantic features includes:

利用所述图像筛选特征得到条件噪声初始图像；Using the image screening features to obtain a conditional noise initial image;

基于所述条件噪声初始图像和所述文本语义特征生成关联图像。A correlation image is generated based on the conditional noise initial image and the text semantic features.

其中，对所有所述特征进行聚类的过程包括：Among them, the process of clustering all the features includes:

计算任意两个所述特征间的欧式距离；Calculate the Euclidean distance between any two of the features;

针对每一所述特征，确定所述欧式距离小于第一预设距离的数量，当所述数量不小于预设数量，将所述特征划分至密集特征子集，当所述数量小于所述预设数量，将所述特征划分至非密集特征子集；For each feature, determine the number by which the Euclidean distance is less than the first preset distance. When the number is not less than the preset number, divide the features into dense feature subsets. When the number is less than the preset distance, Assuming a quantity, divide the features into non-dense feature subsets;

确定一个子类，将所述密集特征子集中的任一个特征加入到所述子类并从所述密集特征子集中剔除；Determine a subcategory, add any feature in the dense feature subset to the subcategory and remove it from the dense feature subset;

计算所述子类中的所有特征与所述密集特征子集中的所有特征之间的最小欧式距离，判断所述密集特征子集中是否存在第一待剔除特征，若是，将所述第一待剔除特征加入到所述子类并从所述密集特征子集中剔除，重复本步骤，直至所述密集特征子集中不存在所述第一待剔除特征，所述第一待剔除特征为所述密集特征子集中与所述子类中的特征之间的最小欧式距离小于第二预设距离的特征；Calculate the minimum Euclidean distance between all features in the subcategory and all features in the dense feature subset, determine whether there is a first feature to be eliminated in the dense feature subset, and if so, remove the first feature to be eliminated Features are added to the subcategory and eliminated from the dense feature subset. Repeat this step until the first feature to be eliminated does not exist in the dense feature subset, and the first feature to be eliminated is the dense feature. Features whose minimum Euclidean distance between the subset and features in the subcategory is less than the second preset distance;

计算所述子类中的所有特征与所述非密集特征子集中的所有特征之间的最小欧式距离，确定所述非密集特征子集中是否存在第二待剔除特征，若是，将所述第二待剔除特征加入到所述子类并从所述非密集特征子集中剔除，重复本步骤，直至所述非密集特征子集中不存在所述第二待剔除特征，所述第二待剔除特征为所述非密集特征子集中与所述子类中的特征之间的最小欧式距离小于所述第二预设距离的特征；Calculate the minimum Euclidean distance between all features in the subcategory and all features in the non-dense feature subset, determine whether there is a second feature to be eliminated in the non-dense feature subset, and if so, remove the second feature The features to be eliminated are added to the subcategory and eliminated from the non-dense feature subset. This step is repeated until the second feature to be eliminated does not exist in the non-dense feature subset. The second feature to be eliminated is Features whose minimum Euclidean distance between the non-dense feature subset and the features in the subcategory is less than the second preset distance;

将所述子类加入到预设聚类集合中。The subclass is added to a preset cluster set.

其中，得到多个一级语义特征的过程包括：Among them, the process of obtaining multiple first-level semantic features includes:

按照第二关系式对所述预设聚类集合中的所有所述子类计算其包括的所有所述特征的加权和，基于所述加权和得到所述一级语义特征；Calculate the weighted sum of all the features included in the preset cluster set for all the subcategories in the preset cluster set according to the second relational expression, and obtain the first-level semantic feature based on the weighted sum;

所述第二关系式为；The second relational expression is ;

其中，t为第b个子类中的特征的个数，f_b为所述第b个子类的加权和，f_t为对所述第b个子类遍历过程中所述第b个子类中的当前特征，f_p为所述遍历过程中的每一个所述特征，为所述第一预设距离或所述第二预设距离，dis（f_t,f_p）为f_t和f_p间的欧式距离，为所述第b个子类中满足/>的特征的数量。Among them, t is the number of features in the b-th sub-category, f _b is the weighted sum of the b-th sub-category, f _t is the current feature in the b-th sub-category during the traversal of the b-th sub-category. Features, f _p is each feature in the traversal process, is the first preset distance or the second preset distance, dis( _ft , _fp ) is the Euclidean distance between _ft and _fp , Satisfies/> for the b-th subcategory the number of features.

为解决上述技术问题，本发明还提供了一种图像生成系统，包括：In order to solve the above technical problems, the present invention also provides an image generation system, including:

获取模块，用于获取语义指导文本和情绪指导文本；The acquisition module is used to obtain semantic guidance text and emotional guidance text;

检索模块，用于基于所述语义指导文本和所述情绪指导文本检索得到多个参考图像样本；A retrieval module configured to retrieve multiple reference image samples based on the semantic guidance text and the emotion guidance text;

提取模块，用于提取多个所述参考图像样本的特征，对所有所述特征中的至少两个所述特征进行组合得到多个图像组合语义特征；An extraction module, configured to extract features of a plurality of reference image samples, and combine at least two of all the features to obtain multiple image combination semantic features;

生成模块，用于获取所述语义指导文本对应的文本语义特征，基于与所述文本语义特征的相似度最高的图像组合语义特征生成关联图像。A generation module is used to obtain text semantic features corresponding to the semantic guidance text, and generate associated images based on the image with the highest similarity to the text semantic features and the semantic features.

为解决上述技术问题，本发明还提供了一种电子设备，包括：In order to solve the above technical problems, the present invention also provides an electronic device, including:

存储器，用于存储计算机程序；Memory, used to store computer programs;

处理器，用于执行所述计算机程序时实现如上文任意一项所述的图像生成方法的步骤。A processor is used to implement the steps of the image generation method as described in any one of the above when executing the computer program.

为解决上述技术问题，本发明还提供了一种计算机可读存储介质，所述计算机可读存储介质上存储有计算机程序，所述计算机程序被处理器执行时实现如上文任意一项所述的图像生成方法的步骤。In order to solve the above technical problems, the present invention also provides a computer-readable storage medium. A computer program is stored on the computer-readable storage medium. When the computer program is executed by a processor, any one of the above mentioned tasks can be implemented. Steps of the image generation method.

本发明提供了一种图像生成方法，基于语义指导文本和情绪指导文本在网页中进行检索，获取多个与情绪指导文本和语义指导文本对应的参考图像样本，便于后续根据多个参考图像生成关联图像，提高图像生成精度，使生成的关联图像与指导文本和情绪文本高度关联，在满足任务场景的语义文本要求的同时，满足该任务场景下的情感需求。本发明还提供了一种图像生成系统、电子设备及计算机可读存储介质，具有和上述图像生成系统相同的有益效果。The present invention provides an image generation method that searches web pages based on semantic guidance text and emotion guidance text, and obtains multiple reference image samples corresponding to the emotion guidance text and semantic guidance text to facilitate subsequent generation of associations based on multiple reference images. Images, improve the accuracy of image generation, so that the generated associated images are highly related to the guidance text and emotional text, which not only meet the semantic text requirements of the task scene, but also meet the emotional needs of the task scene. The present invention also provides an image generation system, electronic equipment and a computer-readable storage medium, which have the same beneficial effects as the above image generation system.

附图说明Description of drawings

为了更清楚地说明本发明实施例，下面将对实施例中所需要使用的附图做简单的介绍，显而易见地，下面描述中的附图仅仅是本发明的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他的附图。In order to more clearly illustrate the embodiments of the present invention, the following briefly introduces the drawings required for use in the embodiments. Obviously, the drawings described below are only some embodiments of the present invention. For ordinary technicians in this field, other drawings can be obtained based on these drawings without paying creative work.

图1为本发明所提供的一种图像生成方法的步骤流程图；FIG1 is a flowchart of the steps of an image generation method provided by the present invention;

图2为本发明所提供的一种关联内容匹配筛选示意图；Figure 2 is a schematic diagram of associated content matching and screening provided by the present invention;

图3为本发明所提供的一种语义关联强化示意图；Figure 3 is a schematic diagram of semantic association enhancement provided by the present invention;

图4为本发明实施例所提供的一种语义特征分布森林结构示意图；Figure 4 is a schematic diagram of a semantic feature distribution forest structure provided by an embodiment of the present invention;

图5为本发明实施例所提供的一种注意力掩码矩阵示意图；Figure 5 is a schematic diagram of an attention mask matrix provided by an embodiment of the present invention;

图6为本发明所提供的一种关联图像生成示意图；FIG6 is a schematic diagram of generating an associated image provided by the present invention;

图7为本发明所提供的一种图像生成系统的结构示意图；Figure 7 is a schematic structural diagram of an image generation system provided by the present invention;

图8为本发明所提供的一种电子设备的结构示意图；Figure 8 is a schematic structural diagram of an electronic device provided by the present invention;

图9为本发明所提供的一种计算机可读存储介质的结构示意图。FIG. 9 is a schematic diagram of the structure of a computer-readable storage medium provided by the present invention.

具体实施方式Detailed ways

本发明的核心是提供一种图像生成方法、系统、电子设备及可读存储介质，能够提高图像生成精度，使生成的关联图像与指导文本和情绪文本高度关联，在满足任务场景的语义文本要求的同时，满足该任务场景下的情感需求。The core of the present invention is to provide an image generation method, system, electronic device and readable storage medium, which can improve the accuracy of image generation, make the generated associated images highly relevant to the guidance text and emotional text, and meet the semantic text requirements of the task scene. while meeting the emotional needs of the task scenario.

为使本发明实施例的目的、技术方案和优点更加清楚，下面将结合本发明实施例中的附图，对本发明实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例是本发明一部分实施例，而不是全部的实施例。基于本发明中的实施例，本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例，都属于本发明保护的范围。In order to make the purpose, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below in conjunction with the drawings in the embodiments of the present invention. Obviously, the described embodiments These are some embodiments of the present invention, rather than all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative efforts fall within the scope of protection of the present invention.

第一方面，请参照图1，图1为本发明所提供的一种图像生成方法的步骤流程图，该图像生成方法包括：In the first aspect, please refer to Figure 1, which is a step flow chart of an image generation method provided by the present invention. The image generation method includes:

S101：获取语义指导文本和情绪指导文本；S101: Obtain semantic guidance text and emotion guidance text;

本实施例中，考虑到在图像内容生成领域，图像内容生成旨在根据文本语义生成准确的图像，图像的内容精确符合文本中的语义指导，在某些任务场景下，图像生成任务不仅要求生成的图像的内容和语义指导相符，还需要配合相关的情绪。示例性地，在文本插图任务场景下，需要生成指定情感的图像，或在心理治疗任务场景下，需要尽量生成正面情绪相关的图像，在销售任务场景下，需要尽量生成使人产生兴奋色彩、令人高兴的图像。为了使生成的图像既符合语义指导，又能够配合相关的情绪，本实施例中，获取用户通过交互装置输入的语义指导文本和情绪指导文本。示例性地，用户的输入文本包括雪地上的小狗以及悲伤，获取到的语义指导文本为雪地上的小狗，获取到的情绪指导文本为悲伤。In this embodiment, considering that in the field of image content generation, image content generation aims to generate accurate images according to text semantics, and the content of the image precisely conforms to the semantic guidance in the text, in certain task scenarios, the image generation task not only requires the content of the generated image to be consistent with the semantic guidance, but also needs to be matched with relevant emotions. For example, in the text illustration task scenario, it is necessary to generate images of specified emotions, or in the psychotherapy task scenario, it is necessary to generate images related to positive emotions as much as possible, and in the sales task scenario, it is necessary to generate images that make people excited and happy as much as possible. In order to make the generated image conform to the semantic guidance and be able to match the relevant emotions, in this embodiment, the semantic guidance text and the emotional guidance text input by the user through the interactive device are obtained. For example, the user's input text includes a puppy on the snow and sadness, the semantic guidance text obtained is a puppy on the snow, and the emotional guidance text obtained is sadness.

S102：基于语义指导文本和情绪指导文本检索得到多个参考图像样本；S102: Obtain multiple reference image samples based on semantic guidance text and emotion guidance text retrieval;

本实施例中，将语义指导文本和情绪指导文本输入搜索引擎进行检索，得到多条关联内容，从多条关联内容中确定与语义指导文本和情绪指导文本关联程度最高的最优关联内容，本实施例中获取最优关联内容的目的是便于后续筛选参考图像样本。In this embodiment, the semantic guidance text and the emotion guidance text are input into the search engine for retrieval, and multiple pieces of associated content are obtained. From the multiple pieces of associated content, the optimal associated content with the highest degree of correlation with the semantic guidance text and the emotion guidance text is determined. The purpose of obtaining optimal correlation content in the embodiment is to facilitate subsequent screening of reference image samples.

在确定最优关联内容后，根据最优关联内容在搜索引擎中搜索相关的图像，然后再基于情绪指导文本和语义指导文本在所有检索到的图像中进行筛选，筛选出多个与输入文本相关的图像作为参考图像样本，此处参考图像样本是一种视觉强化后的样本，基于此生成的关联图像更能满足语义和情感需求。After determining the optimal associated content, search for relevant images in the search engine based on the optimal associated content, and then filter all retrieved images based on the emotional guidance text and semantic guidance text to select multiple images related to the input text. The image is used as a reference image sample. The reference image sample here is a visually enhanced sample, and the associated image generated based on this can better meet the semantic and emotional needs.

S103：提取多个参考图像样本的特征，对所有特征中的至少两个特征进行组合得到多个图像组合语义特征；S103: extracting features of multiple reference image samples, and combining at least two features among all the features to obtain multiple image combined semantic features;

本实施例中，对多个参考图像样本的特征进行提取，并对其进行所有形式的组合，得到图像组合语义特征，每个图像组合语义特征中至少包括两个特征。In this embodiment, features of multiple reference image samples are extracted and combined in all forms to obtain image combination semantic features. Each image combination semantic feature includes at least two features.

S104：获取语义指导文本对应的文本语义特征，基于与文本语义特征的相似度最高的图像组合语义特征生成关联图像。S104: Acquire text semantic features corresponding to the semantic guidance text, and generate associated images based on image combination semantic features with the highest similarity to the text semantic features.

本实施例中，首先获取语义指导文本对应的文本语义特征，根据文本语义特征与各个图像组合语义特征的相似度确定最优图像组合语义特征，将相似度最高的图像组合语义特征确定为最优图像组合语义特征，基于最优图像组合语义特征生成关联图像。可以理解，根据相似度对参考图像样本中的某些特定元素进行参考，使生成的关联图像中的内容更像所参考的若干图像中的模样，从而提高生成关联的图像的准确性。In this embodiment, the text semantic features corresponding to the semantic guidance text are first obtained, the optimal image combination semantic features are determined based on the similarity between the text semantic features and the image combination semantic features, and the image combination semantic features with the highest similarity are determined as the optimal Image combination semantic features generate related images based on optimal image combination semantic features. It can be understood that some specific elements in the reference image sample are referenced based on the similarity, so that the content in the generated associated image is more like the appearance of several referenced images, thereby improving the accuracy of generating the associated image.

可见，本实施例中，基于语义指导文本和情绪指导文本在网页中进行检索，获取多个与情绪指导文本和语义指导文本对应的参考图像样本，便于后续根据多个参考图像生成关联图像，提高图像生成精度，使生成的关联图像与指导文本和情绪文本高度关联，在满足任务场景的语义文本要求的同时，满足该任务场景下的情感需求。It can be seen that in this embodiment, the web page is retrieved based on the semantic guidance text and the emotion guidance text, and multiple reference image samples corresponding to the emotion guidance text and the semantic guidance text are obtained, so as to facilitate subsequent generation of associated images based on the multiple reference images, and improve The accuracy of image generation makes the generated associated images highly relevant to the guidance text and emotional text, which not only meets the semantic text requirements of the task scene, but also meets the emotional needs of the task scene.

在上述实施例的基础上：Based on the above embodiments:

在一示例性实施例中，基于语义指导文本和情绪指导文本检索得到多个参考图像样本的过程包括：In an exemplary embodiment, the process of retrieving multiple reference image samples based on semantic guidance text and emotion guidance text includes:

基于语义指导文本和情绪指导文本进行网页检索；Web page retrieval based on semantic guidance text and emotion guidance text;

根据检索到的前n条网页构建关联内容集合，关联内容集合包括每条网页对应的关联内容，关联内容包括网页的标题文本和内容文本，n为正整数；Construct an associated content collection based on the first n retrieved web pages. The associated content collection includes the associated content corresponding to each web page. The associated content includes the title text and content text of the web page. n is a positive integer;

在关联内容集合中选择与语义指导文本和情绪指导文本的综合关联性最强的最优关联内容；Select the optimal related content from the related content collection that has the strongest comprehensive correlation with the semantic guidance text and the emotional guidance text;

基于最优关联内容检索得到多个参考图像样本。Multiple reference image samples are obtained based on optimal associated content retrieval.

在一示例性实施例中，基于语义指导文本和情绪指导文本进行网页检索的过程包括：In an exemplary embodiment, the process of web page retrieval based on semantic guidance text and emotion guidance text includes:

对语义指导文本和情绪指导文本进行拼接，得到检索文本；The semantic guidance text and the emotional guidance text are concatenated to obtain the retrieval text;

将检索文本输入搜索引擎接口，以便对检索文本进行网页检索。Enter the search text into the search engine interface so that the search text can be retrieved from the web.

在一示例性实施例中，根据检索到的前n条网页构建关联内容集合的过程包括：In an exemplary embodiment, the process of constructing a set of related content based on the first n retrieved web pages includes:

在本地存储空间以字典形式存储每条网页对应的标题文本和内容文本，得到关联内容集合；字典的键为标题文本，字典的值为内容文本。The title text and content text corresponding to each web page are stored in the local storage space in the form of a dictionary to obtain a collection of related content; the key of the dictionary is the title text, and the value of the dictionary is the content text.

本实施例中，首先将情绪指导文本和语义指导文本拼接在一起，得到一条合并指导文本，本实施例在此对情绪指导文本和语义指导文本的拼接先后顺序不做限定，可以将情绪指导文本拼接在语义指导文本后面，也可以将语义指导文本拼接在情绪指导文本后面。In this embodiment, the emotion guidance text and the semantic guidance text are first spliced together to obtain a combined guidance text. This embodiment does not limit the order in which the emotion guidance text and the semantic guidance text are spliced. The emotion guidance text can be Spliced behind the semantic guidance text, the semantic guidance text can also be spliced behind the emotional guidance text.

获取合并指导文本后，调用搜索引擎接口对合并指导文本进行检索，将检索到的前n条网页打开，并将前n条网页中的关联内容下载到本地，关联内容包括标题文本和内容文本，一方面，前n条网页中的关联内容与本实施例中的指导文本的关联性较强，另一方面可以减少数据处理量，在保证关联度的同时，提高数据处理效率。After obtaining the merge guidance text, call the search engine interface to retrieve the merge guidance text, open the first n web pages retrieved, and download the associated content in the first n web pages to the local. The associated content includes title text and content text. On the one hand, the related content in the first n web pages has a strong correlation with the guidance text in this embodiment. On the other hand, the amount of data processing can be reduced, and the data processing efficiency can be improved while ensuring the correlation.

将前n条网页的关联内容存储到本地存储空间后，构造关联内容集合，该关联内容集合形式为字典类型，将每条网页的标题文本和内容文本存到字典中，键为检索到网页的标题文本，值网页中的内容文本。After storing the associated content of the first n web pages into the local storage space, construct an associated content collection. The associated content collection is in the form of a dictionary. The title text and content text of each web page are stored in the dictionary, and the key is the retrieved web page. Title text, the content text in the web page.

当然，除了可以选择字典类型还可以选择其他方式，便于检索即可，本实施例在此不做具体限定。Of course, in addition to selecting the dictionary type, you can also select other methods to facilitate retrieval. This embodiment is not specifically limited here.

在一示例性实施例中，在关联内容集合中选择与语义指导文本和情绪指导文本的综合关联性最强的最优关联内容的过程包括：In an exemplary embodiment, the process of selecting the optimal related content with the strongest comprehensive correlation with the semantic guidance text and the emotion guidance text in the related content collection includes:

针对关联内容集合中的每条内容文本，基于内容文本与语义指导文本对应的语义关联得分以及内容文本与情绪指导文本对应的情绪关联得分，得到内容文本的综合得分；For each content text in the associated content collection, a comprehensive score of the content text is obtained based on the semantic association score corresponding to the content text and the semantic guidance text and the emotion association score corresponding to the content text and the emotion guidance text;

将包括综合得分最高的内容文本的关联内容确定为与语义指导文本和情绪指导文本的综合关联性最强的最优关联内容。The associated content including the content text with the highest comprehensive score is determined as the optimal associated content with the strongest comprehensive association with the semantic guidance text and the emotion guidance text.

在一示例性实施例中，基于内容文本与语义指导文本对应的语义关联得分以及内容文本与情绪指导文本对应的情绪关联得分，得到内容文本的综合得分的过程包括：In an exemplary embodiment, based on the semantic association score corresponding to the content text and the semantic guidance text and the emotion association score corresponding to the content text and the emotion guidance text, the process of obtaining the comprehensive score of the content text includes:

确定内容文本中与语义指导文本匹配的语义相关文本，以及语义相关文本中与情绪指导文本匹配的情绪相关文本；Determine the semantically relevant text in the content text that matches the semantic guidance text, and the emotion-related text in the semantically relevant text that matches the emotion guidance text;

基于语义相关文本确定内容文本的语义关联得分；determining a semantic relevance score of the content text based on semantically related text;

基于情绪相关文本确定内容文本的情绪关联得分；Determine the emotion relevance score of the content text based on the emotion-related text;

利用语义关联得分和情绪关联得分确定内容文本的综合得分。The semantic relevance score and sentiment relevance score are used to determine the comprehensive score of the content text.

在一示例性实施例中，基于语义相关文本确定内容文本的语义关联得分的过程包括：In an exemplary embodiment, the process of determining the semantic relevance score of the content text based on the semantically relevant text includes:

将语义相关文本的字符长度占内容文本的字符长度的比值确定为内容文本的语义关联得分；The ratio of the character length of the semantically related text to the character length of the content text is determined as the semantic association score of the content text;

基于情绪相关文本确定内容文本的情绪关联得分的过程包括：The process of determining the emotion relevance score of content text based on emotion-relevant text includes:

将情绪相关文本的字符长度占语义相关文本的字符长度的比值确定为内容文本的情绪关联得分。The ratio of the character length of the emotion-related text to the character length of the semantically-related text is determined as the emotion relevance score of the content text.

本实施例中，参照图2所示，遍历关联内容集合中的每条内容文本，对内容文本进行语义关联性打分，具体可以选择使用如T5、M6、ChatGPT等大模型进行语义关联性打分，得到语义关联得分，具体的，将内容文本中的每条文本与语义指导文本拼接后输入上述大模型中，判断二者是否匹配，将所有匹配的文本输出，记为语义相关文本，将语义相关文本的字符长度占关联文本内容字符长度的比值输出，记为语义关联得分，表示语义上和输入文本符合的文本占比。同时，对内容文本进行情绪关联性打分，具体可以选择使用如T5、M6、ChatGPT等大模型对语义相关文本进行情绪关联性打分，得到情绪关联得分，具体的，将语义相关文本中每条文本与情绪指导文本拼接后输入上述大模型中，判断二者是否匹配，若相配，将匹配的句子确定情绪相关文本，将所情绪相关文本的字符长度占语义相关文本的字符长度的比值输出，记为情绪关联得分，基于语义关联得分和情绪关联得分得到综合得分，根据综合得分确定最优关联内容，具体的，将综合得分最高的内容文本与该内容文本对应的标题文本拼接，得到与语义指导文本和情绪指导文本对应的最优关联内容。In this embodiment, as shown in Figure 2, each content text in the related content collection is traversed, and the content text is scored for semantic relevance. Specifically, you can choose to use large models such as T5, M6, ChatGPT, etc. for semantic relevance scoring. Obtain the semantic correlation score. Specifically, each text in the content text and the semantic guidance text are spliced and input into the above large model to determine whether they match. All matching texts are output and recorded as semantically relevant texts. The ratio of the character length of the text to the character length of the associated text content is output, recorded as the semantic association score, which represents the proportion of text that is semantically consistent with the input text. At the same time, the content text is scored for emotional relevance. Specifically, you can choose to use large models such as T5, M6, ChatGPT, etc. to score the emotional relevance of the semantically related text to obtain the emotional relevance score. Specifically, each text in the semantically related text is After splicing it with the emotion guidance text, it is input into the above large model to determine whether the two match. If they match, the matching sentence will be determined as the emotion-related text, and the ratio of the character length of the emotion-related text to the character length of the semantic-related text will be output. Record is the emotional correlation score. A comprehensive score is obtained based on the semantic correlation score and the emotional correlation score. The optimal correlation content is determined based on the comprehensive score. Specifically, the content text with the highest comprehensive score is spliced with the title text corresponding to the content text to obtain semantic guidance. Text and emotion guide the optimal associated content corresponding to the text.

在一示例性实施例中，利用语义关联得分和情绪关联得分确定内容文本的综合得分的过程包括：In an exemplary embodiment, the process of determining the comprehensive score of the content text using the semantic association score and the emotional association score includes:

将语义关联得分和情绪关联得分的乘积作为内容文本的综合得分。The product of the semantic relevance score and the emotional relevance score is used as the comprehensive score of the content text.

本实施例中，可将语义关联得分与情感关联得分相乘，得到关联性综合得分。作为另一种可选的实施例，也可将语义关联得分与情感关联得分相加，得到关联性综合得分，根据实际工程需要选择即可，本实施例在此不作具体限定。In this embodiment, the semantic association score and the sentiment association score may be multiplied to obtain a comprehensive association score. As another optional embodiment, the semantic association score and the sentiment association score may be added to obtain a comprehensive association score, which may be selected according to actual engineering needs, and is not specifically limited in this embodiment.

在一示例性实施例中，基于最优关联内容检索得到多个参考图像样本的过程包括：In an exemplary embodiment, the process of retrieving multiple reference image samples based on optimal associated content includes:

基于最优关联内容进行图像检索，得到多个候选图像样本；Perform image retrieval based on optimal associated content to obtain multiple candidate image samples;

利用情绪指导文本和语义指导文本在多个候选图像样本中筛选出多个参考图像样本。Emotional guidance text and semantic guidance text are used to screen out multiple reference image samples from multiple candidate image samples.

在一示例性实施例中，利用情绪指导文本和语义指导文本在多个候选图像样本中筛选出多个参考图像样本的过程包括：In an exemplary embodiment, the process of selecting a plurality of reference image samples from a plurality of candidate image samples using the emotion guidance text and the semantic guidance text includes:

提取每一候选图像样本的图像摘要文本；Extract the image summary text of each candidate image sample;

对输入文本和每一图像摘要文本进行图像元素互斥性计算，得到每一图像摘要文本的视觉得分，输入文本包括语义指导文本和情绪指导文本；Calculate the mutual exclusivity of image elements between the input text and each image summary text to obtain the visual score of each image summary text. The input text includes semantic guidance text and emotion guidance text;

将视觉得分超过预设值的候选图像样本确定为参考图像样本。Candidate image samples whose visual scores exceed a preset value are determined as reference image samples.

在一示例性实施例中，对输入文本和每一图像摘要文本进行图像元素互斥性计算，得到每一图像摘要文本的视觉得分的过程包括：In an exemplary embodiment, image element mutual exclusivity calculation is performed on the input text and each image summary text, and the process of obtaining the visual score of each image summary text includes:

提取每一图像摘要文本的第一实体元素和第一实体关系以及输入文本的第二实体元素和第二实体关系；Extracting a first entity element and a first entity relationship of each image summary text and a second entity element and a second entity relationship of the input text;

将第一实体元素中不存在与第二实体元素不同的实体元素且第一实体关系中不包括与第二实体关系不同的实体关系的图像摘要文本确定为候选摘要文本；Determine the image summary text in which no entity element different from the second entity element exists in the first entity elements and no entity relationship different from the second entity relationship exists in the first entity relationship as a candidate summary text;

计算每一候选摘要文本与输入文本的一致性描述得分，将一致性描述得分作为候选摘要文本的视觉得分。The consistency description score between each candidate summary text and the input text is calculated, and the consistency description score is used as the visual score of the candidate summary text.

本实施例中，参照图3所示，首先将最优关联内容使用第一预设模型进行关联摘要提取，输出关联摘要，该第一预设模型可使用ChatGPT等模型，目的是将其缩减为更精炼的语言文本，方便调用搜索引擎进行检索，基于关联摘要调用搜索引擎进行图像检索，得到多张候选图像样本，基于多张候选图像样本建立候选关联图像集合，对候选关联图像集合使用第二预设模型进行图像摘要文本提取，以获取候选关联图像集合中的每一候选图像样本的图像摘要文本，基于每一候选关联图像的图像摘要文本构建图像摘要集合，遍历图像摘要集合，基于输入文本对每条图像摘要文本进行图像元素互斥性计算，基于计算结果得到参考图像样本。In this embodiment, as shown in FIG. 3 , the optimal associated content is first extracted using a first preset model to extract the associated summary, and the associated summary is output. The first preset model can use a model such as ChatGPT, with the purpose of reducing it to More refined language text makes it easier to call the search engine for retrieval. Based on the association summary, the search engine is called for image retrieval, and multiple candidate image samples are obtained. A candidate associated image set is established based on multiple candidate image samples. The second method is used for the candidate associated image set. The preset model performs image summary text extraction to obtain the image summary text of each candidate image sample in the candidate associated image set, build an image summary set based on the image summary text of each candidate associated image, traverse the image summary set, and based on the input text The mutual exclusivity of image elements is calculated for each image summary text, and a reference image sample is obtained based on the calculation results.

图像元素互斥性计算包括提取输入文本和图像摘要文本的实体元素和实体关系，实体元素包括但不限于类别和属性，如“车”，“红色”等，实体关系如“人‘喂’狗”，将图像摘要文本中的实体元素和实体关系记为Az，将输入文本的实体元素和实体关系记为Ain，比对Az和Ain的包含关系，如果Az包含Ain以外的其他内容，则不合格，需要删除，否则保留，将保留的图像摘要文本确定为候选摘要文本，计算每条候选摘要文本与输入文本的CIDEr（Consensus-based Image Description Evaluation，基于共识的图像描述评分）得分，即一致性描述得分，也即视觉得分。并按照该一致性描述得分为所有候选摘要文本进行排序，将保留下来的候选摘要文本对应的候选图像样本收集起来，作为视觉强化样本，也即本实施例中的参考图像样本。Image element mutual exclusivity calculation includes extracting entity elements and entity relationships of input text and image summary text. Entity elements include but are not limited to categories and attributes, such as "car", "red", etc., and entity relationships such as "people 'feed' dogs ", record the entity elements and entity relationships in the image summary text as Az, record the entity elements and entity relationships in the input text as Ain, compare the inclusion relationship between Az and Ain, if Az contains other content other than Ain, it will not If it is qualified, it needs to be deleted, otherwise it will be retained. The retained image summary text will be determined as a candidate summary text. The CIDEr (Consensus-based Image Description Evaluation, consensus-based image description score) score of each candidate summary text and the input text will be calculated, that is, they are consistent. Sexual description score, also known as visual score. All candidate summary texts are sorted according to the consistency description score, and candidate image samples corresponding to the retained candidate summary texts are collected as visual enhancement samples, that is, reference image samples in this embodiment.

本实施例中，可以将视觉得分超过预设得分的候选图像样本确定为参考图像样本，也可以将按照视觉得分排序后的得分较高的前m个候选摘要文本对应的候选图像样本作为参考图像样本，m为正整数。In this embodiment, the candidate image samples whose visual scores exceed the preset score can be determined as reference image samples, or the candidate image samples corresponding to the top m candidate summary texts with higher scores sorted according to the visual scores can be used as reference images. Sample, m is a positive integer.

在一示例性实施例中，提取多个参考图像样本的特征，对所有特征中的至少两个特征进行组合得到多个图像组合语义特征的过程包括：In an exemplary embodiment, the process of extracting features of multiple reference image samples and combining at least two features among all features to obtain multiple image combination semantic features includes:

提取多个参考图像样本的特征；Extract features from multiple reference image samples;

对所有特征进行聚类，得到多个一级语义特征；All features are clustered to obtain multiple first-level semantic features;

根据一级语义特征的数量构造注意力掩码矩阵，Construct an attention mask matrix based on the number of first-level semantic features,

利用一级语义特征和注意力掩码矩阵得到多个图像组合语义特征。The first-level semantic features and attention mask matrix are used to obtain multiple image combined semantic features.

本实施例针对给定若干张可参考的参考图像样本，并实现对这些参考图像样本中的某些特定元素的参考，比如，让生成的关联图像中的实体元素更贴近所参考的若干张参考图像样本中的实体元素。基于此，本实施例首先提供了一种用于表征多张图像的特征的专用表示，即构建一种图像的语义特征分布森林结构，图像的语义特征分布森林结构由若干树特征组成，每棵树特征由两层节点组成，每个节点表示一个特征（如大小为[1，d]的向量，d为维度）。This embodiment is aimed at given several reference image samples that can be referred to, and realizes the reference to certain specific elements in these reference image samples, for example, making the entity elements in the generated associated images closer to the reference images. Solid elements in image samples. Based on this, this embodiment first provides a special representation for characterizing the features of multiple images, that is, constructing a semantic feature distribution forest structure of the image. The semantic feature distribution forest structure of the image is composed of several tree features. Each tree The tree feature consists of two levels of nodes, each node represents a feature (such as a vector of size [1, d], d is the dimension).

其次，对生成的关联图像而言，考虑到并非每一张参考图像样本都有帮助，对于有帮助的参考图像样本，也并非所有的特征都有帮助，因此，本实施例提供了一种基于注意力掩码矩阵的特征筛选机制，注意力掩码矩阵用来枚举所有可能的图像组合语义特征，以此来选择最佳的特征作为生成关联图像的参考特征。Secondly, for the generated associated images, considering that not every reference image sample is helpful, and for helpful reference image samples, not all features are helpful, this embodiment provides a feature screening mechanism based on an attention mask matrix. The attention mask matrix is used to enumerate all possible semantic features of image combinations, so as to select the best features as reference features for generating associated images.

下面分别对构建图像的语义特征分布森林结构和特征筛选机制进行说明。The following describes the semantic feature distribution forest structure and feature screening mechanism for constructing an image.

本实施例首先对多张参考图像样本进行特征提取，具体的，将多张参考图像样本输入到图像编码器中进行特征提取，得到大小为[N，j，d]的特征集，记为视觉强化样本特征集，N为参考图像样本的数量，j为每张参考图像样本提取出的特征的数量，d表示每个特征的维度。This embodiment first performs feature extraction on multiple reference image samples. Specifically, multiple reference image samples are input into the image encoder for feature extraction, and a feature set of size [N, j, d] is obtained, which is recorded as visual Enhanced sample feature set, N is the number of reference image samples, j is the number of features extracted from each reference image sample, and d represents the dimension of each feature.

使用聚类算法为视觉强化样本集中的每一个特征进行聚类，对所有特征进行聚类的过程包括：Use a clustering algorithm to cluster each feature in the visual reinforcement sample set. The process of clustering all features includes:

计算任意两个特征间的欧式距离；Calculate the Euclidean distance between any two features;

针对每一特征，确定欧式距离小于第一预设距离的数量，当数量不小于预设数量，将特征划分至密集特征子集，当数量小于预设数量，将特征划分至非密集特征子集；For each feature, determine the number by which the Euclidean distance is less than the first preset distance. When the number is not less than the preset number, the features are divided into dense feature subsets. When the number is less than the preset number, the features are divided into non-dense feature subsets. ;

确定一个子类，将密集特征子集中的任一个特征加入到子类并从密集特征子集中剔除；Determine a subclass, add any feature in the dense feature subset to the subclass and remove it from the dense feature subset;

计算子类中的所有特征与密集特征子集中的所有特征之间的最小欧式距离，判断密集特征子集中是否存在第一待剔除特征，若是，将第一待剔除特征加入到子类并从密集特征子集中剔除，重复本步骤，直至密集特征子集中不存在第一待剔除特征，第一待剔除特征为密集特征子集中与子类中的特征之间的最小欧式距离小于第二预设距离的特征；Calculate the minimum Euclidean distance between all features in the subclass and all features in the dense feature subset, and determine whether there is the first feature to be eliminated in the dense feature subset. If so, add the first feature to be eliminated to the subclass and remove it from the dense feature subset. Eliminate from the feature subset, repeat this step until there is no first feature to be eliminated in the dense feature subset, and the first feature to be eliminated is that the minimum Euclidean distance between the features in the dense feature subset and the subclass is less than the second preset distance Characteristics;

计算子类中的所有特征与非密集特征子集中的所有特征之间的最小欧式距离，确定非密集特征子集中是否存在第二待剔除特征，若是，将第二待剔除特征加入到子类并从非密集特征子集中剔除，重复本步骤，直至非密集特征子集中不存在第二待剔除特征，第二待剔除特征为非密集特征子集中与子类中的特征之间的最小欧式距离小于第二预设距离的特征；Calculate the minimum Euclidean distance between all features in the subclass and all features in the non-dense feature subset, determine whether there is a second feature to be eliminated in the non-dense feature subset, and if so, add the second feature to be eliminated to the subclass and Eliminate from the non-dense feature subset, and repeat this step until there is no second feature to be eliminated in the non-dense feature subset, and the second feature to be eliminated is that the minimum Euclidean distance between the features in the non-dense feature subset and the subclass is less than Characteristics of the second preset distance;

将子类加入到预设聚类集合中。Add subcategories to the preset cluster set.

将视觉强化样本特征集转化为[N×j，d]的大小，记L=N×j为特征的总个数，计算每两个特征之间的欧式距离，得到[L，L]的距离矩阵。The visual enhancement sample feature set is converted into the size of [N×j, d], where L=N×j is the total number of features. The Euclidean distance between every two features is calculated to obtain the distance matrix of [L, L].

获取预设数量和预设距离，预设距离包括第一预设距离和第二预设距离，第一预设距离和第二预设距离可以相同，本实施例中的第一预设距离和预设数量用于构造密集特征子集和非密集特征子集。A preset number and a preset distance are obtained, where the preset distance includes a first preset distance and a second preset distance, and the first preset distance and the second preset distance may be the same. In this embodiment, the first preset distance and the preset number are used to construct a dense feature subset and a non-dense feature subset.

针对特征集中的每一个特征，计算该特征与特征集中其他特征的欧式距离，确定计算得到的所有欧式距离小于第一预设距离的数量，判断该数量是否小于预设数量，若不小于预设数量，将该特征划分至密集特征子集中，设密集特征子集的大小为[M，d]，M≤L，并将未划分至密集特征子集中的特征划分至非密集特征子集中，设非密集特征子集的大小为[L-M，d]。For each feature in the feature set, calculate the Euclidean distance between the feature and other features in the feature set, determine the number by which all calculated Euclidean distances are less than the first preset distance, and determine whether the number is less than the preset number. If not, quantity, divide the feature into a dense feature subset, let the size of the dense feature subset be [M, d], M≤L, and divide the features that are not divided into the dense feature subset into a non-dense feature subset, let The size of the non-dense feature subset is [L-M, d].

构造一个预设聚类集合C（空集），当密集特征子集不为空时，遍历密集特征子集中的每一特征f，假设密集特征子集中包括f1，f2，f3，f4和f5。初始化一个新的子类C_b{f}，假设当前遍历到的特征为f1，则将f1从密集特征子集中删除，将f1划分至C_b中，对于当前密集特征子集中的特征即f2、f3、f4、f5，分别计算f2和f1之间的欧式距离，f3与f1之间的欧式距离，f4与f1之间的欧式距离，f5和f1之间的欧式距离，若仅有f2和f1之间的欧式距离小于第二预设距离，则将f2划分至C_b中，并将f2从当前密集特征子集中删除，然后对于当前密集特征子集中的所有特征与C_b中的所有特征计算其欧式距离，针对当前密集特征子集中的f3、f4、f5与C_b中的所有特征f1和f2，由于f3、f4、f5与f1之间的欧式距离在上一次已经计算过了，此处不再计算，然后分别计算f3与f2之间的欧式距离，f4与f2之间的欧式距离，f5与f2之间的欧式距离，判断当前密集特征子集中是否存在欧式距离小于第二预设距离的特征，假设f4与f2之间的欧式距离小于第二预设距离，则将f4划分至C_b中，并将f4从密集特征子集中删除，以此类推，重复上述过程，直至密集特征子集中再无任何一个特征可以被并入C_b。同理，遍历非密集特征子集中的所有特征，按照上述方式在非密集特征子集中选择与C_b中的特征的欧式距离小于第二预设距离的特征加入到C_b中，并将其从非密集特征子集中剔除，直至非密集特征子集中再无任何一个特征可以被并入C_b，将C_b加入到预设聚类集合中，输出C={C₁，C₂，…，C_s}，C_b为C₁至C_s中的任一个。Construct a preset clustering set C (empty set). When the dense feature subset is not empty, traverse each feature f in the dense feature subset. It is assumed that the dense feature subset includes f1, f2, f3, f4 and f5. Initialize a new subclass C _b {f}. Assume that the currently traversed feature is f1, then delete f1 from the dense feature subset, and divide f1 into C _b . For the features in the current dense feature subset, that is, f2, f3, f4, f5, respectively calculate the Euclidean distance between f2 and f1, the Euclidean distance between f3 and f1, the Euclidean distance between f4 and f1, the Euclidean distance between f5 and f1, if there are only f2 and f1 If the Euclidean distance between them is less than the second preset distance, then f2 is divided into C _b , and f2 is deleted from the current dense feature subset, and then all features in the current dense feature subset are calculated with all features in C _b The Euclidean distance is for all features f1 and f2 in f3, f4, f5 and C _b in the current dense feature subset. Since the Euclidean distance between f3, f4, f5 and f1 has been calculated last time, here No more calculations, and then calculate the Euclidean distance between f3 and f2, the Euclidean distance between f4 and f2, and the Euclidean distance between f5 and f2, and determine whether there is a Euclidean distance in the current dense feature subset that is smaller than the second preset distance. characteristics, assuming that the Euclidean distance between f4 and f2 is less than the second preset distance, then divide f4 into C _b , and delete f4 from the dense feature subset, and so on, repeat the above process until the dense feature subset There is no more feature in the set that can be incorporated into C _b . In the same way, traverse all the features in the non-dense feature subset, select the features in the non-dense feature subset whose Euclidean distance is less than the second preset distance from the features in C _b and add them to C _b in the above manner, and add them to C b Eliminate the non-dense feature subset until no feature in the non-dense feature subset can be incorporated into C _b . Add C _b to the preset clustering set and output C = {C ₁ , C ₂ ,..., C _s }, C _b is any one of C ₁ to C _s .

在一示例性实施例中，得到多个一级语义特征的过程包括：In an exemplary embodiment, the process of obtaining multiple first-level semantic features includes:

按照第二关系式对预设聚类集合中的所有子类计算其包括的所有特征的加权和，基于加权和得到一级语义特征；Calculating the weighted sum of all features included in all subclasses in the preset clustering set according to the second relational expression, and obtaining the primary semantic feature based on the weighted sum;

第二关系式为；The second relational expression is ;

其中，t为第b个子类中的特征的个数，f_b为第b个子类的加权和，f_t为对第b个子类遍历过程中第b个子类中的当前特征，f_p为遍历过程中的每一个特征，为第一预设距离或第二预设距离，dis（f_t,f_p）为f_t和f_p间的欧式距离，/>为所述第b个子类中满足/>的特征的数量。Among them, t is the number of features in the b-th sub-category, f _b is the weighted sum of the b-th sub-category, f _t is the current feature in the b-th sub-category during the traversal of the b-th sub-category, f _p is the traversal Every feature in the process, is the first preset distance or the second preset distance, dis(f _t , f _p ) is the Euclidean distance between f _t and f _p ,/> Satisfies/> for the b-th subcategory the number of features.

本实施例中，对于C中每一子类，按照第二关系式计算其包括的所有特征的加权和，其中，t表示子类C_b中特征的个数。In this embodiment, for each subcategory C, the weighted sum of all features included in it is calculated according to the second relational expression, where t represents the number of features in subcategory C _b .

在一示例性实施例中，对所有特征进行聚类，得到多个一级语义特征的过程包括：In an exemplary embodiment, the process of clustering all features to obtain multiple first-level semantic features includes:

对所有特征进行聚类，得到多个一级语义特征及每个一级语义特征下的二级语义特征；All features are clustered to obtain multiple first-level semantic features and second-level semantic features under each first-level semantic feature;

图像生成方法还包括：Image generation methods also include:

构建语义特征分布森林，语义特征分布森林包括多个树特征，每一树特征的树干特征为一级语义特征，每一树干特征的树枝特征为一级语义特征下的二级语义特征；Construct a semantic feature distribution forest. The semantic feature distribution forest includes multiple tree features. The trunk features of each tree feature are first-level semantic features, and the branch features of each trunk feature are second-level semantic features under the first-level semantic features;

基于与文本语义特征的相似度最高的图像组合语义特征生成关联图像的过程包括：The process of generating associated images based on the image with the highest similarity to the semantic features of the text combined with semantic features includes:

基于与文本语义特征的相似度最高的图像组合语义特征对应的注意力掩码矩阵的选择参数确定最优树特征；Determine the optimal tree feature based on the selection parameters of the attention mask matrix corresponding to the image combination semantic feature with the highest similarity to the text semantic feature;

利用最优树特征得到图像筛选特征；The image screening features are obtained using the optimal tree features;

基于图像筛选特征和文本语义特征生成关联图像。Generate associated images based on image screening features and text semantic features.

本实施例中，构造语义特征分布森林，其中，树特征的个数即为预设聚类集合中所有子类的个数，树特征的树干特征为一级语义特征，具体为所有f_b输出，每个树干特征的树枝特征则为其对应的子类里面的所有特征，即每一个C_b中存储的特征。示例性地，假设对多个参考图像样本的特征进行聚类后的得到的预设聚类集合C={C₁，C₂，C₃，C₄，C₅}，可以理解，以图4所示语义特征分布森林结构为例，图4中包括五棵树特征，第一棵树特征的一级语义特征为C₁中所有特征的加权和f_b1，第一棵树特征的二级语义特征为C₁中所有的特征（f_c1），第二棵树特征的一级语义特征为C₂中所有特征的加权和f_b2，第二棵树特征的二级语义特征为C₂中所有的特征（f_c2），第三棵树特征的一级语义特征为C₃中所有特征的加权和f_b3，第三棵树特征的二级语义特征为C₃中所有的特征（f_c3），第四棵树特征的一级语义特征为C₄中所有特征的加权和f_b4，第四棵树特征的二级语义特征为C₄中所有的特征（f_c4），第五棵树特征的一级语义特征为C₅中所有特征的加权和f_b5，第五棵树特征的二级语义特征为C₅中所有的特征（f_c5）。In this embodiment, a semantic feature distribution forest is constructed, in which the number of tree features is the number of all subclasses in the preset cluster set, and the trunk features of the tree features are first-level semantic features, specifically all f _b outputs , the branch features of each trunk feature are all the features in its corresponding subclass, that is, the features stored in each C _b . For example, assuming that the preset clustering set C={C ₁ , C ₂ , C ₃ , C ₄ , C ₅ } obtained after clustering the features of multiple reference image samples, it can be understood that as shown in Figure 4 The semantic feature distribution forest structure shown is an example. Figure 4 includes five tree features. The first-level semantic feature of the first tree feature is the weighted sum f _b1 of all features in C _1. The second-level semantic feature of the first tree feature is The features are all the features in C ₁ (f _c1 ). The first-level semantic features of the second tree feature are the weighted sum f _b2 of all the features in C ₂ . The second-level semantic features of the second tree feature are all the features in C ₂ features (f _c2 ), the first-level semantic feature of the third tree feature is the weighted sum of all features in C ₃ f _b3 , and the second-level semantic feature of the third tree feature is all the features in C ₃ (f _c3 ) , the first-level semantic feature of the fourth tree feature is the weighted sum f _b4 of all features in C ₄ , the second-level semantic feature of the fourth tree feature is all the features in C ₄ (f _c4 ), and the fifth tree feature The first-level semantic feature is the weighted sum of all features in C ₅ f _b5 , and the second-level semantic feature of the fifth tree feature is all the features in C ₅ (f _c5 ).

构造注意力掩码集合，根据语义特征分布森林中树特征的个数进行初始化：Construct an attention mask set and initialize it according to the number of tree features in the semantic feature distribution forest:

；其中，y为树特征的个数，结合图4所示的语义特征分布森林的结构，可以得到如图5所示的注意力掩码矩阵，注意力掩码矩阵中的每一列表示一种可行的特征组合，1表示该特征应该被选择，对于r=2，y=5时，有10种特征组合，见图5中的前10列，r=3，y=5时，有10种特征组合，见图5中的第11列至20列，r=4，y=5时，有5种特征组合，见图5中的第21列至25列，r=5，y=5时，有1种特征组合，见图5中的第26列。使用transformer等模型结构对各一级语义特征（记为g）进行特征提取，并依次使用注意力掩码集合中的各列注意力掩码按照第一关系式进行计算得到与各第一语义特征对应的多个图像组合语义特征，大小为[b，d]，第一关系式为 ; Among them, y is the number of tree features. Combined with the structure of the semantic feature distribution forest shown in Figure 4, the attention mask matrix shown in Figure 5 can be obtained. Each column in the attention mask matrix represents a Feasible feature combinations, 1 indicates that the feature should be selected. For r=2, y=5, there are 10 feature combinations. See the first 10 columns in Figure 5. When r=3, y=5, there are 10 combinations. Feature combinations, see columns 11 to 20 in Figure 5, when r=4, y=5, there are 5 feature combinations, see columns 21 to 25 in Figure 5, when r=5, y=5 , there is 1 feature combination, see column 26 in Figure 5. Use model structures such as transformer to extract features of each first-level semantic feature (denoted as g), and use each column of attention masks in the attention mask set to calculate according to the first relationship to obtain the corresponding first semantic features. The corresponding multiple image combination semantic features are of size [b, d], and the first relational expression is

； ;

其中，transformer为基于注意力机制的模型，g为所述一级语义特征，softmax为概率归一化函数，W_q为查询参数权重，W_k为链参数权重，W_v为值参数权重，Mask[:,k]为所述注意力掩码矩阵的第k列的选择参数，size（g）为一级语义特征的维度大小，为转置符号。使用文本编码器对输入指导文本进行编码，得到文本语义特征，大小为/>。Among them, transformer is a model based on the attention mechanism, g is the first-level semantic feature, softmax is the probability normalization function, _Wq is the query parameter weight, _Wk is the chain parameter weight, _Wv is the value parameter weight, Mask[:,k] is the selection parameter of the kth column of the attention mask matrix, size(g) is the dimension size of the first-level semantic feature, is the transposed symbol. Use the text encoder to encode the input guidance text to obtain the text semantic features, the size is/> .

计算文本语义特征与每一图像组合语义特征之间的三角相似度，确定三角相似度最高的图像组合语义特征，根据该图像组合语义特征在注意力掩码矩阵中的位置，得到最优组合，其中，最优组合用来描述聚类后哪些类的组合能最准确地表达文本中类似的语义，假设该图像组合语义特征在注意力掩码矩阵中的位置为第13列，则最优组合为1，2，5，也即最优组合为图4中的第一棵树特征、第二棵树特征及第五棵树特征，针对第一棵树特征随机挑选一个树枝特征，针对第二棵树特征随机挑选一个树枝特征，针对第五棵树特征随机挑选一个树枝特征，得到一个图像筛选特征，大小为[z，d]，本实施例中z=3。Calculate the triangular similarity between the text semantic features and each image combination semantic feature, determine the image combination semantic feature with the highest triangular similarity, and obtain the optimal combination based on the position of the image combination semantic feature in the attention mask matrix. Among them, the optimal combination is used to describe which combination of classes after clustering can most accurately express similar semantics in the text. Assume that the position of the semantic feature of the image combination in the attention mask matrix is the 13th column, then the optimal combination is 1, 2, 5, that is, the optimal combination is the first tree feature, the second tree feature and the fifth tree feature in Figure 4. Randomly select a branch feature for the first tree feature, and select a branch feature for the second tree feature. The tree feature randomly selects a branch feature, and randomly selects a branch feature for the fifth tree feature to obtain an image screening feature with a size of [z, d]. In this embodiment, z=3.

在一示例性实施例中，基于图像筛选特征和文本语义特征生成关联图像的过程包括：In an exemplary embodiment, the process of generating associated images based on image filtering features and text semantic features includes:

利用图像筛选特征得到条件噪声初始图像；Use image filtering features to obtain conditional noise initial images;

基于条件噪声初始图像和文本语义特征生成关联图像。Generate associated images based on conditional noise initial images and text semantic features.

本实施例中，将图像筛选特征进行复制，将维度变换为[h，w，z×d]的大小，其中前两维表示图像的高和宽，记为条件噪声初始图像，采用扩散生成模型，将条件噪声初始图像与文本语义特征共同输入其中，输出最终生成的关联图像。In this embodiment, the image filtering features are copied and the dimensions are transformed into the size of [h, w, z × d], where the first two dimensions represent the height and width of the image, which are recorded as the conditional noise initial image, and a diffusion generation model is used , input the conditional noise initial image and text semantic features together, and output the finally generated associated image.

综上，关联图像生成方案参照图6所示，包括将语义指导文本（大小为[1，l]）输入文本编码器得到文本语义特征（大小为[1，l，d]），将每一参考图像样本（大小为[N，h，w]）输入图像编码器提取该参考图像样本的特征，基于所有参考图像样本的特征构建特征集（大小为[N，j，d]），对特征集中的特征进行图像语义聚类，根据聚类结果构建语义特征分布森林，并根据语义特征分布森林中树特征的个数对注意力掩码矩阵初始化，以及根据语义特征分布森林和文本语义整体以及注意力掩码矩阵进行语义的相似度计算，得到图像筛选特征，基于图像筛选特征生成条件噪声图像，将条件噪声图像和文本语义特征输入扩散模型生成器，生成关联图像。In summary, the associated image generation scheme is shown in Figure 6, which includes inputting the semantic guidance text (size [1, l]) into the text encoder to obtain text semantic features (size [1, l, d]), and converting each A reference image sample (size [N, h, w]) is input to the image encoder to extract the features of the reference image sample, and a feature set (size [N, j, d]) is constructed based on the features of all reference image samples. The concentrated features are used for image semantic clustering, a semantic feature distribution forest is constructed based on the clustering results, and the attention mask matrix is initialized based on the number of tree features in the semantic feature distribution forest, and the semantic feature distribution forest and text semantics are integrated as a whole and The attention mask matrix performs semantic similarity calculation to obtain image screening features. Based on the image screening features, a conditional noise image is generated. The conditional noise image and text semantic features are input into the diffusion model generator to generate a related image.

第二方面，请参照图7，图7为本发明所提供的一种图像生成系统的结构示意图，包括：In the second aspect, please refer to Figure 7, which is a schematic structural diagram of an image generation system provided by the present invention, including:

获取模块11，用于获取语义指导文本和情绪指导文本；Acquisition module 11, used to obtain semantic guidance text and emotion guidance text;

检索模块12，用于基于语义指导文本和情绪指导文本检索得到多个参考图像样本；A retrieval module 12, configured to retrieve a plurality of reference image samples based on the semantic guidance text and the emotional guidance text;

提取模块13，用于提取多个参考图像样本的特征，对所有特征中的至少两个特征进行组合得到多个图像组合语义特征；The extraction module 13 is used to extract features of multiple reference image samples, and combine at least two features among all features to obtain multiple image combination semantic features;

生成模块14，用于获取语义指导文本对应的文本语义特征，基于与文本语义特征的相似度最高的图像组合语义特征生成关联图像。The generation module 14 is used to obtain text semantic features corresponding to the semantic guidance text, and generate associated images based on the image with the highest similarity to the text semantic features by combining the semantic features.

在一示例性实施例中，基于语义指导文本和情绪指导文本检索得到多个参考图像样本的过程包括：In an exemplary embodiment, the process of retrieving a plurality of reference image samples based on the semantic guidance text and the emotional guidance text includes:

对语义指导文本和情绪指导文本进行拼接，得到检索文本；Splice the semantic guidance text and the emotion guidance text to obtain the retrieval text;

在一示例性实施例中，根据检索到的前n条网页构建关联内容集合的过程包括：In an exemplary embodiment, the process of constructing a related content collection based on the first n retrieved web pages includes:

在本地存储空间以字典形式存储每条网页对应的标题文本和内容文本，得到关联内容集合；字典的键为标题文本，字典的值为内容文本。Store the title text and content text corresponding to each web page in the form of a dictionary in the local storage space to obtain an associated content collection; the key of the dictionary is the title text, and the value of the dictionary is the content text.

确定内容文本中与语义指导文本匹配的语义相关文本，以及语义相关文本中与情绪指导文本匹配的情绪相关文本；determining semantically relevant texts that match the semantically guiding texts in the content texts, and emotionally relevant texts that match the emotionally guiding texts in the semantically relevant texts;

基于语义相关文本确定内容文本的语义关联得分；Determine the semantic relevance score of the content text based on the semantically related text;

利用语义关联得分和情绪关联得分确定内容文本的综合得分。The semantic relevance score and the emotional relevance score are used to determine the overall score of the content text.

利用情绪指导文本和语义指导文本在多个候选图像样本中筛选出多个参考图像样本。Emotional guidance text and semantic guidance text are used to filter out multiple reference image samples from multiple candidate image samples.

在一示例性实施例中，利用情绪指导文本和语义指导文本在多个候选图像样本中筛选出多个参考图像样本的过程包括：In an exemplary embodiment, the process of filtering out multiple reference image samples from multiple candidate image samples using emotional guidance text and semantic guidance text includes:

对输入文本和每一图像摘要文本进行图像元素互斥性计算，得到每一图像摘要文本的视觉得分，输入文本包括语义指导文本和情绪指导文本；Calculate the mutual exclusivity of image elements for the input text and each image summary text to obtain a visual score for each image summary text, where the input text includes a semantic guidance text and an emotional guidance text;

提取每一图像摘要文本的第一实体元素和第一实体关系以及输入文本的第二实体元素和第二实体关系；Extract the first entity element and first entity relationship of each image summary text and the second entity element and second entity relationship of the input text;

将第一实体元素中不存在与第二实体元素不同的实体元素且第一实体关系中不包括与第二实体关系不同的实体关系的图像摘要文本确定为候选摘要文本；Determine the image summary text in which the first entity element does not have an entity element that is different from the second entity element and the first entity relationship does not include an entity relationship that is different from the second entity relationship as a candidate summary text;

计算每一候选摘要文本与输入文本的一致性描述得分，将一致性描述得分作为候选摘要文本的视觉得分。Calculate the consistency description score of each candidate summary text and the input text, and use the consistency description score as the visual score of the candidate summary text.

对所有特征进行聚类，得到多个一级语义特征；Cluster all features to obtain multiple first-level semantic features;

在一示例性实施例中，利用一级语义特征和注意力掩码矩阵得到多个图像组合语义特征的过程包括：In an exemplary embodiment, the process of obtaining a plurality of image combination semantic features using the primary semantic features and the attention mask matrix includes:

利用第一关系式得到多个图像组合语义特征，第一关系式为The first relational expression is used to obtain the semantic features of multiple image combinations. The first relational expression is

； ;

其中，transformer为基于注意力机制的模型，g为所述一级语义特征，softmax为概率归一化函数，W_q为查询参数权重，W_k为链参数权重，W_v为值参数权重，Mask[:,k]为所述注意力掩码矩阵的第k列的选择参数，size（g）为一级语义特征的维度大小，为转置符号。Among them, transformer is a model based on the attention mechanism, g is the first-level semantic feature, softmax is the probability normalization function, W _q is the query parameter weight, W _k is the chain parameter weight, W _v is the value parameter weight, Mask [:,k] is the selection parameter of the k-th column of the attention mask matrix, size (g) is the dimension size of the first-level semantic feature, is the transpose symbol.

图像生成系统还包括：The image generation system also includes:

构建模块，用于构建语义特征分布森林，语义特征分布森林包括多个树特征，每一树特征的树干特征为一级语义特征，每一树干特征的树枝特征为一级语义特征下的二级语义特征；Building module, used to build a semantic feature distribution forest. The semantic feature distribution forest includes multiple tree features. The trunk features of each tree feature are first-level semantic features, and the branch features of each trunk feature are second-level semantic features under the first-level semantic features. Semantic features;

基于与文本语义特征的相似度最高的图像组合语义特征生成关联图像的过程包括：The process of generating associated images based on the semantic features of the image with the highest similarity to the text semantic features includes:

利用最优树特征得到图像筛选特征；Use optimal tree features to obtain image screening features;

基于图像筛选特征和文本语义特征生成关联图像。Generate associated images based on image filtering features and text semantic features.

在一示例性实施例中，基于图像筛选特征和文本语义特征生成关联图像的过程包括：In an exemplary embodiment, the process of generating an associated image based on the image screening feature and the text semantic feature includes:

在一示例性实施例中，对所有特征进行聚类的过程包括：In an exemplary embodiment, the process of clustering all features includes:

在一示例性实施例中，得到多个一级语义特征的过程包括：In an exemplary embodiment, the process of obtaining a plurality of primary semantic features includes:

按照第二关系式对预设聚类集合中的所有子类计算其包括的所有特征的加权和，基于加权和得到一级语义特征；Calculate the weighted sum of all features included in all subcategories in the preset cluster set according to the second relational expression, and obtain the first-level semantic features based on the weighted sum;

所述第二关系式为；The second relational expression is ;

第三方面，参照图8所示，图8为本发明所提供的一种电子设备的结构示意图，该电子设备包括：In a third aspect, refer to FIG. 8 , which is a schematic structural diagram of an electronic device provided by the present invention. The electronic device includes:

存储器21，用于存储计算机程序；A memory 21, used for storing computer programs;

处理器22，用于执行计算机程序时实现如上文任意一个实施例所描述的图像生成方法的步骤。The processor 22 is configured to implement the steps of the image generation method described in any of the above embodiments when executing a computer program.

该电子设备还包括：The electronic device also includes:

输入接口23，经通信总线26与处理器22相连，用于获取外部导入的计算机程序、参数和指令，经处理器22控制保存至存储器21中。该输入接口可以与输入装置相连，接收用户手动输入的参数或指令。该输入装置可以是显示屏上覆盖的触摸层，也可以是终端外壳上设置的按键、轨迹球或触控板。The input interface 23 is connected to the processor 22 via the communication bus 26, and is used to obtain the computer programs, parameters and instructions imported from the outside, and save them in the memory 21 under the control of the processor 22. The input interface can be connected to an input device to receive parameters or instructions manually input by the user. The input device can be a touch layer covered on the display screen, or a key, trackball or touchpad set on the terminal housing.

显示单元24，经通信总线26与处理器22相连，用于显示处理器22发送的数据。该显示单元可以为液晶显示屏或者电子墨水显示屏等。The display unit 24 is connected to the processor 22 via the communication bus 26 and is used to display data sent by the processor 22. The display unit may be a liquid crystal display or an electronic ink display.

网络端口25，经通信总线26与处理器22相连，用于与外部各终端设备进行通信连接。该通信连接所采用的通信技术可以为有线通信技术或无线通信技术，如移动高清链接技术、通用串行总线、高清多媒体接口、无线保真技术、蓝牙通信技术、低功耗蓝牙通信技术、基于IEEE802.11s的通信技术等。The network port 25 is connected to the processor 22 via the communication bus 26, and is used to communicate with various external terminal devices. The communication technology used in the communication connection can be a wired communication technology or a wireless communication technology, such as mobile high-definition link technology, universal serial bus, high-definition multimedia interface, wireless fidelity technology, Bluetooth communication technology, low-power Bluetooth communication technology, communication technology based on IEEE802.11s, etc.

第四方面，请参照图9，图9为本发明所提供的一种计算机可读存储介质的结构示意图，计算机可读存储介质30上存储有计算机程序31，计算机程序31被处理器执行时实现如上文任意一个实施例所描述的图像生成方法的步骤。In the fourth aspect, please refer to Figure 9. Figure 9 is a schematic structural diagram of a computer-readable storage medium provided by the present invention. A computer program 31 is stored on the computer-readable storage medium 30. The computer program 31 is implemented when executed by a processor. The steps of the image generation method described in any of the above embodiments.

该计算机可读存储介质30可以包括：U盘、移动硬盘、只读存储器（Read-OnlyMemory，ROM）、随机存取存储器（Random Access Memory，RAM）、磁碟或者光盘等各种可以存储程序代码的介质。The computer-readable storage medium 30 may include: U disk, mobile hard disk, read-only memory (Read-Only Memory, ROM), random access memory (Random Access Memory, RAM), magnetic disk or optical disk, etc., which can store program codes. medium.

还需要说明的是，在本说明书中，诸如第一和第二等之类的关系术语仅仅用来将一个实体或者操作与另一个实体或操作区分开来，而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。而且，术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含，从而使得包括一系列要素的过程、方法、物品或者设备不仅包括那些要素，而且还包括没有明确列出的其他要素，或者是还包括为这种过程、方法、物品或者设备所固有的要素。在没有更多限制的状况下，由语句“包括一个……”限定的要素，并不排除在包括所述要素的过程、方法、物品或者设备中还存在另外的相同要素。It should also be noted that in this specification, relational terms such as first and second are only used to distinguish one entity or operation from another entity or operation, and do not necessarily require or imply that these entities or operations There is no such actual relationship or sequence between operations. Furthermore, the terms "comprises," "comprises," or any other variation thereof are intended to cover a non-exclusive inclusion such that a process, method, article, or apparatus that includes a list of elements includes not only those elements, but also those not expressly listed other elements, or elements inherent to the process, method, article or equipment. Without further limitation, an element defined by the statement "comprises a..." does not exclude the presence of additional identical elements in a process, method, article, or device that includes the stated element.

对所公开的实施例的上述说明，使本领域专业技术人员能够实现或使用本发明。对这些实施例的多种修改对本领域的专业技术人员来说将是显而易见的，本文中所定义的一般原理可以在不脱离本发明的精神或范围的情况下，在其他实施例中实现。因此，本发明将不会被限制于本文所示的这些实施例，而是要符合与本文所公开的原理和新颖特点相一致的最宽的范围。The above description of the disclosed embodiments enables those skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be practiced in other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. An image generation method, characterized by comprising:

Get semantic guidance text and emotion guidance text;

Retrieve multiple reference image samples based on the semantic guidance text and the emotion guidance text;

Extract features of a plurality of reference image samples, and combine at least two of all the features to obtain multiple image combination semantic features;

Text semantic features corresponding to the semantic guidance text are obtained, and semantic features are combined based on the image with the highest similarity to the text semantic features to generate a related image.

2. The image generation method according to claim 1, wherein the process of retrieving multiple reference image samples based on the semantic guidance text and the emotion guidance text includes:

Perform web page retrieval based on the semantic guidance text and the emotion guidance text;

Construct an associated content set based on the first n retrieved web pages, the associated content set includes associated content corresponding to each of the web pages, the associated content includes the title text and content text of the web page, n is a positive integer;

Select the optimal related content from the related content set that has the strongest comprehensive correlation with the semantic guidance text and the emotional guidance text;

A plurality of reference image samples are retrieved based on the optimal associated content.

3. The image generation method according to claim 2, wherein the process of web page retrieval based on the semantic guidance text and the emotion guidance text includes:

Splice the semantic guidance text and the emotion guidance text to obtain the retrieval text;

The retrieval text is input into the search engine interface to perform web page retrieval on the retrieval text.

4. The image generation method according to claim 2, wherein the process of constructing the associated content collection based on the retrieved first n web pages includes:

Extract the title text and content text of the first n web pages retrieved;

The title text and content text corresponding to each web page are stored in the local storage space in the form of a dictionary to obtain an associated content collection; the key of the dictionary is the title text, and the value of the dictionary is the content text.

5. The image generation method according to claim 2, characterized by the process of selecting the optimal associated content with the strongest comprehensive correlation with the semantic guidance text and the emotion guidance text in the associated content set. include:

For each content text in the associated content set, based on the semantic association score corresponding to the content text and the semantic guidance text and the emotion association score corresponding to the content text and the emotion guidance text, the obtained content text is obtained. The comprehensive score of the content text;

The related content including the content text with the highest comprehensive score is determined as the optimal related content with the strongest comprehensive correlation with the semantic guidance text and the emotion guidance text.

6. The image generation method according to claim 5, characterized in that, based on the semantic association score corresponding to the content text and the semantic guidance text and the emotion association score corresponding to the content text and the emotion guidance text, The process of obtaining a comprehensive score for the content text includes:

Determining semantically related texts in the content text that match the semantic guidance text, and emotion-related texts in the semantically related texts that match the emotion guidance text;

determining a semantic relevance score of the content text based on the semantically relevant text;

Determining a sentiment relevance score of the content text based on the sentiment-related text;

A comprehensive score of the content text is determined using the semantic relevance score and the emotional relevance score.

7. The image generation method according to claim 6, wherein the process of determining the semantic association score of the content text based on the semantically related text includes:

Determine the ratio of the character length of the semantically related text to the character length of the content text as the semantic association score of the content text;

The process of determining the emotion correlation score of the content text based on the emotion-related text includes:

The ratio of the character length of the emotion-related text to the character length of the semantic-related text is determined as the emotion correlation score of the content text.

8. The image generation method according to claim 6, wherein the process of using the semantic association score and the emotional association score to determine the comprehensive score of the content text includes:

The product of the semantic association score and the emotion association score is used as the comprehensive score of the content text.

9. The image generation method according to claim 2, wherein the process of retrieving multiple reference image samples based on the optimal associated content includes:

Perform image retrieval based on the optimal associated content to obtain multiple candidate image samples;

A plurality of reference image samples are screened out from the plurality of candidate image samples using the emotion guidance text and the semantic guidance text.

10. The image generation method according to claim 9, characterized in that the process of using the emotional guidance text and the semantic guidance text to select multiple reference image samples from a plurality of candidate image samples includes:

Extracting image summary text for each candidate image sample;

Performing image element mutual exclusivity calculation on an input text and each of the image summary texts to obtain a visual score of each of the image summary texts, wherein the input text includes the semantic guidance text and the emotion guidance text;

Candidate image samples whose visual scores exceed a preset value are determined as reference image samples.

11. The image generation method according to claim 10, characterized in that the process of performing image element mutual exclusivity calculation on the input text and each image summary text, and obtaining the visual score of each image summary text includes: :

Extract the first entity element and the first entity relationship of each image summary text and the second entity element and the second entity relationship of the input text;

The image summary text in which the first entity element does not have an entity element different from the second entity element and the first entity relationship does not include an entity relationship different from the second entity relationship is determined as a candidate summary. text;

The consistency description score of each candidate summary text and the input text is calculated, and the consistency description score is used as the visual score of the candidate summary text.

12. The image generation method according to any one of claims 1 to 11, characterized in that features of a plurality of reference image samples are extracted, and at least two of all the features are combined to obtain a plurality of features. The process of combining semantic features of an image includes:

Extract features of a plurality of reference image samples;

Cluster all the features to obtain multiple first-level semantic features;

Construct an attention mask matrix according to the number of first-level semantic features,

Multiple image combined semantic features are obtained using the first-level semantic features and the attention mask matrix.

13. The image generation method according to claim 12, characterized in that the process of obtaining a plurality of image combination semantic features using the primary semantic features and the attention mask matrix comprises:

The first relational expression is used to obtain the semantic features of multiple image combinations, and the first relational expression is

;

Among them, transformer is a model based on the attention mechanism, g is the first-level semantic feature, softmax is the probability normalization function, W _q is the query parameter weight, W _k is the chain parameter weight, W _v is the value parameter weight, Mask [:,k] is the selection parameter of the k-th column of the attention mask matrix, size (g) is the dimension size of the first-level semantic feature, is the transposition symbol.

14. The image generation method according to claim 13, characterized in that the process of clustering all the features to obtain multiple first-level semantic features includes:

Clustering all the features to obtain a plurality of first-level semantic features and a second-level semantic feature under each first-level semantic feature;

The image generation method also includes:

Constructing a semantic feature distribution forest, wherein the semantic feature distribution forest includes a plurality of tree features, wherein a trunk feature of each tree feature is the primary semantic feature, and a branch feature of each trunk feature is a secondary semantic feature under the primary semantic feature;

The process of generating associated images based on the image with the highest similarity to the semantic features of the text combined with semantic features includes:

Determine optimal tree features based on the selection parameters of the attention mask matrix corresponding to the image combination semantic feature with the highest similarity to the text semantic feature;

Using the optimal tree features to obtain image screening features;

A related image is generated based on the image filtering features and the text semantic features.

15. The image generation method according to claim 14, characterized in that the process of generating the associated image based on the image screening feature and the text semantic feature comprises:

Using the image screening features to obtain a conditional noise initial image;

A correlation image is generated based on the conditional noise initial image and the text semantic features.

16. The image generation method according to claim 14, wherein the process of clustering all the features comprises:

Calculate the Euclidean distance between any two of the features;

For each feature, determine the number by which the Euclidean distance is less than the first preset distance. When the number is not less than the preset number, divide the features into dense feature subsets. When the number is less than the preset distance, Assuming a quantity, divide the features into non-dense feature subsets;

Determine a subclass, add any feature in the dense feature subset to the subclass and remove it from the dense feature subset;

Calculate the minimum Euclidean distance between all features in the subcategory and all features in the dense feature subset, determine whether there is a first feature to be eliminated in the dense feature subset, and if so, remove the first feature to be eliminated Features are added to the subcategory and eliminated from the dense feature subset. Repeat this step until the first feature to be eliminated does not exist in the dense feature subset, and the first feature to be eliminated is the dense feature. Features whose minimum Euclidean distance between the subset and features in the subcategory is less than the second preset distance;

Calculate the minimum Euclidean distance between all features in the subclass and all features in the non-dense feature subset, determine whether there is a second feature to be eliminated in the non-dense feature subset, and if so, add the second feature to be eliminated to the subclass and eliminate it from the non-dense feature subset, repeat this step until the second feature to be eliminated does not exist in the non-dense feature subset, and the second feature to be eliminated is a feature in the non-dense feature subset whose minimum Euclidean distance with the feature in the subclass is less than the second preset distance;

Add the subcategories to the preset cluster set.

17. The image generation method according to claim 16, characterized in that the process of obtaining multiple first-level semantic features includes:

Calculate the weighted sum of all the features included in the preset cluster set for all the subcategories in the preset cluster set according to the second relational expression, and obtain the first-level semantic feature based on the weighted sum;

The second relation is ;

Wherein, t is the number of features in the b-th subclass, _fb is the weighted sum of the b-th subclass, _ft is the current feature in the b-th subclass during the traversal of the b-th subclass, and _fp is each of the features during the traversal. is the first preset distance or the second preset distance, dis( _ft , _fp ) is the Euclidean distance between _ft and _fp , For the b-th subclass that satisfies/> The number of features.

18. An image generation system, characterized by comprising:

The acquisition module is used to obtain semantic guidance text and emotional guidance text;

A retrieval module configured to retrieve multiple reference image samples based on the semantic guidance text and the emotion guidance text;

An extraction module, configured to extract features of a plurality of reference image samples, and combine at least two of all the features to obtain multiple image combination semantic features;

A generation module is used to obtain text semantic features corresponding to the semantic guidance text, and generate associated images based on the image with the highest similarity to the text semantic features and the semantic features.

19. An electronic device, characterized in that it includes:

Memory for storing computer programs;

A processor, configured to implement the steps of the image generation method according to any one of claims 1-17 when executing the computer program.

20. A computer-readable storage medium, characterized in that a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the steps of the image generation method according to any one of claims 1 to 17 are implemented.