CN115099235B

CN115099235B - Text generation method based on entity description

Info

Publication number: CN115099235B
Application number: CN202210520980.2A
Authority: CN
Inventors: 张雷; 杨竞潮
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2022-05-13
Filing date: 2022-05-13
Publication date: 2024-11-08
Anticipated expiration: 2042-05-13
Also published as: CN115099235A

Abstract

The present application proposes a text generation method based on entity description, which relates to the field of natural language processing technology, wherein the method includes: obtaining a to-be-processed set of a target information extraction task; using a pre-trained text generation model to infer the to-be-processed set, applying beam search to decode the inference results, and obtaining all possible paths, wherein the text generation model is generated by training the pre-trained model based on the target information extraction task using the corresponding training set; all possible paths are screened according to probability to obtain candidate paths; according to the type of the target information extraction task, the candidate paths are pruned to obtain the target information. The present application using the above scheme can well complete various information extraction tasks.

Description

Text generation method based on entity description

技术领域Technical Field

本申请涉及自然语言处理技术领域，尤其涉及基于实体描述的文本生成方法和装置。The present application relates to the technical field of natural language processing, and in particular to a text generation method and device based on entity description.

背景技术Background Art

信息抽取(Information Extraction)任务即为从自然语言文本中，抽取出特定的事件或实体，进而将大量内容自动分类或提取。被抽取的信息包括实体、关系、事件及其所属的类别等等。子任务包括命名实体识别、关系抽取、实体链接、语义角色标注、事件抽取、情感分析等等。The task of information extraction is to extract specific events or entities from natural language text, and then automatically classify or extract a large amount of content. The extracted information includes entities, relations, events and their categories, etc. Subtasks include named entity recognition, relation extraction, entity linking, semantic role labeling, event extraction, sentiment analysis, etc.

各类信息抽取任务现有工作主要以标注和分类的方式处理，并根据各种特定的任务设计不同的预测网络。尽管这些方法有效，但它们忽略了丰富的标签语义，并且需要大量的针对特定任务的设计，无法构造一个通用的框架。Existing work on various information extraction tasks mainly deals with labeling and classification, and designs different prediction networks according to various specific tasks. Although these methods are effective, they ignore the rich label semantics and require a lot of task-specific designs, making it impossible to construct a general framework.

发明内容Summary of the invention

本申请旨在至少在一定程度上解决相关技术中的技术问题之一。The present application aims to solve one of the technical problems in the related art at least to some extent.

为此，本申请的第一个目的在于提出一种基于实体描述的文本生成方法，解决了现有方法不具备通用性的技术问题，能够很好的完成各类信息抽取任务。To this end, the first purpose of this application is to propose a text generation method based on entity description, which solves the technical problem that the existing methods are not universal and can well complete various information extraction tasks.

本申请的第二个目的在于提出一种基于实体描述的文本生成装置。The second objective of this application is to propose a text generation device based on entity description.

为达上述目的，本申请第一方面实施例提出了一种基于实体描述的文本生成方法，包括：获取目标信息抽取任务的待处理集；使用预先训练的文本生成模型对待处理集进行推理，应用波束搜索对推理结果进行解码，得到所有可能路径，其中，文本生成模型是根据目标信息抽取任务，使用对应的训练集对预训练模型进行训练生成的；对所有可能路径按照概率进行筛选，获得候选路径；根据目标信息抽取任务的类型，对候选路径进行剪枝得到目标信息。To achieve the above-mentioned purpose, the first aspect of the present application proposes a text generation method based on entity description, including: obtaining a to-be-processed set of a target information extraction task; using a pre-trained text generation model to infer the to-be-processed set, applying beam search to decode the inference results, and obtaining all possible paths, wherein the text generation model is generated by training the pre-trained model according to the target information extraction task using the corresponding training set; all possible paths are screened according to probability to obtain candidate paths; and according to the type of the target information extraction task, the candidate paths are pruned to obtain the target information.

本申请实施例的基于实体描述的文本生成方法，通过构建一个通用的文本生成框架，面向不同的信息抽取任务，设计不同的实体描述作为输出序列。并当面向一段文本中有多个待抽取元组的情况，使用效果更好的“序列到路径”的思路，来规避生成的序列中多个元组之间的顺序问题与独立性问题，从而很好的完成各类信息抽取任务。The text generation method based on entity description in the embodiment of the present application constructs a general text generation framework, designs different entity descriptions as output sequences for different information extraction tasks. When there are multiple tuples to be extracted in a text, the more effective "sequence to path" approach is used to avoid the order and independence problems between multiple tuples in the generated sequence, thereby completing various information extraction tasks well.

可选地，在本申请的一个实施例中，在使用预先训练的文本生成模型对待处理集进行推理之前，还包括：Optionally, in one embodiment of the present application, before using the pre-trained text generation model to perform reasoning on the processing set, the method further includes:

构造增强数据集，其中，增强数据集包括正样本和负样本；Constructing an enhanced data set, wherein the enhanced data set includes positive samples and negative samples;

使用增强数据集训练预训练模型，得到文本生成模型。Use the enhanced dataset to train the pre-trained model to obtain a text generation model.

可选地，在本申请的一个实施例中，构造增强数据集包括：Optionally, in one embodiment of the present application, constructing an enhanced data set includes:

获取目标任务对应的训练集；Get the training set corresponding to the target task;

对待处理集合和训练集进行预处理，生成正样本，其中，训练集中的每段文本对应一个输出元组，正样本的数量与输出元组的数量相同；Preprocess the processing set and the training set to generate positive samples, where each text in the training set corresponds to an output tuple, and the number of positive samples is the same as the number of output tuples;

按照目标任务的类型，根据正样本的实体描述方式将每个输出元组转化为自然语言句子作为输出序列；According to the type of target task, each output tuple is converted into a natural language sentence as an output sequence according to the entity description of the positive sample;

使用训练集对预训练模型进行训练，生成样本生成模型；Use the training set to train the pre-trained model and generate a sample generation model;

使用样本生成模型对训练集进行推理，应用波束搜索对推理结果进行解码，得到所有可能路径与预测结果，其中，预测结果为可能路径的结合；Use the sample generation model to infer the training set, apply beam search to decode the inference results, and obtain all possible paths and prediction results, where the prediction result is a combination of possible paths;

将预测结果中错误的部分作为第一负样本，使用第一负样本随机替换正样本中的不同元组的元素，生成第二负样本；The wrong part of the prediction result is used as the first negative sample, and the first negative sample is used to randomly replace the elements of different tuples in the positive sample to generate the second negative sample;

根据正样本、第一负样本和第二负样本构建增强数据集。An enhanced dataset is constructed based on the positive samples, the first negative samples, and the second negative samples.

可选地，在本申请的一个实施例中，使用增强数据集训练预训练模型时，增强数据集中一条输入样本的损失函数为每条输出序列损失函数的平均。Optionally, in one embodiment of the present application, when the pre-trained model is trained using an enhanced data set, the loss function of an input sample in the enhanced data set is the average of the loss functions of each output sequence.

为达上述目的，本发明第二方面实施例提出了一种基于实体描述的文本生成装置，包括：获取模块、路径生成模块、路径筛选模块、剪枝模块，其中，To achieve the above-mentioned purpose, the second embodiment of the present invention proposes a text generation device based on entity description, including: an acquisition module, a path generation module, a path screening module, and a pruning module, wherein:

获取模块，用于获取目标信息抽取任务的待处理集；An acquisition module is used to obtain a pending set of target information extraction tasks;

路径生成模块，用于使用预先训练的文本生成模型对待处理集进行推理，应用波束搜索对推理结果进行解码，得到所有可能路径，其中，文本生成模型是根据目标信息抽取任务，使用对应的训练集对预训练模型进行训练生成的；A path generation module is used to use a pre-trained text generation model to infer the processing set, and apply beam search to decode the inference results to obtain all possible paths. The text generation model is generated by training the pre-trained model using the corresponding training set according to the target information extraction task;

路径筛选模块，用于对所有可能路径按照概率进行筛选，获得候选路径；The path screening module is used to screen all possible paths according to probability to obtain candidate paths;

剪枝模块，用于根据目标信息抽取任务的类型，对候选路径进行剪枝得到目标信息。The pruning module is used to prune candidate paths to obtain target information according to the type of target information extraction task.

可选地，在本申请的一个实施例中，还包括文本生成模型训练模块，具体用于：Optionally, in one embodiment of the present application, a text generation model training module is further included, which is specifically used to:

本申请附加的方面和优点将在下面的描述中部分给出，部分将从下面的描述中变得明显，或通过本申请的实践了解到。Additional aspects and advantages of the present application will be given in part in the description below, and in part will become apparent from the description below, or will be learned through the practice of the present application.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

本申请上述的和/或附加的方面和优点从下面结合附图对实施例的描述中将变得明显和容易理解，其中：The above and/or additional aspects and advantages of the present application will become apparent and easily understood from the following description of the embodiments in conjunction with the accompanying drawings, in which:

图1为本申请实施例一所提供的一种基于实体描述的文本生成方法的流程示意图；FIG1 is a schematic diagram of a flow chart of a text generation method based on entity description provided in Example 1 of the present application;

图2为本申请实施例的基于实体描述的文本生成方法示例图；FIG2 is an example diagram of a text generation method based on entity description according to an embodiment of the present application;

图3为本申请实施例提供的一种基于实体描述的文本生成装置的结构示意图。FIG3 is a schematic diagram of the structure of a text generation device based on entity description provided in an embodiment of the present application.

具体实施方式DETAILED DESCRIPTION

下面详细描述本申请的实施例，所述实施例的示例在附图中示出，其中自始至终相同或类似的标号表示相同或类似的元件或具有相同或类似功能的元件。下面通过参考附图描述的实施例是示例性的，旨在用于解释本申请，而不能理解为对本申请的限制。Embodiments of the present application are described in detail below, and examples of the embodiments are shown in the accompanying drawings, wherein the same or similar reference numerals throughout represent the same or similar elements or elements having the same or similar functions. The embodiments described below with reference to the accompanying drawings are exemplary and are intended to be used to explain the present application, and should not be construed as limiting the present application.

为了解决现有方法不具备通用性的问题，考虑搭建一个可以解决各种信息抽取问题的统一生成框架。生成模型可以充分利用对自然语言进行编码，丰富标签语义，将语言标签插入目标输出。此外这种统一的生成模型可以无缝地适应多种任务，无需引入额外的任务特定模型设计。In order to solve the problem that existing methods are not universal, we consider building a unified generative framework that can solve various information extraction problems. The generative model can make full use of natural language encoding, enrich label semantics, and insert language labels into the target output. In addition, this unified generative model can seamlessly adapt to multiple tasks without introducing additional task-specific model design.

因此，各类信息抽取任务都可以转化为一个序列到序列的文本翻译问题。但输出序列的设计至关重要，以实体与关系联合抽取为例，将输出形式设计为(主体、关系、客体)，相对于输出形式(主体、客体、关系)，更符合自然语言的语义顺序，可以得到更好的抽取效果。因此对于更普遍的情况，如果直接用一句完整的话作为输出序列描述待抽取的元素，效果会得到进一步的提升。Therefore, all kinds of information extraction tasks can be transformed into a sequence-to-sequence text translation problem. However, the design of the output sequence is crucial. Taking the joint extraction of entities and relations as an example, the output form is designed as (subject, relation, object). Compared with the output form (subject, object, relation), it is more in line with the semantic order of natural language and can obtain better extraction results. Therefore, for more common situations, if a complete sentence is directly used as the output sequence to describe the elements to be extracted, the effect will be further improved.

尽管序列到序列的生成框架在很多数据集上取得了较好的效果，但依然存在着局限性：(1)顺序问题。情感元组之间是并行的，不存在顺序的概念，但是序列到序列的方法却是串行按顺序输出的。(2)独立性问题。根据序列到序列的模型概念，后面待抽取的元组依赖于前面的元组，但实际上元组之间相互独立，并不存在这种依赖关系。因此本申请面向实体描述的输出文本采用了“序列到路径”的方法。Although the sequence-to-sequence generation framework has achieved good results on many data sets, it still has limitations: (1) The order problem. Sentiment tuples are parallel and there is no concept of order, but the sequence-to-sequence method outputs them serially in order. (2) The independence problem. According to the concept of the sequence-to-sequence model, the subsequent tuple to be extracted depends on the previous tuple, but in fact the tuples are independent of each other and there is no such dependency. Therefore, the output text of the entity description in this application adopts the "sequence to path" method.

此外，为了扩充数据集，需要使用负样本的方法进行数据增强。In addition, in order to expand the data set, it is necessary to use negative sample methods for data augmentation.

下面参考附图描述本申请实施例的基于实体描述的文本生成方法和装置。The following describes the text generation method and device based on entity description in an embodiment of the present application with reference to the accompanying drawings.

图1为本申请实施例一所提供的一种基于实体描述的文本生成方法的流程示意图。FIG1 is a flow chart of a text generation method based on entity description provided in Example 1 of the present application.

如图1所示，该基于实体描述的文本生成方法包括以下步骤：As shown in FIG1 , the text generation method based on entity description includes the following steps:

步骤101，获取目标信息抽取任务的待处理集；Step 101, obtaining a to-be-processed set of target information extraction tasks;

步骤102，使用预先训练的文本生成模型对待处理集进行推理，应用波束搜索对推理结果进行解码，得到所有可能路径，其中，文本生成模型是根据目标信息抽取任务，使用对应的训练集对预训练模型进行训练生成的；Step 102, using a pre-trained text generation model to infer the processing set, applying beam search to decode the inference results, and obtaining all possible paths, wherein the text generation model is generated by training the pre-trained model using the corresponding training set according to the target information extraction task;

步骤103，对所有可能路径按照概率进行筛选，获得候选路径；Step 103, screening all possible paths according to probability to obtain candidate paths;

步骤104，根据目标信息抽取任务的类型，对候选路径进行剪枝得到目标信息。Step 104: Prune the candidate paths according to the type of the target information extraction task to obtain the target information.

使用样本生成模型对预测训练集进行推理，应用波束搜索对推理结果进行解码，得到所有可能路径与预测结果，其中，预测结果为可能路径的结合；Use the sample generation model to infer the prediction training set, apply beam search to decode the inference results, and obtain all possible paths and prediction results, where the prediction result is a combination of possible paths;

本申请提出的一种基于实体描述的文本生成方法，将本申请的优化设计方法用于实体识别、关系抽取、实体链接、事件抽取、语义角色标注、情感分析等多项信息抽取任务中，在实验中使用Nvidia 1080Ti GPU单显卡训练，预训练模型使用huggingface中的t5-base模型，下面结合图2的文本生成过程、以实际应用为例，介绍本申请的基于实体描述的文本生成方法：This application proposes a text generation method based on entity description, and uses the optimization design method of this application in multiple information extraction tasks such as entity recognition, relationship extraction, entity linking, event extraction, semantic role labeling, and sentiment analysis. In the experiment, Nvidia 1080Ti GPU single graphics card is used for training, and the pre-trained model uses the t5-base model in huggingface. The following introduces the text generation method based on entity description of this application in combination with the text generation process of Figure 2 and taking practical application as an example:

步骤1：以情感分析中抽取情感元素三元组为例，常规的序列到序列抽取元组的方法按照输入文本的数量生成样本，是：Step 1: Taking the extraction of sentiment element triplets in sentiment analysis as an example, the conventional sequence-to-sequence tuple extraction method generates samples according to the number of input texts:

输入文本→“(a1,o1,s1),(a2,o2,s2)”Input text → "(a1,o1,s1),(a2,o2,s2)"

基于序列到路径的方法，按照输元组的数量生成样本，每段输入文本对应一个输出元组：Based on the sequence-to-path method, samples are generated according to the number of input tuples, and each piece of input text corresponds to an output tuple:

其中，不同子任务待抽取的实体如表1所示。Among them, the entities to be extracted for different subtasks are shown in Table 1.

表1各信息抽取子任务待抽取元组Table 1 Tuples to be extracted for each information extraction subtask

步骤2：面向不同的子任务，按下述方法将每个输出元组作为正样本转化为自然语言句子作为输出序列：Step 2: For different subtasks, convert each output tuple as a positive sample into a natural language sentence as an output sequence according to the following method:

(1)命名实体识别任务(1) Named Entity Recognition Task

抽取目标为文本中的(<entity>,<category>)二元组，其中<category>为实体<entity>的类别。线性化的自然句子如下：The extraction target is the (<entity>, <category>) tuple in the text, where <category> is the category of the entity <entity>. The linearized natural sentence is as follows:

正样本<entiiy>isa<category>entity.Positive sample <entiiy>isa<category>entity.

负样本<entity>is nota<categoty>entity.Negative sample <entity> is not a <categoty> entity.

(2)实体与关系联合抽取任务(2) Joint extraction of entities and relations

抽取目标为文本中的(<entity1>,<relation>,<entity2>)三元组，其中，<entity1>为主体，<entity2>为客体，<relation>为二者的关系。线性化的自然句子如下：The extraction target is the (<entity1>, <relation>, <entity2>) triple in the text, where <entity1> is the subject, <entity2> is the object, and <relation> is the relationship between the two. The linearized natural sentence is as follows:

正样本Therelationbetween<entity1>and<entity2>is<relation>.Positive example: Therelation between <entity1> and <entity2> is <relation>.

负样本The relation between<entity1>and<entity2>is not<relation>.Negative sample The relation between<entity1>and<entity2>is not<relation>.

(3)语义角色分析任务(3) Semantic role analysis task

对于给定的谓词(<predicate>)，抽取目标为文本中的(<argument>,<role>)二元组，线性化的自然句子如下：For a given predicate (<predicate>), the extraction target is the (<argument>, <role>) tuple in the text. The linearized natural sentence is as follows:

正样本<argument>isthe<role>of<predicate>.A positive example <argument> is the <role> of <predicate>.

负样本<argument>is not the<role>of<predicate>.Negative example <argument> is not the <role> of <predicate>.

(4)事件抽取任务(4) Event extraction task

采用pipeline的方式，第一步抽取trigger，第二步抽取argument。Using the pipeline approach, the first step is to extract the trigger, and the second step is to extract the argument.

①抽取trigger的目标为(<event trigger>,<event category>)二元组，线性化的自然语言句子如下：① The goal of extracting the trigger is the (<event trigger>, <event category>) tuple. The linearized natural language sentence is as follows:

正样本<eventtrigger>isa<eventcategory>event.Positive sample <eventtrigger>isa<eventcategory>event.

负样本<eventtrigger>is not a<eventcategory>event.Negative sample <eventtrigger> is not a <eventcategory> event.

②抽取argument的目标为(<event argument>,<entity type>,<entity role>)三元组，线性化的自然语言句子如下：② The goal of extracting argument is the (<event argument>, <entity type>, <entity role>) triple. The linearized natural language sentence is as follows:

正样本Positive Sample

<eventargument>isa<entitytype>type,anditisthe<entityrole>.<eventargument>isa<entitytype>type,anditisthe<entityrole>.

负样本Negative samples

<event ar gument>is not a<entity type>type,or it is not the<entityrole>.<event ar gument>is not a<entity type>type,or it is not the<entityrole>.

(5)实体链接任务(5) Entity Linking Task

对于给定的实体<entity>，抽取的(<argument>)与<entity>完成链接，线性化的自然语言句子如下：For a given entity <entity>, the extracted (<argument>) is linked to <entity>, and the linearized natural language sentence is as follows:

正样本<argument>referto<entity>.Positive example <argument>refers to <entity>.

负样本<argument>not refer to<entity>.Negative example <argument> not refer to <entity>.

(6)情感分析任务(6) Sentiment analysis task

①AOPE任务①AOPE mission

正样本F_a(a)isF_o(o).Positive sample F _a (a) is _{F o} (o).

负样本F_a(a)is not F_o(o).Negative sample F _a (a) is not F _o (o).

②ASTE任务②ASTE Mission

正样本ItisF_s(s)becauseF_a(a)isF_o(o).Positive sample It is F _s (s) because _{F a} (a) is _{F o} (o).

负样本It is F_s(s)not because F_a(a)is F_o(o).Negative Sample It is F _s (s) not because F _a (a) is F _o (o).

③TASD任务③TASD mission

正样本F_a(a)isaF_c(c)categoryandF_a(a)isF_o(o).Positive sample F _a (a)isaF _c (c)categoryandF _a (a)isF _o (o).

负样本F_a(a)isnot aF_c(c)categoryorF_a(a)is notF_o(o).Negative sample F _a (a) is not aF _c (c) category or _{F a} (a) is not _{F o} (o).

④UABSA任务④UABSA Mission

正样本F_a(a)isF_s(s)Positive sample F _a (a) is F _s (s)

负样本F_a(a)is notF_s(s)Negative sample F _a (a) is not _{F s} (s)

⑤ACOS任务⑤ACOS tasks

正样本F_c(c)isF_s(s)becauseF_a(a)isF_o(o).Positive sample F _c (c) is F _s (s) because F _a (a) is F _o (o).

负样本F_c(c)isF_s(s)not becauseF_a(a)isF_o(o).Negative sample F _c (c) is F _s (s) not because _{F a} (a) is F _o (o).

其中，in,

步骤3：基于预训练模型T5，基于训练集训练文本生成模型M_aug，用于后续的数据增强；Step 3: Based on the pre-trained model T5, train the text generation model _Maug based on the training set for subsequent data enhancement;

步骤4：使用M_aug预测训练集，根据M_aug的错误的预测结果构造负样本D′₂，再随机替换正样本D中的不同元组的元素构造负样本D′₁。D′₁的构造方法：为了提升模型匹配元组元素的能力，随机替换元组中的元素；D′₂的构造方法：为了提升模型过滤错误元素的能力，我们先使用M_aug生成一些错误的例子加入负样本中。数据增强后的数据集D_aug＝D∪D′＝D∪D′₁∪D′₂。以情感分析三元组抽取为例：输入文本是“Those rolls are big,but not goodand sashimi wasn’t fresh.”，待抽取的元组是(rolls,big,positive)，(rolls,notgood,negative)，(rolls,sashimi,fresh)，那么D′₁中的负样本可以是(rolls,wasn’tfresh,positive)，(sashimi,big,negative,false)，D′₂中的负样本可以是(sashimi,n’tfresh,negative,false)。Step 4: Use _Maug to predict the training set, construct negative samples D′ ₂ based on the wrong prediction results of _Maug , and then randomly replace the elements of different tuples in the positive sample D to construct negative samples D′ _1. Construction method of D′ ₁ : In order to improve the model's ability to match tuple elements, randomly replace the elements in the tuple; Construction method of D′ ₂ : In order to improve the model's ability to filter out wrong elements, we first use _Maug to generate some wrong examples and add them to the negative samples. The data set after data enhancement is _Daug = D∪D′ = D∪D′ ₁ ∪D′ ₂ . Take sentiment analysis triple extraction as an example: the input text is "Those rolls are big, but not good and sashimi wasn't fresh.", the tuples to be extracted are (rolls, big, positive), (rolls, notgood, negative), (rolls, sashimi, fresh), then the negative samples in D′ ₁ can be (rolls, wasn'tfresh, positive), (sashimi, big, negative, false), and the negative samples in D′ ₂ can be (sashimi, n'tfresh, negative, false).

步骤5：同步骤1类似，在D_aug上训练一个基于T5的模型M，一条输入样本的损失函数为每条输出序列损失函数的平均，表示为：Step 5: Similar to step 1, train a T5-based model M on _Daug . The loss function of an input sample is the average of the loss functions of each output sequence, expressed as:

其中 in

步骤6：使用训练完成的模型M对测试集进行推理，应用波束搜索进行解码，对概率降序排序，概率最大的k条路径即为候选路径。Step 6: Use the trained model M to infer the test set, apply beam search for decoding, sort the paths in descending order of probability, and the k paths with the highest probability are the candidate paths.

步骤7：面向不同的子任务，对k条路径进行剪枝得到最后结果。在模型推理的过程中，解码得到的单词应该不是属于全词汇集的，而是应来自于原输入文本或一些其他的分类标记。例如在情感分析中ASTE任务中，情感极性s的搜索范围仅仅在{positive,negative,neutral}之中。将候选的单词集合记为T，T由输入文本的单词T(x)和分类标记T(task)构成。Step 7: Prune k paths for different subtasks to get the final result. During model inference, the decoded words should not belong to the full vocabulary, but should come from the original input text or some other classification tags. For example, in the ASTE task in sentiment analysis, the search range of sentiment polarity s is only in {positive, negative, neutral}. The candidate word set is denoted as T, which consists of the words T(x) of the input text and the classification tag T(task).

即Right now

T(x,task)＝T(x)∪T(task)T(x,task)＝T(x)∪T(task)

但是真实的模型推理过程中，生成的单词并不一定都来自于候选集合T，例如单复数或大小写的差别。因此，对于生成的单词e不属于T中元素的情况，可以衡量e到所有候选元素的Levenshtein距离，选取距离最小的替代e。更简单的方式可以直接将生成的单词限定在T的范围内。进一步，可以将待生成的元素分为抽取类元素和分类型元素，抽取类元素即来自T(x)，分类型元素即来自T(task)，每类元素都有各自的限定范围。However, in the actual model reasoning process, the generated words do not necessarily come from the candidate set T, such as the difference between singular and plural or uppercase and lowercase. Therefore, for the case where the generated word e does not belong to the elements in T, the Levenshtein distance from e to all candidate elements can be measured, and the one with the smallest distance can be selected to replace e. A simpler way is to directly limit the generated words to the range of T. Furthermore, the elements to be generated can be divided into extraction-type elements and classification-type elements. The extraction-type elements come from T(x), and the classification-type elements come from T(task). Each type of element has its own limited range.

综上，本申请可有效完成各类信息抽取任务，并具备良好的通用性，其中表2是情感分析(ACOS子任务)数据集上的实验结果，表3是关系抽取数据集上的实验结果，证明本方法是有效的。In summary, this application can effectively complete various information extraction tasks and has good versatility. Table 2 is the experimental results on the sentiment analysis (ACOS subtask) dataset, and Table 3 is the experimental results on the relationship extraction dataset, which proves that this method is effective.

表2情感分析实验结果(以ACOS任务为例)Table 2 Sentiment analysis experimental results (taking ACOS task as an example)

表3关系抽取实验结果Table 3. Experimental results of relation extraction

为了实现上述实施例，本申请还提出一种基于实体描述的文本生成装置。In order to implement the above embodiment, the present application also proposes a text generation device based on entity description.

如图3所示，该基于实体描述的文本生成装置包括：获取模块、路径生成模块、路径筛选模块、剪枝模块，其中，As shown in FIG3 , the text generation device based on entity description includes: an acquisition module, a path generation module, a path screening module, and a pruning module, wherein:

需要说明的是，前述对基于实体描述的文本生成方法实施例的解释说明也适用于该实施例的基于实体描述的文本生成装置，此处不再赘述。It should be noted that the aforementioned explanation of the embodiment of the text generation method based on entity description is also applicable to the text generation device based on entity description of this embodiment, and will not be repeated here.

在本说明书的描述中，参考术语“一个实施例”、“一些实施例”、“示例”、“具体示例”、或“一些示例”等的描述意指结合该实施例或示例描述的具体特征、结构、材料或者特点包含于本申请的至少一个实施例或示例中。在本说明书中，对上述术语的示意性表述不必须针对的是相同的实施例或示例。而且，描述的具体特征、结构、材料或者特点可以在任一个或多个实施例或示例中以合适的方式结合。此外，在不相互矛盾的情况下，本领域的技术人员可以将本说明书中描述的不同实施例或示例以及不同实施例或示例的特征进行结合和组合。In the description of this specification, the description with reference to the terms "one embodiment", "some embodiments", "example", "specific example", or "some examples" etc. means that the specific features, structures, materials or characteristics described in conjunction with the embodiment or example are included in at least one embodiment or example of the present application. In this specification, the schematic representations of the above terms do not necessarily refer to the same embodiment or example. Moreover, the specific features, structures, materials or characteristics described may be combined in any one or more embodiments or examples in a suitable manner. In addition, those skilled in the art may combine and combine the different embodiments or examples described in this specification and the features of the different embodiments or examples, without contradiction.

此外，术语“第一”、“第二”仅用于描述目的，而不能理解为指示或暗示相对重要性或者隐含指明所指示的技术特征的数量。由此，限定有“第一”、“第二”的特征可以明示或者隐含地包括至少一个该特征。在本申请的描述中，“多个”的含义是至少两个，例如两个，三个等，除非另有明确具体的限定。In addition, the terms "first" and "second" are used for descriptive purposes only and should not be understood as indicating or implying relative importance or implicitly indicating the number of the indicated technical features. Therefore, the features defined as "first" and "second" may explicitly or implicitly include at least one of the features. In the description of this application, the meaning of "plurality" is at least two, such as two, three, etc., unless otherwise clearly and specifically defined.

流程图中或在此以其他方式描述的任何过程或方法描述可以被理解为，表示包括一个或更多个用于实现定制逻辑功能或过程的步骤的可执行指令的代码的模块、片段或部分，并且本申请的优选实施方式的范围包括另外的实现，其中可以不按所示出或讨论的顺序，包括根据所涉及的功能按基本同时的方式或按相反的顺序，来执行功能，这应被本申请的实施例所属技术领域的技术人员所理解。Any process or method description in a flowchart or otherwise described herein may be understood to represent a module, fragment or portion of code comprising one or more executable instructions for implementing the steps of a custom logical function or process, and the scope of the preferred embodiments of the present application includes alternative implementations in which functions may not be performed in the order shown or discussed, including performing functions in a substantially simultaneous manner or in the reverse order depending on the functions involved, which should be understood by technicians in the technical field to which the embodiments of the present application belong.

在流程图中表示或在此以其他方式描述的逻辑和/或步骤，例如，可以被认为是用于实现逻辑功能的可执行指令的定序列表，可以具体实现在任何计算机可读介质中，以供指令执行系统、装置或设备(如基于计算机的系统、包括处理器的系统或其他可以从指令执行系统、装置或设备取指令并执行指令的系统)使用，或结合这些指令执行系统、装置或设备而使用。就本说明书而言，"计算机可读介质"可以是任何可以包含、存储、通信、传播或传输程序以供指令执行系统、装置或设备或结合这些指令执行系统、装置或设备而使用的装置。计算机可读介质的更具体的示例(非穷尽性列表)包括以下：具有一个或多个布线的电连接部(电子装置)，便携式计算机盘盒(磁装置)，随机存取存储器(RAM)，只读存储器(ROM)，可擦除可编辑只读存储器(EPROM或闪速存储器)，光纤装置，以及便携式光盘只读存储器(CDROM)。另外，计算机可读介质甚至可以是可在其上打印所述程序的纸或其他合适的介质，因为可以例如通过对纸或其他介质进行光学扫描，接着进行编辑、解译或必要时以其他合适方式进行处理来以电子方式获得所述程序，然后将其存储在计算机存储器中。The logic and/or steps represented in the flowchart or otherwise described herein, for example, can be considered as an ordered list of executable instructions for implementing logical functions, and can be embodied in any computer-readable medium for use by an instruction execution system, device or apparatus (such as a computer-based system, a system including a processor, or other system that can fetch instructions from an instruction execution system, device or apparatus and execute the instructions), or in combination with these instruction execution systems, devices or apparatuses. For the purpose of this specification, "computer-readable medium" can be any device that can contain, store, communicate, propagate or transmit a program for use by an instruction execution system, device or apparatus, or in combination with these instruction execution systems, devices or apparatuses. More specific examples of computer-readable media (a non-exhaustive list) include the following: an electrical connection with one or more wires (electronic device), a portable computer disk box (magnetic device), a random access memory (RAM), a read-only memory (ROM), an erasable and programmable read-only memory (EPROM or flash memory), a fiber optic device, and a portable compact disk read-only memory (CDROM). In addition, the computer-readable medium may even be paper or other suitable medium on which the program is printed, since the program may be obtained electronically, for example, by optically scanning the paper or other medium and then editing, interpreting or processing in other suitable ways if necessary, and then stored in a computer memory.

应当理解，本申请的各部分可以用硬件、软件、固件或它们的组合来实现。在上述实施方式中，多个步骤或方法可以用存储在存储器中且由合适的指令执行系统执行的软件或固件来实现。如，如果用硬件来实现和在另一实施方式中一样，可用本领域公知的下列技术中的任一项或他们的组合来实现：具有用于对数据信号实现逻辑功能的逻辑门电路的离散逻辑电路，具有合适的组合逻辑门电路的专用集成电路，可编程门阵列(PGA)，现场可编程门阵列(FPGA)等。It should be understood that the various parts of the present application can be implemented by hardware, software, firmware or a combination thereof. In the above-mentioned embodiments, multiple steps or methods can be implemented by software or firmware stored in a memory and executed by a suitable instruction execution system. For example, if implemented by hardware, as in another embodiment, it can be implemented by any one of the following technologies known in the art or their combination: a discrete logic circuit having a logic gate circuit for implementing a logic function for a data signal, a dedicated integrated circuit having a suitable combination of logic gate circuits, a programmable gate array (PGA), a field programmable gate array (FPGA), etc.

本技术领域的普通技术人员可以理解实现上述实施例方法携带的全部或部分步骤是可以通过程序来指令相关的硬件完成，所述的程序可以存储于一种计算机可读存储介质中，该程序在执行时，包括方法实施例的步骤之一或其组合。A person skilled in the art may understand that all or part of the steps in the method for implementing the above-mentioned embodiment may be completed by instructing related hardware through a program, and the program may be stored in a computer-readable storage medium, which, when executed, includes one or a combination of the steps of the method embodiment.

此外，在本申请各个实施例中的各功能单元可以集成在一个处理模块中，也可以是各个单元单独物理存在，也可以两个或两个以上单元集成在一个模块中。上述集成的模块既可以采用硬件的形式实现，也可以采用软件功能模块的形式实现。所述集成的模块如果以软件功能模块的形式实现并作为独立的产品销售或使用时，也可以存储在一个计算机可读取存储介质中。In addition, each functional unit in each embodiment of the present application may be integrated into a processing module, or each unit may exist physically separately, or two or more units may be integrated into one module. The above-mentioned integrated module may be implemented in the form of hardware or in the form of a software functional module. If the integrated module is implemented in the form of a software functional module and sold or used as an independent product, it may also be stored in a computer-readable storage medium.

上述提到的存储介质可以是只读存储器，磁盘或光盘等。尽管上面已经示出和描述了本申请的实施例，可以理解的是，上述实施例是示例性的，不能理解为对本申请的限制，本领域的普通技术人员在本申请的范围内可以对上述实施例进行变化、修改、替换和变型。The storage medium mentioned above may be a read-only memory, a disk or an optical disk, etc. Although the embodiments of the present application have been shown and described above, it can be understood that the above embodiments are exemplary and cannot be understood as limiting the present application. A person of ordinary skill in the art may change, modify, replace and modify the above embodiments within the scope of the present application.

Claims

1. A text generation method based on entity description, comprising the steps of:

acquiring a to-be-processed set of a target information extraction task;

Reasoning the set to be processed by using a pre-trained text generation model, and decoding a reasoning result by using beam search to obtain all possible paths, wherein the text generation model is generated by training the pre-trained model by using a corresponding training set according to the target information extraction task;

screening all possible paths according to probability to obtain candidate paths;

and pruning the candidate paths according to the type of the target information extraction task to obtain target information.

2. The method of claim 1, further comprising, prior to said reasoning the set of pending using a pre-trained text generation model:

Constructing an enhanced data set, wherein the enhanced data set comprises positive and negative samples;

training a pre-training model by using the enhanced data set to obtain the text generation model.

3. The method of claim 2, wherein constructing the enhanced data set comprises:

Acquiring a training set corresponding to a target task;

Preprocessing the set to be processed and the training set to generate positive samples, wherein each text section corresponds to one output tuple in preprocessing, and the number of the positive samples is the same as that of the output tuples;

according to the type of the target task, converting each output tuple into a natural language sentence as an output sequence according to the entity description mode of the positive sample;

Training the pre-training model by using a training set to generate a sample generation model;

reasoning the training set by using the sample generation model, and decoding reasoning results by using beam search to obtain all possible paths and prediction results, wherein the prediction results are combinations of the possible paths;

Taking the error part in the prediction result as a first negative sample, and randomly replacing elements of different tuples in the positive sample by using the first negative sample to generate a second negative sample;

the enhanced data set is constructed from the positive sample, the first negative sample, and the second negative sample.

4. A method according to claim 3, wherein the loss function of an input sample in the enhancement data set is an average of the loss functions of each output sequence when training a pre-training model using the enhancement data set.

5. The text generation device based on entity description is characterized by comprising an acquisition module, a path generation module, a path screening module and a pruning module, wherein,

The acquisition module is used for acquiring a set to be processed of the target information extraction task;

The path generation module is used for reasoning the set to be processed by using a pre-trained text generation model, and decoding a reasoning result by applying beam search to obtain all possible paths, wherein the text generation model is generated by training the pre-trained model by using a corresponding training set according to the target information extraction task;

The path screening module is used for screening all the possible paths according to probability to obtain candidate paths;

And the pruning module is used for pruning the candidate paths according to the type of the target information extraction task to obtain target information.

6. The apparatus of claim 5, further comprising a text generation model training module, in particular for:

7. The apparatus of claim 6, wherein the constructing an enhanced data set comprises:

Acquiring a training set corresponding to a target task;

preprocessing the set to be processed and the training set to generate positive samples, wherein each text in the training set corresponds to one output tuple, and the number of the positive samples is the same as the number of the output tuples;

8. The apparatus of claim 7, wherein the loss function for one input sample in the enhancement data set is an average of the loss functions for each output sequence when training a pre-training model using the enhancement data set.