CN109472033B

CN109472033B - Method and system for extracting entity relationship in text, storage medium and electronic equipment

Info

Publication number: CN109472033B
Application number: CN201811376209.2A
Authority: CN
Inventors: 蒋运承; 瞿荣; 朱星图; 郑一东; 马文俊; 詹捷宇; 刘宇东
Original assignee: South China Normal University
Current assignee: South China Normal University
Priority date: 2018-11-19
Filing date: 2018-11-19
Publication date: 2022-12-06
Anticipated expiration: 2038-11-19
Also published as: CN109472033A

Abstract

The invention relates to a method and a system for extracting entity relations in texts, a storage medium and electronic equipment. The method for extracting the entity relationship in the text comprises the following steps: acquiring an entity triple relation set, an entity and entity attribute set and a concept set; training a sentence of a text set and a triple relation set of two entities identified in the sentence; carrying out remote supervision and labeling, acquiring a sentence comprising a training text set, two entities identified in the sentence, concepts respectively corresponding to the two entities and a relationship set of the two entities, inputting a sentence vector into an entity relationship extraction model and training; and acquiring a relation set of each sentence, wherein each sentence comprises two entities, concepts respectively corresponding to the two entities and the two entities. The method for extracting the entity relationship in the text extracts the relationship between the entities by utilizing the semantic context information in the text, thereby solving the problem of wrong labeling in the remote supervision process.

Description

Method and system for extracting entity relationship in text, storage medium and electronic equipment

Technical Field

The invention relates to the technical field of text processing and information extraction, in particular to a method and a system for extracting entity relations in texts, a storage medium and electronic equipment.

Background

In the past, people have built large-scale knowledge bases such as Wikipedia and DBpedia from real-world knowledge. These knowledge bases are widely used in the fields of artificial intelligence and natural language processing, such as question-answering systems, information extraction, and the like. The knowledge base contains a large number of triple facts, such as (New York, city of United States) representing the fact that "New York is a city in the United States". However, existing knowledge bases contain limited and far from complete facts, with new facts being generated each day. How to mark new facts to supplement the knowledge base becomes a difficult problem which needs to be solved urgently. The fact triplets are labeled by adopting a manual labeling method, which is a time-consuming and labor-consuming project, so that many researches are now carried out to transfer the gravity center to automatically label new facts from complicated and diversified internet resources. The extraction of entity relationships in a large amount of texts is a very important task and is the most core task. Although the entity relationship extraction method in the existing text can achieve a better effect with the help of a remote supervision mechanism, the assumption of remote supervision has the problem of wrong labeling. The reason for this is that, in the assumption of remote supervision, there is only one relationship between a pair of entities, and all sentences in which the pair of entities appear are considered to express the relationship. In fact, when two entities appear in a sentence at the same time, the established relationship in the knowledge base may not be expressed, other relationships may be expressed, or a common subject may be reflected, which needs to be determined according to the semantic context in the sentence.

Disclosure of Invention

Based on this, the present invention provides a method for extracting entity relationships in a text, which extracts relationships between entities by using semantic context information in the text, thereby fundamentally solving the problem of wrong labeling in a remote monitoring process.

The invention is realized by the following scheme:

a method for extracting entity relation in text comprises the following steps:

acquiring an entity triple relation set, acquiring an entity and an entity attribute set, and acquiring a concept set;

acquiring a sentence of a training text set and a triple relation set of two entities identified in the sentence;

according to the entity triple relation set, the entity and entity attribute set and the concept set, carrying out remote supervision and labeling on the sentence of the training text set and the triple relation set of the two entities identified in the sentence to obtain the sentence comprising the training text set, the two entities identified in the sentence, concepts respectively corresponding to the two entities and the relationship set of the two entities, and putting the relationship set into a labeling training set;

acquiring vector representation of words in sentences in a training text set according to the labeled training set;

obtaining a sentence vector of each sentence in the training text set according to the vector representation of the words in the sentence;

inputting a sentence vector of each sentence of a training text set into an entity relationship extraction model, and training the entity relationship extraction model according to two entities marked in the sentence, concepts respectively corresponding to the two entities marked in the sentence and the relationship between the two entities marked in the sentence;

obtaining a sentence vector of each sentence in a text set to be extracted;

and inputting the sentence vector of each sentence in the text set to be extracted into the entity relation extraction model, and acquiring a relation set of each sentence in the text set to be extracted, wherein the relation set comprises two entities, concepts corresponding to the two entities respectively and the two entities.

The method for extracting the entity relationship in the text provided by the invention has the advantages that the concept range of the entity in the context represents semantic context information, the entity relationship training set of multi-concept and multi-relationship is obtained according to the concept range, and the entity relationship extraction model is constructed according to the training set, so that the problem of wrong labeling in the remote supervision process is fundamentally solved.

In one embodiment, the performing remote supervision and annotation on the sentence in the training text set and the triple relationship set of two entities identified in the sentence according to the entity triple relationship set, the entity and entity attribute set, and the concept set includes:

and carrying out context recognition on the sentences of the training text set to obtain concepts corresponding to the two entities recognized by the sentences respectively.

In one embodiment, after performing context recognition on the sentences in the training text set and obtaining concepts corresponding to the two entities recognized by the sentences, the method further includes the following steps:

matching two entities identified in the sentences of the training text set with the entity triple relation set;

if the matching fails, a relation is randomly extracted from the entity triple relation set, a relation set which comprises sentences, the two marked entities, concepts respectively corresponding to the two marked entities and the randomly extracted relation set is generated, and the data set is used as a negative sample and is placed into the marking training set.

In one embodiment, the method further comprises the following steps:

if the matching is successful, generating a concept and a matching relation set which respectively correspond to the sentence, the two marked entities and the two marked entities, scoring the confidence degree of the relation obtained by matching, if the scoring result exceeds a first set threshold value, putting the data set into a marking training set as a positive sample, and if the scoring result is lower than the first set threshold value, putting the data set into the marking training set as a negative sample.

In one embodiment, the confidence scoring of the matched relationship comprises:

and acquiring the correlation degree of the matched relation and the context in the sentence according to the proportion of the context of the sentence appearing in the corpus together, wherein the higher the correlation degree is, the higher the confidence score is.

In one embodiment, the method further comprises the following steps:

acquiring a plurality of relation sets with the same concept in the generated relation sets;

judging the context correlation degree of each relation and sentence in the plurality of relation sets;

and replacing the relationship with the maximum degree of correlation into a plurality of relationship sets as a new relationship.

In one embodiment, after replacing the relationship with the largest degree of correlation into the plurality of relationship sets as a new relationship, the method further includes the following steps:

deleting the plurality of relationship sets in the labeling training set;

placing the plurality of relationship sets including new relationships into the annotation training set.

Further, the present invention also provides a system for extracting entity relationships in a text, including:

the first acquisition module is used for acquiring the entity triple relation set, acquiring the entity and the entity attribute set and acquiring the concept set;

the second acquisition module is used for acquiring a sentence of the training text set and a triple relation set of two entities identified in the sentence;

the remote supervision and labeling module is used for carrying out remote supervision and labeling on the sentence of the training text set and the triple relation sets of the two entities identified in the sentence according to the entity triple relation set, the entity and entity attribute set and the concept set, acquiring the sentence comprising the training text set, the two entities identified in the sentence, concepts respectively corresponding to the two entities and the relation sets of the two entities, and putting the relation sets into a labeling training set;

the representation input module is used for acquiring vector representation of words in the sentence of the training text set according to the labeled training set;

the first sentence expression module is used for acquiring a sentence vector of each sentence in the training text set according to the vector expression of the words in the sentence;

the entity relationship extraction model training module is used for inputting the sentence vector of each sentence in the training text set into the entity relationship extraction model, and training the entity relationship extraction model according to the two entities marked in the sentence, the concepts respectively corresponding to the two entities marked in the sentence and the relationship between the two entities marked in the sentence;

the second sentence expression module is used for acquiring a sentence vector of each sentence in the text set to be extracted;

and the entity relationship extraction module is used for inputting the sentence vector of each sentence of the text set to be extracted into the entity relationship extraction model, and acquiring a relationship set which comprises two entities, concepts respectively corresponding to the two entities and the two entities of each sentence of the text set to be extracted.

The method for extracting the entity relationship in the text provided by the invention has the advantages that the concept range of the entity in the context represents semantic context information, the entity relationship training set with multiple concepts and multiple relationships is obtained according to the concept range, and the entity relationship extraction model is constructed according to the training set, so that the problem of wrong labeling in the remote supervision process is fundamentally solved.

Further, the present invention also provides a computer readable medium, on which a computer program is stored, and when the computer program is executed by a processor, the computer program implements the entity relationship extraction method in the text as described in any one of the above embodiments.

Further, the present invention also provides an electronic device, which includes a memory, a processor, and a computer program stored in the memory and executable by the processor, and when the processor executes the computer program, the processor implements the entity relationship extraction method in the text as described in any of the above embodiments.

For a better understanding and practice, the present invention is described in detail below with reference to the accompanying drawings.

Drawings

FIG. 1 is a flowchart illustrating a method for extracting entity relationships in a text according to an embodiment;

FIG. 2 is a diagram illustrating a process of matching relationships in a training text set according to an embodiment;

FIG. 3 is a schematic diagram illustrating a remote surveillance annotation process in one embodiment;

FIG. 4 is a flow chart illustrating the process of modifying the annotation result according to an embodiment;

FIG. 5 is a diagram of an entity relationship extraction model in one embodiment;

FIG. 6 is a block diagram of an embodiment of a system for extracting entity relationships from text;

fig. 7 is a schematic structural diagram of an electronic device in an embodiment.

Detailed Description

Referring to fig. 1, in an embodiment, the method for extracting entity relationships in the text of the present invention includes the following steps:

step S101: and acquiring an entity triple relation set, acquiring an entity and entity attribute set, and acquiring a concept set.

In this example, freebase was selected as the basic knowledge base. Freebase is a large-scale knowledge map, inherently containing 7300 diverse relationships and over 9 million entities. The Resource Description Framework (RDF) triples (entity 1, relationship, entity 2) in the Freebase are sorted and stored in the computer, and as the entity triplet relationship set of this embodiment, denoted as R, include triples such as (New York, circyof, united States). In addition, the entity and the attribute information of the entity in Freebase are sorted and stored in the computer, and each entity may include zero or more attributes, which is denoted as E as the entity and entity attribute set in this embodiment.

The scheme of the embodiment relates to the construction and the use of a multi-concept knowledge base, and a concept dictionary is required to be prepared. The concept is to judge the concept category of the entity according to the context. In the knowledge-graph base, millions of concepts are included, and thus this knowledge base is used as a data source of the concept dictionary in the present embodiment. The entity related to each relationship in the relationship set R and its corresponding concept are organized and stored in a computer, which is denoted as C as a set of entities and its possible concepts of this embodiment, wherein the concept may be 1 or more, such as an entity and its concept (IBM, company; corporation; client; organization; vendor; supplier; 8230).

Step S102: and acquiring a triplet relation set of the sentences of the training text set and two entities identified in the sentences.

The present embodiment uses the new york times news text set as the training text set ratio. For each news document D in the training text set D, the starting point and the ending point of each sentence s are identified through punctuation marks, and the document is divided into a plurality of sentences. In order to perform the entity relationship extraction task, the entity in s needs to be identified, and in the scheme of the invention, the existing natural language processing tool StanfordNLP is used for carrying out the named entity identification operation. If the identified entities in s are not equal to 2, or the identified entities are not in set E, the sentence is considered invalid and discarded. Each sentence s meeting the condition and two recognized entities e ₁ And e ₂ Entering the triplet (s, e) ₁ ，e ₂ ) And stored in the computer, and the set of triple relationships between the sentence and the two entities identified in the sentence, denoted as SE, may include, for example, (New York is the most popular city in the United States, new York, united States).

Step S103: and performing remote supervision and labeling on the sentence of the training text set and the triple relationship set of the two entities identified in the sentence according to the entity triple relationship set, the entity and entity attribute set and the concept set to obtain the sentence comprising the training text set, the two entities identified in the sentence, concepts respectively corresponding to the two entities and the relationship set of the two entities, and putting the relationship set into a labeling training set.

The inputs for remote supervision include the set of entity triple relationships R, the set of entities and entity attributes E, and the set of concepts C. Sequentially carrying out concept recognition, remote supervision and analysis on the sentence of the training text set and the triple relation set SE of the two entities recognized in the sentence,Carrying out remote supervision and annotation by three operations of relation confidence degree scoring, and obtaining a sentence comprising a training text set and two entities e recognized in the sentence ₁ 、e ₁ Concept c corresponding to two entities respectively ₁ 、c ₂ And a set of relationships r of two entities ₁ (s，(e ₁ ，c ₁ ，r ₁ ，c ₂ ，e ₂ ) And putting the relation set into a labeling training set. And obtaining the five-tuple relation (e) ₁ ，c ₁ ，r ₁ ，c ₂ ，e ₂ ) And putting the five-tuple relationship into a knowledge base KB to be built.

Step S104: and acquiring vector representation of words in the sentence in the training text set according to the labeled training set.

In this step, the input includes a label training set T _train And a Wikipedia text corpus, the output being a vector representation of the words.

To represent the annotation training set T _train Every word appearing in Chinese requires two operations: 1) Representing each word with a word vector, 2) enhancing the expression of the word vector in combination with the positional relationship of the words and the two entities in the sentence. In order to compute the word vector, a vocabulary needs to be determined. In the scheme of the invention, words appearing in Wikipedia more than 100 times are stored to jointly form a vocabulary table. And then training by using an open source word2vec tool through context information in a Wikipedia text corpus to obtain word vector expression of each word, and storing the word vector expression in a computer, wherein W is a set containing the words and word vectors corresponding to the words. Here, the dimension of the word vector and the size of the context window may be set, and in order to ensure the calculation efficiency, the dimension is set to be 50 and the window size is set to be 3 in the present embodiment. Suppose there is a training sample (s, (e) ₁ ， c ₁ ，r ₁ ，c ₂ ，e ₂ ) S = { w) including n words in the sentence s ₁ ,w ₂ ,…,w _n Wherein two words correspond to the entity e ₁ And e ₂ . Firstly, obtaining a word vector v of each word through a query set W, and then recording each word to an entity e ₁ And e ₂ Distance ofDistance dist ₁ And dist ₂ And dist ₁ And dist ₂ Splicing the tail part of v to form a 52-dimensional word vector, and finally using the processed word vector sequence (v) ₁ ,v ₂ ,…,v _n ) As input to the encoded sentence s-vector.

Step S105: and acquiring a sentence vector of each sentence in the training text set according to the vector representation of the words in the sentence.

In this step, the label training set T is input _train The output is a sentence vector for each sentence for the word vectors of the words in each sentence in the sample.

Because each word in a sentence may contain important characteristic information in the entity relationship extraction task, the characteristic information of each word in the sentence needs to be integrated to jointly represent the sentence, so as to prepare for extracting the entity relationship from the sentence subsequently. The word vector of each word is obtained in step 3, and features in a plurality of word vectors in the sentence need to be extracted. The feature extraction mode is various, and a Convolutional Neural Network (CNN) is adopted in the scheme of the invention. Specifically, a segmented convolutional neural network model (PCNN) that can effectively use the position information of two entities in a sentence is employed. The process of PCNN mainly comprises 3 steps: 1) convolution, which needs to set step length and filter size, 2) maximum pooling, which divides sentences into three segments according to two entity positions, each segment performs maximum pooling operation, and 3) nonlinear activation and output operation. Through the operation, each input sentence can be represented into a vector, the dimension of the vector can be set by self, and the dimension can be set to be 200 according to the proposal in the prior scheme.

Step S106: and inputting the sentence vector of each sentence in the training text set into an entity relationship extraction model, and training the entity relationship extraction model according to the two entities marked in the sentence, the concepts respectively corresponding to the two entities marked in the sentence and the relationship between the two entities marked in the sentence.

After each sentence in the labeled training set is represented by a vector, the sentence can be used as the input of the entity relationship extraction model M, and the parameters of the neural network model M are trained according to the entity labeled in each training sample, the concept corresponding to the entity and the relationship of the entity.

Step S107: and obtaining a sentence vector of each sentence in the text set to be extracted.

Step S108: and inputting the sentence vector of each sentence in the text set to be extracted into the entity relationship extraction model, and acquiring a relationship set which comprises two entities, concepts respectively corresponding to the two entities and is used for each sentence in the text set to be extracted.

and performing context recognition on the sentences in the training text set to acquire concepts corresponding to the two entities recognized by the sentences respectively.

SE is an element in the set SE, i.e. SE is a triple about a sentence in a news document and two entities contained in the sentence. First, two entities e in the sentence s ₁ And e ₂ Respectively carrying out concept identification through context to obtain c ₁ And c ₂ Here, the concept recognition method is a classification problem, using the naive Bayes classification method, entity e ₁ And entity e ₂ All possible concepts of (C) can be queried from the set C.

Referring to fig. 2, in an embodiment, after performing context recognition on a sentence in the training text set and obtaining concepts corresponding to two entities recognized by the sentence, the method further includes the following steps:

step S201: and matching the two entities identified in the sentence of the training text set with the entity triple relation set.

Step S202: if the matching fails, a relation is randomly extracted from the entity triple relation set, a relation set which comprises sentences, the two marked entities, concepts respectively corresponding to the two marked entities and the randomly extracted relation set is generated, and the data set is used as a negative sample and is placed into the marking training set.

By finding the relationship triplets in the entity triplet relationship set R, the entity e is utilized ₁ And e ₂ As identity and relationship triplets (e) ₁ ，r，e ₂ ) And (6) matching. If not, the entity e of the knowledge base is considered ₁ And e ₂ There is no relation between them, and a relation R existing in the triple set R is randomly selected _random Generating a label record (s, (e) ₁ ，c ₁ ，r _random ，c ₂ ，e ₂ ) Put into the labeled training set T as a negative sample _train 。

In one embodiment, the method further comprises the following steps:

if the matching is successful, generating a concept and a matching relation set which respectively correspond to the sentence, the two marked entities and the two marked entities, and performing confidence score on the matched relation, if the score result exceeds a first set threshold value, putting the data set into a marking training set as a positive sample, and if the score result is lower than the first set threshold value, putting the data set into the marking training set as a negative sample.

Wherein, fig. 3 is a schematic diagram of the remote supervision labeling process in this embodiment, if a triple (e) is matched ₁ ，r，e ₂ ) Then the relation r obtained for matching ₁ And (6) performing confidence score. The scoring is based on the calculation of the relationship r from the co-occurrence ₁ And the degree of relevance of the context in the sentence s, the higher the degree of relevance the higher the confidence score. When the score exceeds a first set threshold, a quintuple (e) is generated ₁ ，c ₁ ，r ₁ ，c ₂ ， e ₂ ) Represents when entity e ₁ Concept of (c) ₁ And entity e ₂ Is c ₂ When e is present ₁ And e ₂ Has a relation r between ₁ . And generates a label record (s, (e) ₁ ，c ₁ ，r ₁ ，c ₂ ，e ₂ ) Add label training set T as positive sample _train . If the score does not exceed the first set threshold, the record (s, (e) is labeled ₁ ，c ₁ ，r _random ， c ₂ ，e ₂ ) Add label training set T as negative sample _train 。

Referring to fig. 4, in an embodiment, the method further includes the following steps:

step S401: and acquiring a plurality of relation sets with the same concept in the generated relation sets.

Step S402: and judging the context correlation degree of each relation and sentence in the plurality of relation sets.

Step S403: and replacing the relationship with the maximum correlation degree into a plurality of relationship sets as a new relationship.

After all triples in the sentence of the training text set and the triple relationship set SE of the two entities identified in the sentence are labeled, because all labeled relationships are derived from Freebase, if the fact relationship contained in the Freebase has deviation, errors will be brought to the following calculation, and therefore, a positive sample in the labeling result needs to be corrected and adjusted to improve and optimize the result of the relationship labeling between the entities. For example, for a labeled positive sample (s, (e) ₁ ，c ₁ ，r ₁ ，c ₂ ，e ₂ ) The annotated relation r was assumed in previous studies) ₁ It is true that in the solution of the invention, concept c is assumed ₁ And c ₂ Is labeled correctly, but the relationship r ₁ The correctness of (1) requires verification and correction. In order to reduce the computational complexity, in this embodiment, a candidate relationship set of each label is first screened out. The screening method is that the relation that two concepts in the label record are respectively the same is added into the candidate relation listR ₁ In (1). For example, record (s 1, (e) ₃ ，c ₁ ，r ₂ ， c ₂ ，e ₄ ) Because of concept c) ₁ And c ₂ Respectively the same as in the above records, so entity e is recorded ₃ And e ₄ In concept c ₁ And c ₂ The relationship r expressed below ₂ Listing into a candidate relationship list R ₁ In (1). Next, the optimal relationship needs to be identified from the candidate relationships. Separately computing a set of relationships R ₁ Each of the relationships r _i The relation r with the context in the sentence s and the larger relation _max As an optimization result of the relationship labeling. Deleting the positive sample record (s, (e) from the annotation data set T ₁ ，c ₁ ，r ₁ ，c ₂ ，e ₂ ) And adding the optimized records (s, (e) ₁ ，c ₁ ，r _max ，c ₂ ，e ₂ ) As a new positive sample. Finally, an entity e is added or updated to the knowledge base KB to be built ₁ And e ₂ Adding the quintuple relation (e) ₁ ，c ₁ ，r _max ，c ₂ ，e ₂ )。

Please refer to fig. 5, which is a diagram illustrating an entity relationship extraction model M according to an embodiment of the present invention, in which a label is labeled as a training set T _train Divided randomly in three parts T according to a ratio _train (80% of the entire data set), T _valid (10％)，T _test (10%) represent the training set, validation set, and test set, respectively, which obey the same data distribution.

The parameters of the entity relation extraction model M comprise a hyper-parameter and a common parameter. There are 4 hyper-parameters in the convolutional neural network that need to be set with initial values, set with B =100 per Batch of sample size, λ =0.01 for Learning rate of stochastic gradient descent, ρ =0.5 for neural network unit discarding probability (Dropout probability), and n =10 for maximum number of uses per sample. And after the hyper-parameters are set, starting a training process of the entity relation extraction model M. Inputting the processed positive and negative samples into the convolutional neural network in batches, and recording the concept recognition result and the label of each sampleAnd (3) noting the error between the concept categories, the error between the entity relation extraction result and the entity relation in the label, minimizing the comprehensive error of the convolutional neural network through a random gradient descent algorithm, and continuously adjusting and storing common parameters in the model M. In order to find the problem of model parameters in time and verify the generalization capability of the model, after 5 batches of sample calculation, a verification set T prepared in advance is used in the scheme of the invention _valid And verifying whether the parameter setting of the current network model M is reasonable or not, and if not, adjusting in time.

After the entity relationship extraction model M training is completed, the present invention uses two benchmarking datasets: 1) A SemEval-2010Task 8 data set, wherein the data set comprises 9 bidirectional relations and 1 undirected 'other' relation, and comprises 10717 labeled samples in total, and 2) a NYT10 data set, wherein the data set comprises 53 relations in total, 1 relation 'NA' represents that two entities do not have any relation, the data set comprises 20202 labeled samples in total, and the two data sets are respectively subjected to entity relation extraction model M and are subjected to statistics on accuracy, recall rate and F1 value.

The method for extracting the entity relationship in the text, which is provided by the invention, can fundamentally reduce and solve the problem of wrong labeling in the knowledge base by using the multi-concept multi-relationship knowledge base. Meanwhile, the method for extracting the entity relationship in the text can effectively utilize the concept information of the entity, combines the context of the entity, eliminates the noise relationship before the relationship extraction, reduces the search space of the relationship extraction and improves the speed and the precision of the relationship extraction.

Referring to fig. 6, fig. 6 is a schematic structural diagram of an entity relationship extraction system in text according to an embodiment of the present invention, where the entity relationship extraction system 600 in text includes:

the first obtaining module 601 is configured to obtain the entity triple relationship set, obtain the entity and the entity attribute set, and obtain the concept set.

A second obtaining module 602, configured to obtain a set of triple relationships between a sentence in the training text set and two entities identified in the sentence.

And a remote supervision and labeling module 603, configured to perform remote supervision and labeling on the sentence in the training text set and the triple relationship set of the two entities identified in the sentence according to the entity triple relationship set, the entity and entity attribute set, and the concept set, to obtain a sentence including the training text set, the two entities identified in the sentence, concepts corresponding to the two entities, and a relationship set of the two entities, and place the relationship set in a labeling training set.

And a representation input module 604, configured to obtain vector representations of words in a sentence in the training text set according to the labeled training set.

A first sentence expression module 605, configured to obtain a sentence vector of each sentence in the training text set according to vector expression of words in the sentence;

an entity relationship extraction model training module 606, configured to input the sentence vector of each sentence in the training text set into the entity relationship extraction model, and train the entity relationship extraction model according to the two entities labeled in the sentence, the concepts corresponding to the two entities labeled in the sentence, and the relationship between the two entities labeled in the sentence;

a second sentence expression module 607, configured to obtain a sentence vector of each sentence in the text set to be extracted;

the entity relationship extraction module 608 is configured to input the sentence vector of each sentence in the to-be-extracted text set into the entity relationship extraction model, and obtain a relationship set including two entities, concepts corresponding to the two entities, and the two entities for each sentence in the to-be-extracted text set.

In an embodiment, the remote supervision tagging module 603 further includes a context recognition unit 6031, configured to perform context recognition on the sentences in the training text set, and obtain concepts corresponding to the two entities recognized by the sentences respectively.

In one embodiment, the remote supervised tagging module 603 further includes a matching unit 6032 and a random extraction unit 6033, where the matching unit 6032 is configured to match two entities identified in a training text set sentence with an entity triplet relationship set; the random extraction unit 6033 is configured to, if matching fails, randomly extract a relationship from the entity triple relationship set, generate a relationship set including a sentence, two tagged entities, concepts corresponding to the two tagged entities, and the random extraction, and place the data set as a negative sample in the tagging training set.

In an embodiment, the remote monitoring labeling module 603 further includes a confidence scoring unit 6034 configured to, if matching is successful, generate a set of matching relationships and concepts corresponding to the sentence, the two labeled entities, and perform confidence scoring on the matched relationships, and if a scoring result exceeds a first set threshold, put the data set as a positive sample into a labeling training set, and if the scoring result is lower than the first set threshold, put the data set as a negative sample into the labeling training set.

In one embodiment, the remote supervised annotation module 603 further comprises:

a relationship set acquisition unit 6035 configured to acquire a plurality of relationship sets having the same concept in the generated relationship sets.

A contextual relevance degree determination unit 6036 configured to determine a contextual relevance degree of each of the relationships and sentences in the plurality of relationship sets.

A relationship replacing unit 6037, configured to replace the relationship with the largest degree of correlation into the plurality of relationship sets as a new relationship.

a relationship set deleting unit 6038, configured to delete the relationship sets in the annotation training set.

A relationship set replacing unit 6039, configured to place the relationship sets including the new relationship into the annotation training set.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

The system for extracting the entity relationship in the text provided by the invention can fundamentally reduce and solve the problem of wrong labeling in the knowledge base by using the multi-concept multi-relationship knowledge base. Meanwhile, the method for extracting the entity relationship in the text can effectively utilize the concept information of the entity, combines the context of the entity, eliminates the noise relationship before the relationship extraction, reduces the search space of the relationship extraction and improves the speed and the precision of the relationship extraction.

The present invention also provides a computer readable medium, on which a computer program is stored, and when the computer program is executed by a processor, the computer program implements the entity relationship extraction method in the text in any one of the above embodiments.

Referring to fig. 7, in an embodiment, an electronic device 700 of the present invention includes a memory 710 and a processor 720, and a computer program stored in the memory 710 and executable by the processor 720, where when the processor 720 executes the computer program, the method for extracting an entity relationship in a text in any one of the above embodiments is implemented.

In this embodiment, the controller 720 may be one or more Application Specific Integrated Circuits (ASICs), digital Signal Processors (DSPs), digital Signal Processing Devices (DSPDs), programmable Logic Devices (PLDs), field Programmable Gate Arrays (FPGAs), controllers, micro-controllers, microprocessors, or other electronic components. The storage medium 710 may take the form of a computer program product embodied on one or more storage media including, but not limited to, disk storage, CD-ROM, optical storage, etc., in which the program code is embodied. Computer readable storage media, including both permanent and non-permanent, removable and non-removable media, may implement the information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of the storage medium of the computer include, but are not limited to: phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technologies, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic tape storage or other magnetic storage devices, or any other non-transmission medium, may be used to store information that may be accessed by a computing device.

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one type of logical functional division, and other divisions may be realized in practice, for example, multiple units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

The above-mentioned embodiments only express several embodiments of the present invention, and the description thereof is specific and detailed, but not to be understood as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the inventive concept, which falls within the scope of the present invention.

Claims

1. an entity relationship extraction method in a text, is characterized in that, comprises the steps:

Obtain the entity triplet relationship set, acquire the entity and entity attribute set, and acquire the concept set;

Obtain the set of triplet relations between the sentence of the training text set and the two entities identified in the sentence;

According to the entity triple relationship set, the entity and entity attribute set, and the concept set, perform remote supervision on the sentence in the training text set and the triple relationship set of the two entities identified in the sentence. Marking, specifically, by performing context recognition on the sentences in the training text set, to obtain the concepts corresponding to the two entities identified by the sentence;

Match the two entities identified in the sentence of the training text set with the set of entity triples, and obtain the sentence including the training text set, the two entities identified in the sentence, the concepts corresponding to the two entities, and the two entities The relationship set of the entity, specifically, obtain multiple relationship sets with the same concept in the generated relationship set, and judge the degree of context correlation between each relationship and the sentence in the multiple relationship sets, and then assign the relationship with the greatest degree of correlation replacing into a plurality of relationship sets as a new relationship, and putting the plurality of relationship sets including the new relationship into a label training set;

According to the labeled training set, obtain the vector representation of the words in the sentence of the training text set;

Obtain the sentence vector of each sentence in the training text set according to the vector representation of the words in the sentence;

Input the sentence vector of each sentence in the training text set into the entity relationship extraction model, according to the two marked entities in the sentence, the concepts corresponding to the two marked entities in the sentence, and the two marked entities in the sentence The relationship between training the entity relationship extraction model;

Obtain the sentence vector of each sentence in the text set to be extracted;

The sentence vector of each sentence in the text set to be extracted is input into the entity relationship extraction model, and each sentence in the text set to be extracted includes two entities, concepts corresponding to the two entities, and a relationship set between the two entities.

2. the entity relation extraction method in the text according to claim 1, is characterized in that, context recognition is carried out to the sentence of described training text set, after obtaining the concept corresponding to two entities that this sentence recognizes respectively, also comprises Follow the steps below:

If the matching fails, a relationship is randomly extracted from the entity triple relationship set, and a data set including the sentence, the two marked entities, the concepts corresponding to the two marked entities and the randomly extracted relationship set is generated, and Put this dataset as a negative sample into the labeled training set.

3. the entity relationship extraction method in the text according to claim 2, is characterized in that, also comprises the steps:

If the matching is successful, generate a sentence, the two marked entities, the concepts corresponding to the two marked entities, and a matching relationship set, and perform a confidence score on the matching relationship. If the scoring result exceeds the first set If the threshold is set, the data set is put into the labeled training set as a positive sample, and if the scoring result is lower than the first set threshold, the data set is put into the labeled training set as a negative sample.

4. The entity-relationship extraction method in the text according to claim 3, is characterized in that, carrying out confidence score to the relation that matching obtains, comprises:

According to the proportion of co-occurrence in the corpus with the context of the sentence, the degree of correlation between the matching relationship and the context in the sentence is obtained. The higher the degree of correlation, the higher the confidence score.

5. the entity relationship extraction method in the text according to claim 4 is characterized in that, after replacing the relationship with the greatest degree of relevance into a plurality of relationship sets as new relationships, it also includes the following steps:

deleting the plurality of relationship sets in the labeled training set.

6. An entity relationship extraction system in a text, characterized in that it comprises:

The first acquisition module is used to acquire entity triplet relationship sets, entity and entity attribute sets, and concept sets;

The second acquisition module is used to acquire the triplet relationship set between the sentence of the training text set and the two entities identified in the sentence;

The remote supervision and labeling module is used to compare the sentence of the training text set and the triplets of the two entities identified in the sentence according to the set of entity triples, the set of entities and entity attributes, and the set of concepts. The set of tuple relations is marked with remote supervision. Specifically, it is used to perform context recognition on the sentences of the training text set to obtain the concepts corresponding to the two entities recognized by the sentence, and then identify the two entities in the sentences of the training text set. Match the two entities of the two entities with the entity triplet relationship set, obtain the sentence including the training text set, the two entities identified in the sentence, the concepts corresponding to the two entities, and the relationship set of the two entities, and obtain all Among the generated relationship sets, there are multiple relationship sets with the same concept, and the degree of context correlation between each relationship and sentence in the multiple relationship sets is judged, and then the relationship with the highest degree of correlation is replaced into the multiple relationship sets as a new relationship, and put the plurality of relationship sets including the new relationship into the label training set;

Indicate an input module, for obtaining the vector representation of words in the sentences of the training text set according to the labeled training set;

The first represents the sentence module, which is used to obtain the sentence vector of each sentence of the training text set according to the vector representation of the words in the sentence;

The entity relationship extraction model training module is used to input the sentence vector of each sentence in the training text set into the entity relationship extraction model, according to the two entities marked in the sentence, the concepts corresponding to the two entities marked in the sentence, and The relationship between the two entities marked in the sentence trains the entity relationship extraction model;

The second represents the sentence module, which is used to obtain the sentence vector of each sentence of the text set to be extracted;

The entity relationship extraction module is used to input the sentence vector of each sentence of the text set to be extracted into the entity relationship extraction model, and obtain each sentence of the text set to be extracted including two entities, concepts corresponding to the two entities, and two An entity's relationship collection.

7. A computer-readable medium having a computer program stored thereon, characterized in that:

When the computer program is executed by the processor, the method for extracting the entity relationship in the text according to any one of claims 1 to 5 is realized.

8. An electronic device comprising a memory, a processor and a computer program stored in the memory and executable by the processor, characterized in that:

When the processor executes the computer program, the method for extracting entity relationship in text according to any one of claims 1 to 5 is realized.