CN107291687B

CN107291687B - Chinese unsupervised open type entity relation extraction method based on dependency semantics

Info

Publication number: CN107291687B
Application number: CN201710285995.4A
Authority: CN
Inventors: 向阳; 贾圣宾; 鄂世嘉; 吕东东
Original assignee: Tongji University
Current assignee: Tongji University
Priority date: 2017-04-27
Filing date: 2017-04-27
Publication date: 2021-03-26
Anticipated expiration: 2037-04-27
Also published as: CN107291687A

Abstract

The invention relates to a Chinese unsupervised open entity relation extraction method based on dependency semantics. The method comprises the following steps: preprocessing input text: performing Chinese word segmentation, part-of-speech tagging and dependency syntax analysis on the input text; performing named entity on the input text Identify; arbitrarily select two entities from the identified entities to form a candidate entity pair; find the dependency path between the two entities in the candidate entity pair; analyze whether the syntactic structure mapped by the dependency path is consistent with the paradigm of the dependency semantic paradigm set If it matches, then according to the matched paradigm, words or phrases are extracted from the remaining part of the input text as relational words, and the extracted relational words and candidate entity pairs form relation triples; otherwise, the next group of candidate entity pairs is processed. Paradigm matching; output relation triples. Compared with the prior art, the present invention has the advantages of low computational complexity, high extraction efficiency, overcoming the limitation of distance and position, and being able to extract even a single sentence.

Description

Chinese unsupervised open type entity relation extraction method based on dependency semantics

Technical Field

The invention relates to information extraction research in the fields of artificial intelligence and natural language processing, in particular to a Chinese unsupervised open entity relation extraction method based on dependency semantics.

Background

The wave of big data is surging by the situation of Qiantangjiang, and the data accumulated in the internet is increased explosively. In the face of massive information in the web, it is very difficult for users to quickly find information concerned by the users. The traditional search engine can only return a large number of webpages related to the contents inquired by the user to the user, and the information required by the user can be obtained only after the webpages are browsed. The single search mode of returning web pages cannot meet the actual requirement of users for massive network data. The internet provides an inexhaustible information source for people, and how to quickly and accurately automatically mine valuable information from the information source becomes the focus of attention of people.

Information extraction techniques have been developed. Unstructured information contained in the text is output in a structured or semi-structured form, and the content really concerned by the user is quickly obtained, so that intelligent and humanized information service is provided, which is the task of information extraction. For example, information such as persons, time, location, casualties, accident causes, etc. is extracted from news reports of airplane crash events, and users are allowed to quickly acquire the original commission of the event. Named Entity relationship extraction is a core subtask of information extraction, which is also called Entity relationship extraction or relationship extraction, and is to extract semantic relationships between related named entities from unstructured natural language text and arrange the semantic relationships into structured relationship triples (Entity1, relationship words, Entity2), where Entity1 and Entity2 are Entity pairs with relationships, and relationship words are words or word sequences describing semantic relationships between entities.

The entity relation extraction has important research value and is widely applied to many artificial intelligence fields such as knowledge maps, intelligent search engines, automatic question answering systems, text mining, machine translation and the like.

Conventional information extraction identifies a target relationship type through a trained extractor, requiring a predefined relationship type and a large number of labeled corpus. The traditional Chinese relation extraction is based on a supervised machine learning algorithm, and mainly comprises a characteristic-based method and a kernel-based method. Such methods have several disadvantages: first, it is difficult to define a comprehensive entity relationship type hierarchy; secondly, the method depends heavily on large-scale labeled training corpora, manual labeling of the corpora is time-consuming and labor-consuming, and the labeling quality is difficult to control; finally, the number of open web texts is large and cannot be predefined, so that the traditional method cannot meet the requirement of information extraction in the open field. The open entity relationship extraction technology overcomes the defects of the traditional relationship extraction, can automatically discover any relationship type in the network text, and has important development prospect and research value. In the aspect of open relation extraction research, a clustering algorithm is mainly applied. And extracting candidate entity pairs by means of position limitation, distance limitation and the like, clustering to generate class clusters of similar entity pairs, labeling relation class labels for the class clusters, and selecting representative words as relation descriptors of the class. Such a method has two problems: the clustering algorithm needs a considerable number of related entity pairs, namely, effective results cannot be obtained for a single entity pair or a small number of entity pairs, and the effect of the method is seriously influenced when the training corpus is insufficient; it is difficult to determine whether the last core relation word can be a valid relation feature word, nor does the last descriptor of the determined family necessarily fit every pair of entities in the cluster. In addition, researchers research methods based on deep syntactic analysis or semantic role labeling to achieve good effects, and the research is mainly focused on English corpora.

Open-ended relations have gained considerable effort in research on english corpuses, but relatively little research has been done on chinese corpuses. The Chinese corpus has unique flexibility and complexity in the aspects of word formation, sentence formation and expression, and the research difficulty is far greater than that of English, so that some existing English entity relation extraction systems cannot be suitable for the Chinese corpus. The entity relation extraction system suitable for the Chinese field can be obtained only by carefully researching the Chinese lexical and syntactic and introducing the Chinese lexical and syntactic into the entity relation extraction.

Research shows that when entity relationship extraction is carried out, certain syntactic relationships often exist between entity pairs with relationships. For example, if two entities are the subject and object of a sentence, respectively, then the relational feature words of the entity pairs are most likely predicate verbs. If the syntactic relation between the entity pair is known in advance, the relation characteristic words between the entity pair can be accurately determined. Dependency parsing may reflect semantic modification relationships between the components of a sentence. Since the named entity in the sentence must appear as a noun phrase in the dependency structure, the dependency path between the entities must reflect the relationship characteristics of the corresponding entity pair.

In conclusion, in order to make the entity relationship extraction method more suitable for Chinese corpora, the method is based on the specific syntactic and semantic features of Chinese, and fully shows the adaptability and effectiveness of the unsupervised method in the open field. The invention provides an unsupervised Chinese open relation extraction method, namely a Dependency Semantic Normal Form (DSNFs). And innovative achievements are brought to the research field of Chinese open relation extraction.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provide a Chinese unsupervised open type entity relation extraction method based on dependency semantics. The invention aims to overcome the defects that the traditional extraction method is high in requirement on training corpora, poor in portability and expansibility, incapable of adapting to an open network text and the like, and also provides an open unsupervised entity relation extraction method for the network text based on Chinese language characteristics, wherein the problem that the extraction method under English corpora cannot be transplanted to Chinese due to the characteristics of complexity, flexibility and the like of the Chinese in the aspects of lexical grammar and the like is considered.

In order to solve the technical problems, the invention takes the mapping between the entity relationship and the dependency analysis tree as the basis, deeply excavates the dependency semantics implied by the shortest dependency path, obtains the dependency semantic normal form by using the characteristics of the dependency relationship, the part of speech information, the position relationship and the like as the limitations, and provides and realizes a novel unsupervised Chinese open-form relationship extraction method.

The purpose of the invention can be realized by the following technical scheme:

a Chinese unsupervised open type entity relation extraction method based on dependency semantics comprises the following steps:

s1, preprocessing input text: performing Chinese word segmentation, part of speech tagging and dependency syntactic analysis on an input text;

s2, carrying out named entity recognition on the input text;

s3, randomly selecting two entities from the identified entities to form a candidate entity pair;

s4, finding a dependency path between two entities in the candidate entity pair;

s5, analyzing whether a syntactic structure mapped by a dependency path between two entities in the candidate entity pair is matched with a paradigm of a dependency semantic paradigm set, if so, extracting words or phrases from the rest part of the input text according to the matched paradigm as relation words, wherein the extracted relation words and the candidate entity pair form a relation triple, and if not, performing paradigm matching of the next group of candidate entity pairs;

and S6, outputting the relation triple.

The relationship triple form is as follows: (Entity1, relationship words, Entity2), where Entity1, Entity2 are pairs of entities that have a relationship, and relationship words, which describe the semantic relationship between entities.

The dependency semantic normal form comprises a first pre-modification structure class, a second parallel structure class, a third verb related class, a fourth templating class and other classes.

The first pre-modification structure class comprises a combined fixed language structure and a structure which is formed by connecting a structure assistant word and a core word, wherein the combined fixed language structure corresponds to a dependency semantic paradigm of 'Entity 1+ AttWord1(+ AttWord2) + Entity 2', the structure which is connected with the core word by the structure assistant word corresponds to a semantic paradigm of 'Entity 1+ Noun + Entity 2' or 'Entity 1+ Entity2+ Noun', wherein the Entity1 and the Entity2 have a relationship, the AttWord1 and the AttWord2 are different fixed language words, and Noun is a Noun.

The second parallel structure class comprises a parallel noun structure and a parallel verb structure.

The parallel noun structure comprises parallel entities serving as a subject structure, parallel entities serving as a predicate object structure, parallel entities serving as a preposition object structure and a mixed structure of the first three, the parallel entities serving as the subject structure correspond to a dependency semantic paradigm "Entity 2+ Conj + (Entity1+ + + Pred + Entity 3", the parallel entities serving as the predicate object structure correspond to a dependency semantic paradigm "Entity 2+ Pred + Entity3+ Conj + (Entity1+ +"), the parallel entities serving as the preposition object structure correspond to a dependency semantic paradigm "Entity 2+ Prep + Entity3+ Conj + (Entity1+ + + Pred (+ Dobj)", wherein the Entity pairs with relationships between Entity2 and Entity3 exist, (Entity1+ +) represents that one or more parallel entities exist, Conj is a conjunctive word, Pred is predicate Dobj, and Prep is a direct preposition object.

The parallel verb structure comprises a verb connection structure and a parallel compound sentence structure.

The third class verb related class comprises a main predicate engine structure and a main predicate intermediary structure, wherein the main predicate engine structure corresponds to a dependency semantic paradigm 'Entity 1+ Pred + Entity 2', the main predicate intermediary structure corresponds to a dependency semantic paradigm 'Entity 1+ Prep + Entity2+ Pred (+ Dobj)', wherein the Entity1 and the Entity2 are Entity pairs with relations, Pred is a predicate, Prep is a preposition, and Dobj is a direct object.

Compared with the prior art, the invention has the following advantages:

1) the method provided by the invention has sufficient capacity to cope with complex Chinese syntax, and in the extraction process, the relative position of the entity pair and the relation word does not need to be limited, thereby avoiding the defects caused by position limitation in the traditional method;

2) the method provided by the invention can obtain richer results, can extract the relational phrases taking verbs or nouns as cores, and in comparison, other extractors with better effects can only extract the verbs as the relational phrases;

3) the method provided by the invention can better identify the dependence relationship of the long span, particularly can extract the co-occurrence relationship triplets under the condition of a parallel structure, and avoids the defects caused by distance limitation in the traditional method;

4) the method provided by the invention does not need model training corpora, can extract the relation of one sentence, has low calculation complexity and high extraction efficiency, and can meet the requirement of high real-time property.

Drawings

FIG. 1 is a schematic flow chart of the extraction method of the present invention;

FIG. 2 is a dependency semantic paradigm DSNF1 graph model;

FIG. 3 is a dependency semantic paradigm DSNF2 graph model;

FIG. 4 is a dependency semantic paradigm DSNF3 graph model;

FIG. 5 is a dependency semantic paradigm DSNF4 graph model;

FIG. 6 is a dependency semantic paradigm DSNF5 graph model;

FIG. 7 is a dependency semantic paradigm DSNF6 graph model;

FIG. 8 is a dependency semantic paradigm DSNF7 graph model;

FIG. 9 is a dependency semantic paradigm DSNF8 graph model;

FIG. 10 is a dependency semantic paradigm DSNF9 graph model.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be obtained by a person skilled in the art without any inventive step based on the embodiments of the present invention, shall fall within the scope of protection of the present invention.

Examples

The invention provides a Chinese unsupervised open type entity relation extraction method based on dependency semantics, which is an entity relation extraction method based on a dependency semantics paradigm (DSNFs), can realize automatic extraction, does not need manual intervention, inputs natural language sentences without any processing, and outputs entity relation triples. As shown in fig. 1, the entire process can be described as follows:

step 1: the input text is preprocessed. Each sentence is subjected to a series of natural language processing operations such as word segmentation, part-of-speech tagging, dependency syntactic analysis and the like, and preparation is made for subsequent steps. The method provided by the invention carries out the operation by means of a natural language processing technology provided by a Language Technology Platform (LTP) developed by the research center of calculation and information retrieval of the Harbour society.

Step 2: a candidate entity pair is selected. And performing entity recognition on the input text through a named entity recognition module, and then pairing all recognized candidate entities in pairs. The method adopts the named entity recognition technology provided by the Harbour language technology platform and an iterative heuristic method to carry out named entity recognition. The latter is to obtain the maximum noun phrase by combining the connected nouns, wherein the part of speech of nouns can only be { ni, nh, ns, nz, j }, and respectively represent the name of a mechanism, the name of a person, the name of a geography, other proper nouns and abbreviations. The two methods complement each other and are developed simultaneously.

And step 3: matching the dependency semantic paradigm. Analyzing whether the syntax structure mapped by the dependence shortest path between the entities can match a certain DSNF or not for the candidate entity pair obtained in the step two,

and 4, step 4: and outputting the relation triple. After the step 3 is executed, if the relation words are matched, extracting the relation words from the relation words, and outputting relation triples; and if not, matching the next group of candidate entity pairs.

The core of the method provided by the invention lies in a dependency semantic paradigm, and the following will focus on the relevant contents:

after a large number of relation examples are analyzed statistically, it is found that the relation triples always appear in some fixed syntactic structures, and the syntactic structures which have a characterization effect on entity relations include: a subject-predicate relationship, a move-guest relationship, a preposition object, a parallel component, a modification relationship, and the like. Mapping these structures into dependency trees yields dependency semantic paradigms (DSNFs). DSNFs are a combination of word sequences, parts of speech, dependency paths and their associated dependency tags. The method divides the paradigm set into five major classes of pre-modification, parallel, verb correlation, templating and other classes, and in each class, one or more DSNF can be obtained, thereby providing reasonable basis for relation extraction.

First, Pre-Modification (Pre-Modification Class, PreMod)

Pre-modifiers are a very important type of modification in Chinese phrases. In the Chinese linguistic view, the relationship expression of PreMod syntax class is a bias structure, which can form a bias phrase. The structure of the partial phrase is composed of a fixed language central language and a modifier pair, wherein the fixed language is a former additional component in the noun partial phrase. The composition range of the fixed language is wide, and besides adverbs and word phrases of 'the' word, other kinds of real words (nouns, verbs and adjectives) and phrases can be used as the fixed language. In addition, the idiom is complicated by its multiple levels of definition, delineation, and simultaneous superimposition before a core from different sides, so that a core may carry multiple idioms.

From the formal structure, the slogans can be classified into the following two types:

1) the combined fixed language is directly added before the central language, and the fixed language without ' is added in the middle, namely ' fixed language + central language '. For example, "two 3 high shifts" in "< ORG > two 3 high shifts" < ORG > master < PER > king certain > "is a fixed phrase of" master, "and" two 3 high shifts master "is a fixed phrase of" king certain, "and" master "also expresses the semantic relationship between the entities" two 3 high shifts "and" king certain, "thereby constituting a relationship triplet (two 3 high shifts, master shift, king certain). Due to the multi-level nature of the final phrase, a plurality of word combinations may be used together as the relation feature words, for example, the "ORG company" < ORG chief executive officer < PER certain PER > "may extract the relation (a company, chief executive officer, Zhao), wherein the relation feature words are combined by the" chief "and the" executive officer ". PER denotes a person name, ORG denotes an organization name.

The combined idiom structure mapping is expressed in dependency analysis as: when a fixed phrase is dependent on a core phrase, the dependency relationship is "fixed relationship", and when there are plural layers of fixed phrases, fixed phrases far from the core phrase are dependent on fixed phrases close to the core phrase or directly dependent on the core phrase, and the dependency relationship is also "fixed relationship". Through statistical research, in the actual relation extraction, we mainly consider the structure of two layers of determinants and three layers of determinants, and obtain a relation extraction paradigm DSNF 1: "Entity 1+ AttWord1(+ AttWord2) + Entity 2", the dependency analysis is shown in FIG. 2. In addition, the limitation of part of speech is also considered, only the case that definite words (AttWord1, AttWord2) are nouns is considered, if "AttWord 1" is a vocational related noun (mainly including nouns related to organizations and works, such as director, general manager, county manager, etc.); or "AttWord 1" is a general noun (relative to a professional noun) and "Entity 2" is a physical Entity, and the relationship extraction is performed only when the two constraints are met.

2) The definite language connected with the central language by the structural auxiliary word, namely the 'definite language + central language'. For example, "< PER1> < PER2> grandfather </PER2> of wife < PER2> of Zhanga </PER 1" may extract a relationship tuple (Zhangan, wife, grandfather). For example, "< PER Pei certain PER teacher of < ORG university > < ORG > and" < PER Pei certain PER of < ORG university > < PERG > ", although different in structure, express the same meaning. It can therefore be expressed as two relation decimation paradigms DSNF2 and DSNF 3: "Entity 1+ of + Noun + Entity 2" or "Entity 1+ of + Entity2+ Noun". From these two structures, relationship triplets (Entity1, Noun, Entity1) can be extracted. Can be mapped into a dependency syntax analysis form, such as fig. 3 and fig. 4.

In relation extraction, the situation may be encountered that the partial phrase only contains one entity noun, such as "a teacher in Liu visits the Shanghai", "a wife in Xiaoming is Xiaohong", and the like, and the partial phrase is often contained in other relation syntax classes. In this case, the entity is a fixed-line modified core, and in the case of dependency analysis, the entity is not directly used as a subject or an object, but the modified core is a skeleton component in the syntactic structure. In the relationship extraction process, the situation is fully considered, and the central language is taken as a 'Pseudo-entity (Pe)' to perform corresponding conversion during dependency analysis. For example, "< PER Liu somewhere </PER > < Pe-PER > teacher </Pe-PER > visits < LOC Shanghai </LOC >", the relationship "visit" between the pseudo entity "teacher" and the entity "Shanghai" is extracted, and then the relationship triplets (Liu somewhere, visit, Shanghai) are converted and output. This is encountered in the following analysis and will not be described in detail. Pe-PER represents a name class pseudo-entity.

VERB correlation (Verbal Class, VERB)

In this class, two related entities, one of which is often in the position of the subject and the other of which is in the position of the object, may be the object of the verb (kinect structure) or the object of the preposition (Prep) (betiment structure), and the relationship between the entities may be directly expressed by a predicate (predate). The method can be further divided into a 'subject-guest-moving' structure and a 'subject-guest-intervening' structure according to different objects.

1) For the "predicate-verb" structure, for example, "< PER > certain" liu visits < LOC > shanghai </LOC > ", in this example sentence," certain "liu is the subject," shanghai "is the object," visit "is the predicate verb that associates the two entities, and the triple (certain, tour, shanghai) can be extracted. And mapping the 'predicate-action element' structure into a dependency analysis graph, wherein the two entities depend on the core verb, and the dependency relations are 'predicate relation' and 'action element relation', respectively. The derivable relationship draws the paradigm DSNF 4: "Entity 1+ Pred + Entity 2", a relationship triple (Entity1, Pred, Entity2) may be extracted. The dependency analysis is shown in FIG. 5. LOC denotes a term of geography,

2) for the "subject-subject" structure, for example, "< PER > certain liu" makes a deep trip to < LOC > shanghai </LOC >, the subject is the entity "certain liu", the verb "go" is the predicate of a sentence, the subject entity depends on the verb, and the dependency is the "subject relationship". The 'Shanghai' constitutes a preposition phrase, the entity 'Shanghai' depends on the preposition 'pair', and the dependency relationship is the preposition relationship; the preposition "pair" in relation "structure-in-the-state" depends on the predicate verb. The noun phrase "deep tour" is the direct object of the predicate, so that the relation tuple (Liu somebody, go deep tour, Shanghai) can be extracted. To be noted, since the entity2 is located at the position of the preposition phrase and is indirectly dependent on the predicate verb through the preposition, the predicate phrase and the direct object (Dobj) of the predicate are used together as the relational feature word in order to make the relational extraction result have a more definite semantic meaning. The "subject-predicate-intermediary" structure can be mapped to a relational extraction paradigm DSNF 5: "Entity 1+ Prep + Entity2+ Pred (+ Dobj)", a relationship triple (Entity1, Pred-Dobj, Entity2) dependency analysis may be extracted as shown in FIG. 6.

In particular, for the structure of "chief implication-interject", if the preposition is "by", and is "by", etc. to represent passive words, the positions of Entity1 and Entity2 are exchanged to form a relationship triple (Entity2, Pred-Dobj, Entity 1).

Three, parallel (Coordination Class, COOR)

The juxtaposition is also quite common in chinese sentences. Parallel representation refers to the relationship between sentences or phrases, either simultaneously or simultaneously, where the parallel components have only front and back scores and no primary score. The objects that are in a parallel relationship may be different objects that are related to each other, may be different aspects of the same object, or may be different actions of the same subject. The parallel phrases are also called parallel phrases, and are generally formed by combining two or more nouns, verbs, adjectives, pronouns or quantitative words, and the like, and the parts of speech of the formed words are generally required to be the same. The words are parallel, and the middle part is often connected with a pause sign or "and, and", etc. (conc). Two kinds of parallel nouns and parallel verbs are mainly considered in relation extraction.

For example, in "< PER1> Liu somewhere PER1> and < PER2> Peng somewhere PER2> visiting < ORG > Shanghai </ORG >", "Liusomewhere" and "Pensomewhere" are two terms having a parallel relationship. When two entities are in such a word-phrase juxtaposition, they behave identically and act on one another in common. In the example, a relationship triplet (some in liu, tour, shanghai) may be extracted, and a "tour" relationship may exist between a parallel component "some in liu" and "shanghai" of "some in liu", and a relationship tuple (some in peng, tour, shanghai) may be extracted. In fact, the COOR syntax class needs to exist depending on other syntax classes, as in the above example, the relation tuple (liu somebody, visit, shanghai) should belong to the VERB syntax class. Because the entity "some in liu" depends on the entity "some in liu" and the dependency relationship is "parallel relationship", the relationship that occurs in the entity "some in liu" is also applicable to the entity "some in liu". According to the fact that the position of the entity in the syntax is mainly three types of subject position, predicate object position and preposition object position, therefore,

1) when the parallel nouns are used as subjects, a relation extraction paradigm DSNF6 is extracted: "Entity 2+ Conj + (Entity1+ + + Pred + Entity 3" (where (Entity1+ +) indicates the presence of one or more juxtaposed entities, the same applies below). The triple (Entity1, Pred, Entity3) may be derived from the relationship triple (Entity2, Pred, Entity3), and the dependency relationship is shown in fig. 7.

2) When the parallel nouns are used as predicate objects, extracting a relation extraction paradigm DSNF 7: "Entity 2+ Pred + Entity3+ Conj + (Entity1+ +", which may be derived from a relationship triple (Entity2, Pred, Entity3) (Entity2, Pred, Entity1), the dependency relationships are shown in FIG. 8.

3) When the parallel nouns are used as preposition objects, extracting a relation extraction paradigm DSNF 8: "Entity 2+ Prep + Entity3+ Conj + (Entity1+ + + Pred (+ Dobj)", which may be derived from the relationship triplets (Entity2, Pred-Dobj, Entity3) (Entity2, Pred-Dobj, Entity1), and the dependency relationships are shown in FIG. 9.

4) The first three types of hybrid types. For example, "< PER1> lie a certain Per1 classmate, < PER2> a certain Per2 classmate, respectively in < ORG1> Shanghai </ORG1> and < ORG2> Hangzhou </ORG2> invite < PER3> a certain Per3 classmate and < PER4> high certain Per4 classmate. "is a mixture of the first three types.

The parallel verbs mainly describe two different actions that are issued simultaneously by the same subject. There are two types of situations in which,

1) the first case is the verb conjunctive use. In Chinese sentence construction, when one verb cannot completely describe the meaning of a behavior, two verbs are often used together, the first verb supplements the second verb, and the second verb is a transitive verb, so that the second verb closer to an object is generally extracted as a relation feature word. For example, "< PER > certain PER steps on the snow to visit < LOC >", wherein the "steps on the snow" and "visit" form a parallel relationship, can extract the relationship (open certain, visit, cottage mountain).

2) The second case is a parallel-type compound sentence, which means that several clauses in the compound sentence have equal parallel relation semantically. If a parallel relationship exists between two or more events and no causal connection exists, a parallel class complex sentence can be formed. Clauses are often separated by commas and conjunctions of "and, also, and" etc. For example, the example sentence "< ORG1> certain company </ORG1> manager < PER > high certain </PER > visits < ORG2> factory building </ORG2>, and the production guidance suggestion is published in < ORG3> certain workshop </ORG3 >. The comma divides the compound sentence into two clauses, which respectively express two events, and the subject is the same as the entity "higher than a certain", so that the two clauses are parallel. Predicates 'visit' and 'release' in the parallel clauses form parallel, and the dependency relationship is 'parallel relationship'. Mapping to dependency syntax can be described as: if the entity2 depends on the verb 2 as an object and this verb 2 is in parallel with another verb 1 (the dependency relationship is "parallel relationship") while the entity1 exists as a subject depending on the verb 1, it can be inferred that there is a relationship between the entity1 and the entity2, and the relationship token is the verb 2. A relation-drawn paradigm DSNF9 can thus be obtained: "Entity 1+ Pred1+ Pred2+ Entity 2", the dependency analysis is shown in FIG. 10. The paradigm DSNF9 can cover both of the above cases.

It is worth mentioning that the parallel structure is nested in other syntax classes. The paradigm of DSNF6, DSNF7, DSNF8 and DSNF9 only express the expression when parallel nouns depend on the structure of "major predicate-animal guest" in the VERB syntax. Other situations are not described in detail. The actual extraction operation steps are similar, when the Entity1 and the Entity2 have a parallel relationship, if the triple (Entity2, relationship word, Entity3) is true, a relationship triple (Entity1, relationship word, Entity3) can be obtained; if a triple (Entity3, relationship word, Entity2) holds, then a relational triple (Entity3, relationship word, Entity1) may be obtained.

Fourthly, modeled (formula Class, FORM)

The types of FORM are often some that appear in chinese, cannot be generalized to the preceding several syntactic relations classes, but generally have a fixed expression format. For example, the corresponding connection words between "wang chi, a university professor, published … …", "wang chi" and "a university professor" can not be found, and there is no direct modification relation, so that they do not conform to the above types. But from this sentence entity-relationship triplets (wang, professor, university) can be extracted. Similar literary expression is common and is a writing habit of Chinese people. For the special grammar expression structures, good effect can be obtained only by extracting templates and performing rigid matching.

Fifthly, others (Other Class)

The method generalizes all other relationship types which cannot be distinguished at present into the category. Because of the uncertainty of this class, no further study of this class is made here.

The invention discloses a Chinese unsupervised open type entity relation extraction method based on dependency semantics, which overcomes the defects of large manual labeling dependency, unreasonable result and the like of the traditional method, is based on the unique and flexible syntactic characteristics of Chinese, deeply excavates the dependency semantics implied by the shortest dependency path on the basis of the mapping between the entity relation and a dependency analysis tree, obtains a Dependency Semantic Normal Form (DSNFs) by using the characteristics of dependency relation, part of speech information, position relation and the like as limitations, and can quickly and accurately extract the entity relation from massive large data by using the normal form set. The method has the advantages of no need of any manual work, realization of full-automatic extraction, no need of dependence on model training corpora, low calculation complexity, high extraction efficiency and capability of meeting the requirement of high real-time property. The invention can be widely applied to the artificial intelligence fields of knowledge maps, intelligent search engines, automatic question answering systems, text mining, machine translation and the like.

While the invention has been described with reference to specific embodiments, the invention is not limited thereto, and various equivalent modifications and substitutions can be easily made by those skilled in the art within the technical scope of the invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. a Chinese unsupervised open entity relation extraction method based on dependency semantics, is characterized in that, this method comprises the following steps:

S1. Preprocessing input text: Perform Chinese word segmentation, part-of-speech tagging and dependency syntax analysis on the input text;

S2. Perform named entity recognition on the input text;

S3, randomly select two entities from the identified entities to form a candidate entity pair;

S4. Find a dependency path between two entities in the candidate entity pair;

S5. Analyze whether the syntactic structure mapped by the dependency path between the two entities in the candidate entity pair matches the paradigm of the dependency semantic paradigm set, and if so, extract the word or word from the rest of the input text according to the matched paradigm. Phrases are used as relation words, and the extracted relation words and candidate entity pairs form relation triples, otherwise, the paradigm matching of the next group of candidate entity pairs is performed;

S6. Output relation triples;

The dependency semantic paradigm includes the first type of pre-modified structure type, the second type of parallel structure type, the third type of verb-related type, the fourth type of templated type and other types;

The first type of pre-modified structure class includes a combined attributive structure and a structure connected by the structural particle "de" and the head. The combined attributive structure corresponds to the dependency semantic paradigm "Entity1+AttWord1(+AttWord2)+Entity2", which is represented by The structure in which the structural particle "de" is connected to the head corresponds to the semantic paradigm "Entity1++Noun+Entity2" or "Entity1++Entity2+Noun", in which Entity1 and Entity2 are entity pairs that have a relationship, and AttWord1 and AttWord2 are different definitions. words, Noun is a noun;

The second type of coordinating structure class includes coordinating noun structure and coordinating verb structure;

The co-located noun structure includes co-located entity as subject structure, co-located entity as predicate-object structure, co-located entity as preposition-object structure and the mixed structure of the first three, co-located entity as subject structure corresponding to the dependency semantic paradigm "Entity2+Conj+(Entity1++) +Pred+Entity3", the parallel entity as a predicate object structure corresponds to the dependency semantic paradigm "Entity2+Pred+Entity3+Conj+(Entity1++)", and the parallel entity as a preposition object structure corresponds to the dependency semantic paradigm "Entity2+Prep+Entity3+Conj+(Entity1++) +Pred(+Dobj)", where Entity2 and Entity3 are entity pairs that have a relationship, (Entity1++) means that there are one or more co-located entities, Conj is a conjunction, Pred is a predicate, Prep is a preposition, Dobj is a direct object, and a coordinating verb The structure corresponds to the dependency semantic paradigm "Entity1+Pred1+Pred2+Entity2", in which Entity1 and Entity2 are entity pairs that have a relationship, and Pred1 and Pred2 are two predicates that constitute a parallel relationship;

The third type of verb-related classes includes subject-verb-verb-object structure and subject-predicate-pre-object structure, the subject-verb-verb-object structure corresponds to the dependency semantic paradigm "Entity1+Pred+Entity2", and the subject-verb-predicate-object structure corresponds to the dependency semantic paradigm "Entity1". +Prep+Entity2+Pred(+Dobj)", where Entity1 and Entity2 are entity pairs that have a relationship, Pred is a predicate, Prep is a preposition, and Dobj is a direct object;

The relation triple form is: (Entity1, RelationWords, Entity2), wherein Entity1, Entity2 are entity pairs that have a relationship, and RelationWords is a word or phrase describing the semantic relationship between entities; the coordinating verb structure includes verbs Conjunctive structure and coordinating complex sentence structure.