+

CN107291687B - Chinese unsupervised open type entity relation extraction method based on dependency semantics - Google Patents

Chinese unsupervised open type entity relation extraction method based on dependency semantics Download PDF

Info

Publication number
CN107291687B
CN107291687B CN201710285995.4A CN201710285995A CN107291687B CN 107291687 B CN107291687 B CN 107291687B CN 201710285995 A CN201710285995 A CN 201710285995A CN 107291687 B CN107291687 B CN 107291687B
Authority
CN
China
Prior art keywords
entity
dependency
entity1
entity2
paradigm
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710285995.4A
Other languages
Chinese (zh)
Other versions
CN107291687A (en
Inventor
向阳
贾圣宾
鄂世嘉
吕东东
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tongji University
Original Assignee
Tongji University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tongji University filed Critical Tongji University
Priority to CN201710285995.4A priority Critical patent/CN107291687B/en
Publication of CN107291687A publication Critical patent/CN107291687A/en
Application granted granted Critical
Publication of CN107291687B publication Critical patent/CN107291687B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

本发明涉及一种基于依存语义的中文无监督开放式实体关系抽取方法,该方法包括以下步骤:预处理输入文本:对输入文本进行中文分词、词性标注和依存句法分析;对输入文本进行命名实体识别;从识别出的实体中任意选出两个实体构成候选实体对;寻找候选实体对中的两个实体之间的依存路径;分析依存路径所映射的句法结构是否与依存语义范式集的范式匹配,若是,则根据被匹配的范式从输入文本的剩余部分中抽取出词或短语作为关系词,抽取的关系词与候选实体对构成关系三元组,若否则进行下一组候选实体对的范式匹配;输出关系三元组。与现有技术相比,本发明具有计算复杂度低、抽取效率高、克服距离位置限制和满足单句也能抽取等优点。

Figure 201710285995

The invention relates to a Chinese unsupervised open entity relation extraction method based on dependency semantics. The method comprises the following steps: preprocessing input text: performing Chinese word segmentation, part-of-speech tagging and dependency syntax analysis on the input text; performing named entity on the input text Identify; arbitrarily select two entities from the identified entities to form a candidate entity pair; find the dependency path between the two entities in the candidate entity pair; analyze whether the syntactic structure mapped by the dependency path is consistent with the paradigm of the dependency semantic paradigm set If it matches, then according to the matched paradigm, words or phrases are extracted from the remaining part of the input text as relational words, and the extracted relational words and candidate entity pairs form relation triples; otherwise, the next group of candidate entity pairs is processed. Paradigm matching; output relation triples. Compared with the prior art, the present invention has the advantages of low computational complexity, high extraction efficiency, overcoming the limitation of distance and position, and being able to extract even a single sentence.

Figure 201710285995

Description

Chinese unsupervised open type entity relation extraction method based on dependency semantics
Technical Field
The invention relates to information extraction research in the fields of artificial intelligence and natural language processing, in particular to a Chinese unsupervised open entity relation extraction method based on dependency semantics.
Background
The wave of big data is surging by the situation of Qiantangjiang, and the data accumulated in the internet is increased explosively. In the face of massive information in the web, it is very difficult for users to quickly find information concerned by the users. The traditional search engine can only return a large number of webpages related to the contents inquired by the user to the user, and the information required by the user can be obtained only after the webpages are browsed. The single search mode of returning web pages cannot meet the actual requirement of users for massive network data. The internet provides an inexhaustible information source for people, and how to quickly and accurately automatically mine valuable information from the information source becomes the focus of attention of people.
Information extraction techniques have been developed. Unstructured information contained in the text is output in a structured or semi-structured form, and the content really concerned by the user is quickly obtained, so that intelligent and humanized information service is provided, which is the task of information extraction. For example, information such as persons, time, location, casualties, accident causes, etc. is extracted from news reports of airplane crash events, and users are allowed to quickly acquire the original commission of the event. Named Entity relationship extraction is a core subtask of information extraction, which is also called Entity relationship extraction or relationship extraction, and is to extract semantic relationships between related named entities from unstructured natural language text and arrange the semantic relationships into structured relationship triples (Entity1, relationship words, Entity2), where Entity1 and Entity2 are Entity pairs with relationships, and relationship words are words or word sequences describing semantic relationships between entities.
The entity relation extraction has important research value and is widely applied to many artificial intelligence fields such as knowledge maps, intelligent search engines, automatic question answering systems, text mining, machine translation and the like.
Conventional information extraction identifies a target relationship type through a trained extractor, requiring a predefined relationship type and a large number of labeled corpus. The traditional Chinese relation extraction is based on a supervised machine learning algorithm, and mainly comprises a characteristic-based method and a kernel-based method. Such methods have several disadvantages: first, it is difficult to define a comprehensive entity relationship type hierarchy; secondly, the method depends heavily on large-scale labeled training corpora, manual labeling of the corpora is time-consuming and labor-consuming, and the labeling quality is difficult to control; finally, the number of open web texts is large and cannot be predefined, so that the traditional method cannot meet the requirement of information extraction in the open field. The open entity relationship extraction technology overcomes the defects of the traditional relationship extraction, can automatically discover any relationship type in the network text, and has important development prospect and research value. In the aspect of open relation extraction research, a clustering algorithm is mainly applied. And extracting candidate entity pairs by means of position limitation, distance limitation and the like, clustering to generate class clusters of similar entity pairs, labeling relation class labels for the class clusters, and selecting representative words as relation descriptors of the class. Such a method has two problems: the clustering algorithm needs a considerable number of related entity pairs, namely, effective results cannot be obtained for a single entity pair or a small number of entity pairs, and the effect of the method is seriously influenced when the training corpus is insufficient; it is difficult to determine whether the last core relation word can be a valid relation feature word, nor does the last descriptor of the determined family necessarily fit every pair of entities in the cluster. In addition, researchers research methods based on deep syntactic analysis or semantic role labeling to achieve good effects, and the research is mainly focused on English corpora.
Open-ended relations have gained considerable effort in research on english corpuses, but relatively little research has been done on chinese corpuses. The Chinese corpus has unique flexibility and complexity in the aspects of word formation, sentence formation and expression, and the research difficulty is far greater than that of English, so that some existing English entity relation extraction systems cannot be suitable for the Chinese corpus. The entity relation extraction system suitable for the Chinese field can be obtained only by carefully researching the Chinese lexical and syntactic and introducing the Chinese lexical and syntactic into the entity relation extraction.
Research shows that when entity relationship extraction is carried out, certain syntactic relationships often exist between entity pairs with relationships. For example, if two entities are the subject and object of a sentence, respectively, then the relational feature words of the entity pairs are most likely predicate verbs. If the syntactic relation between the entity pair is known in advance, the relation characteristic words between the entity pair can be accurately determined. Dependency parsing may reflect semantic modification relationships between the components of a sentence. Since the named entity in the sentence must appear as a noun phrase in the dependency structure, the dependency path between the entities must reflect the relationship characteristics of the corresponding entity pair.
In conclusion, in order to make the entity relationship extraction method more suitable for Chinese corpora, the method is based on the specific syntactic and semantic features of Chinese, and fully shows the adaptability and effectiveness of the unsupervised method in the open field. The invention provides an unsupervised Chinese open relation extraction method, namely a Dependency Semantic Normal Form (DSNFs). And innovative achievements are brought to the research field of Chinese open relation extraction.
Disclosure of Invention
The invention aims to overcome the defects of the prior art and provide a Chinese unsupervised open type entity relation extraction method based on dependency semantics. The invention aims to overcome the defects that the traditional extraction method is high in requirement on training corpora, poor in portability and expansibility, incapable of adapting to an open network text and the like, and also provides an open unsupervised entity relation extraction method for the network text based on Chinese language characteristics, wherein the problem that the extraction method under English corpora cannot be transplanted to Chinese due to the characteristics of complexity, flexibility and the like of the Chinese in the aspects of lexical grammar and the like is considered.
In order to solve the technical problems, the invention takes the mapping between the entity relationship and the dependency analysis tree as the basis, deeply excavates the dependency semantics implied by the shortest dependency path, obtains the dependency semantic normal form by using the characteristics of the dependency relationship, the part of speech information, the position relationship and the like as the limitations, and provides and realizes a novel unsupervised Chinese open-form relationship extraction method.
The purpose of the invention can be realized by the following technical scheme:
a Chinese unsupervised open type entity relation extraction method based on dependency semantics comprises the following steps:
s1, preprocessing input text: performing Chinese word segmentation, part of speech tagging and dependency syntactic analysis on an input text;
s2, carrying out named entity recognition on the input text;
s3, randomly selecting two entities from the identified entities to form a candidate entity pair;
s4, finding a dependency path between two entities in the candidate entity pair;
s5, analyzing whether a syntactic structure mapped by a dependency path between two entities in the candidate entity pair is matched with a paradigm of a dependency semantic paradigm set, if so, extracting words or phrases from the rest part of the input text according to the matched paradigm as relation words, wherein the extracted relation words and the candidate entity pair form a relation triple, and if not, performing paradigm matching of the next group of candidate entity pairs;
and S6, outputting the relation triple.
The relationship triple form is as follows: (Entity1, relationship words, Entity2), where Entity1, Entity2 are pairs of entities that have a relationship, and relationship words, which describe the semantic relationship between entities.
The dependency semantic normal form comprises a first pre-modification structure class, a second parallel structure class, a third verb related class, a fourth templating class and other classes.
The first pre-modification structure class comprises a combined fixed language structure and a structure which is formed by connecting a structure assistant word and a core word, wherein the combined fixed language structure corresponds to a dependency semantic paradigm of 'Entity 1+ AttWord1(+ AttWord2) + Entity 2', the structure which is connected with the core word by the structure assistant word corresponds to a semantic paradigm of 'Entity 1+ Noun + Entity 2' or 'Entity 1+ Entity2+ Noun', wherein the Entity1 and the Entity2 have a relationship, the AttWord1 and the AttWord2 are different fixed language words, and Noun is a Noun.
The second parallel structure class comprises a parallel noun structure and a parallel verb structure.
The parallel noun structure comprises parallel entities serving as a subject structure, parallel entities serving as a predicate object structure, parallel entities serving as a preposition object structure and a mixed structure of the first three, the parallel entities serving as the subject structure correspond to a dependency semantic paradigm "Entity 2+ Conj + (Entity1+ + + Pred + Entity 3", the parallel entities serving as the predicate object structure correspond to a dependency semantic paradigm "Entity 2+ Pred + Entity3+ Conj + (Entity1+ +"), the parallel entities serving as the preposition object structure correspond to a dependency semantic paradigm "Entity 2+ Prep + Entity3+ Conj + (Entity1+ + + Pred (+ Dobj)", wherein the Entity pairs with relationships between Entity2 and Entity3 exist, (Entity1+ +) represents that one or more parallel entities exist, Conj is a conjunctive word, Pred is predicate Dobj, and Prep is a direct preposition object.
The parallel verb structure comprises a verb connection structure and a parallel compound sentence structure.
The third class verb related class comprises a main predicate engine structure and a main predicate intermediary structure, wherein the main predicate engine structure corresponds to a dependency semantic paradigm 'Entity 1+ Pred + Entity 2', the main predicate intermediary structure corresponds to a dependency semantic paradigm 'Entity 1+ Prep + Entity2+ Pred (+ Dobj)', wherein the Entity1 and the Entity2 are Entity pairs with relations, Pred is a predicate, Prep is a preposition, and Dobj is a direct object.
Compared with the prior art, the invention has the following advantages:
1) the method provided by the invention has sufficient capacity to cope with complex Chinese syntax, and in the extraction process, the relative position of the entity pair and the relation word does not need to be limited, thereby avoiding the defects caused by position limitation in the traditional method;
2) the method provided by the invention can obtain richer results, can extract the relational phrases taking verbs or nouns as cores, and in comparison, other extractors with better effects can only extract the verbs as the relational phrases;
3) the method provided by the invention can better identify the dependence relationship of the long span, particularly can extract the co-occurrence relationship triplets under the condition of a parallel structure, and avoids the defects caused by distance limitation in the traditional method;
4) the method provided by the invention does not need model training corpora, can extract the relation of one sentence, has low calculation complexity and high extraction efficiency, and can meet the requirement of high real-time property.
Drawings
FIG. 1 is a schematic flow chart of the extraction method of the present invention;
FIG. 2 is a dependency semantic paradigm DSNF1 graph model;
FIG. 3 is a dependency semantic paradigm DSNF2 graph model;
FIG. 4 is a dependency semantic paradigm DSNF3 graph model;
FIG. 5 is a dependency semantic paradigm DSNF4 graph model;
FIG. 6 is a dependency semantic paradigm DSNF5 graph model;
FIG. 7 is a dependency semantic paradigm DSNF6 graph model;
FIG. 8 is a dependency semantic paradigm DSNF7 graph model;
FIG. 9 is a dependency semantic paradigm DSNF8 graph model;
FIG. 10 is a dependency semantic paradigm DSNF9 graph model.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be obtained by a person skilled in the art without any inventive step based on the embodiments of the present invention, shall fall within the scope of protection of the present invention.
Examples
The invention provides a Chinese unsupervised open type entity relation extraction method based on dependency semantics, which is an entity relation extraction method based on a dependency semantics paradigm (DSNFs), can realize automatic extraction, does not need manual intervention, inputs natural language sentences without any processing, and outputs entity relation triples. As shown in fig. 1, the entire process can be described as follows:
step 1: the input text is preprocessed. Each sentence is subjected to a series of natural language processing operations such as word segmentation, part-of-speech tagging, dependency syntactic analysis and the like, and preparation is made for subsequent steps. The method provided by the invention carries out the operation by means of a natural language processing technology provided by a Language Technology Platform (LTP) developed by the research center of calculation and information retrieval of the Harbour society.
Step 2: a candidate entity pair is selected. And performing entity recognition on the input text through a named entity recognition module, and then pairing all recognized candidate entities in pairs. The method adopts the named entity recognition technology provided by the Harbour language technology platform and an iterative heuristic method to carry out named entity recognition. The latter is to obtain the maximum noun phrase by combining the connected nouns, wherein the part of speech of nouns can only be { ni, nh, ns, nz, j }, and respectively represent the name of a mechanism, the name of a person, the name of a geography, other proper nouns and abbreviations. The two methods complement each other and are developed simultaneously.
And step 3: matching the dependency semantic paradigm. Analyzing whether the syntax structure mapped by the dependence shortest path between the entities can match a certain DSNF or not for the candidate entity pair obtained in the step two,
and 4, step 4: and outputting the relation triple. After the step 3 is executed, if the relation words are matched, extracting the relation words from the relation words, and outputting relation triples; and if not, matching the next group of candidate entity pairs.
The core of the method provided by the invention lies in a dependency semantic paradigm, and the following will focus on the relevant contents:
after a large number of relation examples are analyzed statistically, it is found that the relation triples always appear in some fixed syntactic structures, and the syntactic structures which have a characterization effect on entity relations include: a subject-predicate relationship, a move-guest relationship, a preposition object, a parallel component, a modification relationship, and the like. Mapping these structures into dependency trees yields dependency semantic paradigms (DSNFs). DSNFs are a combination of word sequences, parts of speech, dependency paths and their associated dependency tags. The method divides the paradigm set into five major classes of pre-modification, parallel, verb correlation, templating and other classes, and in each class, one or more DSNF can be obtained, thereby providing reasonable basis for relation extraction.
First, Pre-Modification (Pre-Modification Class, PreMod)
Pre-modifiers are a very important type of modification in Chinese phrases. In the Chinese linguistic view, the relationship expression of PreMod syntax class is a bias structure, which can form a bias phrase. The structure of the partial phrase is composed of a fixed language central language and a modifier pair, wherein the fixed language is a former additional component in the noun partial phrase. The composition range of the fixed language is wide, and besides adverbs and word phrases of 'the' word, other kinds of real words (nouns, verbs and adjectives) and phrases can be used as the fixed language. In addition, the idiom is complicated by its multiple levels of definition, delineation, and simultaneous superimposition before a core from different sides, so that a core may carry multiple idioms.
From the formal structure, the slogans can be classified into the following two types:
1) the combined fixed language is directly added before the central language, and the fixed language without ' is added in the middle, namely ' fixed language + central language '. For example, "two 3 high shifts" in "< ORG > two 3 high shifts" < ORG > master < PER > king certain > "is a fixed phrase of" master, "and" two 3 high shifts master "is a fixed phrase of" king certain, "and" master "also expresses the semantic relationship between the entities" two 3 high shifts "and" king certain, "thereby constituting a relationship triplet (two 3 high shifts, master shift, king certain). Due to the multi-level nature of the final phrase, a plurality of word combinations may be used together as the relation feature words, for example, the "ORG company" < ORG chief executive officer < PER certain PER > "may extract the relation (a company, chief executive officer, Zhao), wherein the relation feature words are combined by the" chief "and the" executive officer ". PER denotes a person name, ORG denotes an organization name.
The combined idiom structure mapping is expressed in dependency analysis as: when a fixed phrase is dependent on a core phrase, the dependency relationship is "fixed relationship", and when there are plural layers of fixed phrases, fixed phrases far from the core phrase are dependent on fixed phrases close to the core phrase or directly dependent on the core phrase, and the dependency relationship is also "fixed relationship". Through statistical research, in the actual relation extraction, we mainly consider the structure of two layers of determinants and three layers of determinants, and obtain a relation extraction paradigm DSNF 1: "Entity 1+ AttWord1(+ AttWord2) + Entity 2", the dependency analysis is shown in FIG. 2. In addition, the limitation of part of speech is also considered, only the case that definite words (AttWord1, AttWord2) are nouns is considered, if "AttWord 1" is a vocational related noun (mainly including nouns related to organizations and works, such as director, general manager, county manager, etc.); or "AttWord 1" is a general noun (relative to a professional noun) and "Entity 2" is a physical Entity, and the relationship extraction is performed only when the two constraints are met.
2) The definite language connected with the central language by the structural auxiliary word, namely the 'definite language + central language'. For example, "< PER1> < PER2> grandfather </PER2> of wife < PER2> of Zhanga </PER 1" may extract a relationship tuple (Zhangan, wife, grandfather). For example, "< PER Pei certain PER teacher of < ORG university > < ORG > and" < PER Pei certain PER of < ORG university > < PERG > ", although different in structure, express the same meaning. It can therefore be expressed as two relation decimation paradigms DSNF2 and DSNF 3: "Entity 1+ of + Noun + Entity 2" or "Entity 1+ of + Entity2+ Noun". From these two structures, relationship triplets (Entity1, Noun, Entity1) can be extracted. Can be mapped into a dependency syntax analysis form, such as fig. 3 and fig. 4.
In relation extraction, the situation may be encountered that the partial phrase only contains one entity noun, such as "a teacher in Liu visits the Shanghai", "a wife in Xiaoming is Xiaohong", and the like, and the partial phrase is often contained in other relation syntax classes. In this case, the entity is a fixed-line modified core, and in the case of dependency analysis, the entity is not directly used as a subject or an object, but the modified core is a skeleton component in the syntactic structure. In the relationship extraction process, the situation is fully considered, and the central language is taken as a 'Pseudo-entity (Pe)' to perform corresponding conversion during dependency analysis. For example, "< PER Liu somewhere </PER > < Pe-PER > teacher </Pe-PER > visits < LOC Shanghai </LOC >", the relationship "visit" between the pseudo entity "teacher" and the entity "Shanghai" is extracted, and then the relationship triplets (Liu somewhere, visit, Shanghai) are converted and output. This is encountered in the following analysis and will not be described in detail. Pe-PER represents a name class pseudo-entity.
VERB correlation (Verbal Class, VERB)
In this class, two related entities, one of which is often in the position of the subject and the other of which is in the position of the object, may be the object of the verb (kinect structure) or the object of the preposition (Prep) (betiment structure), and the relationship between the entities may be directly expressed by a predicate (predate). The method can be further divided into a 'subject-guest-moving' structure and a 'subject-guest-intervening' structure according to different objects.
1) For the "predicate-verb" structure, for example, "< PER > certain" liu visits < LOC > shanghai </LOC > ", in this example sentence," certain "liu is the subject," shanghai "is the object," visit "is the predicate verb that associates the two entities, and the triple (certain, tour, shanghai) can be extracted. And mapping the 'predicate-action element' structure into a dependency analysis graph, wherein the two entities depend on the core verb, and the dependency relations are 'predicate relation' and 'action element relation', respectively. The derivable relationship draws the paradigm DSNF 4: "Entity 1+ Pred + Entity 2", a relationship triple (Entity1, Pred, Entity2) may be extracted. The dependency analysis is shown in FIG. 5. LOC denotes a term of geography,
2) for the "subject-subject" structure, for example, "< PER > certain liu" makes a deep trip to < LOC > shanghai </LOC >, the subject is the entity "certain liu", the verb "go" is the predicate of a sentence, the subject entity depends on the verb, and the dependency is the "subject relationship". The 'Shanghai' constitutes a preposition phrase, the entity 'Shanghai' depends on the preposition 'pair', and the dependency relationship is the preposition relationship; the preposition "pair" in relation "structure-in-the-state" depends on the predicate verb. The noun phrase "deep tour" is the direct object of the predicate, so that the relation tuple (Liu somebody, go deep tour, Shanghai) can be extracted. To be noted, since the entity2 is located at the position of the preposition phrase and is indirectly dependent on the predicate verb through the preposition, the predicate phrase and the direct object (Dobj) of the predicate are used together as the relational feature word in order to make the relational extraction result have a more definite semantic meaning. The "subject-predicate-intermediary" structure can be mapped to a relational extraction paradigm DSNF 5: "Entity 1+ Prep + Entity2+ Pred (+ Dobj)", a relationship triple (Entity1, Pred-Dobj, Entity2) dependency analysis may be extracted as shown in FIG. 6.
In particular, for the structure of "chief implication-interject", if the preposition is "by", and is "by", etc. to represent passive words, the positions of Entity1 and Entity2 are exchanged to form a relationship triple (Entity2, Pred-Dobj, Entity 1).
Three, parallel (Coordination Class, COOR)
The juxtaposition is also quite common in chinese sentences. Parallel representation refers to the relationship between sentences or phrases, either simultaneously or simultaneously, where the parallel components have only front and back scores and no primary score. The objects that are in a parallel relationship may be different objects that are related to each other, may be different aspects of the same object, or may be different actions of the same subject. The parallel phrases are also called parallel phrases, and are generally formed by combining two or more nouns, verbs, adjectives, pronouns or quantitative words, and the like, and the parts of speech of the formed words are generally required to be the same. The words are parallel, and the middle part is often connected with a pause sign or "and, and", etc. (conc). Two kinds of parallel nouns and parallel verbs are mainly considered in relation extraction.
For example, in "< PER1> Liu somewhere PER1> and < PER2> Peng somewhere PER2> visiting < ORG > Shanghai </ORG >", "Liusomewhere" and "Pensomewhere" are two terms having a parallel relationship. When two entities are in such a word-phrase juxtaposition, they behave identically and act on one another in common. In the example, a relationship triplet (some in liu, tour, shanghai) may be extracted, and a "tour" relationship may exist between a parallel component "some in liu" and "shanghai" of "some in liu", and a relationship tuple (some in peng, tour, shanghai) may be extracted. In fact, the COOR syntax class needs to exist depending on other syntax classes, as in the above example, the relation tuple (liu somebody, visit, shanghai) should belong to the VERB syntax class. Because the entity "some in liu" depends on the entity "some in liu" and the dependency relationship is "parallel relationship", the relationship that occurs in the entity "some in liu" is also applicable to the entity "some in liu". According to the fact that the position of the entity in the syntax is mainly three types of subject position, predicate object position and preposition object position, therefore,
1) when the parallel nouns are used as subjects, a relation extraction paradigm DSNF6 is extracted: "Entity 2+ Conj + (Entity1+ + + Pred + Entity 3" (where (Entity1+ +) indicates the presence of one or more juxtaposed entities, the same applies below). The triple (Entity1, Pred, Entity3) may be derived from the relationship triple (Entity2, Pred, Entity3), and the dependency relationship is shown in fig. 7.
2) When the parallel nouns are used as predicate objects, extracting a relation extraction paradigm DSNF 7: "Entity 2+ Pred + Entity3+ Conj + (Entity1+ +", which may be derived from a relationship triple (Entity2, Pred, Entity3) (Entity2, Pred, Entity1), the dependency relationships are shown in FIG. 8.
3) When the parallel nouns are used as preposition objects, extracting a relation extraction paradigm DSNF 8: "Entity 2+ Prep + Entity3+ Conj + (Entity1+ + + Pred (+ Dobj)", which may be derived from the relationship triplets (Entity2, Pred-Dobj, Entity3) (Entity2, Pred-Dobj, Entity1), and the dependency relationships are shown in FIG. 9.
4) The first three types of hybrid types. For example, "< PER1> lie a certain Per1 classmate, < PER2> a certain Per2 classmate, respectively in < ORG1> Shanghai </ORG1> and < ORG2> Hangzhou </ORG2> invite < PER3> a certain Per3 classmate and < PER4> high certain Per4 classmate. "is a mixture of the first three types.
The parallel verbs mainly describe two different actions that are issued simultaneously by the same subject. There are two types of situations in which,
1) the first case is the verb conjunctive use. In Chinese sentence construction, when one verb cannot completely describe the meaning of a behavior, two verbs are often used together, the first verb supplements the second verb, and the second verb is a transitive verb, so that the second verb closer to an object is generally extracted as a relation feature word. For example, "< PER > certain PER steps on the snow to visit < LOC >", wherein the "steps on the snow" and "visit" form a parallel relationship, can extract the relationship (open certain, visit, cottage mountain).
2) The second case is a parallel-type compound sentence, which means that several clauses in the compound sentence have equal parallel relation semantically. If a parallel relationship exists between two or more events and no causal connection exists, a parallel class complex sentence can be formed. Clauses are often separated by commas and conjunctions of "and, also, and" etc. For example, the example sentence "< ORG1> certain company </ORG1> manager < PER > high certain </PER > visits < ORG2> factory building </ORG2>, and the production guidance suggestion is published in < ORG3> certain workshop </ORG3 >. The comma divides the compound sentence into two clauses, which respectively express two events, and the subject is the same as the entity "higher than a certain", so that the two clauses are parallel. Predicates 'visit' and 'release' in the parallel clauses form parallel, and the dependency relationship is 'parallel relationship'. Mapping to dependency syntax can be described as: if the entity2 depends on the verb 2 as an object and this verb 2 is in parallel with another verb 1 (the dependency relationship is "parallel relationship") while the entity1 exists as a subject depending on the verb 1, it can be inferred that there is a relationship between the entity1 and the entity2, and the relationship token is the verb 2. A relation-drawn paradigm DSNF9 can thus be obtained: "Entity 1+ Pred1+ Pred2+ Entity 2", the dependency analysis is shown in FIG. 10. The paradigm DSNF9 can cover both of the above cases.
It is worth mentioning that the parallel structure is nested in other syntax classes. The paradigm of DSNF6, DSNF7, DSNF8 and DSNF9 only express the expression when parallel nouns depend on the structure of "major predicate-animal guest" in the VERB syntax. Other situations are not described in detail. The actual extraction operation steps are similar, when the Entity1 and the Entity2 have a parallel relationship, if the triple (Entity2, relationship word, Entity3) is true, a relationship triple (Entity1, relationship word, Entity3) can be obtained; if a triple (Entity3, relationship word, Entity2) holds, then a relational triple (Entity3, relationship word, Entity1) may be obtained.
Fourthly, modeled (formula Class, FORM)
The types of FORM are often some that appear in chinese, cannot be generalized to the preceding several syntactic relations classes, but generally have a fixed expression format. For example, the corresponding connection words between "wang chi, a university professor, published … …", "wang chi" and "a university professor" can not be found, and there is no direct modification relation, so that they do not conform to the above types. But from this sentence entity-relationship triplets (wang, professor, university) can be extracted. Similar literary expression is common and is a writing habit of Chinese people. For the special grammar expression structures, good effect can be obtained only by extracting templates and performing rigid matching.
Fifthly, others (Other Class)
The method generalizes all other relationship types which cannot be distinguished at present into the category. Because of the uncertainty of this class, no further study of this class is made here.
The invention discloses a Chinese unsupervised open type entity relation extraction method based on dependency semantics, which overcomes the defects of large manual labeling dependency, unreasonable result and the like of the traditional method, is based on the unique and flexible syntactic characteristics of Chinese, deeply excavates the dependency semantics implied by the shortest dependency path on the basis of the mapping between the entity relation and a dependency analysis tree, obtains a Dependency Semantic Normal Form (DSNFs) by using the characteristics of dependency relation, part of speech information, position relation and the like as limitations, and can quickly and accurately extract the entity relation from massive large data by using the normal form set. The method has the advantages of no need of any manual work, realization of full-automatic extraction, no need of dependence on model training corpora, low calculation complexity, high extraction efficiency and capability of meeting the requirement of high real-time property. The invention can be widely applied to the artificial intelligence fields of knowledge maps, intelligent search engines, automatic question answering systems, text mining, machine translation and the like.
While the invention has been described with reference to specific embodiments, the invention is not limited thereto, and various equivalent modifications and substitutions can be easily made by those skilled in the art within the technical scope of the invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims (1)

1.一种基于依存语义的中文无监督开放式实体关系抽取方法,其特征在于,该方法包括以下步骤:1. a Chinese unsupervised open entity relation extraction method based on dependency semantics, is characterized in that, this method comprises the following steps: S1、预处理输入文本:对输入文本进行中文分词、词性标注和依存句法分析;S1. Preprocessing input text: Perform Chinese word segmentation, part-of-speech tagging and dependency syntax analysis on the input text; S2、对输入文本进行命名实体识别;S2. Perform named entity recognition on the input text; S3、从识别出的实体中任意选出两个实体构成候选实体对;S3, randomly select two entities from the identified entities to form a candidate entity pair; S4、寻找候选实体对中的两个实体之间的依存路径;S4. Find a dependency path between two entities in the candidate entity pair; S5、分析候选实体对中的两个实体之间的依存路径所映射的句法结构是否与依存语义范式集的范式匹配,若是,则根据被匹配的范式从输入文本的剩余部分中抽取出词或短语作为关系词,抽取的关系词与候选实体对构成关系三元组,若否则进行下一组候选实体对的范式匹配;S5. Analyze whether the syntactic structure mapped by the dependency path between the two entities in the candidate entity pair matches the paradigm of the dependency semantic paradigm set, and if so, extract the word or word from the rest of the input text according to the matched paradigm. Phrases are used as relation words, and the extracted relation words and candidate entity pairs form relation triples, otherwise, the paradigm matching of the next group of candidate entity pairs is performed; S6、输出关系三元组;S6. Output relation triples; 所述的依存语义范式包括第一类前修饰结构类、第二类并列结构类、第三类动词相关类、第四类模板化类和其他类;The dependency semantic paradigm includes the first type of pre-modified structure type, the second type of parallel structure type, the third type of verb-related type, the fourth type of templated type and other types; 所述的第一类前修饰结构类包括组合式定语结构和由结构助词“的”与中心语连接的结构,组合式定语结构对应依存语义范式“Entity1+AttWord1(+AttWord2)+Entity2”,由结构助词“的”与中心语连接的结构对应语义范式“Entity1+的+Noun+Entity2”或“Entity1+的+Entity2+Noun”,其中Entity1、Entity2是存在关系的实体对,AttWord1和AttWord2为不同的定语词,Noun为名词;The first type of pre-modified structure class includes a combined attributive structure and a structure connected by the structural particle "de" and the head. The combined attributive structure corresponds to the dependency semantic paradigm "Entity1+AttWord1(+AttWord2)+Entity2", which is represented by The structure in which the structural particle "de" is connected to the head corresponds to the semantic paradigm "Entity1++Noun+Entity2" or "Entity1++Entity2+Noun", in which Entity1 and Entity2 are entity pairs that have a relationship, and AttWord1 and AttWord2 are different definitions. words, Noun is a noun; 所述的第二类并列结构类包括并列名词结构和并列动词结构;The second type of coordinating structure class includes coordinating noun structure and coordinating verb structure; 所述的并列名词结构包括并列实体作为主语结构,并列实体作为谓词宾语结构,并列实体作为介词宾语结构以及前三种的混合结构,并列实体作为主语结构对应依存语义范式“Entity2+Conj+(Entity1++)+Pred+Entity3”,并列实体作为谓词宾语结构对应依存语义范式“Entity2+Pred+Entity3+Conj+(Entity1++)”,并列实体作为介词宾语结构对应依存语义范式“Entity2+Prep+Entity3+Conj+(Entity1++)+Pred(+Dobj)”,其中Entity2、Entity3为存在关系的实体对,(Entity1++)表示存在一个或多个并列实体,Conj为连词,Pred为谓词,Prep为介词,Dobj为直接宾语,并列动词结构对应依存语义范式“Entity1+Pred1+Pred2+Entity2”,其中,Entity1、Entity2是存在关系的实体对,Pred1和Pred2为两个构成并列关系的谓词;The co-located noun structure includes co-located entity as subject structure, co-located entity as predicate-object structure, co-located entity as preposition-object structure and the mixed structure of the first three, co-located entity as subject structure corresponding to the dependency semantic paradigm "Entity2+Conj+(Entity1++) +Pred+Entity3", the parallel entity as a predicate object structure corresponds to the dependency semantic paradigm "Entity2+Pred+Entity3+Conj+(Entity1++)", and the parallel entity as a preposition object structure corresponds to the dependency semantic paradigm "Entity2+Prep+Entity3+Conj+(Entity1++) +Pred(+Dobj)", where Entity2 and Entity3 are entity pairs that have a relationship, (Entity1++) means that there are one or more co-located entities, Conj is a conjunction, Pred is a predicate, Prep is a preposition, Dobj is a direct object, and a coordinating verb The structure corresponds to the dependency semantic paradigm "Entity1+Pred1+Pred2+Entity2", in which Entity1 and Entity2 are entity pairs that have a relationship, and Pred1 and Pred2 are two predicates that constitute a parallel relationship; 所述的第三类动词相关类包括主谓动宾结构和主谓介宾结构,主谓动宾结构对应依存语义范式“Entity1+Pred+Entity2”,主谓介宾结构对应依存语义范式“Entity1+Prep+Entity2+Pred(+Dobj)”,其中,Entity1、Entity2是存在关系的实体对,Pred为谓词,Prep为介词,Dobj为直接宾语;The third type of verb-related classes includes subject-verb-verb-object structure and subject-predicate-pre-object structure, the subject-verb-verb-object structure corresponds to the dependency semantic paradigm "Entity1+Pred+Entity2", and the subject-verb-predicate-object structure corresponds to the dependency semantic paradigm "Entity1". +Prep+Entity2+Pred(+Dobj)", where Entity1 and Entity2 are entity pairs that have a relationship, Pred is a predicate, Prep is a preposition, and Dobj is a direct object; 所述的关系三元组形式为:(Entity1,RelationWords,Entity2),其中Entity1、Entity2是存在关系的实体对,RelationWords是描述实体之间语义关系的词或短语;所述的并列动词结构包括动词连用结构和并列类复句结构。The relation triple form is: (Entity1, RelationWords, Entity2), wherein Entity1, Entity2 are entity pairs that have a relationship, and RelationWords is a word or phrase describing the semantic relationship between entities; the coordinating verb structure includes verbs Conjunctive structure and coordinating complex sentence structure.
CN201710285995.4A 2017-04-27 2017-04-27 Chinese unsupervised open type entity relation extraction method based on dependency semantics Active CN107291687B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710285995.4A CN107291687B (en) 2017-04-27 2017-04-27 Chinese unsupervised open type entity relation extraction method based on dependency semantics

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710285995.4A CN107291687B (en) 2017-04-27 2017-04-27 Chinese unsupervised open type entity relation extraction method based on dependency semantics

Publications (2)

Publication Number Publication Date
CN107291687A CN107291687A (en) 2017-10-24
CN107291687B true CN107291687B (en) 2021-03-26

Family

ID=60093929

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710285995.4A Active CN107291687B (en) 2017-04-27 2017-04-27 Chinese unsupervised open type entity relation extraction method based on dependency semantics

Country Status (1)

Country Link
CN (1) CN107291687B (en)

Families Citing this family (45)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107798136B (en) 2017-11-23 2020-12-01 北京百度网讯科技有限公司 Entity relation extraction method, device and server based on deep learning
CN108415971B (en) * 2018-02-08 2021-07-23 兰州智豆信息科技有限公司 Method and device for recommending supply and demand information using knowledge graph
CN108492118B (en) * 2018-04-03 2020-09-29 电子科技大学 Two-stage extraction method for return visit text data of automobile after-sale service quality evaluation
CN108763195B (en) * 2018-05-02 2022-01-18 武汉烽火普天信息技术有限公司 Dependency syntax and mode rule-based non-restricted relationship mining method
CN110569494B (en) * 2018-06-05 2023-04-07 北京百度网讯科技有限公司 Method, device, electronic device and readable medium for generating information
CN108874778B (en) * 2018-06-15 2023-01-17 广东蔚海数问大数据科技有限公司 Semantic entity relation extraction method and device and electronic equipment
CN110807311B (en) * 2018-07-18 2023-06-23 百度在线网络技术(北京)有限公司 Method and device for generating information
CN109146078B (en) * 2018-07-19 2021-04-30 桂林电子科技大学 Knowledge graph representation learning method based on dynamic path
CN109241538B (en) * 2018-09-26 2022-12-20 上海德拓信息技术股份有限公司 Chinese entity relation extraction method based on dependency of keywords and verbs
CN109597894B (en) * 2018-09-30 2023-10-03 创新先进技术有限公司 Correlation model generation method and device, and data correlation method and device
CN109446530A (en) * 2018-11-03 2019-03-08 上海犀语科技有限公司 It is a kind of based on LSTM model by the method and device of Extracting Information in text
CN109522418B (en) * 2018-11-08 2020-05-12 杭州费尔斯通科技有限公司 Semi-automatic knowledge graph construction method
CN111210824B (en) * 2018-11-21 2023-04-07 深圳绿米联创科技有限公司 Voice information processing method and device, electronic equipment and storage medium
CN109815497B (en) * 2019-01-23 2023-04-18 四川易诚智讯科技有限公司 Character attribute extraction method based on syntactic dependency
CN110188193A (en) * 2019-04-19 2019-08-30 四川大学 An entity relationship extraction method for electronic medical records based on the shortest dependent subtree
CN110222332B (en) * 2019-04-29 2023-06-16 闽江学院 A Method of Recognition of Dish Name Entity Based on Dependency Analysis
CN110083284B (en) * 2019-05-06 2023-04-18 腾讯科技(深圳)有限公司 Candidate information processing device, candidate information display method, storage medium, and electronic device
CN110188359B (en) * 2019-05-31 2023-01-03 成都火石创造科技有限公司 Text entity extraction method
CN110377753B (en) * 2019-07-01 2022-10-21 吉林大学 Relation extraction method and device based on relation trigger word and GRU model
CN110309513B (en) * 2019-07-09 2023-07-25 北京金山数字娱乐科技有限公司 Text dependency analysis method and device
CN110334219B (en) * 2019-07-12 2023-05-09 电子科技大学 Knowledge graph representation learning method based on attention mechanism integrated with text semantic features
CN110399433A (en) * 2019-07-23 2019-11-01 福建奇点时空数字科技有限公司 A kind of data entity Relation extraction method based on deep learning
CN110502642B (en) * 2019-08-21 2024-01-23 武汉工程大学 Entity relation extraction method based on dependency syntactic analysis and rules
CN110555083B (en) * 2019-08-26 2021-06-25 北京工业大学 A method for unsupervised entity relation extraction based on zero-shot
CN110569366B (en) * 2019-09-09 2023-05-23 腾讯科技(深圳)有限公司 Text entity relation extraction method, device and storage medium
CN111160030B (en) * 2019-12-11 2023-09-19 北京明略软件系统有限公司 Information extraction method, device and storage medium
CN111198932B (en) * 2019-12-30 2023-03-21 北京明略软件系统有限公司 Triple acquiring method and device, electronic equipment and readable storage medium
CN111178079B (en) * 2019-12-31 2023-05-26 北京明略软件系统有限公司 Triplet extraction method and device
CN111209411B (en) * 2020-01-03 2023-12-12 北京明略软件系统有限公司 Document analysis method and device
CN111339774B (en) * 2020-02-07 2022-11-29 腾讯科技(深圳)有限公司 Text entity relation extraction method and model training method
CN113468335B (en) * 2020-03-30 2025-04-11 海信集团有限公司 Method and device for extracting entity implicit relations
CN111597794B (en) * 2020-05-11 2023-06-06 浪潮软件集团有限公司 Dependency relationship-based 'Yes' word and sentence relationship extraction method and device
CN111859858B (en) * 2020-07-22 2024-03-01 智者四海(北京)技术有限公司 Method and device for extracting relation from text
CN112183059B (en) * 2020-09-24 2024-06-11 万齐智 Chinese structured event extraction method
CN112232074B (en) * 2020-11-13 2022-01-04 完美世界控股集团有限公司 Entity relationship extraction method and device, electronic equipment and storage medium
CN112835927B (en) * 2021-03-25 2025-06-17 中国工商银行股份有限公司 A method, device and apparatus for generating structured query statements
CN113011189B (en) * 2021-03-26 2025-03-21 深圳壹账通智能科技有限公司 Method, device, equipment and storage medium for extracting open entity relations
CN113282762B (en) * 2021-05-27 2023-06-02 深圳数联天下智能科技有限公司 Knowledge graph construction method, knowledge graph construction device, electronic equipment and storage medium
CN113515630B (en) * 2021-06-10 2024-04-09 深圳数联天下智能科技有限公司 Triplet generation and verification method and device, electronic equipment and storage medium
CN113609838B (en) * 2021-07-14 2024-05-24 华东计算技术研究所(中国电子科技集团公司第三十二研究所) Document information extraction and graphing method and system
CN113705198B (en) * 2021-10-21 2022-03-25 北京达佳互联信息技术有限公司 Scene graph generation method and device, electronic equipment and storage medium
CN114328970A (en) * 2021-12-30 2022-04-12 达闼机器人有限公司 Triple extraction method, equipment and computer storage medium
CN114997398B (en) * 2022-03-09 2023-05-26 哈尔滨工业大学 A Knowledge Base Fusion Method Based on Relation Extraction
CN114564967B (en) * 2022-04-29 2022-09-13 天津大学 Semantic annotation and semantic dependency analysis method and device for emotion semantics
CN115906848A (en) * 2022-11-18 2023-04-04 上海爱可生信息技术股份有限公司 Entity relation extraction method

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104134017A (en) * 2014-07-18 2014-11-05 华南理工大学 Protein interaction relationship pair extraction method based on compact character representation

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103034693B (en) * 2012-12-03 2016-03-02 哈尔滨工业大学 Open entity and kind identification method thereof
KR101706827B1 (en) * 2014-12-04 2017-02-16 강원대학교산학협력단 Apparatus and method for extracting social relation between entity
CN104933027B (en) * 2015-06-12 2017-10-27 华东师范大学 A kind of open Chinese entity relation extraction method of utilization dependency analysis
CN105138507A (en) * 2015-08-06 2015-12-09 电子科技大学 Pattern self-learning based Chinese open relationship extraction method

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104134017A (en) * 2014-07-18 2014-11-05 华南理工大学 Protein interaction relationship pair extraction method based on compact character representation

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
WebSets: extracting sets of entities from the web using unsupervised information extraction;Bhavana Bharat Dalvi;《Proceedings of the fifth ACM international conference on Web search and data mining》;20120229;第243-252页 *
开放式中文实体关系抽取研究;刘安安;《中国优秀硕士学位论文全文数据库》;20140315(第3期);第I138-1199页 *
自动化构建的中文知识图谱系统;鄂世嘉;《计算机应用》;20160410;第36卷(第4期);第992-1001页 *

Also Published As

Publication number Publication date
CN107291687A (en) 2017-10-24

Similar Documents

Publication Publication Date Title
CN107291687B (en) Chinese unsupervised open type entity relation extraction method based on dependency semantics
US11144544B2 (en) Providing answers to questions including assembling answers from multiple document segments
US10823265B2 (en) Providing answers to questions using multiple models to score candidate answers
US10902038B2 (en) Providing answers to questions using logical synthesis of candidate answers
Abdelnabi et al. Generating UML class diagram using NLP techniques and heuristic rules
Konys Knowledge systematization for ontology learning methods
Agirre et al. Improving dependency parsing with semantic classes
Shah et al. NLKBIDB-Natural language and keyword based interface to database
Franconi et al. Quelo natural language interface: Generating queries and answer descriptions
Damljanovic Natural language interfaces to conceptual models
Arora Automatic Ontology Construction: Ontology From Plain Text Using Conceptualization and Semantic Roles
Vileiniškis et al. An approach for Semantic search over Lithuanian news website corpus
Joshi Semantification of text through summarisation.
Agirre Bengoa et al. Improving Dependency Parsing with Semantic Classes
Mathews et al. Extracting Ontological Knowledge from Textual Descriptions
Keighobadi-Lamjiri A syntactic candidate ranking method for answering non-copulative questions
UMAMAHESWARI A SEMANTIC APPROACH FOR NEWS EVENT SEARCH
Pokhrel Ascol’s Natural Language Interface for Database A Final Year Project
Xu Linking flat predicate argument structures
Górka et al. Application of semantic networks in natural language issues
LAMJIRI ANSWERING NON-COPULATIVE QUESTIONS
Dědek Semantic annotations
ANDREW QUERY PROCESSING

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
点击 这是indexloc提供的php浏览器服务,不要输入任何密码和下载