CN106250412A

CN106250412A - The knowledge mapping construction method merged based on many source entities

Info

Publication number: CN106250412A
Application number: CN201610583823.0A
Authority: CN
Inventors: 鲁伟明; 戴豪; 庄越挺
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2016-07-22
Filing date: 2016-07-22
Publication date: 2016-12-21
Anticipated expiration: 2036-07-22
Also published as: CN106250412B

Abstract

本发明公开了一种基于多源实体融合的知识图谱构建方法。本发明首先爬取中文三大百科：百度百科、互动百科，维基百科，并对数据做预处理，包括标题同义词提取、消岐页面提取、候选集提取和文本分词等。然后，针对在同一个候选集里的页面，计算两两页面之间的特征，并训练分类器计算页面之间的相似度，并根据相似度构建权重图。最后，通过混合线性规划模型，约束权重图中顶点与顶点之间的关系，通过计算目标函数的最大值，得到顶点与顶点之间的连通性，将每一个连通分量当作一个实体，从而获得描述同一个实体的所有页面。本发明通过引入候选集，大大减小了问题的规模；同时又通过混合线性规划模型，提高了实体融合的准确率。The invention discloses a knowledge map construction method based on multi-source entity fusion. The invention first crawls the three major Chinese encyclopedias: Baidu Encyclopedia, Interactive Encyclopedia, and Wikipedia, and preprocesses the data, including title synonym extraction, disambiguation page extraction, candidate set extraction, and text word segmentation. Then, for the pages in the same candidate set, calculate the features between two pages, and train the classifier to calculate the similarity between pages, and build a weight map according to the similarity. Finally, through the hybrid linear programming model, the relationship between vertices in the weight graph is constrained, and the connectivity between vertices is obtained by calculating the maximum value of the objective function, and each connected component is regarded as an entity, thus obtaining All pages describing the same entity. The invention greatly reduces the scale of the problem by introducing the candidate set; at the same time, it improves the accuracy of entity fusion through the hybrid linear programming model.

Description

Knowledge map construction method based on multi-source entity fusion

技术领域technical field

本发明涉及文本相似度计算方法，尤其涉及一种基于多源实体融合的知识图谱构建方法。The invention relates to a text similarity calculation method, in particular to a knowledge map construction method based on multi-source entity fusion.

背景技术Background technique

随着互联网的迅速发展，人们获取信息和知识的途径越来越多样化，但是海量的数据分布于互联网的每一个角落，这给用户获取知识带来了很大的障碍。因此，构建一个统一完备的知识库迫在眉睫。With the rapid development of the Internet, people have more and more diverse ways to obtain information and knowledge, but massive data are distributed in every corner of the Internet, which has brought great obstacles to users' access to knowledge. Therefore, it is imminent to construct a unified and complete knowledge base.

目前已经存在许多知识库，比如DBpedia是一个特殊的语义网应用范例，它从维基百科的词条里撷取出结构化的资料，以强化维基百科的搜寻功能，并将其他资料集连结至维基百科；Freebase是一个大型的合作知识库，它整合了网络上的许多资源。Freebase中的条目也与DBpedia类似，都采用结构化数据的形式。通过访问其数据可以发现其中所有的内容都是格式化的，按照三元组的格式存储并展示。这个模式是固定的，同一类型的条目都包含相同的属性。鉴于以上原因，同类数据之间就可以很容易地联系在一起，为信息查询提供了便利。Freebase包含数以千万计的主题，成千上万的类型和属性。但是这些知识库的语言都是英语，目前中文领域还没有一个大型的完备的知识库。There are already many knowledge bases. For example, DBpedia is a special example of Semantic Web applications. It extracts structured data from Wikipedia entries to enhance Wikipedia's search function and links other data sets to Wikipedia. ; Freebase is a large-scale cooperative knowledge base, which integrates many resources on the Internet. Entries in Freebase are also similar to DBpedia, both in the form of structured data. By accessing its data, it can be found that all the contents are formatted, stored and displayed in triple format. The schema is fixed, and entries of the same type all contain the same attributes. In view of the above reasons, similar data can be easily linked together, which facilitates information query. Freebase contains tens of millions of themes, thousands of types and properties. However, the languages of these knowledge bases are all English, and there is not yet a large and complete knowledge base in the Chinese field.

传统的关于知识库的实体匹配算法中，主要是基于成对实体的匹配，并把这个问题形式化成一个分类问题。然而，大多数这类算法都严重地依赖于数据模板的质量。对于Web数据来说，数据不是以一个统一的三元组形式呈现的，而且不同源的数据在表达形式上也有较大的差异，因此这种方法在我们的这个问题上适用性较低。In the traditional entity matching algorithm about the knowledge base, it is mainly based on the matching of paired entities, and this problem is formalized as a classification problem. However, most of these algorithms depend heavily on the quality of the data templates. For web data, the data is not presented in the form of a unified triple, and data from different sources also have large differences in expression forms, so this method is less applicable to our problem.

在另外一些匹配算法中，将页面的结构信息也考虑到特征中，比如在中英文维基的实体匹配中，因为已经有相当一部分页面存在跨语言链接，所以这部分信息可以作为先验知识。然而，我们的多源数据之间是没有任何链接的，所以页面的结构特征无法纳入特征之中。In some other matching algorithms, the structural information of the page is also taken into account in the features. For example, in the entity matching of Chinese and English wikis, since a considerable part of the pages already have cross-language links, this part of the information can be used as prior knowledge. However, there is no link between our multi-source data, so the structural features of the page cannot be included in the features.

在两个集合的特征计算中，可以使用Jaccard系数。Jaccard系数主要用于计算符号度量或布尔值度量的个体间的相似度，因为个体的特征属性都是由符号度量或者布尔值标识，因此无法衡量差异具体值的大小，只能获得“是否相同”这个结果，所以Jaccard系数只关心个体间共同具有的特征是否一致这个问题。如果比较X与Y的Jaccard相似系数，只比较X_n和Y_n中相同的个数。In the calculation of the characteristics of the two sets, the Jaccard coefficient can be used. The Jaccard coefficient is mainly used to calculate the similarity between individuals measured by symbolic metrics or Boolean values, because the characteristic attributes of individuals are identified by symbolic metrics or Boolean values, so it is impossible to measure the size of the specific value of the difference, and only "whether they are the same" can be obtained This result, so the Jaccard coefficient only cares about whether the common characteristics among individuals are consistent. If comparing the Jaccard similarity coefficients of X and Y, only compare the same numbers in X _n and Y _n .

在特征相似度计算中，有许多算法可以应用。简单的可以直接计算欧式距离或者余弦距离。也可以根据特征训练分类器，使用分类器来计算相似度。随机森林是一种性能良好的分类器，可以用在特征相似度计算中。它指的是利用多棵决策树对样本进行训练并预测的一种分类器，并且其输出的类别是由个别树输出的类别的众数而定。随机森林具有许多优点，比如特征丢失时，仍可以保持较高的准确度，且不会产生过拟合问题。In the calculation of feature similarity, there are many algorithms that can be applied. Simply, you can directly calculate the Euclidean distance or the cosine distance. It is also possible to train a classifier based on features and use the classifier to calculate the similarity. Random forest is a classifier with good performance and can be used in feature similarity calculation. It refers to a classifier that uses multiple decision trees to train and predict samples, and the output category is determined by the mode of the category output by individual trees. Random forest has many advantages, for example, it can maintain high accuracy when features are lost, and it will not cause overfitting problems.

发明内容Contents of the invention

本发明为整合多源百科知识，构建统一的知识库，提供了一种基于多源实体融合的知识图谱构建方法。不同源的百科通常会包含描述同一个实体的多个页面，多源实体融合技术可以在海量的数据中找到这些页面，并将其映射到同一个实体上。In order to integrate multi-source encyclopedic knowledge and build a unified knowledge base, the present invention provides a knowledge map construction method based on multi-source entity fusion. Encyclopedias from different sources usually contain multiple pages describing the same entity. Multi-source entity fusion technology can find these pages in massive data and map them to the same entity.

本发明解决其技术问题采用的技术方案如下：一种基于多源实体融合的知识图谱构建方法，包括以下步骤：The technical solution adopted by the present invention to solve its technical problems is as follows: a method for building a knowledge map based on the fusion of multi-source entities, comprising the following steps:

1)预处理百科页面：提取百科标题的同义词，提取消岐页面，利用同义词的传递关系构建同义词组，所有同义词组形成同义词组集合，根据同义词组集合中每一个同义词组对应的页面构建候选集，用分词工具对百科页面的文本进行分词。1) Preprocessing encyclopedia pages: extracting synonyms of encyclopedia titles, extracting disambiguation pages, using the transitive relationship of synonyms to construct synonym groups, all synonym groups form a synonym group set, and construct candidate sets according to the pages corresponding to each synonym group in the synonym group set , use the word segmentation tool to segment the text of the encyclopedia page.

2)通过步骤1)的分词结果，计算同一个候选集里的两两页面之间的特征，通过训练分类器为每一维特征赋上不同的权重，并利用这个分类器计算页面之间的相似度。2) Through the word segmentation results of step 1), calculate the features between two pages in the same candidate set, assign different weights to each dimension feature by training a classifier, and use this classifier to calculate the features between pages similarity.

3)根据步骤2)中计算的页面之间的相似度构建该候选集的权重图，利用混合线性规划模型，定义该模型目标函数，并计算目标函数的最大值，得到顶点与顶点之间的连通性。将权重图上的每一个连通分量当作一个实体，从而获得描述同一个实体的所有页面。3) Construct the weight map of the candidate set according to the similarity between pages calculated in step 2), use the mixed linear programming model to define the model objective function, and calculate the maximum value of the objective function to obtain the vertex-to-vertex connectivity. Treat each connected component on the weight map as an entity, so as to obtain all pages describing the same entity.

进一步地，所述的步骤1)包括：Further, described step 1) includes:

1.1)提取百科标题的同义词，提取方式包括以下两种：1.1) Extract the synonyms of the encyclopedia title, the extraction methods include the following two methods:

a)模板匹配：利用特定的模板去匹配每个页面的开头和摘要的第一句话，如果匹配成功，则得到同义词对。模板人为定义，涵盖大部分同义词对出现模式。a) Template matching: Use a specific template to match the beginning of each page and the first sentence of the abstract. If the match is successful, a synonym pair is obtained. Templates are artificially defined and cover most occurrence patterns of synonym pairs.

b)链接重定向：通过页面中超链接跳转到另一个页面，如果另一个页面的标题和该超链接的文本不同，则认为这两个词是同义词。b) Link redirection: Jump to another page through a hyperlink in the page. If the title of another page is different from the text of the hyperlink, these two words are considered to be synonyms.

1.2)提取消岐页面：第k个百科表示为k最大值为3，其中a_i表示页面，n表示页面总数量。由消岐页面中出现的所有页面，可提取消岐页面集合M，集合M里面的任意两两页面都不能表示同一个实体。1.2) Extract and disambiguate pages: the kth encyclopedia is expressed as The maximum value of k is 3, where a _i represents a page, and n represents the total number of pages. From all the pages that appear in the disambiguation pages, the disambiguation page set M can be extracted, and any two pages in the set M cannot represent the same entity.

M＝{a_i∈ε_k|a_i∈M≠a_j∈M}M＝{a _i ∈ε _k |a _i ∈M≠a _j ∈M}

1.3)提取候选集：根据同义词的传递性，如果A和B互为同义词，A和C互为同义词，那么B和C也互为同义词。通过这种方式，得到同义词组S_t，所有同义词组S_t形成同义词组集合,该集合的每一个同义词组中的两两元素互为同义词。1.3) Extract candidate sets: According to the transitivity of synonyms, if A and B are synonyms for each other, and A and C are synonyms for each other, then B and C are also synonyms for each other. In this way, the synonym group S _t is obtained, and all the synonym groups S _t form a synonym group set, and two elements in each synonym group in the set are synonyms for each other.

给定S_t,从所有百科源中找出标题属于S_t的页面，所有的这些页面构成候选集P_t。Given S _t , find pages whose titles belong to S _t from all encyclopedia sources, and all these pages constitute a candidate set P _t .

P_t＝{a∈ε_1,…,K|a.Title∈S_t}P _t ＝{a∈ε _1,...,K |a.Title∈S _t }

K为百科的总数；a.Title为页面a的标题。K is the total number of encyclopedias; a.Title is the title of page a.

1.4)对百科页面的文本进行分词：对页面的5个域分词，包括摘要，信息框(键和值)，链接，目录，用户标签，并去除停用词和长度小于2的词。1.4) Segment the text of the encyclopedia page: Segment the 5 domains of the page, including abstract, information box (key and value), link, directory, user tag, and remove stop words and words whose length is less than 2.

进一步地，所述的步骤2)包括：Further, described step 2) includes:

2.1)定义一个页面所包含的6个域，包括标题T，摘要A，信息框I，目录C，用户标签G和链接L，用一个6元组来表示一个页面：2.1) Define 6 domains contained in a page, including title T, abstract A, information box I, directory C, user label G and link L, and use a 6-tuple to represent a page:

a＝{T,A,I,C,G,L}a={T,A,I,C,G,L}

其中信息框表示为键值对，因此I＝{P,V},其中P表示属性，V表示属性值；The information box is represented as a key-value pair, so I={P,V}, wherein P represents an attribute, and V represents an attribute value;

对于属于同一个候选集的2个页面，如果他们描述的是一个实体，那么他们的文本重叠率会比较大，因此定义以下7个特征，分别如下：For two pages belonging to the same candidate set, if they describe an entity, their text overlap rate will be relatively large, so the following seven features are defined, as follows:

1)摘要特征1) Summary features

${f f}_{a a} (({a a}_{i i},, {a a}_{j j})) = = \frac{| | {S S}_{w w} (({a a}_{i i} . . A A)) \cap \cap {S S}_{w w} (({a a}_{j j} . . A A)) | |}{| | {S S}_{w w} (({a a}_{i i} . . A A)) \cup \cup {S S}_{w w} (({a a}_{j j} . . A A)) | |}$

2)信息框属性特征2) Information box attribute characteristics

${f f}_{p p} (({a a}_{i i},, {a a}_{j j})) = = \frac{| | {S S}_{w w} (({a a}_{i i} . . I I . . P P)) \cap \cap {S S}_{w w} (({a a}_{j j} . . I I . . P P)) | |}{| | {S S}_{w w} (({a a}_{i i} . . I I . . P P)) \cup \cup {S S}_{w w} (({a a}_{j j} . . I I . . P P)) | |}$

3)信息框属性值特征3) Information box attribute value feature

${f f}_{v v} (({a a}_{i i},, {a a}_{j j})) = = \frac{| | {S S}_{w w} (({a a}_{i i} . . I I . . V V)) \cap \cap {S S}_{w w} (({a a}_{j j} . . I I . . V V)) | |}{| | {S S}_{w w} (({a a}_{i i} . . I I . . V V)) \cup \cup {S S}_{w w} (({a a}_{j j} . . I I . . V V)) | |}$

4)目录特征4) Directory features

${f f}_{C C} (({a a}_{i i},, {a a}_{j j})) = = \frac{| | {S S}_{w w} (({a a}_{i i} . . C C)) \cap \cap {S S}_{w w} (({a a}_{j j} . . C C)) | |}{| | {S S}_{w w} (({a a}_{i i} . . C C)) \cup \cup {S S}_{w w} (({a a}_{j j} . . C C)) | |}$

5)用户标签特征5) User label features

${f f}_{g g} (({a a}_{i i},, {a a}_{j j})) = = \frac{| | {S S}_{w w} (({a a}_{i i} . . G G)) \cap \cap {S S}_{w w} (({a a}_{j j} . . G G)) | |}{| | {S S}_{w w} (({a a}_{i i} . . G G)) \cup \cup {S S}_{w w} (({a a}_{j j} . . G G)) | |}$

6)链接特征6) Link feature

${f f}_{l l} (({a a}_{i i},, {a a}_{j j})) = = \frac{| | {S S}_{w w} (({a a}_{i i} . . L L)) \cap \cap {S S}_{w w} (({a a}_{j j} . . L L)) | |}{| | {S S}_{w w} (({a a}_{i i} . . L L)) \cup \cup {S S}_{w w} (({a a}_{j j} . . L L)) | |}$

7)全局特征,S表示6元组{T,A,I,C,G,L}的字符串拼接7) Global features, S represents the string splicing of 6-tuple {T, A, I, C, G, L}

${f f}_{a a l l l l} (({a a}_{i i},, {a a}_{j j})) = = \frac{| | {S S}_{w w} (({a a}_{i i} . . S S)) \cap \cap {S S}_{w w} (({a a}_{j j} . . S S)) | |}{| | {S S}_{w w} (({a a}_{i i} . . S S)) \cup \cup {S S}_{w w} (({a a}_{j j} . . S S)) | |}$

S_w(X)表示对字符串X分词后的结果集合。S _w (X) represents the result set after tokenizing the character string X.

2.2)将在步骤2.1)得到的7个特征作为分类器的输入，利用Weka算法包中的RandomForest算法训练二类分类器，然后用这个二类分类器来预测两个页面之间的相似度。2.2) With the 7 features obtained in step 2.1) as the input of the classifier, use the RandomForest algorithm in the Weka algorithm package to train a two-class classifier, and then use this two-class classifier to predict the similarity between the two pages.

进一步地，所述的步骤3)具体包括以下步骤：Further, described step 3) specifically includes the following steps:

3.1)根据步骤2)计算得到的页面之间的相似度构建该候选集的权重图，两个结点之间的权重边用相似度表示。由此，将原问题转换成边的取舍问题。用y_ij表示两个结点之间是否有边：3.1) Construct the weight map of the candidate set according to the similarity between the pages calculated in step 2), and the weighted edge between two nodes is represented by the similarity. As a result, the original problem is transformed into a problem of side selection. Use y _ij to indicate whether there is an edge between two nodes:

同时加入其他惩罚项和约束条件来构建混合线性规划模型：Also add other penalties and constraints to build a mixed linear programming model:

惩罚项1：Penalty item 1:

如果a_i与a_j有边，且a_i与a_k有边,那么a_j与a_k之间也应该有边，否则加入惩罚项φ，同时乘上系数u作为调整参数。因此对于φ，有下面的约束：If there is an edge between a _i and a _j , and there is an edge between a _i and a _k , then there should be an edge between a _j and a _k , otherwise a penalty term φ is added, and the coefficient u is multiplied as an adjustment parameter. So for φ, there are the following constraints:

${y the y}_{i i j j} + + {y the y}_{i i k k} \leq \leq 11 + + {y the y}_{j j k k} + + {φ φ}_{j j k k},, &ForAll; &ForAll; {a a}_{i i},, {a a}_{j j},, {a a}_{k k} &Element; &Element; {P P}_{t t}$

φ_jk≥0φ _jk ≥ 0

惩罚项2：Penalty item 2:

如果a_i与a_j之间的相似度越高，那么他们之间有边的概率越大。对于两个相似度很小的a_i与a_j，如果他们之间有边，则惩罚项较大，如果a_i与a_j的相似度较大，那么惩罚项较小。因此，用ψ_ij表示惩罚项，用λ表示调整参数，该惩罚项用下式约束：If the similarity between a _i and a _j is higher, the probability of an edge between them is greater. For two a _i and a _j with very small similarity, if there is an edge between them, the penalty term will be larger, and if the similarity between a _i and a _j is greater, then the penalty term will be smaller. Therefore, use ψ _ij to represent the penalty term, and λ to represent the adjustment parameter, and the penalty term is constrained by the following formula:

$λ λ | | {y the y}_{i i j j} - - s the s i i m m (({a a}_{i i},, {a a}_{j j})) | | \leq \leq {ψ ψ}_{i i j j},, &ForAll; &ForAll; {a a}_{i i},, {a a}_{j j} &Element; &Element; {P P}_{t t}$

ψ_ij≥0ψ _ij ≥ 0

sim(a_i,a_j)为a_i和a_j之间的权重；sim(a _i ,a _j ) is the weight between a _i and a _j ;

惩罚项3：Penalty item 3:

对于在一个消岐页面集合M里面出现的a_i与a_j，如果y_ij等于1，则表明匹配错误，因此需要用惩罚项ζ_ij来约束a_i与a_j之间没有边。用下面的式子表示这个约束条件：For a _i and a _j appearing in a disambiguation page set M, if y _ij is equal to 1, it indicates a matching error, so a penalty term ζ _ij is required to constrain that there is no edge between a _i and a _j . Express this constraint with the following formula:

${y the y}_{i i j j} < < {ζ ζ}_{i i j j},, &ForAll; &ForAll; {a a}_{i i},, {a a}_{j j} &Element; &Element; {M m}_{n no},, n no = = 11,, 22,, ... ...,, N N$

ζ_ij≥0ζ _ij ≥ 0

N为消岐页面集合的个数；N is the number of disambiguation page sets;

此外，对相似度设置阈值τ，只有相似度大于阈值τ的a_i与a_j的页面之间才能有边。In addition, a threshold τ is set for the similarity, and only the pages a _i and a _j whose similarity is greater than the threshold τ can have an edge.

综合以上各个惩罚项和阈值，得到目标函数如下所示：Combining the above penalty items and thresholds, the objective function is obtained as follows:

$\begin{matrix} max max i i m m i i z z e e \underset{{a a}_{i i},, {a a}_{j j} &Element; &Element; {P P}_{t t}}{Σ Σ} (({y the y}_{i i j j} * * s the s i i m m (({a a}_{i i},, {a a}_{j j})) - - u u * * {φ φ}_{i i j j} - - {ψ ψ}_{i i j j})) \\ - - {Σ Σ}_{n no = = 11}^{N N} \underset{{a a}_{i i},, {a a}_{j j} &Element; &Element; {M m}_{n no}}{Σ Σ} {ζ ζ}_{i i j j} \end{matrix}$

s.t.y_ij∈{0,1},φ_ij,ψ_ij,ζ_ij≥0sty _ij ∈{0,1},φ _ij ,ψ _ij ,ζ _ij ≥0

${y the y}_{i i j j} + + {y the y}_{i i k k} \leq \leq 11 + + {y the y}_{i i j j} + + {φ φ}_{j j k k},, &ForAll; &ForAll; {a a}_{i i},, {a a}_{j j},, {a a}_{k k} &Element; &Element; {P P}_{t t}$

$s the s i i m m (({a a}_{i i},, {a a}_{j j})) > > {y the y}_{i i j j} * * τ τ,, &ForAll; &ForAll; {a a}_{i i},, {a a}_{j j} &Element; &Element; {P P}_{t t}$

求得该目标函数的最大值，从而得到该最大值对应的边的参数y_ij。Obtain the maximum value of the objective function, so as to obtain the parameter y _ij of the side corresponding to the maximum value.

3.2)将该权重图中的每一个连通分量当作一个实体，得到描述一个实体的所有页面。3.2) Treat each connected component in the weight graph as an entity, and obtain all pages describing an entity.

本发明方法与现有技术相比具有的有益效果：The inventive method has the beneficial effect compared with prior art:

1.该方法利用标题同义词，得到标题候选集，再从标题候选集中得到页面候选集，在一个页面候选集中计算页面相似度，从而很大程度地减小了问题的规模，使得接下来的算法实施更加简单。1. This method uses title synonyms to obtain a title candidate set, and then obtains a page candidate set from the title candidate set, and calculates page similarity in a page candidate set, thereby greatly reducing the scale of the problem, making the following algorithm Implementation is simpler.

2.该方法根据页面结构，提取了7个文本特征的Jaccard系数，并采用随机森林算法计算页面与页面之间的相似度，这个相似度可以较准确地反应页面的相似度。2. According to the page structure, this method extracts the Jaccard coefficients of 7 text features, and uses the random forest algorithm to calculate the similarity between pages. This similarity can accurately reflect the similarity of pages.

3.该方法在图上对页面之间的相似度建模，利用混合线性规划模型求得图上顶点与顶点之间的关系，即页面与页面之间的关系。通过这些关系，可以构建一个无向图。在这个无向图中，可以较准确地得到描述一个实体的所有页面。3. This method models the similarity between pages on the graph, and uses the mixed linear programming model to obtain the relationship between vertices on the graph, that is, the relationship between pages. Through these relationships, an undirected graph can be constructed. In this undirected graph, all pages describing an entity can be obtained more accurately.

附图说明Description of drawings

图1是本发明的总体流程图；Fig. 1 is the general flowchart of the present invention;

图2是步骤2)的流程图；Fig. 2 is the flowchart of step 2);

图3是步骤3)的流程图；Fig. 3 is a flow chart of step 3);

图4是步骤4)的流程图。Fig. 4 is a flowchart of step 4).

具体实施方式detailed description

下面结合附图和具体实施例对本发明作进一下详细说明。The present invention will be described in detail below in conjunction with the accompanying drawings and specific embodiments.

如图1-图4所示，基于多源实体融合的知识图谱构建方法的步骤如下：As shown in Figure 1-Figure 4, the steps of the knowledge graph construction method based on multi-source entity fusion are as follows:

3)根据步骤2)中计算的页面之间的相似度构建该候选集的权重图，利用混合线性规划模型，定义该模型目标函数，并计算目标函数的最大值，得到顶点与顶点之间的连通性。将权重图上的每一个连通分量当作一个实体，从而获得描述同一个实体的所有页面。3) Construct the weight map of the candidate set according to the similarity between the pages calculated in step 2), use the mixed linear programming model to define the model objective function, and calculate the maximum value of the objective function to obtain the vertex-to-vertex connectivity. Treat each connected component on the weight map as an entity, so as to obtain all pages describing the same entity.

所述的步骤1)为：Described step 1) is:

a)模板匹配：利用特定的模板去匹配每个页面的开头和摘要的第一句话，如果匹配成功，则得到同义词对。模板人为定义，涵盖大部分同义词对出现模式。例如：对于带有同义词的页面，在页面的开头或摘要的第一句话通常会出现“A又名B”,“A别称B”，“A是B的同义词”等字符串，通过正则匹配，可以得到一部分同义词对。a) Template matching: Use a specific template to match the beginning of each page and the first sentence of the abstract. If the match is successful, a synonym pair is obtained. Templates are artificially defined and cover most occurrence patterns of synonym pairs. For example: For a page with synonyms, strings such as "A is also known as B", "A is another name for B", "A is a synonym of B" and other strings usually appear at the beginning of the page or the first sentence of the abstract, and are matched by regular , you can get some synonym pairs.

M＝{a_i∈ε_k|a_i∈M≠a_j∈M}M＝{a _i ∈ε _k |a _i ∈M≠a _j ∈M}

P_t＝{a∈ε_1,…,K|a.Title∈S_t}P _t ＝{a∈ε _1,...,K |a.Title∈S _t }

所述的步骤2)包括：Described step 2) comprises:

a＝{T,A,I,C,G,L}a={T,A,I,C,G,L}

对于属于同一个候选集的2个页面，如果他们描述的是一个实体，那么他们的文本重叠率会比较大，因此定义以下7个特征，分别如下：1)摘要特征For two pages belonging to the same candidate set, if they describe an entity, their text overlap rate will be relatively large, so the following seven features are defined, as follows: 1) Summary features

2)信息框属性特征2) Information box attribute characteristics

3)信息框属性值特征3) Information box attribute value feature

4)目录特征4) Directory features

5)用户标签特征5) User label feature

6)链接特征6) Link feature

所述的步骤3)具体包括以下步骤：Described step 3) specifically comprises the following steps:

惩罚项1：Penalty item 1:

φ_jk≥0φ _jk ≥ 0

惩罚项2：Penalty item 2:

ψ_ij≥0ψ _ij ≥ 0

惩罚项3：Penalty item 3:

对于在一个消岐页面集合M里面出现的a_i与a_j，如果y_ij等于1，则表明匹配错误，因此需要用惩罚项ζ_ij来约束a_i与a_j之间没有边。用下面的式子表示这个约束条件：For a _i and a _j appearing in a disambiguation page set M, if y _ij is equal to 1, it indicates a matching error, so a penalty item ζ _ij is required to constrain that there is no edge between a _i and a _j . Express this constraint with the following formula:

ζ_ij≥0ζ _ij ≥ 0

N为消岐页面集合的个数；N is the number of disambiguation page sets;

实施例Example

下面提供一实例详细说明本发明的实现步骤：An example is provided below to describe the implementation steps of the present invention in detail:

(1)实例采用的数据集来自百度百科和互动百科，其中百度百科的页面数量为10143321，互动百科的页面数量为6618544。(1) The data set used in the example comes from Baidu Encyclopedia and Interactive Encyclopedia. The number of pages in Baidu Encyclopedia is 10143321, and the number of pages in Interactive Encyclopedia is 6618544.

(2)根据(1)中的所有页面，分析页面版块结构，提取标题，摘要，目录，分类，链接，信息框等信息，并将这些信息存入lucene索引中。除了标题之外，其他的域均可以为空。(2) According to all the pages in (1), analyze the page section structure, extract titles, abstracts, categories, categories, links, information boxes and other information, and store these information in the lucene index. Except for the title, all other fields can be empty.

(3)根据(1)中的所有页面，提取标题同义词。同义词的提取方法主要包括模板匹配和链接重定向。通过提取到的同义词对，进一步得到标题同义词集合。用这些标题同义词集合去和(1)中的页面标题匹配，得到候选集页面。(3) Extract title synonyms according to all pages in (1). Synonym extraction methods mainly include template matching and link redirection. Through the extracted synonym pairs, a title synonym set is further obtained. Use these title synonym sets to match the page titles in (1) to get candidate set pages.

(4)在(3)得到的候选集页面中，提取两两页面之间的特征，并以这些特征为输入，训练随机森林分类器。在这个步骤中，需要人工标注训练集。(4) From the pages in the candidate set obtained in (3), extract features between two pages, and use these features as input to train a random forest classifier. In this step, the training set needs to be manually labeled.

(5)基于步骤(4)得到的相似度矩阵，构建混合线性规划模型，用该模型可得到顶点与顶点之间的关系，1表示两个顶点之间有边，0表示两个顶点之间没有边。以这些顶点和边为输入，可以构建一个无向图。提取无向图中的每一个连通分量，这些连通分量代表的页面表示一个实体。(5) Based on the similarity matrix obtained in step (4), construct a mixed linear programming model, and use this model to obtain the relationship between vertices. 1 indicates that there is an edge between two vertices, and 0 indicates that there is an edge between two vertices. There are no sides. Taking these vertices and edges as input, an undirected graph can be constructed. Extract each connected component in the undirected graph, and the pages represented by these connected components represent an entity.

本实例的运行结果：The result of running this example:

对于相似度计算，采用了5种方法进行对比，最后得出随机森林分类器的效果是最好的。相似度的计算通过Precision,Recall,F1和Accuracy四种评价指标将本发明所使用的方法(SCM)和其他方法，包括贪心匹配(GA),层次聚类(AC)，最小生成树聚类(MSTC)和协同聚类(CC)进行比较，得到的结果如下表：For the similarity calculation, five methods are used for comparison, and finally the effect of the random forest classifier is the best. The calculation of similarity is by the method (SCM) used in the present invention and other methods by Precision, Recall, F1 and four kinds of evaluation indicators of Accuracy, including greedy matching (GA), hierarchical clustering (AC), minimum spanning tree clustering ( MSTC) and co-clustering (CC), the results are as follows:

方法method PrecisionPrecision Recallrecall F1F1 AccuracyAccuracy GAGA 78.3％78.3% 76.1％76.1% 77.2％77.2% 91.6％91.6% ACAC 73.0％73.0% 79.0％79.0% 75.9％75.9% 91.5％91.5% MSTCMSTC 63.4％63.4% 80.5％80.5% 71％71% 88.8％88.8% CCCC 62.4％62.4% 65.5％65.5% 63.9％63.9% 87.4％87.4% SCMSCM 75.8％75.8% 82.5％82.5% 79.0％79.0% 92.592.5

由上表对比可以看出，本方法在F1和Accuracy的表现上都要比其他方法要好。因此，本方法在实体匹配方面具有良好的使用价值和应用前景。It can be seen from the comparison in the above table that this method is better than other methods in both F1 and Accuracy. Therefore, this method has good use value and application prospect in entity matching.

Claims

1. A knowledge graph construction method based on multi-source entity fusion, is characterized in that, comprises the following steps:

1) Preprocessing encyclopedia pages: extracting synonyms of encyclopedia titles, extracting disambiguation pages, using the transitive relationship of synonyms to construct synonym groups, all synonym groups form a synonym group set, and construct candidate sets according to the pages corresponding to each synonym group in the synonym group set , use the word segmentation tool to segment the text of the encyclopedia page.

2) Through the word segmentation results of step 1), calculate the features between two pages in the same candidate set, assign different weights to each dimension feature by training a classifier, and use this classifier to calculate the features between pages similarity.

3) Construct the weight map of the candidate set according to the similarity between pages calculated in step 2), use the mixed linear programming model to define the model objective function, and calculate the maximum value of the objective function to obtain the vertex-to-vertex connectivity. Treat each connected component on the weight map as an entity, so as to obtain all pages describing the same entity.

2. according to a kind of knowledge map construction method based on multi-source entity fusion described in claim 1, it is characterized in that, described step 1) comprises:

1.1) Extract the synonyms of the encyclopedia title, the extraction methods include the following two methods:

a) Template matching: Use a specific template to match the beginning of each page and the first sentence of the abstract. If the match is successful, a synonym pair is obtained. Templates are artificially defined and cover most occurrence patterns of synonym pairs.

b) Link redirection: Jump to another page through a hyperlink in the page. If the title of another page is different from the text of the hyperlink, these two words are considered to be synonyms.

1.2) Extract and disambiguate pages: the kth encyclopedia is expressed as The maximum value of k is 3, where a _i represents a page, and n represents the total number of pages. From all the pages that appear in the disambiguation pages, the disambiguation page set M can be extracted, and any two pages in the set M cannot represent the same entity.

M＝{a _i ∈ε _k |a _i ∈M≠a _j ∈M}

1.3) Extract candidate sets: According to the transitivity of synonyms, if A and B are synonyms for each other, and A and C are synonyms for each other, then B and C are also synonyms for each other. In this way, the synonym group S _t is obtained, and all the synonym groups S _t form a synonym group set, and two elements in each synonym group in the set are synonyms for each other.

Given S _t , find pages whose titles belong to S _t from all encyclopedia sources, and all these pages constitute a candidate set P _t .

P _t ＝{a∈ε _1,...,K |a.Title∈S _t }

K is the total number of encyclopedias; a.Title is the title of page a.

1.4) Segment the text of the encyclopedia page: Segment the 5 domains of the page, including abstract, information box (key and value), link, directory, user tag, and remove stop words and words whose length is less than 2.

3. according to a kind of knowledge map construction method based on multi-source entity fusion described in claim 1, it is characterized in that, described step 2) comprises:

2.1) Define 6 domains contained in a page, including title T, abstract A, information box I, directory C, user label G and link L, and use a 6-tuple to represent a page:

a={T,A,I,C,G,L}

The information box is represented as a key-value pair, so I={P,V}, wherein P represents an attribute, and V represents an attribute value;

For two pages belonging to the same candidate set, if they describe an entity, their text overlap rate will be relatively large, so the following seven features are defined, as follows:

1) Summary features

{f f}_{a a} (({a a}_{i i},, {a a}_{j j})) = = \frac{| | {S S}_{w w} (({a a}_{i i} . . A A)) \cap \cap {S S}_{w w} (({a a}_{j j} . . A A)) | |}{| | {S S}_{w w} (({a a}_{i i} . . A A)) \cup \cup {S S}_{w w} (({a a}_{j j} . . A A)) | |}

2) Information box attribute characteristics

{f f}_{p p} (({a a}_{i i},, {a a}_{j j})) = = \frac{| | {S S}_{w w} (({a a}_{i i} . . I I . . P P)) \cap \cap {S S}_{w w} (({a a}_{j j} . . I I . . P P)) | |}{| | {S S}_{w w} (({a a}_{i i} . . I I . . P P)) \cup \cup {S S}_{w w} (({a a}_{j j} . . I I . . P P)) | |}

3) Information box attribute value feature

{f f}_{v v} (({a a}_{i i},, {a a}_{j j})) = = \frac{| | {S S}_{w w} (({a a}_{i i} . . I I . . V V)) \cap \cap {S S}_{w w} (({a a}_{j j} . . I I . . V V)) | |}{| | {S S}_{w w} (({a a}_{i i} . . I I . . V V)) \cup \cup {S S}_{w w} (({a a}_{j j} . . I I . . V V)) | |}

4) Directory Features

{f f}_{C C} (({a a}_{i i},, {a a}_{j j})) = = \frac{| | {S S}_{w w} (({a a}_{i i} . . C C)) \cap \cap {S S}_{w w} (({a a}_{j j} . . C C)) | |}{| | {S S}_{w w} (({a a}_{i i} . . C C)) \cup \cup {S S}_{w w} (({a a}_{j j} . . C C)) | |}

5) User label features

{f f}_{g g} (({a a}_{i i},, {a a}_{j j})) = = \frac{| | {S S}_{w w} (({a a}_{i i} . . G G)) \cap \cap {S S}_{w w} (({a a}_{j j} . . G G)) | |}{| | {S S}_{w w} (({a a}_{i i} . . G G)) \cup \cup {S S}_{w w} (({a a}_{j j} . . G G)) | |}

6) Link feature

{f f}_{l l} (({a a}_{i i},, {a a}_{j j})) = = \frac{| | {S S}_{w w} (({a a}_{i i} . . L L)) \cap \cap {S S}_{w w} (({a a}_{j j} . . L L)) | |}{| | {S S}_{w w} (({a a}_{i i} . . L L)) \cup \cup {S S}_{w w} (({a a}_{j j} . . L L)) | |}

7) Global features, S represents the string splicing of 6-tuple {T, A, I, C, G, L}

{f f}_{a a l l l l} (({a a}_{i i},, {a a}_{j j})) = = \frac{| | {S S}_{w w} (({a a}_{i i} . . S S)) \cap \cap {S S}_{w w} (({a a}_{j j} . . S S)) | |}{| | {S S}_{w w} (({a a}_{i i} . . S S)) \cup \cup {S S}_{w w} (({a a}_{j j} . . S S)) | |}

S _w (X) represents the result set after tokenizing the character string X.

2.2) With the 7 features obtained in step 2.1) as the input of the classifier, use the RandomForest algorithm in the Weka algorithm package to train a two-class classifier, and then use this two-class classifier to predict the similarity between the two pages.

4. A kind of knowledge map construction method based on multi-source entity fusion described in claim 1, is characterized in that, described step 3) specifically comprises the following steps:

3.1) Construct the weight map of the candidate set according to the similarity between the pages calculated in step 2), and the weighted edge between two nodes is represented by the similarity. As a result, the original problem is transformed into a problem of side selection. Use y _ij to indicate whether there is an edge between two nodes:

Also add other penalties and constraints to build a mixed linear programming model:

Penalty item 1:

If there is an edge between a _i and a _j , and there is an edge between a _i and a _k , then there should be an edge between a _j and a _k , otherwise a penalty term φ is added, and the coefficient u is multiplied as an adjustment parameter. So for φ, there are the following constraints:

{y the y}_{i i j j} + + {y the y}_{i i k k} \leq \leq 11 + + {y the y}_{j j k k} + + {φ φ}_{j j k k},, &ForAll; &ForAll; {a a}_{i i},, {a a}_{j j},, {a a}_{k k} &Element; &Element; {P P}_{t t}

φ _jk ≥ 0

Penalty item 2:

If the similarity between a _i and a _j is higher, the probability of an edge between them is greater. For two a _i and a _j with very small similarity, if there is an edge between them, the penalty term will be larger, and if the similarity between a _i and a _j is greater, then the penalty term will be smaller. Therefore, use ψ _ij to represent the penalty term, and λ to represent the adjustment parameter, and the penalty term is constrained by the following formula:

λ λ | | {y the y}_{i i j j} - - s the s i i m m (({a a}_{i i},, {a a}_{j j})) | | \leq \leq {ψ ψ}_{i i j j},, &ForAll; &ForAll; {a a}_{i i},, {a a}_{j j} &Element; &Element; {P P}_{t t}

ψ _ij ≥ 0

sim(a _i ,a _j ) is the weight between a _i and a _j ;

Penalty item 3:

For a _i and a _j appearing in a disambiguation page set M, if y _ij is equal to 1, it indicates a matching error, so a penalty term ζ _ij is required to constrain that there is no edge between a _i and a _j . Express this constraint with the following formula:

{y the y}_{i i j j} < < {ζ ζ}_{i i j j},, &ForAll; &ForAll; {a a}_{i i},, {a a}_{j j} &Element; &Element; {M m}_{n no},, n no = = 11,, 22,, ... ...,, N N

ζ _ij ≥ 0

N is the number of disambiguation page sets;

In addition, a threshold τ is set for the similarity, and only the pages a _i and a _j whose similarity is greater than the threshold τ can have an edge.

Combining the above penalty items and thresholds, the objective function is obtained as follows:

\begin{matrix} max max i i m m i i z z e e \underset{{a a}_{i i},, {a a}_{j j} &Element; &Element; {P P}_{t t}}{Σ Σ} (({y the y}_{i i j j} * * s the s i i m m (({a a}_{i i},, {a a}_{j j})) - - u u * * {φ φ}_{i i j j} - - {ψ ψ}_{i i j j})) \\ - - {Σ Σ}_{n no = = 11}^{N N} \underset{{a a}_{i i},, {a a}_{j j} &Element; &Element; {M m}_{n no}}{Σ Σ} {ζ ζ}_{i i j j} \end{matrix}

st y _ij ∈{0,1},φ _ij ,ψ _ij ,ζ _ij ≥0

{y the y}_{i i j j} + + {y the y}_{i i k k} \leq \leq 11 + + {y the y}_{i i j j} + + {φ φ}_{j j k k},, &ForAll; &ForAll; {a a}_{i i},, {a a}_{j j},, {a a}_{k k} &Element; &Element; {P P}_{t t}

λ λ | | {y the y}_{i i j j} - - s the s i i m m (({a a}_{i i},, {a a}_{j j})) | | \leq \leq {ψ ψ}_{i i j j},, &ForAll; &ForAll; {a a}_{i i},, {a a}_{j j} &Element; &Element; {P P}_{t t}

s the s i i m m (({a a}_{i i},, {a a}_{j j})) > > {y the y}_{i i j j} * * τ τ,, &ForAll; &ForAll; {a a}_{i i},, {a a}_{j j} &Element; &Element; {P P}_{t t}

{y the y}_{i i j j} < < {ζ ζ}_{i i j j},, &ForAll; &ForAll; {a a}_{i i},, {a a}_{j j} &Element; &Element; {M m}_{n no},, n no = = 11,, 22,, ... ...,, N N

Obtain the maximum value of the objective function, so as to obtain the parameter y _ij of the side corresponding to the maximum value.

3.2) Treat each connected component in the weight graph as an entity, and obtain all pages describing an entity.