+

CN106250412A - The knowledge mapping construction method merged based on many source entities - Google Patents

The knowledge mapping construction method merged based on many source entities Download PDF

Info

Publication number
CN106250412A
CN106250412A CN201610583823.0A CN201610583823A CN106250412A CN 106250412 A CN106250412 A CN 106250412A CN 201610583823 A CN201610583823 A CN 201610583823A CN 106250412 A CN106250412 A CN 106250412A
Authority
CN
China
Prior art keywords
pages
page
similarity
entity
synonyms
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201610583823.0A
Other languages
Chinese (zh)
Other versions
CN106250412B (en
Inventor
鲁伟明
戴豪
庄越挺
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University ZJU
Original Assignee
Zhejiang University ZJU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University ZJU filed Critical Zhejiang University ZJU
Priority to CN201610583823.0A priority Critical patent/CN106250412B/en
Publication of CN106250412A publication Critical patent/CN106250412A/en
Application granted granted Critical
Publication of CN106250412B publication Critical patent/CN106250412B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/247Thesauruses; Synonyms
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本发明公开了一种基于多源实体融合的知识图谱构建方法。本发明首先爬取中文三大百科:百度百科、互动百科,维基百科,并对数据做预处理,包括标题同义词提取、消岐页面提取、候选集提取和文本分词等。然后,针对在同一个候选集里的页面,计算两两页面之间的特征,并训练分类器计算页面之间的相似度,并根据相似度构建权重图。最后,通过混合线性规划模型,约束权重图中顶点与顶点之间的关系,通过计算目标函数的最大值,得到顶点与顶点之间的连通性,将每一个连通分量当作一个实体,从而获得描述同一个实体的所有页面。本发明通过引入候选集,大大减小了问题的规模;同时又通过混合线性规划模型,提高了实体融合的准确率。The invention discloses a knowledge map construction method based on multi-source entity fusion. The invention first crawls the three major Chinese encyclopedias: Baidu Encyclopedia, Interactive Encyclopedia, and Wikipedia, and preprocesses the data, including title synonym extraction, disambiguation page extraction, candidate set extraction, and text word segmentation. Then, for the pages in the same candidate set, calculate the features between two pages, and train the classifier to calculate the similarity between pages, and build a weight map according to the similarity. Finally, through the hybrid linear programming model, the relationship between vertices in the weight graph is constrained, and the connectivity between vertices is obtained by calculating the maximum value of the objective function, and each connected component is regarded as an entity, thus obtaining All pages describing the same entity. The invention greatly reduces the scale of the problem by introducing the candidate set; at the same time, it improves the accuracy of entity fusion through the hybrid linear programming model.

Description

基于多源实体融合的知识图谱构建方法Knowledge map construction method based on multi-source entity fusion

技术领域technical field

本发明涉及文本相似度计算方法,尤其涉及一种基于多源实体融合的知识图谱构建方法。The invention relates to a text similarity calculation method, in particular to a knowledge map construction method based on multi-source entity fusion.

背景技术Background technique

随着互联网的迅速发展,人们获取信息和知识的途径越来越多样化,但是海量的数据分布于互联网的每一个角落,这给用户获取知识带来了很大的障碍。因此,构建一个统一完备的知识库迫在眉睫。With the rapid development of the Internet, people have more and more diverse ways to obtain information and knowledge, but massive data are distributed in every corner of the Internet, which has brought great obstacles to users' access to knowledge. Therefore, it is imminent to construct a unified and complete knowledge base.

目前已经存在许多知识库,比如DBpedia是一个特殊的语义网应用范例,它从维基百科的词条里撷取出结构化的资料,以强化维基百科的搜寻功能,并将其他资料集连结至维基百科;Freebase是一个大型的合作知识库,它整合了网络上的许多资源。Freebase中的条目也与DBpedia类似,都采用结构化数据的形式。通过访问其数据可以发现其中所有的内容都是格式化的,按照三元组的格式存储并展示。这个模式是固定的,同一类型的条目都包含相同的属性。鉴于以上原因,同类数据之间就可以很容易地联系在一起,为信息查询提供了便利。Freebase包含数以千万计的主题,成千上万的类型和属性。但是这些知识库的语言都是英语,目前中文领域还没有一个大型的完备的知识库。There are already many knowledge bases. For example, DBpedia is a special example of Semantic Web applications. It extracts structured data from Wikipedia entries to enhance Wikipedia's search function and links other data sets to Wikipedia. ; Freebase is a large-scale cooperative knowledge base, which integrates many resources on the Internet. Entries in Freebase are also similar to DBpedia, both in the form of structured data. By accessing its data, it can be found that all the contents are formatted, stored and displayed in triple format. The schema is fixed, and entries of the same type all contain the same attributes. In view of the above reasons, similar data can be easily linked together, which facilitates information query. Freebase contains tens of millions of themes, thousands of types and properties. However, the languages of these knowledge bases are all English, and there is not yet a large and complete knowledge base in the Chinese field.

传统的关于知识库的实体匹配算法中,主要是基于成对实体的匹配,并把这个问题形式化成一个分类问题。然而,大多数这类算法都严重地依赖于数据模板的质量。对于Web数据来说,数据不是以一个统一的三元组形式呈现的,而且不同源的数据在表达形式上也有较大的差异,因此这种方法在我们的这个问题上适用性较低。In the traditional entity matching algorithm about the knowledge base, it is mainly based on the matching of paired entities, and this problem is formalized as a classification problem. However, most of these algorithms depend heavily on the quality of the data templates. For web data, the data is not presented in the form of a unified triple, and data from different sources also have large differences in expression forms, so this method is less applicable to our problem.

在另外一些匹配算法中,将页面的结构信息也考虑到特征中,比如在中英文维基的实体匹配中,因为已经有相当一部分页面存在跨语言链接,所以这部分信息可以作为先验知识。然而,我们的多源数据之间是没有任何链接的,所以页面的结构特征无法纳入特征之中。In some other matching algorithms, the structural information of the page is also taken into account in the features. For example, in the entity matching of Chinese and English wikis, since a considerable part of the pages already have cross-language links, this part of the information can be used as prior knowledge. However, there is no link between our multi-source data, so the structural features of the page cannot be included in the features.

在两个集合的特征计算中,可以使用Jaccard系数。Jaccard系数主要用于计算符号度量或布尔值度量的个体间的相似度,因为个体的特征属性都是由符号度量或者布尔值标识,因此无法衡量差异具体值的大小,只能获得“是否相同”这个结果,所以Jaccard系数只关心个体间共同具有的特征是否一致这个问题。如果比较X与Y的Jaccard相似系数,只比较Xn和Yn中相同的个数。In the calculation of the characteristics of the two sets, the Jaccard coefficient can be used. The Jaccard coefficient is mainly used to calculate the similarity between individuals measured by symbolic metrics or Boolean values, because the characteristic attributes of individuals are identified by symbolic metrics or Boolean values, so it is impossible to measure the size of the specific value of the difference, and only "whether they are the same" can be obtained This result, so the Jaccard coefficient only cares about whether the common characteristics among individuals are consistent. If comparing the Jaccard similarity coefficients of X and Y, only compare the same numbers in X n and Y n .

在特征相似度计算中,有许多算法可以应用。简单的可以直接计算欧式距离或者余弦距离。也可以根据特征训练分类器,使用分类器来计算相似度。随机森林是一种性能良好的分类器,可以用在特征相似度计算中。它指的是利用多棵决策树对样本进行训练并预测的一种分类器,并且其输出的类别是由个别树输出的类别的众数而定。随机森林具有许多优点,比如特征丢失时,仍可以保持较高的准确度,且不会产生过拟合问题。In the calculation of feature similarity, there are many algorithms that can be applied. Simply, you can directly calculate the Euclidean distance or the cosine distance. It is also possible to train a classifier based on features and use the classifier to calculate the similarity. Random forest is a classifier with good performance and can be used in feature similarity calculation. It refers to a classifier that uses multiple decision trees to train and predict samples, and the output category is determined by the mode of the category output by individual trees. Random forest has many advantages, for example, it can maintain high accuracy when features are lost, and it will not cause overfitting problems.

发明内容Contents of the invention

本发明为整合多源百科知识,构建统一的知识库,提供了一种基于多源实体融合的知识图谱构建方法。不同源的百科通常会包含描述同一个实体的多个页面,多源实体融合技术可以在海量的数据中找到这些页面,并将其映射到同一个实体上。In order to integrate multi-source encyclopedic knowledge and build a unified knowledge base, the present invention provides a knowledge map construction method based on multi-source entity fusion. Encyclopedias from different sources usually contain multiple pages describing the same entity. Multi-source entity fusion technology can find these pages in massive data and map them to the same entity.

本发明解决其技术问题采用的技术方案如下:一种基于多源实体融合的知识图谱构建方法,包括以下步骤:The technical solution adopted by the present invention to solve its technical problems is as follows: a method for building a knowledge map based on the fusion of multi-source entities, comprising the following steps:

1)预处理百科页面:提取百科标题的同义词,提取消岐页面,利用同义词的传递关系构建同义词组,所有同义词组形成同义词组集合,根据同义词组集合中每一个同义词组对应的页面构建候选集,用分词工具对百科页面的文本进行分词。1) Preprocessing encyclopedia pages: extracting synonyms of encyclopedia titles, extracting disambiguation pages, using the transitive relationship of synonyms to construct synonym groups, all synonym groups form a synonym group set, and construct candidate sets according to the pages corresponding to each synonym group in the synonym group set , use the word segmentation tool to segment the text of the encyclopedia page.

2)通过步骤1)的分词结果,计算同一个候选集里的两两页面之间的特征,通过训练分类器为每一维特征赋上不同的权重,并利用这个分类器计算页面之间的相似度。2) Through the word segmentation results of step 1), calculate the features between two pages in the same candidate set, assign different weights to each dimension feature by training a classifier, and use this classifier to calculate the features between pages similarity.

3)根据步骤2)中计算的页面之间的相似度构建该候选集的权重图,利用混合线性规划模型,定义该模型目标函数,并计算目标函数的最大值,得到顶点与顶点之间的连通性。将权重图上的每一个连通分量当作一个实体,从而获得描述同一个实体的所有页面。3) Construct the weight map of the candidate set according to the similarity between pages calculated in step 2), use the mixed linear programming model to define the model objective function, and calculate the maximum value of the objective function to obtain the vertex-to-vertex connectivity. Treat each connected component on the weight map as an entity, so as to obtain all pages describing the same entity.

进一步地,所述的步骤1)包括:Further, described step 1) includes:

1.1)提取百科标题的同义词,提取方式包括以下两种:1.1) Extract the synonyms of the encyclopedia title, the extraction methods include the following two methods:

a)模板匹配:利用特定的模板去匹配每个页面的开头和摘要的第一句话,如果匹配成功,则得到同义词对。模板人为定义,涵盖大部分同义词对出现模式。a) Template matching: Use a specific template to match the beginning of each page and the first sentence of the abstract. If the match is successful, a synonym pair is obtained. Templates are artificially defined and cover most occurrence patterns of synonym pairs.

b)链接重定向:通过页面中超链接跳转到另一个页面,如果另一个页面的标题和该超链接的文本不同,则认为这两个词是同义词。b) Link redirection: Jump to another page through a hyperlink in the page. If the title of another page is different from the text of the hyperlink, these two words are considered to be synonyms.

1.2)提取消岐页面:第k个百科表示为k最大值为3,其中ai表示页面,n表示页面总数量。由消岐页面中出现的所有页面,可提取消岐页面集合M,集合M里面的任意两两页面都不能表示同一个实体。1.2) Extract and disambiguate pages: the kth encyclopedia is expressed as The maximum value of k is 3, where a i represents a page, and n represents the total number of pages. From all the pages that appear in the disambiguation pages, the disambiguation page set M can be extracted, and any two pages in the set M cannot represent the same entity.

M={ai∈εk|ai∈M≠aj∈M}M={a i ∈ε k |a i ∈M≠a j ∈M}

1.3)提取候选集:根据同义词的传递性,如果A和B互为同义词,A和C互为同义词,那么B和C也互为同义词。通过这种方式,得到同义词组St,所有同义词组St形成同义词组集合,该集合的每一个同义词组中的两两元素互为同义词。1.3) Extract candidate sets: According to the transitivity of synonyms, if A and B are synonyms for each other, and A and C are synonyms for each other, then B and C are also synonyms for each other. In this way, the synonym group S t is obtained, and all the synonym groups S t form a synonym group set, and two elements in each synonym group in the set are synonyms for each other.

给定St,从所有百科源中找出标题属于St的页面,所有的这些页面构成候选集PtGiven S t , find pages whose titles belong to S t from all encyclopedia sources, and all these pages constitute a candidate set P t .

Pt={a∈ε1,…,K|a.Title∈St}P t ={a∈ε 1,...,K |a.Title∈S t }

K为百科的总数;a.Title为页面a的标题。K is the total number of encyclopedias; a.Title is the title of page a.

1.4)对百科页面的文本进行分词:对页面的5个域分词,包括摘要,信息框(键和值),链接,目录,用户标签,并去除停用词和长度小于2的词。1.4) Segment the text of the encyclopedia page: Segment the 5 domains of the page, including abstract, information box (key and value), link, directory, user tag, and remove stop words and words whose length is less than 2.

进一步地,所述的步骤2)包括:Further, described step 2) includes:

2.1)定义一个页面所包含的6个域,包括标题T,摘要A,信息框I,目录C,用户标签G和链接L,用一个6元组来表示一个页面:2.1) Define 6 domains contained in a page, including title T, abstract A, information box I, directory C, user label G and link L, and use a 6-tuple to represent a page:

a={T,A,I,C,G,L}a={T,A,I,C,G,L}

其中信息框表示为键值对,因此I={P,V},其中P表示属性,V表示属性值;The information box is represented as a key-value pair, so I={P,V}, wherein P represents an attribute, and V represents an attribute value;

对于属于同一个候选集的2个页面,如果他们描述的是一个实体,那么他们的文本重叠率会比较大,因此定义以下7个特征,分别如下:For two pages belonging to the same candidate set, if they describe an entity, their text overlap rate will be relatively large, so the following seven features are defined, as follows:

1)摘要特征1) Summary features

ff aa (( aa ii ,, aa jj )) == || SS ww (( aa ii .. AA )) ∩∩ SS ww (( aa jj .. AA )) || || SS ww (( aa ii .. AA )) ∪∪ SS ww (( aa jj .. AA )) ||

2)信息框属性特征2) Information box attribute characteristics

ff pp (( aa ii ,, aa jj )) == || SS ww (( aa ii .. II .. PP )) ∩∩ SS ww (( aa jj .. II .. PP )) || || SS ww (( aa ii .. II .. PP )) ∪∪ SS ww (( aa jj .. II .. PP )) ||

3)信息框属性值特征3) Information box attribute value feature

ff vv (( aa ii ,, aa jj )) == || SS ww (( aa ii .. II .. VV )) ∩∩ SS ww (( aa jj .. II .. VV )) || || SS ww (( aa ii .. II .. VV )) ∪∪ SS ww (( aa jj .. II .. VV )) ||

4)目录特征4) Directory features

ff CC (( aa ii ,, aa jj )) == || SS ww (( aa ii .. CC )) ∩∩ SS ww (( aa jj .. CC )) || || SS ww (( aa ii .. CC )) ∪∪ SS ww (( aa jj .. CC )) ||

5)用户标签特征5) User label features

ff gg (( aa ii ,, aa jj )) == || SS ww (( aa ii .. GG )) ∩∩ SS ww (( aa jj .. GG )) || || SS ww (( aa ii .. GG )) ∪∪ SS ww (( aa jj .. GG )) ||

6)链接特征6) Link feature

ff ll (( aa ii ,, aa jj )) == || SS ww (( aa ii .. LL )) ∩∩ SS ww (( aa jj .. LL )) || || SS ww (( aa ii .. LL )) ∪∪ SS ww (( aa jj .. LL )) ||

7)全局特征,S表示6元组{T,A,I,C,G,L}的字符串拼接7) Global features, S represents the string splicing of 6-tuple {T, A, I, C, G, L}

ff aa ll ll (( aa ii ,, aa jj )) == || SS ww (( aa ii .. SS )) ∩∩ SS ww (( aa jj .. SS )) || || SS ww (( aa ii .. SS )) ∪∪ SS ww (( aa jj .. SS )) ||

Sw(X)表示对字符串X分词后的结果集合。S w (X) represents the result set after tokenizing the character string X.

2.2)将在步骤2.1)得到的7个特征作为分类器的输入,利用Weka算法包中的RandomForest算法训练二类分类器,然后用这个二类分类器来预测两个页面之间的相似度。2.2) With the 7 features obtained in step 2.1) as the input of the classifier, use the RandomForest algorithm in the Weka algorithm package to train a two-class classifier, and then use this two-class classifier to predict the similarity between the two pages.

进一步地,所述的步骤3)具体包括以下步骤:Further, described step 3) specifically includes the following steps:

3.1)根据步骤2)计算得到的页面之间的相似度构建该候选集的权重图,两个结点之间的权重边用相似度表示。由此,将原问题转换成边的取舍问题。用yij表示两个结点之间是否有边:3.1) Construct the weight map of the candidate set according to the similarity between the pages calculated in step 2), and the weighted edge between two nodes is represented by the similarity. As a result, the original problem is transformed into a problem of side selection. Use y ij to indicate whether there is an edge between two nodes:

同时加入其他惩罚项和约束条件来构建混合线性规划模型:Also add other penalties and constraints to build a mixed linear programming model:

惩罚项1:Penalty item 1:

如果ai与aj有边,且ai与ak有边,那么aj与ak之间也应该有边,否则加入惩罚项φ,同时乘上系数u作为调整参数。因此对于φ,有下面的约束:If there is an edge between a i and a j , and there is an edge between a i and a k , then there should be an edge between a j and a k , otherwise a penalty term φ is added, and the coefficient u is multiplied as an adjustment parameter. So for φ, there are the following constraints:

ythe y ii jj ++ ythe y ii kk ≤≤ 11 ++ ythe y jj kk ++ φφ jj kk ,, ∀∀ aa ii ,, aa jj ,, aa kk ∈∈ PP tt

φjk≥0φ jk ≥ 0

惩罚项2:Penalty item 2:

如果ai与aj之间的相似度越高,那么他们之间有边的概率越大。对于两个相似度很小的ai与aj,如果他们之间有边,则惩罚项较大,如果ai与aj的相似度较大,那么惩罚项较小。因此,用ψij表示惩罚项,用λ表示调整参数,该惩罚项用下式约束:If the similarity between a i and a j is higher, the probability of an edge between them is greater. For two a i and a j with very small similarity, if there is an edge between them, the penalty term will be larger, and if the similarity between a i and a j is greater, then the penalty term will be smaller. Therefore, use ψ ij to represent the penalty term, and λ to represent the adjustment parameter, and the penalty term is constrained by the following formula:

λλ || ythe y ii jj -- sthe s ii mm (( aa ii ,, aa jj )) || ≤≤ ψψ ii jj ,, ∀∀ aa ii ,, aa jj ∈∈ PP tt

ψij≥0ψ ij ≥ 0

sim(ai,aj)为ai和aj之间的权重;sim(a i ,a j ) is the weight between a i and a j ;

惩罚项3:Penalty item 3:

对于在一个消岐页面集合M里面出现的ai与aj,如果yij等于1,则表明匹配错误,因此需要用惩罚项ζij来约束ai与aj之间没有边。用下面的式子表示这个约束条件:For a i and a j appearing in a disambiguation page set M, if y ij is equal to 1, it indicates a matching error, so a penalty term ζ ij is required to constrain that there is no edge between a i and a j . Express this constraint with the following formula:

ythe y ii jj << &zeta;&zeta; ii jj ,, &ForAll;&ForAll; aa ii ,, aa jj &Element;&Element; Mm nno ,, nno == 11 ,, 22 ,, ...... ,, NN

ζij≥0ζ ij ≥ 0

N为消岐页面集合的个数;N is the number of disambiguation page sets;

此外,对相似度设置阈值τ,只有相似度大于阈值τ的ai与aj的页面之间才能有边。In addition, a threshold τ is set for the similarity, and only the pages a i and a j whose similarity is greater than the threshold τ can have an edge.

综合以上各个惩罚项和阈值,得到目标函数如下所示:Combining the above penalty items and thresholds, the objective function is obtained as follows:

maxmax ii mm ii zz ee &Sigma;&Sigma; aa ii ,, aa jj &Element;&Element; PP tt (( ythe y ii jj ** sthe s ii mm (( aa ii ,, aa jj )) -- uu ** &phi;&phi; ii jj -- &psi;&psi; ii jj )) -- &Sigma;&Sigma; nno == 11 NN &Sigma;&Sigma; aa ii ,, aa jj &Element;&Element; Mm nno &zeta;&zeta; ii jj

s.t.yij∈{0,1},φijijij≥0sty ij ∈{0,1},φ ijijij ≥0

ythe y ii jj ++ ythe y ii kk &le;&le; 11 ++ ythe y ii jj ++ &phi;&phi; jj kk ,, &ForAll;&ForAll; aa ii ,, aa jj ,, aa kk &Element;&Element; PP tt

&lambda;&lambda; || ythe y ii jj -- sthe s ii mm (( aa ii ,, aa jj )) || &le;&le; &psi;&psi; ii jj ,, &ForAll;&ForAll; aa ii ,, aa jj &Element;&Element; PP tt

sthe s ii mm (( aa ii ,, aa jj )) >> ythe y ii jj ** &tau;&tau; ,, &ForAll;&ForAll; aa ii ,, aa jj &Element;&Element; PP tt

ythe y ii jj << &zeta;&zeta; ii jj ,, &ForAll;&ForAll; aa ii ,, aa jj &Element;&Element; Mm nno ,, nno == 11 ,, 22 ,, ...... ,, NN

求得该目标函数的最大值,从而得到该最大值对应的边的参数yijObtain the maximum value of the objective function, so as to obtain the parameter y ij of the side corresponding to the maximum value.

3.2)将该权重图中的每一个连通分量当作一个实体,得到描述一个实体的所有页面。3.2) Treat each connected component in the weight graph as an entity, and obtain all pages describing an entity.

本发明方法与现有技术相比具有的有益效果:The inventive method has the beneficial effect compared with prior art:

1.该方法利用标题同义词,得到标题候选集,再从标题候选集中得到页面候选集,在一个页面候选集中计算页面相似度,从而很大程度地减小了问题的规模,使得接下来的算法实施更加简单。1. This method uses title synonyms to obtain a title candidate set, and then obtains a page candidate set from the title candidate set, and calculates page similarity in a page candidate set, thereby greatly reducing the scale of the problem, making the following algorithm Implementation is simpler.

2.该方法根据页面结构,提取了7个文本特征的Jaccard系数,并采用随机森林算法计算页面与页面之间的相似度,这个相似度可以较准确地反应页面的相似度。2. According to the page structure, this method extracts the Jaccard coefficients of 7 text features, and uses the random forest algorithm to calculate the similarity between pages. This similarity can accurately reflect the similarity of pages.

3.该方法在图上对页面之间的相似度建模,利用混合线性规划模型求得图上顶点与顶点之间的关系,即页面与页面之间的关系。通过这些关系,可以构建一个无向图。在这个无向图中,可以较准确地得到描述一个实体的所有页面。3. This method models the similarity between pages on the graph, and uses the mixed linear programming model to obtain the relationship between vertices on the graph, that is, the relationship between pages. Through these relationships, an undirected graph can be constructed. In this undirected graph, all pages describing an entity can be obtained more accurately.

附图说明Description of drawings

图1是本发明的总体流程图;Fig. 1 is the general flowchart of the present invention;

图2是步骤2)的流程图;Fig. 2 is the flowchart of step 2);

图3是步骤3)的流程图;Fig. 3 is a flow chart of step 3);

图4是步骤4)的流程图。Fig. 4 is a flowchart of step 4).

具体实施方式detailed description

下面结合附图和具体实施例对本发明作进一下详细说明。The present invention will be described in detail below in conjunction with the accompanying drawings and specific embodiments.

如图1-图4所示,基于多源实体融合的知识图谱构建方法的步骤如下:As shown in Figure 1-Figure 4, the steps of the knowledge graph construction method based on multi-source entity fusion are as follows:

1)预处理百科页面:提取百科标题的同义词,提取消岐页面,利用同义词的传递关系构建同义词组,所有同义词组形成同义词组集合,根据同义词组集合中每一个同义词组对应的页面构建候选集,用分词工具对百科页面的文本进行分词。1) Preprocessing encyclopedia pages: extracting synonyms of encyclopedia titles, extracting disambiguation pages, using the transitive relationship of synonyms to construct synonym groups, all synonym groups form a synonym group set, and construct candidate sets according to the pages corresponding to each synonym group in the synonym group set , use the word segmentation tool to segment the text of the encyclopedia page.

2)通过步骤1)的分词结果,计算同一个候选集里的两两页面之间的特征,通过训练分类器为每一维特征赋上不同的权重,并利用这个分类器计算页面之间的相似度。2) Through the word segmentation results of step 1), calculate the features between two pages in the same candidate set, assign different weights to each dimension feature by training a classifier, and use this classifier to calculate the features between pages similarity.

3)根据步骤2)中计算的页面之间的相似度构建该候选集的权重图,利用混合线性规划模型,定义该模型目标函数,并计算目标函数的最大值,得到顶点与顶点之间的连通性。将权重图上的每一个连通分量当作一个实体,从而获得描述同一个实体的所有页面。3) Construct the weight map of the candidate set according to the similarity between the pages calculated in step 2), use the mixed linear programming model to define the model objective function, and calculate the maximum value of the objective function to obtain the vertex-to-vertex connectivity. Treat each connected component on the weight map as an entity, so as to obtain all pages describing the same entity.

所述的步骤1)为:Described step 1) is:

1.1)提取百科标题的同义词,提取方式包括以下两种:1.1) Extract the synonyms of the encyclopedia title, the extraction methods include the following two methods:

a)模板匹配:利用特定的模板去匹配每个页面的开头和摘要的第一句话,如果匹配成功,则得到同义词对。模板人为定义,涵盖大部分同义词对出现模式。例如:对于带有同义词的页面,在页面的开头或摘要的第一句话通常会出现“A又名B”,“A别称B”,“A是B的同义词”等字符串,通过正则匹配,可以得到一部分同义词对。a) Template matching: Use a specific template to match the beginning of each page and the first sentence of the abstract. If the match is successful, a synonym pair is obtained. Templates are artificially defined and cover most occurrence patterns of synonym pairs. For example: For a page with synonyms, strings such as "A is also known as B", "A is another name for B", "A is a synonym of B" and other strings usually appear at the beginning of the page or the first sentence of the abstract, and are matched by regular , you can get some synonym pairs.

b)链接重定向:通过页面中超链接跳转到另一个页面,如果另一个页面的标题和该超链接的文本不同,则认为这两个词是同义词。b) Link redirection: Jump to another page through a hyperlink in the page. If the title of another page is different from the text of the hyperlink, these two words are considered to be synonyms.

1.2)提取消岐页面:第k个百科表示为k最大值为3,其中ai表示页面,n表示页面总数量。由消岐页面中出现的所有页面,可提取消岐页面集合M,集合M里面的任意两两页面都不能表示同一个实体。1.2) Extract and disambiguate pages: the kth encyclopedia is expressed as The maximum value of k is 3, where a i represents a page, and n represents the total number of pages. From all the pages that appear in the disambiguation pages, the disambiguation page set M can be extracted, and any two pages in the set M cannot represent the same entity.

M={ai∈εk|ai∈M≠aj∈M}M={a i ∈ε k |a i ∈M≠a j ∈M}

1.3)提取候选集:根据同义词的传递性,如果A和B互为同义词,A和C互为同义词,那么B和C也互为同义词。通过这种方式,得到同义词组St,所有同义词组St形成同义词组集合,该集合的每一个同义词组中的两两元素互为同义词。1.3) Extract candidate sets: According to the transitivity of synonyms, if A and B are synonyms for each other, and A and C are synonyms for each other, then B and C are also synonyms for each other. In this way, the synonym group S t is obtained, and all the synonym groups S t form a synonym group set, and two elements in each synonym group in the set are synonyms for each other.

给定St,从所有百科源中找出标题属于St的页面,所有的这些页面构成候选集PtGiven S t , find pages whose titles belong to S t from all encyclopedia sources, and all these pages constitute a candidate set P t .

Pt={a∈ε1,…,K|a.Title∈St}P t ={a∈ε 1,...,K |a.Title∈S t }

K为百科的总数;a.Title为页面a的标题。K is the total number of encyclopedias; a.Title is the title of page a.

1.4)对百科页面的文本进行分词:对页面的5个域分词,包括摘要,信息框(键和值),链接,目录,用户标签,并去除停用词和长度小于2的词。1.4) Segment the text of the encyclopedia page: Segment the 5 domains of the page, including abstract, information box (key and value), link, directory, user tag, and remove stop words and words whose length is less than 2.

所述的步骤2)包括:Described step 2) comprises:

2.1)定义一个页面所包含的6个域,包括标题T,摘要A,信息框I,目录C,用户标签G和链接L,用一个6元组来表示一个页面:2.1) Define 6 domains contained in a page, including title T, abstract A, information box I, directory C, user label G and link L, and use a 6-tuple to represent a page:

a={T,A,I,C,G,L}a={T,A,I,C,G,L}

其中信息框表示为键值对,因此I={P,V},其中P表示属性,V表示属性值;The information box is represented as a key-value pair, so I={P,V}, wherein P represents an attribute, and V represents an attribute value;

对于属于同一个候选集的2个页面,如果他们描述的是一个实体,那么他们的文本重叠率会比较大,因此定义以下7个特征,分别如下:1)摘要特征For two pages belonging to the same candidate set, if they describe an entity, their text overlap rate will be relatively large, so the following seven features are defined, as follows: 1) Summary features

ff aa (( aa ii ,, aa jj )) == || SS ww (( aa ii .. AA )) &cap;&cap; SS ww (( aa jj .. AA )) || || SS ww (( aa ii .. AA )) &cup;&cup; SS ww (( aa jj .. AA )) ||

2)信息框属性特征2) Information box attribute characteristics

ff pp (( aa ii ,, aa jj )) == || SS ww (( aa ii .. II .. PP )) &cap;&cap; SS ww (( aa jj .. II .. PP )) || || SS ww (( aa ii .. II .. PP )) &cup;&cup; SS ww (( aa jj .. II .. PP )) ||

3)信息框属性值特征3) Information box attribute value feature

ff vv (( aa ii ,, aa jj )) == || SS ww (( aa ii .. II .. VV )) &cap;&cap; SS ww (( aa jj .. II .. VV )) || || SS ww (( aa ii .. II .. VV )) &cup;&cup; SS ww (( aa jj .. II .. VV )) ||

4)目录特征4) Directory features

ff CC (( aa ii ,, aa jj )) == || SS ww (( aa ii .. CC )) &cap;&cap; SS ww (( aa jj .. CC )) || || SS ww (( aa ii .. CC )) &cup;&cup; SS ww (( aa jj .. CC )) ||

5)用户标签特征5) User label feature

ff gg (( aa ii ,, aa jj )) == || SS ww (( aa ii .. GG )) &cap;&cap; SS ww (( aa jj .. GG )) || || SS ww (( aa ii .. GG )) &cup;&cup; SS ww (( aa jj .. GG )) ||

6)链接特征6) Link feature

ff ll (( aa ii ,, aa jj )) == || SS ww (( aa ii .. LL )) &cap;&cap; SS ww (( aa jj .. LL )) || || SS ww (( aa ii .. LL )) &cup;&cup; SS ww (( aa jj .. LL )) ||

7)全局特征,S表示6元组{T,A,I,C,G,L}的字符串拼接7) Global features, S represents the string splicing of 6-tuple {T, A, I, C, G, L}

ff aa ll ll (( aa ii ,, aa jj )) == || SS ww (( aa ii .. SS )) &cap;&cap; SS ww (( aa jj .. SS )) || || SS ww (( aa ii .. SS )) &cup;&cup; SS ww (( aa jj .. SS )) ||

Sw(X)表示对字符串X分词后的结果集合。S w (X) represents the result set after tokenizing the character string X.

2.2)将在步骤2.1)得到的7个特征作为分类器的输入,利用Weka算法包中的RandomForest算法训练二类分类器,然后用这个二类分类器来预测两个页面之间的相似度。2.2) With the 7 features obtained in step 2.1) as the input of the classifier, use the RandomForest algorithm in the Weka algorithm package to train a two-class classifier, and then use this two-class classifier to predict the similarity between the two pages.

所述的步骤3)具体包括以下步骤:Described step 3) specifically comprises the following steps:

3.1)根据步骤2)计算得到的页面之间的相似度构建该候选集的权重图,两个结点之间的权重边用相似度表示。由此,将原问题转换成边的取舍问题。用yij表示两个结点之间是否有边:3.1) Construct the weight map of the candidate set according to the similarity between the pages calculated in step 2), and the weighted edge between two nodes is represented by the similarity. As a result, the original problem is transformed into a problem of side selection. Use y ij to indicate whether there is an edge between two nodes:

同时加入其他惩罚项和约束条件来构建混合线性规划模型:Also add other penalties and constraints to build a mixed linear programming model:

惩罚项1:Penalty item 1:

如果ai与aj有边,且ai与ak有边,那么aj与ak之间也应该有边,否则加入惩罚项φ,同时乘上系数u作为调整参数。因此对于φ,有下面的约束:If there is an edge between a i and a j , and there is an edge between a i and a k , then there should be an edge between a j and a k , otherwise a penalty term φ is added, and the coefficient u is multiplied as an adjustment parameter. So for φ, there are the following constraints:

ythe y ii jj ++ ythe y ii kk &le;&le; 11 ++ ythe y jj kk ++ &phi;&phi; jj kk ,, &ForAll;&ForAll; aa ii ,, aa jj ,, aa kk &Element;&Element; PP tt

φjk≥0φ jk ≥ 0

惩罚项2:Penalty item 2:

如果ai与aj之间的相似度越高,那么他们之间有边的概率越大。对于两个相似度很小的ai与aj,如果他们之间有边,则惩罚项较大,如果ai与aj的相似度较大,那么惩罚项较小。因此,用ψij表示惩罚项,用λ表示调整参数,该惩罚项用下式约束:If the similarity between a i and a j is higher, the probability of an edge between them is greater. For two a i and a j with very small similarity, if there is an edge between them, the penalty term will be larger, and if the similarity between a i and a j is greater, then the penalty term will be smaller. Therefore, use ψ ij to represent the penalty term, and λ to represent the adjustment parameter, and the penalty term is constrained by the following formula:

&lambda;&lambda; || ythe y ii jj -- sthe s ii mm (( aa ii ,, aa jj )) || &le;&le; &psi;&psi; ii jj ,, &ForAll;&ForAll; aa ii ,, aa jj &Element;&Element; PP tt

ψij≥0ψ ij ≥ 0

sim(ai,aj)为ai和aj之间的权重;sim(a i ,a j ) is the weight between a i and a j ;

惩罚项3:Penalty item 3:

对于在一个消岐页面集合M里面出现的ai与aj,如果yij等于1,则表明匹配错误,因此需要用惩罚项ζij来约束ai与aj之间没有边。用下面的式子表示这个约束条件:For a i and a j appearing in a disambiguation page set M, if y ij is equal to 1, it indicates a matching error, so a penalty item ζ ij is required to constrain that there is no edge between a i and a j . Express this constraint with the following formula:

ythe y ii jj << &zeta;&zeta; ii jj ,, &ForAll;&ForAll; aa ii ,, aa jj &Element;&Element; Mm nno ,, nno == 11 ,, 22 ,, ...... ,, NN

ζij≥0ζ ij ≥ 0

N为消岐页面集合的个数;N is the number of disambiguation page sets;

此外,对相似度设置阈值τ,只有相似度大于阈值τ的ai与aj的页面之间才能有边。In addition, a threshold τ is set for the similarity, and only the pages a i and a j whose similarity is greater than the threshold τ can have an edge.

综合以上各个惩罚项和阈值,得到目标函数如下所示:Combining the above penalty items and thresholds, the objective function is obtained as follows:

maxmax ii mm ii zz ee &Sigma;&Sigma; aa ii ,, aa jj &Element;&Element; PP tt (( ythe y ii jj ** sthe s ii mm (( aa ii ,, aa jj )) -- uu ** &phi;&phi; ii jj -- &psi;&psi; ii jj )) -- &Sigma;&Sigma; nno == 11 NN &Sigma;&Sigma; aa ii ,, aa jj &Element;&Element; Mm nno &zeta;&zeta; ii jj

s.t.yij∈{0,1},φijijij≥0sty ij ∈{0,1},φ ijijij ≥0

ythe y ii jj ++ ythe y ii kk &le;&le; 11 ++ ythe y ii jj ++ &phi;&phi; jj kk ,, &ForAll;&ForAll; aa ii ,, aa jj ,, aa kk &Element;&Element; PP tt

&lambda;&lambda; || ythe y ii jj -- sthe s ii mm (( aa ii ,, aa jj )) || &le;&le; &psi;&psi; ii jj ,, &ForAll;&ForAll; aa ii ,, aa jj &Element;&Element; PP tt

sthe s ii mm (( aa ii ,, aa jj )) >> ythe y ii jj ** &tau;&tau; ,, &ForAll;&ForAll; aa ii ,, aa jj &Element;&Element; PP tt

ythe y ii jj << &zeta;&zeta; ii jj ,, &ForAll;&ForAll; aa ii ,, aa jj &Element;&Element; Mm nno ,, nno == 11 ,, 22 ,, ...... ,, NN

求得该目标函数的最大值,从而得到该最大值对应的边的参数yijObtain the maximum value of the objective function, so as to obtain the parameter y ij of the side corresponding to the maximum value.

3.2)将该权重图中的每一个连通分量当作一个实体,得到描述一个实体的所有页面。3.2) Treat each connected component in the weight graph as an entity, and obtain all pages describing an entity.

实施例Example

下面提供一实例详细说明本发明的实现步骤:An example is provided below to describe the implementation steps of the present invention in detail:

(1)实例采用的数据集来自百度百科和互动百科,其中百度百科的页面数量为10143321,互动百科的页面数量为6618544。(1) The data set used in the example comes from Baidu Encyclopedia and Interactive Encyclopedia. The number of pages in Baidu Encyclopedia is 10143321, and the number of pages in Interactive Encyclopedia is 6618544.

(2)根据(1)中的所有页面,分析页面版块结构,提取标题,摘要,目录,分类,链接,信息框等信息,并将这些信息存入lucene索引中。除了标题之外,其他的域均可以为空。(2) According to all the pages in (1), analyze the page section structure, extract titles, abstracts, categories, categories, links, information boxes and other information, and store these information in the lucene index. Except for the title, all other fields can be empty.

(3)根据(1)中的所有页面,提取标题同义词。同义词的提取方法主要包括模板匹配和链接重定向。通过提取到的同义词对,进一步得到标题同义词集合。用这些标题同义词集合去和(1)中的页面标题匹配,得到候选集页面。(3) Extract title synonyms according to all pages in (1). Synonym extraction methods mainly include template matching and link redirection. Through the extracted synonym pairs, a title synonym set is further obtained. Use these title synonym sets to match the page titles in (1) to get candidate set pages.

(4)在(3)得到的候选集页面中,提取两两页面之间的特征,并以这些特征为输入,训练随机森林分类器。在这个步骤中,需要人工标注训练集。(4) From the pages in the candidate set obtained in (3), extract features between two pages, and use these features as input to train a random forest classifier. In this step, the training set needs to be manually labeled.

(5)基于步骤(4)得到的相似度矩阵,构建混合线性规划模型,用该模型可得到顶点与顶点之间的关系,1表示两个顶点之间有边,0表示两个顶点之间没有边。以这些顶点和边为输入,可以构建一个无向图。提取无向图中的每一个连通分量,这些连通分量代表的页面表示一个实体。(5) Based on the similarity matrix obtained in step (4), construct a mixed linear programming model, and use this model to obtain the relationship between vertices. 1 indicates that there is an edge between two vertices, and 0 indicates that there is an edge between two vertices. There are no sides. Taking these vertices and edges as input, an undirected graph can be constructed. Extract each connected component in the undirected graph, and the pages represented by these connected components represent an entity.

本实例的运行结果:The result of running this example:

对于相似度计算,采用了5种方法进行对比,最后得出随机森林分类器的效果是最好的。相似度的计算通过Precision,Recall,F1和Accuracy四种评价指标将本发明所使用的方法(SCM)和其他方法,包括贪心匹配(GA),层次聚类(AC),最小生成树聚类(MSTC)和协同聚类(CC)进行比较,得到的结果如下表:For the similarity calculation, five methods are used for comparison, and finally the effect of the random forest classifier is the best. The calculation of similarity is by the method (SCM) used in the present invention and other methods by Precision, Recall, F1 and four kinds of evaluation indicators of Accuracy, including greedy matching (GA), hierarchical clustering (AC), minimum spanning tree clustering ( MSTC) and co-clustering (CC), the results are as follows:

方法method PrecisionPrecision Recallrecall F1F1 AccuracyAccuracy GAGA 78.3%78.3% 76.1%76.1% 77.2%77.2% 91.6%91.6% ACAC 73.0%73.0% 79.0%79.0% 75.9%75.9% 91.5%91.5% MSTCMSTC 63.4%63.4% 80.5%80.5% 71%71% 88.8%88.8% CCCC 62.4%62.4% 65.5%65.5% 63.9%63.9% 87.4%87.4% SCMSCM 75.8%75.8% 82.5%82.5% 79.0%79.0% 92.592.5

由上表对比可以看出,本方法在F1和Accuracy的表现上都要比其他方法要好。因此,本方法在实体匹配方面具有良好的使用价值和应用前景。It can be seen from the comparison in the above table that this method is better than other methods in both F1 and Accuracy. Therefore, this method has good use value and application prospect in entity matching.

Claims (4)

1.一种基于多源实体融合的知识图谱构建方法,其特征在于,包括以下步骤:1. A knowledge graph construction method based on multi-source entity fusion, is characterized in that, comprises the following steps: 1)预处理百科页面:提取百科标题的同义词,提取消岐页面,利用同义词的传递关系构建同义词组,所有同义词组形成同义词组集合,根据同义词组集合中每一个同义词组对应的页面构建候选集,用分词工具对百科页面的文本进行分词。1) Preprocessing encyclopedia pages: extracting synonyms of encyclopedia titles, extracting disambiguation pages, using the transitive relationship of synonyms to construct synonym groups, all synonym groups form a synonym group set, and construct candidate sets according to the pages corresponding to each synonym group in the synonym group set , use the word segmentation tool to segment the text of the encyclopedia page. 2)通过步骤1)的分词结果,计算同一个候选集里的两两页面之间的特征,通过训练分类器为每一维特征赋上不同的权重,并利用这个分类器计算页面之间的相似度。2) Through the word segmentation results of step 1), calculate the features between two pages in the same candidate set, assign different weights to each dimension feature by training a classifier, and use this classifier to calculate the features between pages similarity. 3)根据步骤2)中计算的页面之间的相似度构建该候选集的权重图,利用混合线性规划模型,定义该模型目标函数,并计算目标函数的最大值,得到顶点与顶点之间的连通性。将权重图上的每一个连通分量当作一个实体,从而获得描述同一个实体的所有页面。3) Construct the weight map of the candidate set according to the similarity between pages calculated in step 2), use the mixed linear programming model to define the model objective function, and calculate the maximum value of the objective function to obtain the vertex-to-vertex connectivity. Treat each connected component on the weight map as an entity, so as to obtain all pages describing the same entity. 2.根据权利要求1中所述的一种基于多源实体融合的知识图谱构建方法,其特征在于,所述的步骤1)包括:2. according to a kind of knowledge map construction method based on multi-source entity fusion described in claim 1, it is characterized in that, described step 1) comprises: 1.1)提取百科标题的同义词,提取方式包括以下两种:1.1) Extract the synonyms of the encyclopedia title, the extraction methods include the following two methods: a)模板匹配:利用特定的模板去匹配每个页面的开头和摘要的第一句话,如果匹配成功,则得到同义词对。模板人为定义,涵盖大部分同义词对出现模式。a) Template matching: Use a specific template to match the beginning of each page and the first sentence of the abstract. If the match is successful, a synonym pair is obtained. Templates are artificially defined and cover most occurrence patterns of synonym pairs. b)链接重定向:通过页面中超链接跳转到另一个页面,如果另一个页面的标题和该超链接的文本不同,则认为这两个词是同义词。b) Link redirection: Jump to another page through a hyperlink in the page. If the title of another page is different from the text of the hyperlink, these two words are considered to be synonyms. 1.2)提取消岐页面:第k个百科表示为k最大值为3,其中ai表示页面,n表示页面总数量。由消岐页面中出现的所有页面,可提取消岐页面集合M,集合M里面的任意两两页面都不能表示同一个实体。1.2) Extract and disambiguate pages: the kth encyclopedia is expressed as The maximum value of k is 3, where a i represents a page, and n represents the total number of pages. From all the pages that appear in the disambiguation pages, the disambiguation page set M can be extracted, and any two pages in the set M cannot represent the same entity. M={ai∈εk|ai∈M≠aj∈M}M={a i ∈ε k |a i ∈M≠a j ∈M} 1.3)提取候选集:根据同义词的传递性,如果A和B互为同义词,A和C互为同义词,那么B和C也互为同义词。通过这种方式,得到同义词组St,所有同义词组St形成同义词组集合,该集合的每一个同义词组中的两两元素互为同义词。1.3) Extract candidate sets: According to the transitivity of synonyms, if A and B are synonyms for each other, and A and C are synonyms for each other, then B and C are also synonyms for each other. In this way, the synonym group S t is obtained, and all the synonym groups S t form a synonym group set, and two elements in each synonym group in the set are synonyms for each other. 给定St,从所有百科源中找出标题属于St的页面,所有的这些页面构成候选集PtGiven S t , find pages whose titles belong to S t from all encyclopedia sources, and all these pages constitute a candidate set P t . Pt={a∈ε1,…,K|a.Title∈St}P t ={a∈ε 1,...,K |a.Title∈S t } K为百科的总数;a.Title为页面a的标题。K is the total number of encyclopedias; a.Title is the title of page a. 1.4)对百科页面的文本进行分词:对页面的5个域分词,包括摘要,信息框(键和值),链接,目录,用户标签,并去除停用词和长度小于2的词。1.4) Segment the text of the encyclopedia page: Segment the 5 domains of the page, including abstract, information box (key and value), link, directory, user tag, and remove stop words and words whose length is less than 2. 3.根据权利要求1中所述的一种基于多源实体融合的知识图谱构建方法,其特征在于,所述的步骤2)包括:3. according to a kind of knowledge map construction method based on multi-source entity fusion described in claim 1, it is characterized in that, described step 2) comprises: 2.1)定义一个页面所包含的6个域,包括标题T,摘要A,信息框I,目录C,用户标签G和链接L,用一个6元组来表示一个页面:2.1) Define 6 domains contained in a page, including title T, abstract A, information box I, directory C, user label G and link L, and use a 6-tuple to represent a page: a={T,A,I,C,G,L}a={T,A,I,C,G,L} 其中信息框表示为键值对,因此I={P,V},其中P表示属性,V表示属性值;The information box is represented as a key-value pair, so I={P,V}, wherein P represents an attribute, and V represents an attribute value; 对于属于同一个候选集的2个页面,如果他们描述的是一个实体,那么他们的文本重叠率会比较大,因此定义以下7个特征,分别如下:For two pages belonging to the same candidate set, if they describe an entity, their text overlap rate will be relatively large, so the following seven features are defined, as follows: 1)摘要特征1) Summary features ff aa (( aa ii ,, aa jj )) == || SS ww (( aa ii .. AA )) &cap;&cap; SS ww (( aa jj .. AA )) || || SS ww (( aa ii .. AA )) &cup;&cup; SS ww (( aa jj .. AA )) || 2)信息框属性特征2) Information box attribute characteristics ff pp (( aa ii ,, aa jj )) == || SS ww (( aa ii .. II .. PP )) &cap;&cap; SS ww (( aa jj .. II .. PP )) || || SS ww (( aa ii .. II .. PP )) &cup;&cup; SS ww (( aa jj .. II .. PP )) || 3)信息框属性值特征3) Information box attribute value feature ff vv (( aa ii ,, aa jj )) == || SS ww (( aa ii .. II .. VV )) &cap;&cap; SS ww (( aa jj .. II .. VV )) || || SS ww (( aa ii .. II .. VV )) &cup;&cup; SS ww (( aa jj .. II .. VV )) || 4)目录特征4) Directory Features ff CC (( aa ii ,, aa jj )) == || SS ww (( aa ii .. CC )) &cap;&cap; SS ww (( aa jj .. CC )) || || SS ww (( aa ii .. CC )) &cup;&cup; SS ww (( aa jj .. CC )) || 5)用户标签特征5) User label features ff gg (( aa ii ,, aa jj )) == || SS ww (( aa ii .. GG )) &cap;&cap; SS ww (( aa jj .. GG )) || || SS ww (( aa ii .. GG )) &cup;&cup; SS ww (( aa jj .. GG )) || 6)链接特征6) Link feature ff ll (( aa ii ,, aa jj )) == || SS ww (( aa ii .. LL )) &cap;&cap; SS ww (( aa jj .. LL )) || || SS ww (( aa ii .. LL )) &cup;&cup; SS ww (( aa jj .. LL )) || 7)全局特征,S表示6元组{T,A,I,C,G,L}的字符串拼接7) Global features, S represents the string splicing of 6-tuple {T, A, I, C, G, L} ff aa ll ll (( aa ii ,, aa jj )) == || SS ww (( aa ii .. SS )) &cap;&cap; SS ww (( aa jj .. SS )) || || SS ww (( aa ii .. SS )) &cup;&cup; SS ww (( aa jj .. SS )) || Sw(X)表示对字符串X分词后的结果集合。S w (X) represents the result set after tokenizing the character string X. 2.2)将在步骤2.1)得到的7个特征作为分类器的输入,利用Weka算法包中的RandomForest算法训练二类分类器,然后用这个二类分类器来预测两个页面之间的相似度。2.2) With the 7 features obtained in step 2.1) as the input of the classifier, use the RandomForest algorithm in the Weka algorithm package to train a two-class classifier, and then use this two-class classifier to predict the similarity between the two pages. 4.权利要求1中所述的一种基于多源实体融合的知识图谱构建方法,其特征在于,所述的步骤3)具体包括以下步骤:4. A kind of knowledge map construction method based on multi-source entity fusion described in claim 1, is characterized in that, described step 3) specifically comprises the following steps: 3.1)根据步骤2)计算得到的页面之间的相似度构建该候选集的权重图,两个结点之间的权重边用相似度表示。由此,将原问题转换成边的取舍问题。用yij表示两个结点之间是否有边:3.1) Construct the weight map of the candidate set according to the similarity between the pages calculated in step 2), and the weighted edge between two nodes is represented by the similarity. As a result, the original problem is transformed into a problem of side selection. Use y ij to indicate whether there is an edge between two nodes: 同时加入其他惩罚项和约束条件来构建混合线性规划模型:Also add other penalties and constraints to build a mixed linear programming model: 惩罚项1:Penalty item 1: 如果ai与aj有边,且ai与ak有边,那么aj与ak之间也应该有边,否则加入惩罚项φ,同时乘上系数u作为调整参数。因此对于φ,有下面的约束:If there is an edge between a i and a j , and there is an edge between a i and a k , then there should be an edge between a j and a k , otherwise a penalty term φ is added, and the coefficient u is multiplied as an adjustment parameter. So for φ, there are the following constraints: ythe y ii jj ++ ythe y ii kk &le;&le; 11 ++ ythe y jj kk ++ &phi;&phi; jj kk ,, &ForAll;&ForAll; aa ii ,, aa jj ,, aa kk &Element;&Element; PP tt φjk≥0φ jk ≥ 0 惩罚项2:Penalty item 2: 如果ai与aj之间的相似度越高,那么他们之间有边的概率越大。对于两个相似度很小的ai与aj,如果他们之间有边,则惩罚项较大,如果ai与aj的相似度较大,那么惩罚项较小。因此,用ψij表示惩罚项,用λ表示调整参数,该惩罚项用下式约束:If the similarity between a i and a j is higher, the probability of an edge between them is greater. For two a i and a j with very small similarity, if there is an edge between them, the penalty term will be larger, and if the similarity between a i and a j is greater, then the penalty term will be smaller. Therefore, use ψ ij to represent the penalty term, and λ to represent the adjustment parameter, and the penalty term is constrained by the following formula: &lambda;&lambda; || ythe y ii jj -- sthe s ii mm (( aa ii ,, aa jj )) || &le;&le; &psi;&psi; ii jj ,, &ForAll;&ForAll; aa ii ,, aa jj &Element;&Element; PP tt ψij≥0ψ ij ≥ 0 sim(ai,aj)为ai和aj之间的权重;sim(a i ,a j ) is the weight between a i and a j ; 惩罚项3:Penalty item 3: 对于在一个消岐页面集合M里面出现的ai与aj,如果yij等于1,则表明匹配错误,因此需要用惩罚项ζij来约束ai与aj之间没有边。用下面的式子表示这个约束条件:For a i and a j appearing in a disambiguation page set M, if y ij is equal to 1, it indicates a matching error, so a penalty term ζ ij is required to constrain that there is no edge between a i and a j . Express this constraint with the following formula: ythe y ii jj << &zeta;&zeta; ii jj ,, &ForAll;&ForAll; aa ii ,, aa jj &Element;&Element; Mm nno ,, nno == 11 ,, 22 ,, ...... ,, NN ζij≥0ζ ij ≥ 0 N为消岐页面集合的个数;N is the number of disambiguation page sets; 此外,对相似度设置阈值τ,只有相似度大于阈值τ的ai与aj的页面之间才能有边。In addition, a threshold τ is set for the similarity, and only the pages a i and a j whose similarity is greater than the threshold τ can have an edge. 综合以上各个惩罚项和阈值,得到目标函数如下所示:Combining the above penalty items and thresholds, the objective function is obtained as follows: maxmax ii mm ii zz ee &Sigma;&Sigma; aa ii ,, aa jj &Element;&Element; PP tt (( ythe y ii jj ** sthe s ii mm (( aa ii ,, aa jj )) -- uu ** &phi;&phi; ii jj -- &psi;&psi; ii jj )) -- &Sigma;&Sigma; nno == 11 NN &Sigma;&Sigma; aa ii ,, aa jj &Element;&Element; Mm nno &zeta;&zeta; ii jj s.t. yij∈{0,1},φijijij≥0st y ij ∈{0,1},φ ijijij ≥0 ythe y ii jj ++ ythe y ii kk &le;&le; 11 ++ ythe y ii jj ++ &phi;&phi; jj kk ,, &ForAll;&ForAll; aa ii ,, aa jj ,, aa kk &Element;&Element; PP tt &lambda;&lambda; || ythe y ii jj -- sthe s ii mm (( aa ii ,, aa jj )) || &le;&le; &psi;&psi; ii jj ,, &ForAll;&ForAll; aa ii ,, aa jj &Element;&Element; PP tt sthe s ii mm (( aa ii ,, aa jj )) >> ythe y ii jj ** &tau;&tau; ,, &ForAll;&ForAll; aa ii ,, aa jj &Element;&Element; PP tt ythe y ii jj << &zeta;&zeta; ii jj ,, &ForAll;&ForAll; aa ii ,, aa jj &Element;&Element; Mm nno ,, nno == 11 ,, 22 ,, ...... ,, NN 求得该目标函数的最大值,从而得到该最大值对应的边的参数yijObtain the maximum value of the objective function, so as to obtain the parameter y ij of the side corresponding to the maximum value. 3.2)将该权重图中的每一个连通分量当作一个实体,得到描述一个实体的所有页面。3.2) Treat each connected component in the weight graph as an entity, and obtain all pages describing an entity.
CN201610583823.0A 2016-07-22 2016-07-22 Knowledge graph construction method based on multi-source entity fusion Active CN106250412B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610583823.0A CN106250412B (en) 2016-07-22 2016-07-22 Knowledge graph construction method based on multi-source entity fusion

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610583823.0A CN106250412B (en) 2016-07-22 2016-07-22 Knowledge graph construction method based on multi-source entity fusion

Publications (2)

Publication Number Publication Date
CN106250412A true CN106250412A (en) 2016-12-21
CN106250412B CN106250412B (en) 2019-04-23

Family

ID=57604424

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610583823.0A Active CN106250412B (en) 2016-07-22 2016-07-22 Knowledge graph construction method based on multi-source entity fusion

Country Status (1)

Country Link
CN (1) CN106250412B (en)

Cited By (33)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106777331A (en) * 2017-01-11 2017-05-31 北京航空航天大学 Knowledge mapping generation method and device
CN106844658A (en) * 2017-01-23 2017-06-13 中山大学 A kind of Chinese text knowledge mapping method for auto constructing and system
CN106909643A (en) * 2017-02-20 2017-06-30 同济大学 The social media big data motif discovery method of knowledge based collection of illustrative plates
CN107038257A (en) * 2017-05-10 2017-08-11 浙江大学 A kind of city Internet of Things data analytical framework of knowledge based collection of illustrative plates
CN107220386A (en) * 2017-06-29 2017-09-29 北京百度网讯科技有限公司 Information-pushing method and device
CN107423820A (en) * 2016-05-24 2017-12-01 清华大学 The knowledge mapping of binding entity stratigraphic classification represents learning method
CN108182295A (en) * 2018-02-09 2018-06-19 重庆誉存大数据科技有限公司 A kind of Company Knowledge collection of illustrative plates attribute extraction method and system
CN108399180A (en) * 2017-02-08 2018-08-14 腾讯科技(深圳)有限公司 A kind of knowledge mapping construction method, device and server
CN108694177A (en) * 2017-04-06 2018-10-23 北大方正集团有限公司 Knowledge mapping construction method and system
CN108777635A (en) * 2018-05-24 2018-11-09 梧州井儿铺贸易有限公司 A kind of Enterprise Equipment Management System
CN109033129A (en) * 2018-06-04 2018-12-18 桂林电子科技大学 Multi-source Information Fusion knowledge mapping based on adaptive weighting indicates learning method
CN109284394A (en) * 2018-09-12 2019-01-29 青岛大学 A method for building enterprise knowledge graph from the perspective of multi-source data integration
CN109522547A (en) * 2018-10-23 2019-03-26 浙江大学 Chinese synonym iteration abstracting method based on pattern learning
CN109657069A (en) * 2018-12-11 2019-04-19 北京百度网讯科技有限公司 The generation method and its device of knowledge mapping
CN109857872A (en) * 2019-02-18 2019-06-07 浪潮软件集团有限公司 The information recommendation method and device of knowledge based map
CN109902144A (en) * 2019-01-11 2019-06-18 杭州电子科技大学 An Entity Alignment Method Based on Improved WMD Algorithm
CN110209839A (en) * 2019-06-18 2019-09-06 卓尔智联(武汉)研究院有限公司 Agricultural knowledge map construction device, method and computer readable storage medium
CN110245198A (en) * 2019-06-18 2019-09-17 北京百度网讯科技有限公司 Multi-source ticketing data management method and system, server and computer readable medium
CN110377747A (en) * 2019-06-10 2019-10-25 河海大学 A kind of knowledge base fusion method towards encyclopaedia website
CN110427612A (en) * 2019-07-02 2019-11-08 平安科技(深圳)有限公司 Based on multilingual entity disambiguation method, device, equipment and storage medium
CN111708891A (en) * 2019-03-01 2020-09-25 九阳股份有限公司 Food material entity linking method and device among multi-source food material data
CN111813962A (en) * 2020-09-07 2020-10-23 北京富通东方科技有限公司 Entity similarity calculation method for knowledge graph fusion
CN111881290A (en) * 2020-06-17 2020-11-03 国家电网有限公司 A multi-source grid entity fusion method for distribution network based on weighted semantic similarity
CN112115328A (en) * 2020-08-24 2020-12-22 苏宁金融科技(南京)有限公司 Page flow map construction method and device and computer readable storage medium
CN112163094A (en) * 2020-08-25 2021-01-01 中国科学院计算机网络信息中心 Scientific and technological resource convergence and continuous service method and device
CN112328812A (en) * 2021-01-05 2021-02-05 成都数联铭品科技有限公司 Domain knowledge extraction method and system based on self-adjusting parameters and electronic equipment
CN113139050A (en) * 2021-05-10 2021-07-20 桂林电子科技大学 Text abstract generation method based on named entity identification additional label and priori knowledge
CN113157861A (en) * 2021-04-12 2021-07-23 山东新一代信息产业技术研究院有限公司 Entity alignment method fusing Wikipedia
CN113326686A (en) * 2020-02-28 2021-08-31 株式会社斯库林集团 Similarity calculation device, recording medium, and similarity calculation method
CN113392220A (en) * 2020-10-23 2021-09-14 腾讯科技(深圳)有限公司 Knowledge graph generation method and device, computer equipment and storage medium
CN114153839A (en) * 2021-10-29 2022-03-08 杭州未名信科科技有限公司 Integration method, device, equipment and storage medium of multi-source heterogeneous data
CN114880468A (en) * 2022-04-21 2022-08-09 淮阴工学院 Building specification examination method and system based on BilSTM and knowledge graph
US11487832B2 (en) * 2018-09-27 2022-11-01 Google Llc Analyzing web pages to facilitate automatic navigation

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103049569A (en) * 2012-12-31 2013-04-17 武汉传神信息技术有限公司 Text similarity matching method on basis of vector space model
CN103729343A (en) * 2013-10-10 2014-04-16 上海交通大学 Semantic ambiguity eliminating method based on encyclopedia link co-occurrence
CN105787105A (en) * 2016-03-21 2016-07-20 浙江大学 Iterative-model-based establishment method of Chinese encyclopedic knowledge graph classification system

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103049569A (en) * 2012-12-31 2013-04-17 武汉传神信息技术有限公司 Text similarity matching method on basis of vector space model
CN103729343A (en) * 2013-10-10 2014-04-16 上海交通大学 Semantic ambiguity eliminating method based on encyclopedia link co-occurrence
CN105787105A (en) * 2016-03-21 2016-07-20 浙江大学 Iterative-model-based establishment method of Chinese encyclopedic knowledge graph classification system

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
楼仁杰: "基于中文百科的知识图谱分类体系构建研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *
王龙甫: "基于中文百科的概念知识库构建", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *

Cited By (49)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107423820A (en) * 2016-05-24 2017-12-01 清华大学 The knowledge mapping of binding entity stratigraphic classification represents learning method
CN106777331A (en) * 2017-01-11 2017-05-31 北京航空航天大学 Knowledge mapping generation method and device
CN106844658A (en) * 2017-01-23 2017-06-13 中山大学 A kind of Chinese text knowledge mapping method for auto constructing and system
CN106844658B (en) * 2017-01-23 2019-12-13 中山大学 A method and system for automatically constructing a Chinese text knowledge graph
CN108399180B (en) * 2017-02-08 2021-11-26 腾讯科技(深圳)有限公司 Knowledge graph construction method and device and server
CN108399180A (en) * 2017-02-08 2018-08-14 腾讯科技(深圳)有限公司 A kind of knowledge mapping construction method, device and server
CN106909643B (en) * 2017-02-20 2020-08-14 同济大学 Knowledge graph-based social media big data topic discovery method
CN106909643A (en) * 2017-02-20 2017-06-30 同济大学 The social media big data motif discovery method of knowledge based collection of illustrative plates
CN108694177A (en) * 2017-04-06 2018-10-23 北大方正集团有限公司 Knowledge mapping construction method and system
CN107038257A (en) * 2017-05-10 2017-08-11 浙江大学 A kind of city Internet of Things data analytical framework of knowledge based collection of illustrative plates
CN107220386A (en) * 2017-06-29 2017-09-29 北京百度网讯科技有限公司 Information-pushing method and device
CN107220386B (en) * 2017-06-29 2020-10-02 北京百度网讯科技有限公司 Information pushing method and device
CN108182295A (en) * 2018-02-09 2018-06-19 重庆誉存大数据科技有限公司 A kind of Company Knowledge collection of illustrative plates attribute extraction method and system
CN108182295B (en) * 2018-02-09 2021-09-10 重庆电信系统集成有限公司 Enterprise knowledge graph attribute extraction method and system
CN108777635A (en) * 2018-05-24 2018-11-09 梧州井儿铺贸易有限公司 A kind of Enterprise Equipment Management System
CN109033129B (en) * 2018-06-04 2021-08-03 桂林电子科技大学 Multi-source information fusion knowledge graph representation learning method based on adaptive weights
CN109033129A (en) * 2018-06-04 2018-12-18 桂林电子科技大学 Multi-source Information Fusion knowledge mapping based on adaptive weighting indicates learning method
CN109284394A (en) * 2018-09-12 2019-01-29 青岛大学 A method for building enterprise knowledge graph from the perspective of multi-source data integration
US11487832B2 (en) * 2018-09-27 2022-11-01 Google Llc Analyzing web pages to facilitate automatic navigation
US11971936B2 (en) 2018-09-27 2024-04-30 Google Llc Analyzing web pages to facilitate automatic navigation
CN109522547A (en) * 2018-10-23 2019-03-26 浙江大学 Chinese synonym iteration abstracting method based on pattern learning
CN109657069A (en) * 2018-12-11 2019-04-19 北京百度网讯科技有限公司 The generation method and its device of knowledge mapping
CN109902144A (en) * 2019-01-11 2019-06-18 杭州电子科技大学 An Entity Alignment Method Based on Improved WMD Algorithm
CN109902144B (en) * 2019-01-11 2020-01-31 杭州电子科技大学 An Entity Alignment Method Based on Improved WMD Algorithm
CN109857872A (en) * 2019-02-18 2019-06-07 浪潮软件集团有限公司 The information recommendation method and device of knowledge based map
CN111708891A (en) * 2019-03-01 2020-09-25 九阳股份有限公司 Food material entity linking method and device among multi-source food material data
CN111708891B (en) * 2019-03-01 2023-12-08 九阳股份有限公司 Food material entity linking method and device between multi-source food material data
CN110377747A (en) * 2019-06-10 2019-10-25 河海大学 A kind of knowledge base fusion method towards encyclopaedia website
CN110377747B (en) * 2019-06-10 2021-12-07 河海大学 Knowledge base fusion method for encyclopedic website
CN110209839A (en) * 2019-06-18 2019-09-06 卓尔智联(武汉)研究院有限公司 Agricultural knowledge map construction device, method and computer readable storage medium
CN110245198A (en) * 2019-06-18 2019-09-17 北京百度网讯科技有限公司 Multi-source ticketing data management method and system, server and computer readable medium
CN110427612A (en) * 2019-07-02 2019-11-08 平安科技(深圳)有限公司 Based on multilingual entity disambiguation method, device, equipment and storage medium
CN113326686B (en) * 2020-02-28 2024-05-10 株式会社斯库林集团 Similarity calculation device, recording medium, and similarity calculation method
CN113326686A (en) * 2020-02-28 2021-08-31 株式会社斯库林集团 Similarity calculation device, recording medium, and similarity calculation method
CN111881290A (en) * 2020-06-17 2020-11-03 国家电网有限公司 A multi-source grid entity fusion method for distribution network based on weighted semantic similarity
CN112115328A (en) * 2020-08-24 2020-12-22 苏宁金融科技(南京)有限公司 Page flow map construction method and device and computer readable storage medium
CN112115328B (en) * 2020-08-24 2022-08-19 苏宁金融科技(南京)有限公司 Page flow map construction method and device and computer readable storage medium
CN112163094A (en) * 2020-08-25 2021-01-01 中国科学院计算机网络信息中心 Scientific and technological resource convergence and continuous service method and device
CN111813962A (en) * 2020-09-07 2020-10-23 北京富通东方科技有限公司 Entity similarity calculation method for knowledge graph fusion
CN111813962B (en) * 2020-09-07 2020-12-18 北京富通东方科技有限公司 Entity similarity calculation method for knowledge graph fusion
CN113392220B (en) * 2020-10-23 2024-03-26 腾讯科技(深圳)有限公司 Knowledge graph generation method and device, computer equipment and storage medium
CN113392220A (en) * 2020-10-23 2021-09-14 腾讯科技(深圳)有限公司 Knowledge graph generation method and device, computer equipment and storage medium
CN112328812A (en) * 2021-01-05 2021-02-05 成都数联铭品科技有限公司 Domain knowledge extraction method and system based on self-adjusting parameters and electronic equipment
CN113157861B (en) * 2021-04-12 2022-05-24 山东浪潮科学研究院有限公司 Entity alignment method fusing Wikipedia
CN113157861A (en) * 2021-04-12 2021-07-23 山东新一代信息产业技术研究院有限公司 Entity alignment method fusing Wikipedia
CN113139050A (en) * 2021-05-10 2021-07-20 桂林电子科技大学 Text abstract generation method based on named entity identification additional label and priori knowledge
CN114153839A (en) * 2021-10-29 2022-03-08 杭州未名信科科技有限公司 Integration method, device, equipment and storage medium of multi-source heterogeneous data
CN114153839B (en) * 2021-10-29 2024-09-20 杭州未名信科科技有限公司 Multi-source heterogeneous data integration method, device, equipment and storage medium
CN114880468A (en) * 2022-04-21 2022-08-09 淮阴工学院 Building specification examination method and system based on BilSTM and knowledge graph

Also Published As

Publication number Publication date
CN106250412B (en) 2019-04-23

Similar Documents

Publication Publication Date Title
CN106250412A (en) The knowledge mapping construction method merged based on many source entities
Pham et al. Semantic labeling: a domain-independent approach
CN106776711B (en) Chinese medical knowledge map construction method based on deep learning
CN111680173A (en) A CMR Model for Unified Retrieval of Cross-Media Information
US20170140044A1 (en) Systems and methods for semantic concept definition and semantic concept relationship synthesis utilizing existing domain definitions
CN103903164B (en) Semi-supervised aspect extraction method and its system based on realm information
CN108595708A (en) A kind of exception information file classification method of knowledge based collection of illustrative plates
CN114254653A (en) Scientific and technological project text semantic extraction and representation analysis method
Liang et al. GLTM: a global and local word embedding-based topic model for short texts
CN114048305B (en) Class case recommendation method of administrative punishment document based on graph convolution neural network
Sarkhel et al. Visual segmentation for information extraction from heterogeneous visually rich documents
CN103559199B (en) Method for abstracting web page information and device
CN115329101A (en) A method and device for constructing a standard knowledge graph of the power Internet of things
CN106055675A (en) Relation extracting method based on convolution neural network and distance supervision
CN116127090A (en) Aviation system knowledge graph construction method based on fusion and semi-supervision information extraction
CN106599054A (en) Method and system for title classification and push
Yuan-Jie et al. Web service classification based on automatic semantic annotation and ensemble learning
CN106844349A (en) Comment spam recognition methods based on coorinated training
CN108319583A (en) Method and system for extracting knowledge from Chinese language material library
CN106156023A (en) The methods, devices and systems of semantic matches
CN106096609A (en) A kind of merchandise query keyword automatic generation method based on OCR
CN107391565A (en) A kind of across language hierarchy taxonomic hierarchies matching process based on topic model
CN112925901A (en) Evaluation resource recommendation method for assisting online questionnaire evaluation and application thereof
CN114997288A (en) Design resource association method
CN111553160A (en) A method and system for obtaining answers to questions in the legal field

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
EE01 Entry into force of recordation of patent licensing contract

Application publication date: 20161221

Assignee: TONGDUN HOLDINGS Co.,Ltd.

Assignor: ZHEJIANG University

Contract record no.: X2021990000612

Denomination of invention: Construction method of knowledge map based on multi-source entity fusion

Granted publication date: 20190423

License type: Common License

Record date: 20211012

EE01 Entry into force of recordation of patent licensing contract
点击 这是indexloc提供的php浏览器服务,不要输入任何密码和下载