CN108052683A

CN108052683A - A kind of knowledge mapping based on cosine measurement rule represents learning method

Info

Publication number: CN108052683A
Application number: CN201810058745.1A
Authority: CN
Inventors: 常亮; 饶官军; 古天龙; 罗义琴; 祝曼丽; 徐周波
Original assignee: Guilin University of Electronic Technology
Current assignee: Guilin University of Electronic Technology
Priority date: 2018-01-22
Filing date: 2018-01-22
Publication date: 2018-05-18
Anticipated expiration: 2038-01-22
Also published as: CN108052683B

Abstract

The invention discloses a learning method for knowledge graph representation based on cosine metric rules. Firstly, entities and relationships in the knowledge graph are randomly embedded into two vector spaces; secondly, by using the statistical rules of candidate entities, the triplet sets and Candidate entity vector set; again use the cosine similarity to construct the scoring function of the target vector and the candidate entity, and evaluate the candidate entity; finally, use the loss function to uniformly train all the candidate entity vectors and target vectors related to each other, and use stochastic gradient descent The algorithm minimizes a loss function. When the optimization goal is achieved, the best representation of each entity vector and relationship vector in the knowledge graph can be obtained, so as to better represent the connection between the entity and the relationship, and can be well applied to large-scale knowledge graph complementation. All of them.

Description

A Knowledge Graph Representation Learning Method Based on Cosine Metric Rule

技术领域technical field

本发明涉及知识图谱技术领域，具体涉及一种基于余弦度量规则的知识图谱表示学习方法。The invention relates to the technical field of knowledge graphs, in particular to a knowledge graph representation learning method based on cosine metric rules.

背景技术Background technique

随着大数据时代的到来，数据的红利使人工智能技术得到了前所未有飞速发展。以知识图谱为主要代表的知识工程以及表示学习为主要代表的机器学习等相关领域得到了长足进步。一方面，随着表示学习对大数据红利的消耗殆尽，使得表示学习模型效果趋于瓶颈。另一方面，伴随这大量的知识图谱不断涌现，而这些蕴含这大量的人类先验知识宝库却仍未被表示学习有效的利用。融合知识图谱与表示学习成为了进一步提高表示学习模型效果的重要思路之一。以知识图谱为代表的符号主义、以表示学习为代表的联结主义，日益脱离原先各自独立发展的轨道，走上协同并进的新道路。With the advent of the era of big data, the dividend of data has made artificial intelligence technology develop at an unprecedented speed. Great progress has been made in related fields such as knowledge engineering represented by knowledge graph and machine learning represented by representation learning. On the one hand, as the dividends of big data are exhausted by representation learning, the effect of representation learning models tends to become a bottleneck. On the other hand, with the continuous emergence of a large number of knowledge graphs, these treasures of human prior knowledge have not been effectively utilized by representation learning. The fusion of knowledge graph and representation learning has become one of the important ideas to further improve the effect of representation learning model. The symbolism represented by the knowledge map and the connectionism represented by the representation learning are increasingly deviating from the original independent development track and embarking on a new path of synergy.

知识图谱本质上是一种语义网络，表达了各类实体、概念及其之间的语义关系。相对于传统知识表示形式(诸如本体、传统语义网络)，知识图谱具有实体/概念覆盖率高、语义关系多样、结构友好(通常表示为RDF格式)以及质量较高等优势，从而使得知识图谱日益成为大数据时代和人工智能时代最为主要的知识表示方式。The knowledge graph is essentially a semantic network, which expresses various entities, concepts and the semantic relationship between them. Compared with traditional knowledge representation forms (such as ontology and traditional semantic network), knowledge graph has the advantages of high entity/concept coverage, diverse semantic relations, friendly structure (usually expressed as RDF format) and high quality, which makes knowledge graph increasingly become a It is the most important knowledge representation method in the era of big data and artificial intelligence.

三元组是知识图谱的一种通用的表示形式，三元组的基本形式包括(实体1，关系，实体2)和(概念，属性，属性值)等，实体是知识图谱中最基本的元素，不同实体之间存在不同的关系。概念主要包括集合、对象类型、事物种类，如地理、人物等；属性是指对象具有的属性特征、特性，如国籍出生日期；属性值则是指属性所对应的值，如中国，1993-01-12等。通常使用(head,relation,tail)(简写为(h,r,t))来表示三元组，其中r表示头实体h和尾实体t之间的关系。如巴黎是法国首都这一知识，在知识图谱中可以使用(巴黎，是......首都，法国)这一三元组表示。Triple is a general representation of knowledge graph. The basic form of triple includes (entity 1, relationship, entity 2) and (concept, attribute, attribute value), etc. Entity is the most basic element in knowledge graph. , there are different relationships between different entities. Concepts mainly include collections, object types, and types of things, such as geography, people, etc.; attributes refer to the attribute characteristics and characteristics of objects, such as nationality and date of birth; attribute values refer to the values corresponding to attributes, such as China, 1993-01 -12 etc. Usually (head, relation, tail) (abbreviated as (h, r, t)) is used to represent a triple, where r represents the relationship between the head entity h and the tail entity t. For example, the knowledge that Paris is the capital of France can be represented by the triplet (Paris, is...the capital, France) in the knowledge graph.

知识图谱的表示学习旨在将知识图谱中的实体和关系嵌入到低维向量空间，将其表示为稠密低维实值向量。其关键是合理定义知识图谱中关于事实(三元组(h,r,t))的损失函数f_r(h,t)与三元组的两个实体h、t的向量化表示。通常情况下，当事实(h,r,t)成立时，期望最小化f_r(h,t)。考虑整个知识图谱的事实，则可通过最小化损失函数来学得实体与关系的向量表示。不同的表示学习可以使用不同的原则和方法定义相应的损失函数。当前以TransE模型为代表的翻译模型以其突出的性能与简单的模型参数受到了广泛的关注。然而，现有的翻译模型可以有效的处理1-1的简单关系，但对于1-N、N-1、N-N的复杂关系仍受到限制。这就导致现有知识图谱表示学习方法无法很好地应用于大规模的知识图谱。Representation learning for knowledge graphs aims to embed entities and relationships in knowledge graphs into low-dimensional vector spaces and represent them as dense low-dimensional real-valued vectors. The key is to reasonably define the vectorized representation of the loss function f _r (h, t) of the fact (triple (h, r, t)) in the knowledge graph and the two entities h, t of the triple. In general, it is desirable to minimize f _r (h,t) when the fact (h,r,t) holds. Considering the facts of the entire knowledge graph, the vector representation of entities and relationships can be learned by minimizing the loss function. Different representation learning can use different principles and methods to define corresponding loss functions. The current translation model represented by the TransE model has received extensive attention for its outstanding performance and simple model parameters. However, existing translation models can effectively deal with simple relations of 1-1, but are still limited for complex relations of 1-N, N-1, NN. This leads to the fact that existing knowledge graph representation learning methods cannot be well applied to large-scale knowledge graphs.

发明内容Contents of the invention

本发明所要解决的是现有知识图谱表示学习方法无法很好地处理1-N、N-1、N-N的复杂关系的问题，提供一种基于余弦度量规则的知识图谱表示学习方法。What the present invention aims to solve is the problem that the existing knowledge graph representation learning method cannot handle complex relationships of 1-N, N-1, and N-N well, and provides a knowledge graph representation learning method based on the cosine metric rule.

为解决上述问题，本发明是通过以下技术方案实现的：In order to solve the above problems, the present invention is achieved through the following technical solutions:

一种基于余弦度量规则的知识图谱表示学习方法，包括步骤如下：A knowledge map representation learning method based on the cosine metric rule, comprising the following steps:

步骤1、利用向量随机生成方法将知识图谱中的实体集和关系集分别嵌入到实体向量空间和关系向量空间，得到实体向量和关系向量；Step 1. Use the vector random generation method to embed the entity set and relation set in the knowledge map into the entity vector space and the relation vector space respectively to obtain the entity vector and relation vector;

步骤2、利用候选实体统计规则，获得随机选择三元组的候选实体向量集，并根据候选实体向量集随机生成错误实体向量集；Step 2, using the statistical rules of candidate entities to obtain a set of candidate entity vectors for randomly selected triples, and randomly generate a set of wrong entity vectors according to the set of candidate entity vectors;

步骤3、利用余弦相似度构造目标向量与候选实体向量之间的评分函数，同时规范其函数值的取值范围；Step 3, using the cosine similarity to construct the scoring function between the target vector and the candidate entity vector, and at the same time standardize the value range of its function value;

步骤4、利用评分函数构建基于边界的区分候选实体向量集与错误实体向量集的损失函数，然后根据损失函数使得候选实体向量集对目标向量进行统一约束；Step 4. Use the scoring function to construct a boundary-based loss function for distinguishing the candidate entity vector set from the wrong entity vector set, and then make the candidate entity vector set uniformly constrain the target vector according to the loss function;

步骤5、利用优化算法最优化损失函数值，从而使得候选实体向量集的评分函数值接近于1，错误实体向量集的评分函数值接近于0，以学得实体与关系的最佳向量表示，达到优化目标。Step 5. Use the optimization algorithm to optimize the loss function value, so that the scoring function value of the candidate entity vector set is close to 1, and the scoring function value of the wrong entity vector set is close to 0, so as to learn the best vector representation of entities and relationships. reach the optimization goal.

上述步骤2的具体子步骤如下：The specific sub-steps of the above step 2 are as follows:

步骤2.1、从知识图谱的三元组集中随机选择一个三元组；Step 2.1, randomly select a triplet from the triplet set in the knowledge map;

步骤2.2、在三元组集中找到同时匹配所选择的三元组的头实体向量和关系向量的所有尾实体向量，并将所找出的尾实体向量形成候选尾实体向量集；同时，在三元组集中找到同时匹配所选择的三元组的尾实体向量和关系向量的所有头实体向量，并将所找出的头实体向量形成候选头实体向量集；Step 2.2, find all tail entity vectors that match the head entity vector and relation vector of the selected triple in the triple set, and form the candidate tail entity vector set with the found tail entity vectors; meanwhile, in the three Find all head entity vectors that match the tail entity vector and relation vector of the selected triple in the tuple set, and form the candidate head entity vector set with the found head entity vectors;

步骤2.3、对候选尾实体向量集进行随机替换操作，生成错误尾实体向量集；同时，对候选头实体向量集进行随机替换操作，生成错误头实体向量集。Step 2.3, performing random replacement operation on the candidate tail entity vector set to generate the error tail entity vector set; at the same time, performing random replacement operation on the candidate head entity vector set to generate the error head entity vector set.

上述步骤3中所构造的评分函数如下：The scoring function constructed in the above step 3 is as follows:

候选尾实体向量的评分函数f_t(g_t,t)为：The scoring function f _t (g _t ,t) of the candidate tail entity vector is:

候选头实体向量的评分函数f_h(g_h,h)为：The scoring function f _h (g _h ,h) of the candidate head entity vector is:

错误尾实体向量的评分函数f_t′(g_t,t′)为：The scoring function f _t ′(g _t ,t′) of the wrong tail entity vector is:

错误头实体向量的评分函数f′_h(g_h,h′)为：The scoring function f′ _h (g _h ,h′) of the wrong head entity vector is:

上述各式中，α是评分函数值范围规范参数；g_t是尾实体的目标向量，g_t＝h₀+r₀；g_h是头实体的目标向量，g_h＝t₀-r₀；h₀是所选择的三元组的头实体向量，t₀是所选择的三元组的尾实体向量，r₀是所选择的三元组的关系向量；t是候选尾实体向量，h是候选头实体向量，t′是错误尾实体向量，h′是错误头实体向量。Among the above formulas, α is the scoring function value range specification parameter; g _t is the target vector of the tail entity, g _t =h ₀ +r ₀ ; g _h is the target vector of the head entity, g _h =t ₀ -r ₀ ; h ₀ is the head entity vector of the selected triple, t ₀ is the tail entity vector of the selected triple, r ₀ is the relation vector of the selected triple; t is the candidate tail entity vector, h is Candidate head entity vector, t' is the error tail entity vector, h' is the error head entity vector.

上述评分函数值范围规范参数α∈[0,1]。The above scoring function value range specification parameter α∈[0,1].

上述步骤4中所构建的损失函数L为：The loss function L constructed in step 4 above is:

式中，γ为设定的边界值；f_t′(g_t,t′)表示错误尾实体向量的评分函数，f_t(g_t,t)表示候选尾实体向量的评分函数，f′_h(g_h,h′)表示错误头实体向量的评分函数，f_h(g_h,h)表示候选头实体向量的评分函数；g_t是尾实体的目标向量，g_t＝h₀+r₀；g_h是头实体的目标向量，g_h＝t₀-r₀；h₀是所选择的三元组的头实体向量，t₀是所选择的三元组的尾实体向量，r₀是所选择的三元组的关系向量；t为候选尾实体向量，t_c为候选尾实体向量集，t′为错误尾实体向量，t′_c为错误尾实体向量集，h为候选头实体向量，h_c为候选头实体向量集，h′为错误头实体向量，h′_c为错误头实体向量集。In the formula, γ is the set boundary value; f _t ′(g _t ,t′) represents the scoring function of the wrong tail entity vector, f _t (g _t ,t) represents the scoring function of the candidate tail entity vector, f′ _h (g _h , h′) represents the scoring function of the wrong head entity vector, f _h (g _h , h) represents the scoring function of the candidate head entity vector; g _t is the target vector of the tail entity, g _t = h ₀ + r ₀ ; g _h is the target vector of the head entity, g _h =t ₀ -r ₀ ; h ₀ is the head entity vector of the selected triple, t ₀ is the tail entity vector of the selected triple, r ₀ is The relationship vector of the selected triplet; t is the candidate tail entity vector, t _c is the candidate tail entity vector set, t′ is the wrong tail entity vector, t′ _c is the wrong tail entity vector set, h is the candidate head entity vector , h _c is the candidate header entity vector set, h′ is the error header entity vector, h′ _c is the error header entity vector set.

上述步骤5中所述优化算法为随机梯度下降算法。The optimization algorithm described in step 5 above is a stochastic gradient descent algorithm.

与现有技术相比，本发明具有如下特点：Compared with prior art, the present invention has following characteristics:

第一，提出了一种候选实体统计规则，统计获得相关关系的候选实体向量集；First, a candidate entity statistics rule is proposed to obtain the candidate entity vector set of correlation relationship;

第二，通过引入两个向量之间的余弦相似度，计算目标向量与候选实体向量之间余弦值作为衡量两个个体之间的差异大小，相比欧式距离，不是简单计算两个向量之间的距离，余弦相似度更加注重两个向量在方向上的差异；这样解决了现有模型在处理1-N、N-1、N-N等复杂关系时的不足，丰富了实体与关系的表达能力，整体上提升了模型性能；Second, by introducing the cosine similarity between the two vectors, the cosine value between the target vector and the candidate entity vector is calculated as a measure of the difference between the two individuals. Compared with the Euclidean distance, it is not a simple calculation between the two vectors The cosine similarity pays more attention to the difference in the direction of the two vectors; this solves the shortcomings of the existing models when dealing with complex relationships such as 1-N, N-1, and N-N, and enriches the expressive ability of entities and relationships. Overall improved model performance;

第三，通过联合相关关系的所有候选实体与目标向量形成统一约束，提高候选实体向量与目标向量交互性。Thirdly, by uniting all candidate entities and target vectors of related relations to form a unified constraint, the interaction between candidate entity vectors and target vectors is improved.

附图说明Description of drawings

图1本发明一种基于余弦度量规则的知识图谱表示学习方法的流程图。Fig. 1 is a flow chart of a knowledge map representation learning method based on the cosine metric rule of the present invention.

图2知识图谱中实体与关系三元组的示例图。Figure 2 An example diagram of entity and relation triples in the knowledge graph.

图3本发明一种基于余弦度量规则的知识图谱表示学习方法的训练目标示例图，其中(a)为训练前，(b)为训练后。Fig. 3 is an example diagram of the training target of a knowledge map representation learning method based on the cosine metric rule of the present invention, wherein (a) is before training, and (b) is after training.

具体实施方式Detailed ways

为使本发明的目的、技术方案和优点更加清楚明白，以下结合具体实例，并参照附图，对本发明进一步详细说明。In order to make the object, technical solution and advantages of the present invention clearer, the present invention will be further described in detail below in combination with specific examples and with reference to the accompanying drawings.

本发明公开了一种基于余弦度量规则的知识图谱表示学习方法，如图1所示，首先将知识图谱中的实体与关系随机嵌入到两个向量空间；其次利用候选实体统计规则，统计相关关系对应的三元组集与候选实体向量集，并根据候选实体向量集随机生成错误实体向量集；再次利用余弦相似度构造目标向量与候选实体的评分函数，对候选实体进行评价；最后利用损失函数将所有相关关系的候选实体向量与目标向量进行统一训练，并通过随机梯度下降算法最小化损失函数。当达到优化目标时，即可获得知识图谱中每个实体向量和关系向量的最佳表示，从而更好的表示实体与关系之间的联系，并能够很好的应用于大规模的知识图谱补全当中。本发明通过余弦相似度与统一训练，很好的克服了现有模型无法很好的处理1-N、N-1、N-N的复杂关系问题，具有较强的可行性与良好的实用性。The present invention discloses a knowledge map representation learning method based on the cosine metric rule. As shown in Figure 1, firstly, the entities and relations in the knowledge map are randomly embedded into two vector spaces; secondly, the statistical rules of the candidate entities are used to count the relevant relations The corresponding triplet set and the candidate entity vector set, and randomly generate the wrong entity vector set according to the candidate entity vector set; again use the cosine similarity to construct the scoring function between the target vector and the candidate entity, and evaluate the candidate entity; finally use the loss function The candidate entity vectors and target vectors of all relevant relations are uniformly trained, and the loss function is minimized by the stochastic gradient descent algorithm. When the optimization goal is achieved, the best representation of each entity vector and relationship vector in the knowledge graph can be obtained, so as to better represent the connection between the entity and the relationship, and can be well applied to large-scale knowledge graph complementation. All of them. Through the cosine similarity and unified training, the present invention well overcomes the problem that the existing models cannot deal with the complex relationship of 1-N, N-1, and N-N well, and has strong feasibility and good practicability.

本发明对知识图谱中的三元组结构化信息进行了考虑，并采用(head，relation，tail)的典型三元组形式进行知识表示，relation是用来连接实体head、tail，并刻画两个实体之间的关联性。图2是典型的知识图谱三元组结构的示例图。其中，圆圈表示为实体节点(如“唐**”、“伊**”、“蒂**”等)，两个实体之间的连边表示关系(如“国籍”、“总统”、“女儿”等)。另外可以看到实体“唐**”与“美国”之间有多种关系，并且存在“女儿”、“国籍”关系对应了多个实体对。The present invention considers the triplet structured information in the knowledge map, and uses the typical triplet form of (head, relation, tail) to represent the knowledge. The relation is used to connect the entity head and tail, and describe two Relationships between entities. Figure 2 is an example diagram of a typical knowledge graph triplet structure. Among them, the circle represents the entity node (such as "Tang**", "Yi**", "Ti**", etc.), and the connection between two entities represents the relationship (such as "nationality", "president", " daughter", etc.). In addition, it can be seen that there are multiple relationships between the entity "Tang **" and "United States", and there are multiple entity pairs corresponding to the "daughter" and "nationality" relationships.

一种基于余弦度量规则的知识图谱表示学习方法，以图2知识图谱作为训练集，图2中包括5个实体，4个关系，8个三元组。方法的具体实施包括如下步骤：A knowledge graph representation learning method based on the cosine metric rule, using the knowledge graph in Figure 2 as the training set, which includes 5 entities, 4 relationships, and 8 triples. The concrete implementation of method comprises the following steps:

步骤1、利用向量随机生成方法将知识图谱中的实体和关系分别嵌入到两个向量空间；Step 1. Use the vector random generation method to embed the entities and relationships in the knowledge map into two vector spaces respectively;

步骤11、将图2知识图谱的实体集{“唐**”，“玛*”“蒂**”，“伊**”，“美国”}表示为{e₁,e₂,e₃,e₄,e₅}与关系集{“妻子”，“女儿”，“总统”，“国籍”}表示为{r₁,r₂,r₃,r₄}。图2知识图谱中包括三元组集{(e₁,r₁,e₂)，(e₁,r₂,e₃)，(e₁,r₂,e₄)，(e₁,r₃,e₅)，(e₂,r₂,e₃)，(e₃,r₃,e₅)，(e₄,r₃,e₅)，(e₅,r₄,e₁)}；Step 11. Express the entity set {"Tang**", "Ma*", "Ti**", "伊**", "United States"} of the knowledge graph in Figure 2 as {e ₁ , e ₂ , e ₃ , e ₄ , e ₅ } and the relation set {"wife", "daughter", "president", "nationality"} are expressed as {r ₁ ,r ₂ ,r ₃ ,r ₄ }. The knowledge map in Figure 2 includes triple sets {(e ₁ ,r ₁ ,e ₂ ), (e ₁ ,r ₂ ,e ₃ ), (e ₁ ,r ₂ ,e ₄ ), (e ₁ ,r ₃ ,e ₅ ),(e ₂ ,r ₂ ,e ₃ ),(e ₃ ,r ₃ ,e ₅ ),(e ₄ ,r ₃ ,e ₅ ),(e ₅ ,r ₄ ,e ₁ )};

步骤12、将他们分别嵌入到实体向量空间与关系向量空间，得到实体向量{e₁,e₂,e₃,e₄,e₅}与关系向量{r₁,r₂,r₃,r₄}。Step 12. Embed them into the entity vector space and relation vector space respectively to obtain entity vectors {e ₁ , e ₂ , e ₃ , e ₄ , e ₅ } and relation vectors {r ₁ , r ₂ , r ₃ , r ₄ }.

步骤2、利用候选实体统计规则，获得相关关系的三元组集与候选实体向量集，同时随机生成错误实体向量集。Step 2, using the statistical rules of candidate entities to obtain the triplet set of related relations and the vector set of candidate entities, and at the same time randomly generate the set of wrong entity vectors.

候选实体统计规则如下：The statistical rules for candidate entities are as follows:

第一步，从知识图谱中随机抽取一个三元组(如，(h₀,r₀,t₀))；In the first step, a triplet is randomly extracted from the knowledge graph (eg, (h ₀ ,r ₀ ,t ₀ ));

步骤21、通过随机抽取获得一个三元组(如，(唐**，女儿，蒂**)即(e₁,r₂,e₃))；Step 21, obtain a triplet by random extraction (eg, (Tang **, daughter, Ti **) namely (e ₁ , r ₂ , e ₃ ));

第二步，分别匹配出训练集中包含(h₀,r₀,？)与(？,r₀,t₀)的三元组集(h₀,r₀,t_c)与(h_c,r₀,t₀)，其中是以g_t＝h₀+r₀为目标向量的所有候选尾实体集，包含|t_c|个不同的尾实体数量，是以g_h＝t₀-r₀目标向量的所有候选头实体集，包含|h_c|个不同的头实体数量；The second step is to match the triplet sets (h ₀ _, r ₀ _, t _c ) and ₍ _{h c} _, r ₀ ,t ₀ ), where is all candidate tail entity sets with g _t = h ₀ +r ₀ as the target vector, including |t _c | different numbers of tail entities, is all candidate head entity sets of g _h =t ₀ -r ₀ target vectors, including |h _c | different numbers of head entities;

步骤22、根据((e₁,r₂,？)匹配出三元组集{(e₁,r₂,e₃)，(e₁,r₂,e₄)}，同时获得候选尾实体集{e₃,e₄}；Step 22. According to ((e ₁ ,r ₂ ,?), match the triple set {(e ₁ ,r ₂ ,e ₃ ), (e ₁ ,r ₂ ,e ₄ )}, and obtain the candidate tail entity set at the same time {e ₃ ,e ₄ };

步骤23、根据(？,r₂,e₃)匹配出三元组集{(e₁,r₂,e₃)，(e₂,r₂,e₃)}，同时获得候选头实体集{e₁,e₂}；Step 23. According to (?, r ₂ , e ₃ ), match the triplet set {(e ₁ , r ₂ , e ₃ ), (e ₂ , r ₂ , e ₃ )}, and obtain the candidate head entity set { e ₁ ,e ₂ };

第三步，随机生成候选实体集对应的错误实体集。在错误实体集的生成过程中，会用生成错误尾实体与候选实体集中的候选实体进行比对，仅当不存在于候选实体集中的错误尾实体会归入错误实体集。The third step is to randomly generate the error entity set corresponding to the candidate entity set. In the process of generating the error entity set, the generated error tail entity will be compared with the candidate entity in the candidate entity set, and only the error tail entity that does not exist in the candidate entity set will be included in the error entity set.

步骤24、随机生成错误尾实体集{e₅,e₂}与错误头实体集{e₄,e₅}。Step 24: Randomly generate an error tail entity set {e ₅ , e ₂ } and an error head entity set {e ₄ , e ₅ }.

步骤3、利用余弦相似度构造目标向量与候选实体向量之间的评分函数，同时规范其函数值的取值范围，一般为[0,1]。Step 3. Using the cosine similarity to construct the scoring function between the target vector and the candidate entity vector, and at the same time standardize the value range of the function value, which is generally [0,1].

步骤31、利用余弦值公式，将向量a与向量b的余弦值cos＜a,b＞表示为：Step 31. Using the cosine value formula, express the cosine value cos<a, b> of vector a and vector b as:

步骤32、基于上述余弦公式，构造评分函数：Step 32. Construct a scoring function based on the above cosine formula:

根据目标向量与候选尾实体向量的余弦相似度的候选尾实体的评分函数f_t(g_t,t)构造如下：The scoring function f _t (g _t ,t) of the candidate tail entity according to the cosine similarity between the target vector and the candidate tail entity vector is constructed as follows:

根据目标向量与候选头实体向量的余弦相似度，构造候选头实体的评分函数f_h(g_h,h)如下：According to the cosine similarity between the target vector and the candidate head entity vector, construct the scoring function f _h (g _h ,h) of the candidate head entity as follows:

其中，α∈[0,1]是评分函数值范围规范参数，g_t是尾实体的目标向量，g_t＝h₀+r₀，g_h是头实体的目标向量，g_h＝t₀-r₀，t是候选尾实体向量，h是候选头实体向量。Among them, α∈[0,1] is the specification parameter of the scoring function value range, g _t is the target vector of the tail entity, g _t =h ₀ +r ₀ , g _h is the target vector of the head entity, g _h =t ₀ − r ₀ , t is the candidate tail entity vector, h is the candidate head entity vector.

根据目标向量与错误尾实体向量的余弦相似度，构造错误尾实体的评分函数f_t′(g_t,t′)如下：According to the cosine similarity between the target vector and the wrong tail entity vector, the scoring function f _t ′(g _t ,t′) of the wrong tail entity is constructed as follows:

根据目标向量与错误头实体向量的余弦相似度，构造错误头实体的评分函数f′_h(g_h,h′)如下：According to the cosine similarity between the target vector and the error head entity vector, the scoring function f′ _h (g _h ,h′) of the error head entity is constructed as follows:

其中，α∈[0,1]是评分函数值范围规范参数，g_t是尾实体的目标向量，g_t＝h₀+r₀，g_h是头实体的目标向量，g_h＝t₀-r₀，t′是错误尾实体向量，h′是错误头实体向量。Among them, α∈[0,1] is the specification parameter of the scoring function value range, g _t is the target vector of the tail entity, g _t =h ₀ +r ₀ , g _h is the target vector of the head entity, g _h =t ₀ − r ₀ , t' is the error tail entity vector, h' is the error head entity vector.

步骤41、基于边界的损失函数构造如下：Step 41, the boundary-based loss function is constructed as follows:

其中，[γ+f′-f]₊＝max(0,γ+f′-f)；γ为设定的边界值；f_t′(g_t,t′)表示错误尾实体向量的评分函数，f_t(g_t,t)表示候选尾实体向量的评分函数，f′_h(g_h,h′)表示错误头实体向量的评分函数，f_h(g_h,h)表示候选头实体向量的评分函数；g_t是尾实体的目标向量，g_t＝h₀+r₀；g_h是头实体的目标向量，g_h＝t₀-r₀；h₀是所选择的三元组的头实体向量，t₀是所选择的三元组的尾实体向量，r₀是所选择的三元组的关系向量；t为候选尾实体向量，t_c为候选尾实体向量集，t′为错误尾实体向量，t′_c为错误尾实体向量集，h为候选头实体向量，h_c为候选头实体向量集，h′为错误头实体向量，h′_c为错误头实体向量集。Among them, [γ+f′-f] ₊ ＝max(0,γ+f′-f); γ is the set boundary value; f _t ′(g _t ,t′) represents the scoring function of the wrong tail entity vector , f _t (g _t ,t) represents the scoring function of the candidate tail entity vector, f′ _h (g _h ,h′) represents the scoring function of the wrong head entity vector, f _h (g _h ,h) represents the candidate head entity vector scoring function; g _t is the target vector of the tail entity, g _t = h ₀ + r ₀ ; g _h is the target vector of the head entity, g _h = t ₀ -r ₀ ; h ₀ is the selected triplet head entity vector, t ₀ is the tail entity vector of the selected triple, r ₀ is the relationship vector of the selected triple; t is the candidate tail entity vector, t _c is the candidate tail entity vector set, and t′ is Error tail entity vector, t′ _c is the error tail entity vector set, h is the candidate head entity vector, h _c is the candidate head entity vector set, h′ is the error head entity vector, h′ _c is the error head entity vector set.

步骤42、将步骤22中获得的三元组集{(e₁,r₂,e₃)，(e₂,r₂,e₃)}与候选尾实体集{e₃,e₄}，分别通过步骤32中的评分函数计算得分；Step 42. Combine the triplet set {(e ₁ , r ₂ , e ₃ ), (e ₂ , r ₂ , e ₃ )} obtained in step 22 and the candidate tail entity set {e ₃ , e ₄ }, respectively Calculate the score by the scoring function in step 32;

步骤421、(e₁,r₂,e₃)得分为Step 421, (e ₁ , r ₂ , e ₃ ) score is

其对应的错误三元组(e₁,r₂,e₅)得分为Its corresponding error triplet (e ₁ ,r ₂ ,e ₅ ) score is

步骤422、(e₁,r₂,e₄)得分为Step 422, (e ₁ , r ₂ , e ₄ ) score is

其对应的错误三元组(e₁,r₂,e₂)得分为Its corresponding error triplet (e ₁ , r ₂ , e ₂ ) score is

步骤43、将步骤22中获得的三元组集{(e₁,r₂,e₃)，(e₂,r₂,e₃)}与候选头实体集{e₁,e₂}，分别通过步骤32中的评分函数计算得分；Step 43. Combine the triplet set {(e ₁ , r ₂ , e ₃ ), (e ₂ , r ₂ , e ₃ )} obtained in step 22 with the candidate head entity set {e ₁ , e ₂ }, respectively Calculate the score by the scoring function in step 32;

步骤431、(e₁,r₂,e₃)得分为Step 431, (e ₁ , r ₂ , e ₃ ) score is

其对应的错误三元组(e₄,r₂,e₃)得分为Its corresponding error triplet (e ₄ , r ₂ , e ₃ ) score is

步骤432、(e₂,r₂,e₃)得分为Step 432, (e ₂ , r ₂ , e ₃ ) score is

其对应的错误三元组(e₅,r₂,e₂)得分为The corresponding error triplet (e ₅ ,r ₂ ,e ₂ ) score is

步骤44、将步骤42与步骤43中计算的得分代入损失函数L中，得到损失函数值。通过基于边界的损失函数将相关关系的所有候选实体向量与目标向量形成统一约束。Step 44. Substituting the scores calculated in steps 42 and 43 into the loss function L to obtain the loss function value. All candidate entity vectors and target vectors of related relations are uniformly constrained by a boundary-based loss function.

步骤5、利用优化算法最优化损失函数值，从而使得候选实体向量集的评分函数值接近于1，错误实体向量集的评分函数值接近于0，以学的实体与关系的最佳向量表示，达到优化目标。Step 5, use the optimization algorithm to optimize the loss function value, so that the scoring function value of the candidate entity vector set is close to 1, and the scoring function value of the wrong entity vector set is close to 0, represented by the best vector of the learned entity and relationship, reach the optimization goal.

步骤51、优化算法将采用随机梯度下降算法最小化损失函数，获得所有候选实体向量的最佳得分，从而学的实体与关系的最佳向量表示，达到优化目标。Step 51, the optimization algorithm will use the stochastic gradient descent algorithm to minimize the loss function, and obtain the best scores of all candidate entity vectors, so as to learn the best vector representation of entities and relationships, and achieve the optimization goal.

本发明知识图谱表示学习方法所采用的翻译原则，参见图3，其基本思想是：先根据三元组(h,r,t)构造训练目标向量g＝h+r(或g＝t-r)，同时获得候选实体向量集，如图3中的{e₁,e₂,e₃,e₄}；其次利用两个向量间的余弦相似度计算得分；最后，利用随机梯度下降算法根据损失函数得分来改变候选实体向量，同时根据优选实体的整体改变方向来改变目标向量。本发明利用两个向量之间的余弦相似度，能够更好的计算目标向量与候选实体向量之间差异大小，相比欧式距离，不是简单计算两个向量之间的距离，余弦相似度更加注重两个向量在方向上的差异。这样解决了现有模型在处理1-N、N-1、N-N等复杂关系时的不足，丰富了实体与关系的表达能力，整体上提升了模型性能。The translation principle adopted by the knowledge map representation learning method of the present invention is shown in Fig. 3, and its basic idea is: first construct the training target vector g=h+r (or g=tr) according to the triplet (h, r, t), At the same time, obtain the set of candidate entity vectors, such as {e ₁ , e ₂ , e ₃ , e ₄ } in Figure 3; secondly, use the cosine similarity between the two vectors to calculate the score; finally, use the stochastic gradient descent algorithm to score according to the loss function to change the candidate entity vector, while changing the target vector according to the overall change direction of the preferred entity. The present invention uses the cosine similarity between two vectors to better calculate the difference between the target vector and the candidate entity vector. Compared with the Euclidean distance, it is not a simple calculation of the distance between two vectors, and the cosine similarity pays more attention to The difference in direction of the two vectors. This solves the shortcomings of existing models when dealing with complex relationships such as 1-N, N-1, and NN, enriches the expressive ability of entities and relationships, and improves the performance of the model as a whole.

本发明利用两个向量之间的余弦相似度，能够更好的计算目标向量与候选实体向量之间的相似性。采用实体向量与关系向量基于嵌入的模型，利用候选实体统计规则，获得相关关系的三元组集与候选实体向量集；并且引入目标向量与候选实体向量的余弦相似度，增强了模型对1-N、N-1、N-N等复杂关系的表达能力，同时构造了目标向量与候选实体向量独有的评分函数。最后构造了新的损失函数，通过随即梯度下降算法优化损失函数，当达到最佳优化目标时，就能够获得知识图谱中每一个最优的实体向量和关系向量，从而更好的将实体和关系进行表示，并保存实体和关系之间的联系，从而能够很好的应用于大规模的知识图谱补全当中。The present invention uses the cosine similarity between two vectors to better calculate the similarity between the target vector and the candidate entity vector. Using the embedding-based model of entity vectors and relationship vectors, using the statistical rules of candidate entities, the triple set of related relations and the set of candidate entity vectors are obtained; and the cosine similarity between the target vector and candidate entity vectors is introduced to enhance the model pair 1- The ability to express complex relationships such as N, N-1, and N-N, and construct a unique scoring function for target vectors and candidate entity vectors. Finally, a new loss function is constructed, and the loss function is optimized through the random gradient descent algorithm. When the optimal optimization goal is reached, each optimal entity vector and relationship vector in the knowledge map can be obtained, so as to better integrate entities and relationships. Represent and save the connection between entities and relationships, so that it can be well applied to large-scale knowledge map completion.

需要说明的是，尽管以上本发明所述的实施例是说明性的，但这并非是对本发明的限制，因此本发明并不局限于上述具体实施方式中。在不脱离本发明原理的情况下，凡是本领域技术人员在本发明的启示下获得的其它实施方式，均视为在本发明的保护之内。It should be noted that although the above-mentioned embodiments of the present invention are illustrative, they are not intended to limit the present invention, so the present invention is not limited to the above specific implementation manners. Without departing from the principles of the present invention, all other implementations obtained by those skilled in the art under the inspiration of the present invention are deemed to be within the protection of the present invention.

Claims

1. A knowledge map representation learning method based on the cosine metric rule, is characterized in that, comprises steps as follows:

Step 1. Use the vector random generation method to embed the entity set and relation set in the knowledge map into the entity vector space and the relation vector space respectively to obtain the entity vector and relation vector;

Step 2, using the statistical rules of candidate entities to obtain a set of candidate entity vectors for randomly selected triples, and randomly generate a set of wrong entity vectors according to the set of candidate entity vectors;

Step 3, using the cosine similarity to construct the scoring function between the target vector and the candidate entity vector, and at the same time standardize the value range of its function value;

Step 4. Use the scoring function to construct a boundary-based loss function for distinguishing the candidate entity vector set from the wrong entity vector set, and then make the candidate entity vector set uniformly constrain the target vector according to the loss function;

Step 5. Use the optimization algorithm to optimize the loss function value, so that the scoring function value of the candidate entity vector set is close to 1, and the scoring function value of the wrong entity vector set is close to 0, so as to learn the best vector representation of entities and relationships. reach the optimization goal.

2. a kind of knowledge map representation learning method based on cosine metric rule according to claim 1, it is characterized in that, the specific sub-steps of step 2 are as follows:

Step 2.1, randomly select a triplet from the triplet set in the knowledge graph;

Step 2.2, find all tail entity vectors that match the head entity vector and relation vector of the selected triple in the triple set, and form the candidate tail entity vector set with the found tail entity vectors; meanwhile, in the three Find all head entity vectors that match the tail entity vector and relation vector of the selected triple in the tuple set, and form the candidate head entity vector set with the found head entity vectors;

Step 2.3, performing random replacement operation on the candidate tail entity vector set to generate the error tail entity vector set; at the same time, performing random replacement operation on the candidate head entity vector set to generate the error head entity vector set.

3. A kind of knowledge map representation learning method based on cosine metric rule according to claim 1 or 2, it is characterized in that, the scoring function constructed in step 3 is as follows:

The scoring function f _t (g _t ,t) of the candidate tail entity vector is:

<mrow><msub><mi>f</mi><mi>t</mi></msub><mrow><mo>(</mo><msub><mi>g</mi><mi>t</mi></msub><mo>,</mo><mi>t</mi><mo>)</mo></mrow><mo>=</mo><mi>&alpha;</mi><mo>+</mo><mrow><mo>(</mo><mn>1</mn><mo>-</mo><mi>&alpha;</mi><mo>)</mo></mrow><mfrac><mrow><msub><mi>g</mi><mi>t</mi></msub><mo>&CenterDot;</mo><mi>t</mi></mrow><mrow><msqrt><msub><mi>g</mi><mi>t</mi></msub></msqrt><mo>*</mo><msqrt><mi>t</mi></msqrt></mrow></mfrac><mo>,</mo></mrow>

The scoring function f _h (g _h ,h) of the candidate head entity vector is:

<mrow><msub><mi>f</mi><mi>h</mi></msub><mrow><mo>(</mo><msub><mi>g</mi><mi>h</mi></msub><mo>,</mo><mi>h</mi><mo>)</mo></mrow><mo>=</mo><mi>&alpha;</mi><mo>+</mo><mrow><mo>(</mo><mn>1</mn><mo>-</mo><mi>&alpha;</mi><mo>)</mo></mrow><mfrac><mrow><msub><mi>g</mi><mi>h</mi></msub><mo>&CenterDot;</mo><mi>h</mi></mrow><mrow><msqrt><msub><mi>g</mi><mi>h</mi></msub></msqrt><mo>*</mo><msqrt><mi>h</mi></msqrt></mrow></mfrac><mo>,</mo></mrow>

The scoring function f _t ′(g _t ,t′) of the wrong tail entity vector is:

<mrow><msubsup><mi>f</mi><mi>t</mi><mo>&prime;</mo></msubsup><mrow><mo>(</mo><msub><mi>g</mi><mi>t</mi></msub><mo>,</mo><msup><mi>t</mi><mo>&prime;</mo></msup><mo>)</mo></mrow><mo>=</mo><mi>&alpha;</mi><mo>+</mo><mrow><mo>(</mo><mn>1</mn><mo>-</mo><mi>&alpha;</mi><mo>)</mo></mrow><mfrac><mrow><msub><mi>g</mi><mi>t</mi></msub><mo>&CenterDot;</mo><msup><mi>t</mi><mo>&prime;</mo></msup></mrow><mrow><msqrt><msub><mi>g</mi><mi>t</mi></msub></msqrt><mo>*</mo><msqrt><msup><mi>t</mi><mo>&prime;</mo></msup></msqrt></mrow></mfrac><mo>,</mo></mrow>

The scoring function f _h ′(g _h ,h′) of the error head entity vector is:

<mrow><msubsup><mi>f</mi><mi>h</mi><mo>&prime;</mo></msubsup><mrow><mo>(</mo><msub><mi>g</mi><mi>h</mi></msub><mo>,</mo><msup><mi>h</mi><mo>&prime;</mo></msup><mo>)</mo></mrow><mo>=</mo><mi>&alpha;</mi><mo>+</mo><mrow><mo>(</mo><mn>1</mn><mo>-</mo><mi>&alpha;</mi><mo>)</mo></mrow><mfrac><mrow><msub><mi>g</mi><mi>h</mi></msub><mo>&CenterDot;</mo><msup><mi>h</mi><mo>&prime;</mo></msup></mrow><mrow><msqrt><msub><mi>g</mi><mi>h</mi></msub></msqrt><mo>*</mo><msqrt><msup><mi>h</mi><mo>&prime;</mo></msup></msqrt></mrow></mfrac><mo>,</mo></mrow>

Among the above formulas, α is the scoring function value range specification parameter; g _t is the target vector of the tail entity, g _t =h ₀ +r ₀ ; g _h is the target vector of the head entity, g _h =t ₀ -r ₀ ; h ₀ is the head entity vector of the selected triple, t ₀ is the tail entity vector of the selected triple, r ₀ is the relation vector of the selected triple; t is the candidate tail entity vector, h is Candidate head entity vector, t' is the error tail entity vector, h' is the error head entity vector.

4. A knowledge map representation learning method based on the cosine metric rule according to claim 3, characterized in that the scoring function value range specification parameter α∈[0,1].

5. A kind of knowledge map representation learning method based on cosine metric rule according to claim 1, it is characterized in that, the loss function L constructed in step 4 is:

<mrow><mi>L</mi><mo>=</mo><munder><mo>&Sigma;</mo><mrow><mi>t</mi><mo>&Element;</mo><msub><mi>t</mi><mi>c</mi></msub><mo>,</mo><msup><mi>t</mi><mo>&prime;</mo></msup><mo>&Element;</mo><msubsup><mi>t</mi><mi>c</mi><mo>&prime;</mo></msubsup></mrow></munder><msub><mrow><mo>&lsqb;</mo><mi>&gamma;</mi><mo>+</mo><msubsup><mi>f</mi><mi>t</mi><mo>&prime;</mo></msubsup><mrow><mo>(</mo><msub><mi>g</mi><mi>t</mi></msub><mo>,</mo><msup><mi>t</mi><mo>&prime;</mo></msup><mo>)</mo></mrow><mo>-</mo><msub><mi>f</mi><mi>t</mi></msub><mrow><mo>(</mo><msub><mi>g</mi><mi>t</mi></msub><mo>,</mo><mi>t</mi><mo>)</mo></mrow><mo>&rsqb;</mo></mrow><mo>+</mo></msub><mo>+</mo><munder><mo>&Sigma;</mo><mrow><mi>h</mi><mo>&Element;</mo><msub><mi>h</mi><mi>c</mi></msub><mo>,</mo><msup><mi>h</mi><mo>&prime;</mo></msup><mo>&Element;</mo><msubsup><mi>h</mi><mi>c</mi><mo>&prime;</mo></msubsup></mrow></munder><msub><mrow><mo>&lsqb;</mo><mi>&gamma;</mi><mo>+</mo><msubsup><mi>f</mi><mi>h</mi><mo>&prime;</mo></msubsup><mrow><mo>(</mo><msub><mi>g</mi><mi>h</mi></msub><mo>,</mo><msup><mi>h</mi><mo>&prime;</mo></msup><mo>)</mo></mrow><mo>-</mo><msub><mi>f</mi><mi>h</mi></msub><mrow><mo>(</mo><msub><mi>g</mi><mi>h</mi></msub><mo>,</mo><mi>h</mi><mo>)</mo></mrow><mo>&rsqb;</mo></mrow><mo>+</mo></msub></mrow>

In the formula, γ is the set boundary value; f _t ′(g _t ,t′) represents the scoring function of the wrong tail entity vector, f _t (g _t ,t) represents the scoring function of the candidate tail entity vector, f _h ′ (g _h , h′) represents the scoring function of the wrong head entity vector, f _h (g _h , h) represents the scoring function of the candidate head entity vector; g _t is the target vector of the tail entity, g _t = h ₀ +r ₀ ; g _h is the target vector of the head entity, g _h =t ₀ -r ₀ ; h ₀ is the head entity vector of the selected triple, t ₀ is the tail entity vector of the selected triple, r ₀ is The relationship vector of the selected triplet; t is the candidate tail entity vector, t _c is the candidate tail entity vector set, t′ is the wrong tail entity vector, t′ _c is the wrong tail entity vector set, h is the candidate head entity vector , h _c is the candidate header entity vector set, h′ is the error header entity vector, h′ _c is the error header entity vector set.

6. A knowledge map representation learning method based on the cosine metric rule according to claim 1, wherein the optimization algorithm in step 5 is a stochastic gradient descent algorithm.