CN102750315A

CN102750315A - Rapid discovering method of conceptual relations based on sovereignty iterative search

Info

Publication number: CN102750315A
Application number: CN201210125040XA
Authority: CN
Inventors: 张辉; 陈勇; 胡红萍; 马永星
Original assignee: Beihang University
Current assignee: Beihang University
Priority date: 2012-04-25
Filing date: 2012-04-25
Publication date: 2012-10-24
Anticipated expiration: 2032-04-25
Also published as: CN102750315B

Abstract

The invention discloses a method for quickly discovering concept relationships based on sovereign iterative search, which includes: using Boolean search to eliminate concept pairs that do not contain the same non-zero elements or only contain very few non-zero elements, and narrow down the candidate concept set in magnitude; Use the enumeration method to calculate the concept correlation in the vector space; obtain the most relevant concepts by sorting. The invention integrates the advantages of the method for discovering the conceptual relationship of the Boolean model and the method for discovering the enumerated relationship based on the vector space model, and provides a fast method with time efficiency close to the former and accuracy and recall close to the latter.

Description

A Fast Discovery Method of Conceptual Relations Based on Sovereign Iterative Search

技术领域 technical field

本发明涉及一种概念关系快速发现方法，尤其涉及一种基于主权迭代搜索的概念关系快速发现方法，属于语义网络技术领域。The invention relates to a method for quickly discovering concept relationships, in particular to a method for quickly discovering concept relationships based on sovereign iterative search, and belongs to the technical field of semantic networks.

背景技术 Background technique

在自然语言世界中，概念是对客观实体的抽象描述，是客观实体属性特征的集合。由于客观实体的相互作用，概念之间亦产生千丝万缕的关联，我们称之为概念关系。概念及概念关系共同构成了自然语言世界的基础，如果说自然语言世界是一个语义网络，那么概念就是语义的载体，而概念关系就是语义载体间的纽带。通过研究概念关系可以反射得出客观世界中实体关联的内容与性质，进而为人类的工作和生活服务。In the natural language world, a concept is an abstract description of an objective entity and a collection of attributes and characteristics of an objective entity. Due to the interaction of objective entities, concepts are also inextricably linked, which we call concept relations. Concepts and conceptual relations together form the basis of the natural language world. If the natural language world is a semantic network, then concepts are the carriers of semantics, and concept relations are the links between semantic carriers. Through the study of conceptual relations, the content and nature of entity associations in the objective world can be reflected, and then serve for human work and life.

在当前信息社会中，互联网无疑是数据的最大载体，以超链接关联的超文本信息日益增长，构成了信息网络世界，已经彻底改变了现代人类的工作和生活的方式。然而，亦是因为信息的爆炸式增长，管理和使用信息越来越复杂，越来越困难。为适应语义推理和智能化服务的需求，语义Web为代表的下一代信息互联网络试图在任何微小数据间构建连接，而概念关系正是构建语义网络的基础。因此，概念关系抽取技术是人类信息第二次变革的基础。In the current information society, the Internet is undoubtedly the largest carrier of data. The hypertext information associated with hyperlinks is increasing day by day, forming an information network world, which has completely changed the way modern humans work and live. However, also because of the explosive growth of information, managing and using information is becoming more and more complex and difficult. In order to meet the needs of semantic reasoning and intelligent services, the next-generation information Internet network represented by the Semantic Web tries to build connections between any tiny data, and the conceptual relationship is the basis for building a semantic network. Therefore, concept relationship extraction technology is the basis of the second revolution of human information.

搜索引擎和文本挖掘是平台门户系统的核心技术，而概念和文本的“语义相关度计算”又是搜索引擎和文本挖掘的关键基础。在纯粹的统计模型下，由于缺少知识智能的支持，只能以“相似度”代替“语义相关度”，作为搜索引擎和文本挖掘中解决一系列复杂技术问题的基础。Search engine and text mining are the core technologies of the platform portal system, and the "semantic correlation calculation" of concepts and texts is the key basis of search engine and text mining. Under the pure statistical model, due to the lack of intellectual intelligence support, "similarity" can only be used to replace "semantic relevance" as the basis for solving a series of complex technical problems in search engines and text mining.

但是，随着信息量爆炸增长、信息结构日益复杂化和互联网的社会化趋势，人们对搜索引擎和文本挖掘提出了愈发强烈的搜索智能和个性化服务的需求，此时，语义相关度计算的问题再次置于搜索引擎和文本挖掘技术面前，“相似度”计算无论在准确性还是物理意义上都已经无法满足这种需求，“语义相关度”计算似乎已经成为一个智能搜索时代必须解决的基础问题。However, with the explosive growth of information volume, the increasingly complex information structure and the socialization trend of the Internet, people put forward increasingly strong demands for search intelligence and personalized services for search engines and text mining. At this time, semantic correlation calculation The problem of "similarity" is once again placed in front of search engines and text mining technology. The calculation of "similarity" can no longer meet this demand in terms of accuracy and physical sense. The calculation of "semantic relevance" seems to have become an era that must be solved in the era of intelligent search. basic question.

目前，布尔模型和向量空间模型均在文本分类中得到广泛而有效的应用。布尔搜索可以快速发现概念关系，但是准确性和召回率均无法得到保证，而且用于构造布尔查询的特征元素数目亦难以确定，查询特征过少则会导致查询效率降低，结果数过多，查询特征过多则召回率降低。扩展的布尔搜索可以对查询匹配的特征元素赋予权重，可以一定程度上提高召回率，但是不能解决查询逻辑的构造问题。向量空间模型也存在一些缺点，主要表现为：向量空间的维数往往很高，导致计算量大，影响系统速度。此外，向量中特征权值的确定也是较难考量的一个部分。Currently, Boolean models and vector space models are widely and effectively used in text classification. Boolean search can quickly discover conceptual relationships, but the accuracy and recall rate cannot be guaranteed, and the number of feature elements used to construct Boolean queries is also difficult to determine. Too few query features will lead to low query efficiency and too many results. Too many features reduce the recall rate. The extended Boolean search can give weight to the feature elements matched by the query, which can improve the recall rate to a certain extent, but it cannot solve the problem of query logic construction. The vector space model also has some shortcomings, mainly as follows: the dimensionality of the vector space is often very high, resulting in a large amount of calculation and affecting the system speed. In addition, the determination of the feature weights in the vector is also a difficult part to consider.

发明内容 Contents of the invention

针对现有技术所存在的不足，本发明所要解决的技术问题在于提供一种概念关系快速发现方法。该方法既可保证查询效率，亦可保证准确性和召回率Aiming at the deficiencies in the prior art, the technical problem to be solved by the present invention is to provide a method for quickly discovering concept relationships. This method can not only guarantee the query efficiency, but also ensure the accuracy and recall rate

为实现上述的发明目的，本发明采用下述的技术方案：For realizing above-mentioned purpose of the invention, the present invention adopts following technical scheme:

一种基于主权迭代搜索的概念关系快速发现方法，包括：A rapid discovery method for concept relations based on sovereign iterative search, including:

使用布尔搜索消除不包含相同非零元素或者仅包含极少非零元素的概念对；Use Boolean search to eliminate concept pairs that do not contain the same non-zero elements or contain only very few non-zero elements;

使用枚举法计算向量空间下的概念相关度；Use the enumeration method to calculate the concept correlation in the vector space;

排序求得最相关概念。Sort to find the most relevant concepts.

更进一步地，所述布尔搜索包括：Furthermore, the Boolean search includes:

将概念的语义特征向量转化为布尔表达式，并构建特征向量正向索引和特征倒排索引，使用目标概念的语义特征构造逻辑查询，在逻辑表达式集合中搜索得到目标概念的相关概念集。The semantic feature vector of the concept is converted into a Boolean expression, and the feature vector forward index and feature inverted index are constructed. The semantic feature of the target concept is used to construct a logical query, and the related concept set of the target concept is searched in the logical expression set.

更进一步地所述使用枚举法计算向量空间下的概念相关度步骤包括：Further, the step of using the enumeration method to calculate the concept correlation degree under the vector space includes:

a.搜索器根据特征向量正向索引搜索特征向量；a. The searcher searches the eigenvector according to the forward index of the eigenvector;

b.根据搜索到的特征向量获取主权特征；b. Obtain the sovereign feature according to the searched feature vector;

c.搜索器以进行关联搜索，选出候选相关概念，并放入候选相关概念集；c. The searcher is used for association search, selects candidate related concepts, and puts them into the candidate related concept set;

d.候选概念选取策略得到候选概念；d. Candidate concept selection strategy to obtain candidate concepts;

e.候选概念集饱和控制策略搜索到关联特征倒排拉链后，选取候选概念；e. After the candidate concept set saturation control strategy searches for the associated feature inverted zipper, select the candidate concept;

f.判断候选概念集是否已达到饱和，若候选概念集未达到饱和则返回b，重复b～f，直到候选概念集达到饱和；f. Determine whether the candidate concept set has reached saturation, if the candidate concept set has not reached saturation, return to b, and repeat b to f until the candidate concept set is saturated;

f.若候选概念集已达到饱和迭代搜索停止，精确计算目标概念同候选相关概念的相关度。f. If the candidate concept set has reached saturation, the iterative search is stopped, and the correlation between the target concept and the candidate related concepts is accurately calculated.

更进一步地，所述候选概念选取策略包括：Further, the candidate concept selection strategy includes:

a.搜索器使用关联特征为查询键，在特征倒排索引中执行查询，得到候选概念拉链；a. The searcher uses the associated feature as the query key, executes the query in the feature inverted index, and obtains the candidate concept zipper;

b.将窗口为M的滑动窗口置于拉链首部，若元素不多于M，则全部选为候选概念；b. Place the sliding window whose window is M at the head of the zipper, if there are no more than M elements, select all of them as candidate concepts;

c.如果元素多于M，监督器监测监督条件，即将首尾权重减幅和幅度阈值进行对比；c. If there are more elements than M, the supervisor monitors the supervisory conditions, that is, compares the head and tail weight reduction with the amplitude threshold;

d.若首尾权重减幅小于或等于幅度阈值，滑动窗口移动一位并返c，重复c～d，直到首尾权重减幅大于幅度阈值；d. If the decrease of the first and last weights is less than or equal to the amplitude threshold, move the sliding window one bit and return to c, and repeat c to d until the decrease of the first and last weights is greater than the amplitude threshold;

e.若首尾权重减幅大于幅度阈值，则将窗口后的元素剪掉，剩余即选为候选概念。e. If the reduction of the head and tail weights is greater than the magnitude threshold, the elements behind the window are cut off, and the rest are selected as candidate concepts.

更进一步地，所述候选概念集饱和控制策略包括：Further, the candidate concept set saturation control strategy includes:

a.搜索候选概念，若候选概念已存在于候选相关概念集中，对关联特征增量计算“伪相关度”，若候选概念为新增概念使用关联特征计算“伪相关度”；a. Search for candidate concepts, if the candidate concept already exists in the set of candidate related concepts, calculate the "false correlation degree" for the associated feature incrementally, if the candidate concept is a newly added concept, use the associated feature to calculate the "pseudo-related degree";

b.重新调整候选概念拉链顺序；b. Re-adjust the zipper order of candidate concepts;

c.滑动窗口置于拉链首部，判断元素数和窗口的大小，若元素数不大于窗口大小，重新执行候选概念选取策略，若元素数不小于窗口大小，则检查首尾候选概念“伪相关度”减幅，与幅度阈值比较；c. The sliding window is placed at the head of the zipper, and the number of elements and the size of the window are judged. If the number of elements is not greater than the size of the window, re-execute the candidate concept selection strategy. If the number of elements is not less than the size of the window, check the "false correlation" of the first and last candidate concepts Amplitude reduction, compared with the amplitude threshold;

d.若“伪相关度”减幅小于幅度阈值，向拉链尾部滑动窗口进入步骤，若已到达拉链尾部，重新执行候选概念选取策略，否则返回c，重复c～d，直到“伪相关度”减幅不小于幅度阈值；d. If the decrease of "false correlation degree" is less than the magnitude threshold, enter the step of sliding the window to the end of the zipper, if it has reached the end of the zipper, re-execute the candidate concept selection strategy, otherwise return to c, repeat c~d, until the "false correlation degree" The decrease is not less than the magnitude threshold;

e.若“伪相关度”减幅不小于幅度阈值，候选相关概念集达到饱和，停止迭代搜索。e. If the decrease of the "false correlation degree" is not less than the magnitude threshold, the set of candidate related concepts is saturated, and the iterative search is stopped.

更进一步地，所述构建特征向量正向索引包括：Further, the construction of the feature vector forward index includes:

对所有概念进行编号，对所有概念的语义特征按照权重降序，并将特征向量归一化为单位向量；Number all concepts, sort the semantic features of all concepts in descending order of weight, and normalize the feature vectors to unit vectors;

以“概念编号-特征向量”的键值关系构造特征向量正向索引。Construct the feature vector forward index with the key-value relationship of "concept number - feature vector".

更进一步地，所述构建特征倒排索引包括：Further, the construction of the feature inverted index includes:

将每个单位特征向量看作一个文档，以向量特征为倒排词条，索引拉链中的概念编号按照特征权重大小降序排列，构造特征倒排索引。Each unit feature vector is regarded as a document, and the vector feature is used as an inverted entry. The concept numbers in the index zipper are arranged in descending order according to the weight of the feature, and the feature inverted index is constructed.

本发明基于布尔模型的概念关系发现方法和基于向量空间模型的枚举关系发现方法，将两者的优点进行融合，扬长避短，给出一种时间效率接近前者，而准确率和召回率接近后者的快速方法。The present invention combines the advantages of the concept relationship discovery method based on the Boolean model and the enumeration relationship discovery method based on the vector space model, maximizes the strengths and avoids the weaknesses, and provides a time efficiency close to the former, while the accuracy rate and recall rate are close to the latter the quick method.

附图说明 Description of drawings

下面结合附图和具体实施方式对本发明作进一步的详细说明。The present invention will be further described in detail below in conjunction with the accompanying drawings and specific embodiments.

图1是特征向量单位归一化及索引结构示意图；Figure 1 is a schematic diagram of unit normalization of feature vectors and index structure;

图2是特征倒排构建及索引结构示意图；Figure 2 is a schematic diagram of feature inversion construction and index structure;

图3是单次迭代搜索的操作流程示意图；Fig. 3 is a schematic diagram of the operation flow of a single iterative search;

图4是基于主权迭代搜索的关系发现流程示意图；Figure 4 is a schematic diagram of the relationship discovery process based on sovereign iterative search;

图5是候选概念选取策略的执行流程示意图；Fig. 5 is a schematic diagram of the execution flow of a candidate concept selection strategy;

图6是候选概念集饱和控制策略的执行流程示意图；Fig. 6 is a schematic diagram of the execution flow of the candidate concept set saturation control strategy;

图7是召回率和时间效率的对比实验总体流程示意图；Figure 7 is a schematic diagram of the overall flow of the comparative experiment of recall rate and time efficiency;

图8是单次搜索概念选取数平均值示意图；Fig. 8 is a schematic diagram of a single search concept selection average value;

图9是饱和控制策略参数调控下的平均召回率示意图；Figure 9 is a schematic diagram of the average recall rate under the regulation of saturation control strategy parameters;

图10是饱和控制策略参数调控下的平均召回率示意图；Figure 10 is a schematic diagram of the average recall rate under the regulation of saturation control strategy parameters;

图11是时间效率和召回率对比示意图。Figure 11 is a schematic diagram of the comparison of time efficiency and recall rate.

具体实施方式 Detailed ways

概念关系发现是指发现语义关系密切的概念关系对，它是概念关系抽取的关键步骤。概念关系发现一般只搜寻和概念关系最为紧密的Top-M个概念。一般情况下，M为自然数且取值不大，在两位数的范围，M值过大就失去了相关概念的意义。Concept relationship discovery refers to the discovery of concept relationship pairs with close semantic relations, which is a key step in concept relationship extraction. Concept relationship discovery generally only searches for the Top-M concepts that are most closely related to concepts. In general, M is a natural number and the value is not large. In the range of two digits, if the value of M is too large, the meaning of related concepts will be lost.

对于一个目标概念，从300万个候选概念中搜寻Top-M(M＜100)个最相关概念，有效率仅为1/10000。如果使用枚举法搜寻最相关概念，则99.99％的计算资源都将被用在计算与非相关概念的相关度上，这是极不划算的。概念语义特征向量的平均元素个数仅为300左右，相对于300万维的概念语义空间来说，特征向量仅有1/10000非零元素，可谓极其稀疏。因此，绝大多数概念对之间并不包含相同的非零元素，即使包含相同的非零元素，大部分概念对仅包含极少数的相同非零元素。For a target concept, searching Top-M (M<100) most relevant concepts from 3 million candidate concepts has an effective rate of only 1/10000. If the enumeration method is used to search for the most relevant concepts, 99.99% of computing resources will be used to calculate the degree of correlation with non-related concepts, which is extremely uneconomical. The average number of elements of the conceptual semantic feature vector is only about 300. Compared with the 3 million-dimensional conceptual semantic space, the feature vector has only 1/10000 non-zero elements, which can be described as extremely sparse. Therefore, most concept pairs do not contain the same non-zero elements, even if they contain the same non-zero elements, most concept pairs only contain a very small number of the same non-zero elements.

本发明使用布尔搜索消除不包含相同非零元素或者仅包含极少非零元素的概念对，在量级上缩小候选概念集，然后，在概念集内使用枚举法计算向量空间下的概念相关度，而后排序求得Top-M最相关概念。则即可保证查询效率，亦可保证准确性和召回率。The present invention uses Boolean search to eliminate concept pairs that do not contain the same non-zero elements or only contain very few non-zero elements, narrows down the candidate concept set in magnitude, and then uses the enumeration method to calculate the concept correlation under the vector space in the concept set degree, and then sorted to obtain the Top-M most relevant concepts. Then the query efficiency can be guaranteed, and the accuracy and recall rate can also be guaranteed.

布尔模型就是采用布尔表达式对文本进行表示。布尔模型在传统的信息检索中有广泛的应用，它通过与用户给出的检索式进行逻辑比较来检索文本，是一种基于关键词的匹配。在标准的布尔模型中，文本如式(1)所示：The Boolean model is to use Boolean expressions to represent the text. The Boolean model is widely used in traditional information retrieval. It retrieves text through logical comparison with the retrieval formula given by the user, which is a keyword-based matching. In the standard Boolean model, the text is shown in equation (1):

d_i＝(w_i1，w_i2，......w_ik......w_in)，其中k＝1，2，...，n (1)d _i =(w _i1 , w _i2 ,... _wik ... _win ), where k=1, 2,..., n (1)

其中n为特征项的个数，w_ik为0或1，表示第k个特征项在文本i中是否出现。布尔模型易于实现，但在文本分类领域，它的准确率和召回率较差。Among them, n is the number of feature items, and w _ik is 0 or 1, indicating whether the kth feature item appears in text i. Boolean models are easy to implement, but in the field of text classification, their precision and recall are poor.

假如，概念共同包含某个或某几个特征是概念间存在紧密语义关系的充分条件，那么，概念关系发现即可转化为布尔模型下的搜索过程。将概念的语义特征向量转化为布尔表达式，并构建倒排索引，使用目标概念的语义特征构造逻辑查询，在逻辑表达式集合中搜索即可得到目标概念的相关概念集。If it is a sufficient condition for a close semantic relationship between concepts to contain one or several features, then the concept relationship discovery can be transformed into a search process under the Boolean model. Transform the semantic feature vector of the concept into a Boolean expression, and build an inverted index, use the semantic feature of the target concept to construct a logical query, and search in the set of logical expressions to get the related concept set of the target concept.

布尔模型将连续值离散化为0或1，仅在存在与否间做抉择，虽然一刀切的做法不甚精准，但是却有着极高的效率。但是，在高维的语义空间中，一个概念的语义性质是特征向量中所有的语义特征共同决定的，而且他们的决定权重大相径庭。因此，基于布尔模型的搜索方法亦不可直接用于海量概念的关系发现。The Boolean model discretizes continuous values into 0 or 1, and only makes a choice between existence or non-existence. Although the one-size-fits-all approach is not very accurate, it has extremely high efficiency. However, in a high-dimensional semantic space, the semantic properties of a concept are jointly determined by all the semantic features in the feature vector, and their decision weights are quite different. Therefore, the search method based on the Boolean model cannot be directly used for the relationship discovery of massive concepts.

相对布尔搜索，缩小候选概念集后再使用枚举法计算向量空间下的相关度，可以提高准确率。但是要保证很好的召回率，则需要尽可能的将相关概念包含在缩小后的候选概念集中，然而，为了保证效率，候选概念集合又不能太大。所以，如何尽量使用布尔逻辑缩小候选概念关系集合而又尽量少损失相关概念是本方法的核心关键问题。Compared with Boolean search, the accuracy can be improved by narrowing down the set of candidate concepts and then using the enumeration method to calculate the correlation in the vector space. But to ensure a good recall rate, it is necessary to include relevant concepts in the reduced candidate concept set as much as possible. However, in order to ensure efficiency, the candidate concept set cannot be too large. Therefore, how to use Boolean logic to narrow down the set of candidate concept relationships while minimizing the loss of related concepts is the core key issue of this method.

向量空间模型是由Gerard Salton和McGill于1969年提出的。因其概念简单，在信息检索(Information Retrieval，简称IR)中得到广泛应用。在文本自动分类中使用这种方法主要是从IR中引进过来的。向量空间模型的基本思想是：将每个词条作为特征空间坐标系的一维，将文本看作特征空间的一个向量，用两个向量之间的夹角来衡量两个文本之间的相似度。在向量空间模型中，每一个文本都被表示为由一组规范化正交词条向量所张成的向量空间的一个点，即形式化为n维空间中的向量，因此我们可以将文本抽象为式(2)所示：The vector space model was proposed by Gerard Salton and McGill in 1969. Because of its simple concept, it is widely used in Information Retrieval (IR). The use of this method in automatic text classification is mainly introduced from IR. The basic idea of the vector space model is: take each entry as one dimension of the feature space coordinate system, regard the text as a vector in the feature space, and use the angle between the two vectors to measure the similarity between the two texts Spend. In the vector space model, each text is represented as a point in the vector space spanned by a set of normalized orthogonal term vectors, that is, it is formalized as a vector in n-dimensional space, so we can abstract the text as Formula (2) shows:

V(d_i)＝((t₁，w_1j)，......(t_i，w_ij)，......(t_n，w_nj))，i＝1，2，......n (2)V(d _i )=((t ₁ ,w _1j ),...(t _i ,w _ij ),...(t _n ,w _nj )), i=1,2, ...... n (2)

其中，t_i是特征项i，w_ij是t_i在文本d_j中的权重。对于一个训练文本集合，我们就可以得到如下表1所示的一个向量空间W：Among them, t _i is feature item i, w _ij is the weight of t _i in text d _j . For a training text set, we can get a vector space W as shown in Table 1 below:

d₁ d ₁ d₂ d ₂ d_m _m t₁ t ₁ w₁₁ w ₁₁ w₁₂ w ₁₂ w_1m w _1m t₂ t ₂ t₂₁ t ₂₁ t₂₂ t ₂₂ t_2m t _2m t_n t _n t_n1 t _n1 t_n2 t _n2 t_nm t _nm

表1向量空间下的文本表示Table 1 Text representation under vector space

W通常是一个稀疏矩阵。训练文本和待分类文本在向量空间模型中都可用相同的方法表示出来。待分类文本的向量越接近训练文本向量，说明其与训练空间中的文本的相似度越大，越有可能和训练文本属于同一个类别。W is usually a sparse matrix. Both the training text and the text to be classified can be expressed in the same way in the vector space model. The closer the vector of the text to be classified is to the vector of the training text, the greater the similarity between it and the text in the training space, and the more likely it belongs to the same category as the training text.

向量空间模型是目前广泛使用的文本表示模型，具有如下优点：The vector space model is a widely used text representation model, which has the following advantages:

(1)文本内容被形式化到多维空间中的一个点，通过向量形式给出，将文本以量的形式定义到了实数域中，提高了自然语言文本的可计算性和可操作性；(1) The text content is formalized to a point in the multi-dimensional space, given in the form of a vector, and the text is defined in the real number field in the form of quantity, which improves the computability and operability of the natural language text;

(2)为词计算权值，通过调节词对应权值的大小来反映词与所在文本的相关程度，克服了传统布尔模型的缺陷；(2) Calculate the weight value for the word, and reflect the degree of relevance between the word and the text by adjusting the weight value corresponding to the word, which overcomes the defects of the traditional Boolean model;

(3)通过计算文本之间的相似度，使属性相似的文本尽量聚拢在一起，以提高匹配效率。(3) By calculating the similarity between the texts, the texts with similar attributes are gathered together as much as possible to improve the matching efficiency.

向量空间模型下通过向量余弦夹角的方法计算最相关概念，在本质上有别于布尔模型。在向量空间模型中，维度之间是离散的，但是在任一维度内部，又是连续的。向量空间模型正是用这种离散和连续结合的方式实现了特征间的共同作用，不绝对依赖某个特征，但是又不放弃任何一个特征。Under the vector space model, the most relevant concept is calculated by the method of vector cosine angle, which is different from the Boolean model in essence. In the vector space model, the dimensions are discrete, but within any dimension, they are continuous. The vector space model uses this combination of discrete and continuous methods to realize the interaction between features, without absolutely relying on a certain feature, but without giving up any feature.

对语料库中的每一个概念，使用表现语义构造法构建语义特征向量，用TF-ODF特征赋权，并基于中文语用进行调权，特征降维之后即可得到概念的解释语义特征向量表示。For each concept in the corpus, use the expressive semantics construction method to construct a semantic feature vector, use TF-ODF feature weighting, and adjust the weight based on Chinese pragmatics. After feature dimensionality reduction, the semantic feature vector representation of the concept can be obtained.

设向量平均长度为n，对特征进行排序后可以在O(n)的时间内计算余弦夹角，采用快速排序时间复杂度为O(nlogn)，故最终时间复杂度为O(nlogn)。Assuming that the average length of the vector is n, after sorting the features, the cosine angle can be calculated in O(n) time, and the time complexity of quick sorting is O(nlogn), so the final time complexity is O(nlogn).

基于解释关系构造概念语义特征向量后，对语义特征赋权，两两计算概念相关度后排序，就可获取任意概念的Top-M相关概念。假设开放百科语料库中的概念总数为N，两两计算时间复杂度为O(N²)。对任一概念使用最大堆搜寻保存当前最大的M个相关度值及对应概念，则计算完所有概念对的相关度后即可得到任一概念的Top-M最相关概念，最大堆每次调整时间消耗为logM。又已知概念相关度计算的平均复杂度为O(nlogn)，则可知使用两两计算概念相关度的方法Top-M相关概念发现的时间复杂度为O(N²nlognlogM)。其中，n为向量平均特征个数，降维后n约为300，而N为开放百科语料库概念总数，约为300万。粗略计算，运算能力在10亿/s的单台计算机需要约3年才可完成，这显然是不可接受的。After constructing the concept semantic feature vector based on the interpretation relationship, the semantic features are given weights, and the concept correlation is calculated in pairs and then sorted to obtain the Top-M related concepts of any concept. Assuming that the total number of concepts in the Open Encyclopedia corpus is N, the time complexity of pairwise calculation is O(N ² ). Use the maximum heap search for any concept to save the current maximum M correlation values and corresponding concepts, then after calculating the correlation of all concept pairs, you can get the Top-M most relevant concepts of any concept, and the maximum heap is adjusted each time The time consumption is logM. It is also known that the average complexity of concept correlation calculation is O(nlogn), then it can be seen that the time complexity of Top-M related concept discovery using the method of pairwise calculation of concept correlation is O(N ² nlognlogM). Among them, n is the average number of features of the vector, and n is about 300 after dimensionality reduction, and N is the total number of concepts in the Open Encyclopedia corpus, which is about 3 million. Roughly calculated, a single computer with a computing capacity of 1 billion/s would take about 3 years to complete, which is obviously unacceptable.

在向量空间模型中，每一维特征项所对应的权值称为特征权重，简称权重。主权特征指语义特征向量中起主导作用特征。统计表明，在80％特征向量中，权重之和的80％集中在前20％的特征中。迭代搜索是在搜索过程中不断调整并维护候选概念集，直到满足一定的条件终止迭代，确定候选概念集合。In the vector space model, the weight corresponding to each dimension feature item is called feature weight, or weight for short. Sovereign feature refers to the dominant feature in the semantic feature vector. Statistics show that in 80% of the feature vectors, 80% of the sum of the weights is concentrated in the top 20% of the features. Iterative search is to continuously adjust and maintain the candidate concept set during the search process until certain conditions are met to terminate the iteration and determine the candidate concept set.

基于主权迭代搜索的概念关系快速发现方法结合布尔模型和向量空间模型，取长补短，利用布尔模型的效率优势进行粗略计算，将候选相关概念限定在小数量级范围，而后再利用向量空间模型的准确性优势进行细致计算，最终得到Top-M最相关概念。The rapid discovery method of concept relationship based on sovereign iterative search combines the Boolean model and the vector space model, learns from each other, uses the efficiency advantage of the Boolean model to perform rough calculations, limits the candidate related concepts to a small order of magnitude, and then uses the accuracy advantage of the vector space model Carry out detailed calculations, and finally get the most relevant concepts of Top-M.

如图1所示，首先，我们对所有概念进行编号，代替概念词语进行存储以节省空间，之后对所有概念的语义特征按照权重降序，并将特征向量归一化为单位向量。单位向量的模为1，便于在计算向量的夹角余弦，也便于横向比较同一特征在不同向量中的权重。此外，以“概念编号-特征向量”的键值关系构造索引，方便随机快速的获取任一概念编号对应的特征向量。为了和后续构建的特征倒排索引相区别，我们称此处构建的索引为特征向量正向索引。As shown in Figure 1, first, we number all concepts and store them instead of concept words to save space, and then sort the semantic features of all concepts in descending order of weight, and normalize the feature vectors into unit vectors. The modulus of the unit vector is 1, which is convenient for calculating the cosine of the included angle of the vector, and also convenient for horizontally comparing the weights of the same feature in different vectors. In addition, the index is constructed with the key-value relationship of "concept number-feature vector", which is convenient to randomly and quickly obtain the feature vector corresponding to any concept number. In order to distinguish it from the feature inverted index constructed later, we call the index constructed here the feature vector forward index.

然后，将每个单位特征向量看作一个文档，以向量特征为倒排词条，构造倒排索引，索引拉链中的概念编号按照特征权重大小降序排列。如图2所示，这种结构可以方便快速地随机访问任一语义特征横向关联的概念集合，概念按权重大小排序，利于由强到弱的顺序访问关联概念。Then, each unit feature vector is regarded as a document, and the vector feature is used as an inverted entry to construct an inverted index. The concept numbers in the index zipper are arranged in descending order according to the feature weight. As shown in Figure 2, this structure can conveniently and quickly randomly access the set of concepts associated with any semantic feature horizontally, and the concepts are sorted by weight, which is conducive to accessing related concepts in order from strong to weak.

特征向量的正向索引和特征倒排索引是后续搜索迭代的结构基础和效率保障，索引构造完毕后即可按照搜索迭代方法进行概念关系发现。为方便方法描述，我们将待计算相关概念的概念称为目标概念。The forward index of the feature vector and the inverted index of the feature are the structural basis and efficiency guarantee of the subsequent search iteration. After the index is constructed, the concept relationship can be discovered according to the search iteration method. For the convenience of method description, we refer to the concept of the related concept to be calculated as the target concept.

搜索迭代方法包括“搜索器”和“候选相关概念集”两部分。候选相关概念集由候选相关概念组成，初始为空集。搜索器以目标概念的高权重特征为关联特征进行关联搜索，选出候选相关概念，并放入候选相关概念集。候选概念同目标概念具有相同的关联特征，使用这些动态变化的关联特征计算候选概念与目标概念的相关度，我们称为“伪相关度”。之所以称其为“伪相关度”，是因为其仅是目标概念与候选概念部分匹配特征计算得到。但由于关联特征是目标概念的主要权重特征，而候选概念又是从关联特征列表的首部取得，因此，“伪相关度”是“真实相关度”的主要部分，当关联特征趋近于目标概念的全部特征时，“伪相关度”等于“真实相关度”。因为特征向量均已被单位归一化，模为1，所以“伪相关度”的计算无需考虑向量的模，进而可实现增量计算，提高计算效率。迭代搜索流程框架如图3所示，图中将前两位的主权特征搜索合并在一起进行展示。图4是基于主权迭代搜索的关系发现流程示意图。The search iteration method includes two parts: "searcher" and "candidate related concept set". The candidate related concept set is composed of candidate related concepts, and is initially an empty set. The searcher uses the high-weight features of the target concept as the associated features to perform an associated search, selects candidate related concepts, and puts them into the candidate related concept set. Candidate concepts have the same associated features as the target concept, and these dynamically changing associated features are used to calculate the correlation between the candidate concept and the target concept, which we call "pseudo-correlation". The reason why it is called "pseudo-correlation" is that it is only calculated from the partial matching features between the target concept and the candidate concept. However, since the correlation feature is the main weight feature of the target concept, and the candidate concept is obtained from the head of the correlation feature list, the "false correlation" is the main part of the "true correlation". When the correlation feature is close to the target concept When all the features of , the "false correlation" is equal to the "true correlation". Because the eigenvectors have been normalized by the unit, the modulus is 1, so the calculation of the "false correlation degree" does not need to consider the modulus of the vector, and then the incremental calculation can be realized, and the calculation efficiency can be improved. The iterative search process framework is shown in Figure 3, where the first two sovereign feature searches are combined for display. Fig. 4 is a schematic diagram of the relationship discovery process based on sovereign iterative search.

迭代搜索的过程有两个关键策略步骤，一是搜索到关联特征倒排拉链后，选取多少候选概念，我们称为“候选概念选取策略”；二是候选相关概念集何时饱和，停止关联搜索，我们称为“候选概念集饱和控制策略”。带监督器的滑动窗口可以用过调控监督器一定程度上保证剪枝掉的元素较大概率的不会出现在目标结果集中。我们亦将带监督器的滑动窗口引入此处的两个策略步骤中，同特征降维一样，监督器的参数亦通过实验得出。There are two key strategic steps in the iterative search process. One is how many candidate concepts are selected after searching for the associated feature inverted zipper, which we call "candidate concept selection strategy"; the second is when the set of candidate related concepts is saturated and the associated search is stopped. , which we refer to as the "candidate concept set saturation control strategy". The sliding window with a supervisor can use the over-regulation supervisor to ensure that the pruned elements will not appear in the target result set with a high probability to a certain extent. We also introduce a sliding window with a supervisor into the two strategy steps here. Like feature dimensionality reduction, the parameters of the supervisor are also obtained through experiments.

带监督器的滑动窗口需要设定窗口大小winLen和幅度阈值δ。由于我们欲获取Top-M个相关概念，因此，设定候选概念选取策略的滑动窗口大小为winLen₁＝M，幅度阈值δ₁则通过实验得出。候选概念集饱和控制策略滑动窗口大小设为winLen₂，幅度阈值设为δ₂。A sliding window with a supervisor requires setting the window size winLen and the magnitude threshold δ. Since we want to obtain Top-M related concepts, the sliding window size of the candidate concept selection strategy is set as winLen ₁ =M, and the amplitude threshold δ ₁ is obtained through experiments. The sliding window size of the candidate concept set saturation control strategy is set to winLen ₂ , and the amplitude threshold is set to δ ₂ .

如图5所示，候选概念选取策略的执行步骤如下：As shown in Figure 5, the execution steps of the candidate concept selection strategy are as follows:

步骤401，搜索器使用关联特征为查询键，在特征倒排索引中执行查询，得到候选概念拉链，进入步骤402。Step 401 , the searcher uses the associated feature as a query key to perform a query in the feature inverted index to obtain a candidate concept zipper, and enters step 402 .

步骤402，将窗口为M的滑动窗口置于拉链首部，若元素不多于M，则全部选为候选概念，如果元素多于M则进入步骤404。Step 402, place the sliding window whose window is M at the head of the zipper, if there are no more than M elements, select all of them as candidate concepts, and if there are more elements than M, go to step 404.

步骤404，监督器监测监督条件，即将首尾权重减幅和幅度阈值δ₁进行对比进入步骤405。In step 404, the supervisor monitors the supervisory conditions, that is, compares the head-to-tail weight reduction with the amplitude threshold _δ1 and proceeds to step 405.

步骤405，若首尾权重减幅小于或等于幅度阈值，进入步骤407；若首尾权重减幅大于幅度阈值，进入步骤406。In step 405, if the decrease of the first and last weights is less than or equal to the amplitude threshold, go to step 407; if the decrease of the first and last weights is greater than the amplitude threshold, go to step 406.

步骤406，则将窗口后的元素剪掉，剩余即选为候选概念。In step 406, the elements behind the window are cut off, and the rest are selected as candidate concepts.

步骤407，滑动窗口移动一位，并返回到步骤404。Step 407, move the sliding window by one bit, and return to step 404.

如图6所示候选概念集饱和控制策略的执行步骤如下：As shown in Figure 6, the execution steps of the candidate concept set saturation control strategy are as follows:

步骤501，搜索候选概念，若候选概念已存在于候选相关概念集中，进入步骤502；若候选概念为新增概念进入步骤503。Step 501, search for candidate concepts, if the candidate concept already exists in the set of candidate related concepts, go to step 502; if the candidate concept is a newly added concept, go to step 503.

步骤502，对关联特征增量计算“伪相关度”，进入步骤504。Step 502, calculate the "false correlation degree" for the correlation feature increment, and go to step 504.

步骤503，搜索候选概念，使用关联特征计算“伪相关度”，进入步骤504。Step 503 , search for candidate concepts, calculate "pseudo-correlation" using associated features, and proceed to step 504 .

步骤504，根据“伪相关度”重新调整候选概念拉链顺序，进入步骤505。Step 504, according to the "pseudo-correlation", readjust the order of candidate concept zippers, and go to step 505.

步骤505，滑动窗口置于拉链首部进入步骤506。Step 505, place the sliding window at the head of the zipper and proceed to step 506.

步骤506，判断元素数和winLen₂的大小，若元素数不大于winLen₂，重新执行候选概念选取策略进入子流程4；若元素数不小于winLen₂，则进入步骤507。Step 506, judge the number of elements and the size of winLen ₂ , if the number of elements is not greater than winLen ₂ , re-execute the candidate concept selection strategy and enter sub-process 4; if the number of elements is not less than winLen ₂ , enter step 507.

步骤507，监督器检查首尾候选概念“伪相关度”减幅，与幅度阈值δ₂比较进入步骤508。In step 507, the supervisor checks the reduction of the first and last candidate concept "pseudo-correlation" and compares it with the amplitude threshold δ ₂ to enter step 508.

步骤508，若“伪相关度”减幅不小于δ₂，进入步骤509；若“伪相关度”减幅小于δ₂，则进入步骤510。In step 508, if the decrease in "false correlation degree" is not less than δ ₂ , go to step 509; if the decrease in "false correlation degree" is smaller than δ ₂ , go to step 510.

步骤509，候选相关概念集达到饱和，停止迭代搜索，候选概念集饱和控制策略流程结束。Step 509 , when the set of candidate related concepts is saturated, the iterative search is stopped, and the flow of the control strategy for the set of candidate concepts is saturated.

步骤510，向拉链尾部滑动窗口进入步骤511。Step 510, slide the window to the tail of the zipper and go to step 511.

步骤511，若已到达拉链尾部，重新执行候选概念选取策略，进入子流程4；否则回到步骤507。Step 511, if the end of the zipper has been reached, re-execute the candidate concept selection strategy and enter sub-process 4; otherwise, return to step 507.

候选概念集达到饱和后，迭代搜索停止，此时就确定了候选相关概集合，集合规模远小于开放百科的全体概念集合。进一步精确计算目标概念同候选相关概念的相关度，排序后即可得到Top-M最相关概念。基于主权迭代搜索的概念关系发现方法整体数据流程如图6所示，具体步骤如下：After the candidate concept set reaches saturation, the iterative search stops, and at this time, the candidate related concept set is determined, and the set size is much smaller than the entire concept set of Open Encyclopedia. Further accurately calculate the correlation between the target concept and the candidate related concepts, and after sorting, the Top-M most related concepts can be obtained. The overall data flow of the conceptual relationship discovery method based on sovereign iterative search is shown in Figure 6, and the specific steps are as follows:

步骤601，目标概念搜索开始，进入步骤603。In step 601, the target concept search starts, and the process goes to step 603.

步骤602，对所有概念进行编号，构建特征向量正向索引。Step 602, number all the concepts, and construct the forward index of the feature vector.

步骤603，搜索器根据特征向量正向索引搜索特征向量，进入步骤604。In step 603, the searcher searches for the feature vector according to the forward index of the feature vector, and proceeds to step 604.

步骤603，根据搜索到的特征向量获取主权特征，进入步骤606。Step 603, obtain the sovereign feature according to the searched feature vector, and go to step 606.

步骤605，以向量特征为倒排词条，构造特征倒排索引。Step 605, constructing a feature inverted index with vector features as inverted entries.

步骤606，搜索器以目标概念的高权重特征为关联特征进行关联搜索，选出候选相关概念，并放入候选相关概念集。Step 606, the searcher uses the high-weight features of the target concept as associated features to perform an associated search, selects candidate related concepts, and puts them into the candidate related concept set.

步骤4，候选概念选取策略子流程得到候选概念，进入步骤5。Step 4, the candidate concept selection strategy sub-process obtains candidate concepts, and proceeds to step 5.

步骤5，候选概念集饱和控制策略子流程搜索到关联特征倒排拉链后，选取候选概念进入步骤607。Step 5, after the sub-flow of the candidate concept set saturation control strategy searches for the associated feature inverted zipper, select a candidate concept and enter step 607.

步骤607，判断候选概念集是否已达到饱和，若候选概念集已达到饱和迭代搜索停止，进入步骤608；候选概念集未达到饱和则返回到步骤603。Step 607, judge whether the candidate concept set has reached saturation, if the candidate concept set has reached saturation, stop the iterative search, go to step 608; if the candidate concept set is not saturated, return to step 603.

步骤608，精确计算目标概念同候选相关概念的相关度，排序后即可得到Top-M最相关概念，进入步骤609。Step 608, accurately calculate the degree of correlation between the target concept and the candidate related concepts, and after sorting, the Top-M most related concepts can be obtained, and go to step 609.

步骤609，基于主权迭代搜索的概念关系发现方法流程结束。In step 609, the flow of the method for discovering concept relationships based on sovereign iterative search ends.

特征向量正向索引、向量特征倒排索引、搜索器、候选相关概念集、候选概念选取策略和候选概念集饱和控制策略组成了基于主权迭代搜索的关系发现系统。Feature vector forward index, vector feature inverted index, searcher, candidate related concept set, candidate concept selection strategy and candidate concept set saturation control strategy constitute a relation discovery system based on sovereign iterative search.

在此基础上，进一步对基于主权迭代搜索的进行验证，求取候选概念选择策略和候选概念集饱和控制策略的最优滑动窗口参数，并同“布尔模型下的关系发现”和“向量空间模型下穷举关系发现”在效率和召回率上进行对比。On this basis, the algorithm based on sovereign iterative search is further verified, and the optimal sliding window parameters of candidate concept selection strategy and candidate concept set saturation control strategy are obtained, and compared with "relationship discovery under Boolean model" and "vector space model The next exhaustive relationship discovery" is compared in efficiency and recall rate.

效率问题是概念关系发现需重点解决的问题，也是基于主权迭代搜索方法提出的目的，因此，时间效率是本实验的重要考察指标。计算一个概念的Top-M相关概念所需的时间记为Time_relM，我们使用平均时间作为衡量一个概念关系发现方法的时间效率指标。Efficiency is the key problem to be solved in concept relationship discovery, and it is also the purpose of the sovereign iterative search method. Therefore, time efficiency is an important indicator of this experiment. The time required to calculate the Top-M related concepts of a concept is recorded as Time _rel M, and we use the average time As a measure of the time efficiency of a concept relation discovery method.

概念关系发现的平均用时固然可以反应方法的时间效率，但是由于迭代搜索方法分为多个步骤，为更清晰的分析时间效率在各个步骤之间的分布以找出效率瓶颈，我们追加迭代搜索次数和饱和候选相关概念集大小作为两个辅助时间效率指标。每一次迭代搜索需要查询特征向量正向索引，饱和候选相关概念集的大小直接影响后续准确相关度的计算量，因此这两个指标能够很好的反映迭代搜索的效率变化。The average time spent on concept relationship discovery can certainly reflect the time efficiency of the method, but since the iterative search method is divided into multiple steps, in order to more clearly analyze the distribution of time efficiency among the steps to find out the efficiency bottleneck, we add the number of iterative searches and saturation candidate related concept set size as two auxiliary time efficiency metrics. Each iterative search needs to query the forward index of the feature vector, and the size of the saturated candidate related concept set directly affects the calculation amount of the subsequent accurate correlation. Therefore, these two indicators can well reflect the efficiency change of the iterative search.

我们将向量空间模型下穷举概念相关度计算得到的Top-M相关概念作为召回标准，计算关系发现的召回率，将穷举概念相关度计算得到的Top-M相关概念集合记为Φ_std，将待评估相关概念集合记为Φ_est，使用Card(Φ)表示可数集合包含的元素个数，则召回率计算公式如下：We use the Top-M related concepts calculated by the exhaustive concept correlation under the vector space model as the recall standard, calculate the recall rate of relationship discovery, and record the Top-M related concepts set obtained by the exhaustive concept correlation calculation as Φ _std , Denote the set of related concepts to be evaluated as Φ _est , and use Card(Φ) to represent the number of elements contained in the countable set. The formula for calculating the recall rate is as follows:

$recall recall (({Φ Φ}_{est est} | | {Φ Φ}_{std std})) = = \frac{card card (({Φ Φ}_{std std} \cap \cap {Φ Φ}_{est est}))}{card card (({Φ Φ}_{std std}))} * * 100100 % % - - - - - - ((33))$

尽管召回率不能反应相关概念在不同位序缺失的差异性，但是相关概念的质量评估更注重级差质量，因此，召回率能够直接反应基于搜索的快速关系发现方法为计算效率付出的结果质量代价。Although the recall rate cannot reflect the difference in the absence of related concepts in different positions, the quality evaluation of related concepts pays more attention to the quality of the difference. Therefore, the recall rate can directly reflect the result quality price paid by the search-based fast relationship discovery method for computational efficiency.

上述基于主权迭代搜索的关系发现方法有三个参数需要实验确定，候选概念选取策略幅度阈值δ₁，候选概念集饱和控制策略滑动窗口大小winLen₂和幅度阈值δ₂。其中δ₁∈(0，1)，winLen₂∈[M，+∞)，δ₂∈(0，1)。这三个变量同候选概念选取策略滑动窗口大小winLen₁＝M之间相互影响，关系复杂。同时实验三个变量会增加实验分组使实验结果而复杂难于分析。The above relationship discovery method based on sovereign iterative search has three parameters to be determined experimentally, candidate concept selection strategy amplitude threshold δ ₁ , candidate concept set saturation control strategy sliding window size winLen ₂ and amplitude threshold δ ₂ . where δ ₁ ∈ (0, 1), winLen ₂ ∈ [M, +∞), δ ₂ ∈ (0, 1). These three variables interact with the candidate concept selection strategy sliding window size winLen ₁ =M, and the relationship is complicated. Simultaneously experimenting with three variables will increase the experimental grouping and make the experimental results complicated and difficult to analyze.

候选概念选取策略幅度阈值δ₁的大小会影响单次搜索选取的概念数，进而在候选概念集饱和控制策略滑动窗口大小winLen₂和幅度阈值δ₂确定的情况下影响迭代搜索的次数。幅度阈值δ₁过大会使单次搜索选取的概念数过大，进而可能导致迭代搜索的次数下降而降低召回率；幅度阈值过小则会使单次搜索选取的概念数偏少，进而可能导致搜索迭代次数上升而使时间效率降低。由于候选概念选取策略的滑动窗口已经确定，经过反复试验的经验，单次搜索选取的概念数在[M，2M]之间为佳，因此，可先通过单次搜索选取概念数作为结果反馈缩小幅度阈值的范围。The magnitude threshold δ ₁ of the candidate concept selection strategy will affect the number of concepts selected in a single search, and then affect the number of iterative searches when the sliding window size winLen ₂ and the magnitude threshold δ ₂ of the candidate concept set saturation control strategy are determined. If the magnitude threshold δ ₁ is too large, the number of concepts selected in a single search will be too large, which may lead to a decrease in the number of iterative searches and reduce the recall rate; if the magnitude threshold is too small, the number of concepts selected in a single search will be too small, which may lead to The increase in the number of search iterations reduces the time efficiency. Since the sliding window of the candidate concept selection strategy has been determined, the number of concepts selected by a single search is preferably between [M, 2M] through trial and error experience. Therefore, the number of concepts selected by a single search can be used as the result feedback to narrow down The range for the amplitude threshold.

确定候选概念选取策略的参数后，可对候选相关概念集饱和控制策略的两个参数联合实验。先根据召回率的走势确定滑动窗口的大小，然后配以多组幅度阈值组合分析实验的召回率和时间效率，确定最优参数组合。After determining the parameters of the candidate concept selection strategy, the two parameters of the candidate related concept set saturation control strategy can be jointly tested. First determine the size of the sliding window according to the trend of the recall rate, and then analyze the recall rate and time efficiency of the experiment with multiple sets of amplitude threshold combinations to determine the optimal parameter combination.

参数确定后，基于主权迭代搜索的关系发现方法达到已知最优状态，进而同布尔模型下的搜索方法、向量空间模型下的枚举方法进行召回率和时间效率的对比实验，以验证方法的有效性。After the parameters are determined, the relationship discovery method based on sovereign iterative search reaches the known optimal state, and then compares the recall rate and time efficiency with the search method under the Boolean model and the enumeration method under the vector space model to verify the effectiveness of the method. effectiveness.

如图7所示，主权迭代搜索的关系发现方法同布尔模型下的搜索方法、向量空间模型下的枚举方法进行召回率和时间效率的对比实验大致分为三个阶段，候选概念选取策略幅度阈值的确定、候选相关概念集饱和控制参数的确定、关系发现方法效率和召回率对比。As shown in Figure 7, the comparison experiment of the relationship discovery method of sovereign iterative search with the search method under the Boolean model and the enumeration method under the vector space model is roughly divided into three stages in terms of recall rate and time efficiency. The determination of the threshold, the determination of the saturation control parameters of the candidate related concept set, and the comparison of the efficiency and recall rate of the relationship discovery methods.

(1)候选概念选取策略幅度阈值实验(1) Candidate concept selection strategy amplitude threshold experiment

设计2*9组实验，M取值为10和20，δ₁取值为5％、10％、15％、20％、25％、30％、35％、40％、45％。每组实验计算目标概念集中所有目标概念的单次搜索选取的平均概念数，每个概念选取前M位的特征进行关联搜索。结果如下表2所示：Design 2*9 groups of experiments, the values of M are 10 and 20, and the values of _δ1 are 5%, 10%, 15%, 20%, 25%, 30%, 35%, 40%, 45%. Each group of experiments calculates the average number of concepts selected by a single search of all target concepts in the target concept set, and selects the top M features of each concept for association search. The results are shown in Table 2 below:

表2单次搜索选取概念数平均值Table 2 The average number of concepts selected in a single search

如图8所示，单次搜索选取概念数平均值的2倍M值出现在幅度阈值25％和30％之间，由于参数之间是相互影响的，所以单参数的小范围浮动可以在其他参数中调整达到最优。通过本实验结果分析，我们选用更接近2倍M的25％作为候选概念选取策略的幅度阈值继续后续试验。As shown in Figure 8, the M value of 2 times the average value of the number of concepts selected in a single search appears between 25% and 30% of the amplitude threshold. Since the parameters affect each other, the small-scale fluctuation of a single parameter can be found in other The parameters are adjusted to the optimum. Through the analysis of the results of this experiment, we choose 25% which is closer to 2 times M as the magnitude threshold of the candidate concept selection strategy to continue the follow-up experiment.

(2)候选相关概念集饱和控制策略参数实验(2) Candidate related concept set saturation control strategy parameter experiment

饱和控制策略参数是本发明的重要实验内容，因为该策略对关系发现的召回率和时间效率有着决定性的影响。实际操作中，该实验经过多次反复迭代才得以完成，实验发现，待发现关系数M处于不同范围时，策略参数有着很大的差异。为了方便描述，现将实验划分为两部分，第一部分针对M≥10的情况进行实验描述并分析总结，第二部分对0＜M＜10的情况进行实验并分析。The saturation control strategy parameter is an important experimental content of the present invention, because this strategy has a decisive impact on the recall rate and time efficiency of relation discovery. In actual operation, the experiment was completed after several iterations. The experiment found that when the relationship number M is found to be in different ranges, the strategy parameters are very different. For the convenience of description, the experiment is now divided into two parts. The first part is for the experimental description and analysis of the situation of M≥10, and the second part is for the experiment and analysis of the situation of 0<M<10.

首先，阐述实验的第一部分。设计4*10*7组实验参数，其中M值为4组、滑动窗口为5组，幅度阈值为7组。幅度阈值的实验设定值为20％、30％、40％、50％、60％、70％和80％。滑动窗口大小以M值为初始值，首先翻倍增长直到接近或变为M2，然后按等量线性增长。M值和滑动窗口的取值组合如下表3所示：First, the first part of the experiment is explained. Design 4*10*7 groups of experimental parameters, in which the M value is 4 groups, the sliding window is 5 groups, and the amplitude threshold is 7 groups. The experimental settings for the amplitude threshold are 20%, 30%, 40%, 50%, 60%, 70% and 80%. The size of the sliding window takes M as the initial value, first doubles until it approaches or becomes M2, and then increases linearly by the same amount. The combination of the M value and the sliding window is shown in Table 3 below:

WinLen10 WinLen10 10 10 20 20 40 40 60 60 80 80 100 100 150 150 200 200 250 250 300 300

WinLen20 WinLen20 20 20 40 40 80 80 160 160 240 240 320 320 400 400 500 500 600 600 700 700 WinLen30 WinLen30 30 30 60 60 120 120 240 240 480 480 900 900 1000 1000 1100 1100 1200 1200 1300 1300 WinLen50 WinLen50 50 50 100 100 200 200 400 400 800 800 1600 1600 2500 2500 2600 2600 2700 2700 2800 2800

表3M值和滑动窗口实验取值Table 3 M value and sliding window experimental value

每组实验参数都可生成一个特定的饱和控制策略，使用每一组参数进行迭代搜索方式的关系发现实验，求取目标概念集的相关概念，记录候选概念集的饱和值。同时使用枚举法计算得到标准相关概念集合，然后根据召回率公式计算关系发现的召回率。Each set of experimental parameters can generate a specific saturation control strategy, use each set of parameters to conduct an iterative search method for relationship discovery experiments, obtain related concepts of the target concept set, and record the saturation value of the candidate concept set. At the same time, the enumeration method is used to calculate the set of standard related concepts, and then the recall rate of relationship discovery is calculated according to the recall rate formula.

我们求取记录的饱和相关概念集概念数平均值，在4组不同的M值下，以幅度阈值为横轴变量，以概念数值为纵轴，制作不同滑动窗口下的饱和集概念数随幅度阈值的变化曲线。由于概念数会随着滑动窗口的增大超出线性的速度增长，因此我们使用指数变化的纵轴，以方便观察不同滑动窗口下的曲线。We calculate the average value of the concept number of the recorded saturated related concept set. Under 4 different M values, the amplitude threshold is used as the horizontal axis variable, and the concept value is used as the vertical axis to make the concept number of the saturated set under different sliding windows. Threshold curve. Since the number of concepts will increase beyond a linear rate as the sliding window increases, we use an exponentially changing vertical axis to facilitate observation of the curves under different sliding windows.

图9示出了饱和相关集概念数平均值，如下图9包含的四幅小图所示，从左至右，从上至下分别为M取值10、20、30、50时的饱和集概念数曲线图，记为饱和集10，饱和集20，饱和集30和饱和集50。从这四幅图中可以发现，当滑动窗口较小时，饱和集概念数会对幅度阈值比较敏感，随着幅度阈值的增加而增大，但是随着滑动窗口的增加，敏感度会大大减弱，甚至达到一定窗口大小时，饱和集概念数基本不变，只是在幅度阈值较大时会有少量增长，但就增长的相对比例而言基本可以忽略。Figure 9 shows the average number of saturated related set concepts, as shown in the four small pictures contained in Figure 9 below, from left to right, and from top to bottom are the saturated set concepts when M is 10, 20, 30, and 50 The graphs are denoted as Saturation Set 10, Saturation Set 20, Saturation Set 30 and Saturation Set 50. From these four figures, it can be found that when the sliding window is small, the concept number of the saturated set is more sensitive to the magnitude threshold, which increases with the increase of the magnitude threshold, but with the increase of the sliding window, the sensitivity will be greatly weakened, even When a certain window size is reached, the number of concepts in the saturated set basically remains unchanged, except that there will be a small increase when the magnitude threshold is large, but the relative proportion of the increase can basically be ignored.

四幅图中都出现了较小滑动窗口饱和集超出较大滑动窗口的现象，此现象可如下进行解释：当滑动窗口很小时，幅度变化监督器只能监测相对较小的范围，当该范围小于概念相关度的局部区分能力，就会导致迭代搜索持续不止，最终候选相关概念集的概念数就有可能超过较大滑动窗口在同一幅度阈值下的水平。因此，滑动窗口要对着M值得增大而适量增加，避免出现迭代搜索持续不止的情况。The phenomenon that the smaller sliding window saturation set exceeds the larger sliding window appears in the four figures. This phenomenon can be explained as follows: when the sliding window is small, the amplitude change monitor can only monitor a relatively small range, and when the range is smaller than The local discrimination ability of concept correlation will lead to continuous iterative search, and the number of concepts in the final candidate related concept set may exceed the level of the larger sliding window under the same amplitude threshold. Therefore, the sliding window should be increased appropriately as the value of M increases, so as to avoid the situation where the iterative search continues.

接下来，对目标概念集的关系召回率计算平均值，在4组不同的M值下制作滑动窗口和浮动阈值不同组合情况下的平均召回率折线图。其中横坐标为滑动窗口大小，纵坐标为召回率，每个幅度阈值对应一条彩色曲线。如下四幅图从左至右，从上至下分别为M取值10、20、30、50时的召回率曲线图，记为召回率10，召回率20，召回率30和召回率50。Next, calculate the average value of the relational recall rate of the target concept set, and make a line chart of the average recall rate under different combinations of sliding windows and floating thresholds under 4 different M values. The abscissa is the size of the sliding window, the ordinate is the recall rate, and each magnitude threshold corresponds to a colored curve. The following four figures are from left to right and from top to bottom the recall rate curves when M values are 10, 20, 30, and 50, which are recorded as recall rate 10, recall rate 20, recall rate 30, and recall rate 50.

从图10所包含的四幅图中容易看出，随着滑动窗口的增加，各幅度阈值下的召回率变化趋势是相似的。当幅度阈值不变时，随着滑动窗口的增大，召回率也会随之增加，但增速由快至缓，直到升至(0.95，1)区间后，趋于平缓。当滑动窗口大小不变时，随着幅度阈值的增加，召回率会有所上升，但是并不能在滑动窗口较小时达到一个令人满意的值。It is easy to see from the four graphs included in Fig. 10 that as the sliding window increases, the changing trends of the recall rate at each amplitude threshold are similar. When the amplitude threshold is constant, the recall rate will increase with the increase of the sliding window, but the growth rate will be from fast to slow until it rises to the (0.95, 1) interval and then tends to be flat. When the size of the sliding window is constant, the recall rate increases with the increase of the magnitude threshold, but it cannot reach a satisfactory value when the sliding window is small.

总结分析召回率实验结果可以得到以下结论：Summarizing and analyzing the results of the recall experiment can draw the following conclusions:

1)当滑动窗口不变时，可通过增大幅度阈值的方法提高召回率，但是幅度阈值的调节能力有限，若滑动窗口太小，仅通过调节幅度阈值无法获取令人满意的召回率。1) When the sliding window remains unchanged, the recall rate can be improved by increasing the amplitude threshold, but the adjustment ability of the amplitude threshold is limited. If the sliding window is too small, a satisfactory recall rate cannot be obtained only by adjusting the amplitude threshold.

2)当幅度阈值不变时，可通过增大滑动窗口的方法提高召回率，但是滑动窗口值无需一味增大，当窗口增大到一定程度时召回率达到一定水平将不会再有明显的提升2) When the magnitude threshold remains unchanged, the recall rate can be increased by increasing the sliding window, but the sliding window value does not need to be increased blindly. When the window increases to a certain extent, the recall rate reaches a certain level and there will be no obvious difference. promote

3)当滑动窗口达到一定大小时，幅度阈值在实验中看不出明显的作用。当幅度阈值极尽趋于零时可能会有召回率的下降，但是此种情况不具有实际意义。3) When the sliding window reaches a certain size, the amplitude threshold has no obvious effect in the experiment. There may be a drop in recall as the magnitude threshold approaches zero as far as possible, but this situation is not practical.

4)召回率在超过95％后就不再由明显的变化，当滑动窗口大小达到一定数值时均可使召回率达到95％，而与幅度阈值的关系不甚明显。4) The recall rate does not change significantly after exceeding 95%. When the size of the sliding window reaches a certain value, the recall rate can reach 95%, but the relationship with the amplitude threshold is not obvious.

增加浮动阈值并不能保证召回率一定会上升，在300万的基数上增加滑动窗口大小会对计算效率产生很大影响，更何况换来的召回率并不能保证相关系数的增加。因此，我们在95％召回的条件下选择更小的滑动窗口。Increasing the floating threshold does not guarantee that the recall rate will increase. Increasing the size of the sliding window on a base of 3 million will have a great impact on computational efficiency, not to mention that the recall rate in exchange cannot guarantee an increase in the correlation coefficient. Therefore, we choose a smaller sliding window under the condition of 95% recall.

图中的深色线标出在各M值下，召回率快速增长区和稳定区的边界，实验证明不同的幅度阈值对边界的影响很小。因为一个确定的滑动窗口大小对候选相关概念饱和控制策略非常重要，所以我们希望通过分析拟合M值同边界的关系。记录M值与边界窗口的关系如下表4所示：The dark lines in the figure mark the boundaries of the fast-growing region and the stable region of the recall rate under each value of M. Experiments have proved that different amplitude thresholds have little influence on the boundaries. Because a certain sliding window size is very important for the candidate correlation concept saturation control strategy, we hope to analyze the relationship between the M value and the boundary by fitting. The relationship between the record M value and the boundary window is shown in Table 4 below:

M值 M value 10 10 20 20 30 30 50 50 边界滑动窗口 Boundary sliding window 96 96 385 385 1017 1017 2530 2530

表4M值和边界滑动窗口的对应关系Table 4 Correspondence between M value and boundary sliding window

我们选用了常用拟合方法——多项式拟合对M值和边界滑动窗口之间的关系数据进行处理，并对拟合结果进行简化，得出如下公式：We chose a common fitting method - polynomial fitting to process the relationship data between the M value and the boundary sliding window, and simplified the fitting results to obtain the following formula:

WinLen₂＝M² (4)WinLen ₂ = M ² (4)

虽然拟参考数据只有四对，但是公式对关系数据基本吻合，而且对本方法的工程需求而言，确定的关系本身意义大于关系的精准性意义。对于此滑动窗口值，可以满足关系发现的需要。Although there are only four pairs of reference data, the formula is basically consistent with relational data, and for the engineering requirements of this method, the meaning of the determined relationship itself is greater than the significance of the accuracy of the relationship. For this sliding window value, the need for relation discovery can be met.

为了对比布尔模型下搜索关系发现、向量空间模型下枚举关系发现和基于主权迭代搜索的关系发现等三种关系发现方法的时间效率和召回率，我们设计了3*10组实验，每组实验对目标概念集中的所有概念计算Top-M最相关概念，记录并计算平均召回率和用时，并制作曲线图。In order to compare the time efficiency and recall rate of three relation discovery methods, namely search relation discovery under the Boolean model, enumeration relation discovery under the vector space model, and relation discovery based on sovereign iterative search, we designed 3*10 groups of experiments, each group of experiments Calculate the Top-M most relevant concepts for all concepts in the target concept set, record and calculate the average recall rate and time, and make a graph.

如图11所示，其中左边的一张为三种关系发现方法的效率对比图，横轴为5～50的M取值，纵轴为执行关系发现的用时，由于三种方法效率差别较大，超出了线性范围，故纵轴使用以10为底的对数刻度，范围为0.0001秒到10000秒。As shown in Figure 11, the one on the left is a comparison chart of the efficiency of the three relationship discovery methods, the horizontal axis is the M value of 5 to 50, and the vertical axis is the execution time of relationship discovery, because the efficiency of the three methods is quite different , beyond the linear range, so the vertical axis uses a logarithmic scale with base 10, ranging from 0.0001 seconds to 10000 seconds.

右边的一张为三种关系方法的召回率对比图，同样横轴为5～50的M取值，纵轴为召回率百分比。The one on the right is a comparison chart of the recall rate of the three relational methods. The horizontal axis is also the M value of 5 to 50, and the vertical axis is the recall rate percentage.

从效率曲线图可以看出，三种方法的时间效率关系是：布尔搜索高于主权迭代搜索，而后者又高于向量对枚举方法。其中布尔搜索和主权迭代搜索方法效率差别在同一量级，而向量对枚举法效率低了5～6个量级，即平均用时超出数十万倍。随着M值得增加，三种方法平均用时均有增长。由于纵轴坐标为对数刻度，所以图中给出了y＝x的对比线。从三条效率曲线斜率同对比线的比较可以看出，枚举法用时基本随M值得增长成线性增长，而布尔搜索和主权特征迭代搜搜两种方法的增长率高于线性，M值较小时尤为明显。It can be seen from the efficiency curve that the time-efficiency relationship of the three methods is: the Boolean search is higher than the sovereign iterative search, and the latter is higher than the vector pair enumeration method. Among them, the efficiency difference between the Boolean search method and the sovereign iterative search method is in the same order of magnitude, while the efficiency of the vector pair enumeration method is 5-6 orders of magnitude lower, that is, the average time is hundreds of thousands of times higher. As the value of M increases, the average time spent by the three methods increases. Since the coordinates of the vertical axis are in logarithmic scale, the comparison line of y=x is given in the figure. From the comparison of the slopes of the three efficiency curves with the comparison line, it can be seen that the enumeration method basically grows linearly with the increase of M value, while the growth rate of the two methods of Boolean search and sovereign feature iterative search is higher than linear, and when the value of M is small Especially obvious.

从召回率曲线图可以看出，三种方法的召回率关系是：向量对枚举法高于主权迭代搜索，而后者又高于布尔搜索方法。由于向量对枚举是召回率的对比标准，因此其召回率为100％。布尔搜索方法虽然具有很高的效率，但是召回率不尽人意，总体平均水平低于50％，当M较大时甚至无法达到30％。相比之下，主权迭代搜索的平均召回率超出95％，而且对M值不敏感。It can be seen from the recall rate curve that the recall rate relationship of the three methods is: the vector pair enumeration method is higher than the sovereign iterative search, and the latter is higher than the Boolean search method. Since vector-to-enumeration is the benchmark of recall, it has a recall of 100%. Although the Boolean search method has high efficiency, the recall rate is unsatisfactory, the overall average level is lower than 50%, and it cannot even reach 30% when M is large. In contrast, Sovereign Iterative Search achieves an average recall exceeding 95% and is insensitive to the M value.

综合评价主权迭代搜索，效率虽然稍逊于同布尔搜索，但处于同一量级，召回率却超出布尔搜索很多，接近100％。Comprehensive evaluation of sovereign iterative search, although the efficiency is slightly lower than that of Boolean search, but it is in the same order of magnitude, and the recall rate is much higher than Boolean search, close to 100%.

上面对本发明所提供的基于主权迭代搜索的概念关系快速发现方法进行了详细的说明。对本领域的一般技术人员而言，在不背离本发明实质精神的前提下对它所做的任何显而易见的改动，都将构成对本发明专利权的侵犯，将承担相应的法律责任。The method for quickly discovering conceptual relations based on sovereign iterative search provided by the present invention has been described in detail above. For those skilled in the art, any obvious changes made to it without departing from the essence of the present invention will constitute an infringement of the patent right of the present invention and will bear corresponding legal responsibilities.

Claims

1. the quick discover method of the conceptual relation based on iterative search with sovereign right is characterized in that comprising the steps:

Use boolean search to eliminate not comprise identical nonzero element or only to comprise the notion of few nonzero element right; Wherein said boolean search comprises: the semantic feature vector of notion is converted into Boolean expression; And construction feature vector forward index and characteristic inverted index; Use the semantic feature constitutive logic inquiry of target concept, search obtains the related notion collection of target concept in the logical expression set;

Further use the conceptual dependency degree under the enumerative technique compute vector space, try to achieve related notion through ordering.

2. the quick discover method of conceptual relation as claimed in claim 1 is characterized in that:

Conceptual dependency degree step under the said use enumerative technique compute vector space comprises:

A. searcher is according to proper vector forward indexed search proper vector;

B. obtain characteristic with sovereign right according to the proper vector that searches;

C. searcher carries out association search, selects candidate's related notion, and puts into candidate's related notion collection;

D. use candidate's notion to choose strategy and obtain candidate's notion;

E. the saturated control strategy of candidate's concept set searches after linked character arranges slide fastener, chooses candidate's notion;

F. judge whether candidate's concept set reaches capacity,, repeat b～f, reach capacity up to candidate's concept set if candidate's concept set does not reach capacity and then returns b;

F. iterative search stops if candidate's concept set has reached capacity, and the accurate Calculation target concept is with the degree of correlation of candidate's related notion.

3. the quick discover method of conceptual relation as claimed in claim 2 is characterized in that:

Said candidate's notion is chosen strategy and is comprised:

3a. searcher uses linked character to be query key, in the characteristic inverted index, carries out inquiry, obtains candidate's notion slide fastener;

3b. with window is that the moving window of M places the slide fastener stem, if the no more than M of element then all elects candidate's notion as, said M is a natural number;

If 3c. element more than M, monitor monitoring surveillance requirements is about to head and the tail weight amount of decrease and amplitude threshold and compares;

3d. if head and the tail weight amount of decrease is less than or equal to amplitude threshold, moving window moves one and return 3c, repeats 3c～3d, up to head and the tail weight amount of decrease greater than amplitude threshold;

3e. if head and the tail weight amount of decrease is then cut the element behind the window greater than amplitude threshold, residue is promptly elected candidate's notion as.

4. the quick discover method of conceptual relation as claimed in claim 2 is characterized in that:

The saturated control strategy of said candidate's concept set comprises:

4a. the search for candidate notion is concentrated if candidate's notion has been present in candidate's related notion, to linked character incremental computations " spurious correlation degree ", calculates " spurious correlation degree " if candidate's notion is used linked character for newly-increased notion;

4b. readjust candidate's notion slide fastener order;

4c. moving window places the slide fastener stem, judges the size of number of elements and window, if number of elements is not more than window size; Again carry out candidate's notion and choose strategy; If number of elements is not less than window size, then inspection head and the tail candidate's notion " spurious correlation degree " amount of decrease compares with amplitude threshold;

4d. if " spurious correlation degree " amount of decrease less than amplitude threshold, gets into step to slide fastener afterbody moving window, if arrived the slide fastener afterbody; Again carry out candidate's notion and choose strategy; Otherwise return 4c, repeat 4c～4d, be not less than amplitude threshold up to " spurious correlation degree " amount of decrease;

4e. if " spurious correlation degree " amount of decrease is not less than amplitude threshold, candidate's related notion collection reaches capacity, and stops iterative search.

5. the quick discover method of conceptual relation as claimed in claim 1 is characterized in that:

Said construction feature vector forward index comprises:

All notions are numbered, the semantic feature of all notions according to the weight descending, and is normalized to vector of unit length with proper vector;

Key assignments with " notion numbering-proper vector " concerns structural attitude vector forward index.

6. the quick discover method of conceptual relation as claimed in claim 1 is characterized in that:

Said construction feature inverted index comprises:

Regard each unit character vector as a document, be characterized as with vector and arrange entry, the notion numbering in the index slide fastener is according to the descending sort of feature weight size, structural attitude inverted index.