CN102236654A

CN102236654A - Web Invalid Link Filtering Method Based on Content Correlation

Info

Publication number: CN102236654A
Application number: CN2010101559607A
Authority: CN
Inventors: 汪敏; 刘轩山
Original assignee: GUANGDONG UCAP INTERNET INFORMATION TECHNOLOGY CO LTD
Current assignee: GUANGDONG UCAP INTERNET INFORMATION TECHNOLOGY CO LTD
Priority date: 2010-04-26
Filing date: 2010-04-26
Publication date: 2011-11-09

Abstract

The present invention discloses a method for filtering invalid Web links based on content relevance. The method firstly utilizes text position information in a web page to remove irrelevant advertising links and navigation links in the web page by statistical methods; then, the content of the web page and the content of the web page to which the link points are analyzed for relevance, and invalid links with irrelevant content are removed. The invalid link filtering method can effectively remove invalid links, and PageRank calculation is performed on the purified link structure diagram, and the page ranking result can be greatly improved, including the improvement of the quality of the top-ranked web pages, and the introduction of more high-value websites.

Description

Web Invalid Link Filtering Method Based on Content Correlation

技术领域 technical field

本发明涉及一种过滤Web页面中无效链接(uselesslinks)的方法，尤其涉及一种基于内容相关性分析的Web页面无效链接过滤方法，属于互联网搜索技术领域。The invention relates to a method for filtering useless links in Web pages, in particular to a method for filtering useless links in Web pages based on content correlation analysis, and belongs to the technical field of Internet search.

背景技术 Background technique

随着互联网的飞速发展，用于互联网信息查询的搜索引擎发挥着日益重要的作用。对于搜索引擎而言，其主要任务是找到相关网页并按页面重要度排序返回给用户。随着Web页面数目的增长、页面内容的丰富和页面链接的多姿多彩，搜索引擎开始变得越来越“力不从心”。这里面的原因很多，其中重要的一点就是Web页面中无效链接的日益泛滥。With the rapid development of the Internet, search engines for Internet information queries are playing an increasingly important role. For a search engine, its main task is to find relevant web pages and return them to users in order of page importance. With the growth of the number of Web pages, the richness of page content and the variety of page links, search engines began to become more and more "powerless". There are many reasons for this, the most important of which is the increasing proliferation of invalid links in Web pages.

经过分析，Web页面中的链接可分为以下四类：After analysis, links in web pages can be divided into the following four categories:

人工生成的链接：这类链接中大部分是人工通过比较两个网页的内容，根据它们的相关性来进行创建的，而且被网页创建者归为“相关链接”，因此大部分这样的链接具有很强的推荐意义。也有一部分所指向的网页内容与本网页内容主题并不相关，只在某一点上稍微有些关联。Artificially generated links: Most of these links are manually created by comparing the content of two web pages based on their relevance, and are classified as "related links" by the web page creators, so most of these links have Strong recommendation. There is also a part of the content of the web pages pointed to that is not related to the content topic of this web page, but is only slightly related to a certain point.

导航类链接：这类链接是网页创建者利用相应的模块生成的，对于同一站点下的网页是基本相同的，主要是让用户可以在本网站的不同领域之间访问。这些链接对于用户访问起到了一定的导航作用，但是和网页相关性推荐没有丝毫关系。Navigation link: This type of link is generated by the webpage creator using the corresponding module, which is basically the same for webpages under the same site, mainly to allow users to access between different areas of this website. These links play a certain navigation role for user visits, but have nothing to do with web page relevance recommendations.

广告类链接：这类链接是根据网页中的一些动态函数生成的，一般是为了网站的商业利益增加的，在链接中占了很大的比重，尤其是对于com类的网页中，这部分链接占到了一半以上。这部分链接对于推荐内容相关的网页基本没有贡献。Advertising links: This type of link is generated according to some dynamic functions in the webpage. It is generally added for the commercial benefit of the website and accounts for a large proportion of the links. Especially for com-type webpages, this part of the link Accounted for more than half. These links basically do not contribute to recommending content-related web pages.

偏袒类链接：这类链接主要指主网页和指向的子网页属于同一站点的这一类链接，是网站创立者为了推荐本网站中的一些新的或者较多关注的网页，增加它们的点击率而在网页中加入的。Favored links: This type of link mainly refers to the link between the main webpage and the sub-webpage it points to belong to the same site. It is for the website creator to recommend some new or more concerned webpages in this website and increase their click-through rate. And added to the web page.

图1是新浪上截取的一个新闻类网页，其中⑤中的链接是人工生成的链接(上述的第一类)。①中的链接是网站中具有导航意义的链接(上述的第二类)，它们分别指向了新浪的其他类网站的主页；②、⑥中的链接属于偏袒类链接(上述的第三类)，是为了推荐当天最新的新闻而生成的，从内容上可以看出它们指向的都是一些与本网页无关的网页；④、⑦部分的链接是网站为了经济利益所增加的广告链接(上述的第四类)。Fig. 1 is a news class webpage intercepted on Sina, wherein the link in ⑤ is an artificially generated link (the above-mentioned first category). The links in ① are links with navigation significance in the website (the second category above), and they point to the homepages of other websites of Sina respectively; the links in ② and ⑥ are partial links (the third category above), It is generated to recommend the latest news of the day. It can be seen from the content that they point to some web pages that have nothing to do with this web page; the links in ④ and ⑦ are advertising links added by the website for economic benefits (the above-mentioned No. four categories).

经过分析，发明人认为对于不具有推荐意义的链接类型，包括第一类中不具有主题相关性的链接、第二、第三、第四类链接，这些统称为“无效链接”；而第一类中具有主题相关性的链接则称为“有效链接”。After analysis, the inventor believes that the types of links that are not recommended, including links in the first category that do not have topic relevance, links in the second, third, and fourth categories, are collectively referred to as "invalid links"; Links within a class that have topical relevance are called "active links".

链接分析是Web页面排序方面最成功的方法之一，包括Google、Yahoo！在内的很多搜索引擎都是利用该方法，同时结合anchor text、词频统计等因素而获得了巨大成功。链接分析方法的成功，很大程度上决定于Web页面链接的有效性，或者说取决于以下假设的合理性：当网页A存在一个到网页B的链接时，说明网页A的作者认为网页B的内容是重要的，并且通常来说，网页A和B的内容具有相关的主题。可以说，这个内容相关性假设是链接分析方法赖以生存的基础。Link analysis is one of the most successful methods for ranking Web pages, including Google, Yahoo! Many search engines, including Google, use this method, combined with factors such as anchor text and word frequency statistics, and achieved great success. The success of the link analysis method largely depends on the validity of the web page links, or on the rationality of the following assumptions: when there is a link from web page A to web page B, it means that the author of web page A thinks that web page B is Content is important, and generally speaking, the content of pages A and B have related topics. It can be said that this assumption of content relevance is the basis on which the link analysis method lives.

在互联网发展初期，网页中的链接基本上符合内容相关性假设，网页之间的相关性传递是有意义的。但随着Web技术的不断发展和网页数量的不断膨胀，越来越多的网页由网页生成工具自动生成，因此很多链接失去了相关性意义，导致无效链接的比例越来越高。同时，随着搜索引擎的应用，很多网站的管理者为了提高在搜索引擎中排名，引入了大量的无用链接，出现了很多Spam站点。另一方面，目前的大部分商业类网站都是以商业利益作为最终目标的，这就造成了大量广告链接的引入。基于以上多种原因，目前Web中链接的内容相关性和推荐意义已经受到了严重的威胁。如果不进行处理，所构造的链接结构图已经不能正确反映网页之间的关联关系，基于这样的链接图得到的排序结果将不再真实有效。In the early days of Internet development, the links in web pages basically conformed to the assumption of content relevance, and the correlation transfer between web pages was meaningful. However, with the continuous development of web technology and the continuous expansion of the number of web pages, more and more web pages are automatically generated by web page generation tools, so many links lose their relevance meaning, resulting in an increasing proportion of invalid links. At the same time, with the application of search engines, many website managers have introduced a large number of useless links in order to improve the ranking in search engines, and many spam sites have appeared. On the other hand, most of the current commercial websites take commercial interests as the ultimate goal, which leads to the introduction of a large number of advertising links. Based on the above reasons, the content relevance and recommendation significance of links in the Web have been seriously threatened. If it is not processed, the constructed link structure graph cannot correctly reflect the association relationship between web pages, and the sorting results obtained based on such a link graph will no longer be true and valid.

发明内容 Contents of the invention

本发明所要解决的技术问题在于提供一种基于内容相关性的Web无效链接过滤方法。该链接过滤方法通过构造更为合理的链接结构图，再进行链接分析，从而提高了无效链接过滤的效果。The technical problem to be solved by the present invention is to provide a web invalid link filtering method based on content correlation. The link filtering method improves the effect of invalid link filtering by constructing a more reasonable link structure graph and then performing link analysis.

为了实现上述的发明目的，本发明采用下述的技术方案：In order to realize above-mentioned purpose of the invention, the present invention adopts following technical scheme:

一种基于内容相关性的Web无效链接过滤方法，其特征在于包括如下的步骤：A kind of Web invalid link filtering method based on content correlation is characterized in that comprising the steps:

(1)利用网页中的文本位置信息，通过统计方法去除网页中不相关的广告类链接和导航类链接；(1) Use the text position information in the webpage to remove irrelevant advertising links and navigation links in the webpage through statistical methods;

(2)对网页内容和链接所指向网页的内容进行相关性分析，去除内容不相关的无效链接。(2) Carry out correlation analysis on the content of the web page and the content of the web page pointed to by the link, and remove invalid links with irrelevant content.

其中，在所述步骤(1)之中，首先将HTML文档转化为DOM树结构，然后在DOM树结构中寻找包含主体内容和与主题相关的链接的最小子树，得到所需要的链接信息。Wherein, in the step (1), the HTML document is first converted into a DOM tree structure, and then the smallest subtree containing the main content and links related to the topic is searched in the DOM tree structure to obtain the required link information.

对于所述DOM树结构，首先利用分块节点将DOM树分为各个子树，在每个子树中计算链接比，与预定的阈值进行比较；如果小于阈值，则将该块设置为主体块，然后回溯查找包含该块的最近的父分块节点，以该父分块节点作为目标节点，输出该父分块节点中的链接，作为后续分析的基础。For the DOM tree structure, first utilize the block node to divide the DOM tree into each subtree, calculate the link ratio in each subtree, and compare it with a predetermined threshold; if it is less than the threshold, then the block is set as the main body block, Then backtrack to find the nearest parent block node that contains the block, take the parent block node as the target node, and output the link in the parent block node as the basis for subsequent analysis.

所述步骤(2)中，在进行网页的内容相关性分析之前，对网页的文本进行预处理，抽取出代表各个文本的内容进行比较。In the step (2), before performing the content correlation analysis of the webpage, the text of the webpage is preprocessed, and the content representing each text is extracted for comparison.

进行文本预处理的过程包括如下的步骤：The process of text preprocessing includes the following steps:

首先进行文本切词，然后统计文本中的词语频率，计算TF-IDF向量，形成与文本集合对应的向量空间模型；利用文本的特征向量来计算各个文本之间的内容相似度，并利用内容相似度来去除网页中内容不相关的链接。First perform text segmentation, then count the frequency of words in the text, calculate the TF-IDF vector, and form a vector space model corresponding to the text set; use the feature vector of the text to calculate the content similarity between each text, and use the content similarity degree to remove irrelevant links in web pages.

所述内容相似度由各个文本的特征向量中所包含的词条的重叠程度确定。The content similarity is determined by the degree of overlap of terms contained in the feature vectors of each text.

或者，所述内容相似度由各个文本的特征向量中的夹角余弦来确定。Alternatively, the content similarity is determined by the cosine of the included angle in the feature vectors of each text.

所述步骤(2)中，所述内容相关性分析包括三层操作：第一层是根据入口文本进行内容相关性分析；第二层是根据网页的标题进行内容相关性分析；第三层是根据网页正文内容进行内容相关性分析；如果在这三层都得到网页主题不相关的结论，则在父网页的链接列表中删除此链接。In described step (2), described content correlation analysis comprises three layers of operations: the first layer is to carry out content correlation analysis according to entry text; The second layer is to carry out content correlation analysis according to the title of webpage; The third layer is Carry out content correlation analysis according to the content of the webpage text; if it is concluded that the theme of the webpage is irrelevant in these three layers, delete this link in the link list of the parent webpage.

本发明所提供的无效链接过滤方法基于内容相关性分析实现，可以使过滤后的链接能更真实地反映网页之间的相互关系，使网页链接相关性假设更为合理，从而大大提高链接分析结果的有效性。The invalid link filtering method provided by the present invention is implemented based on content correlation analysis, which can make the filtered links more truly reflect the mutual relationship between web pages, make the link correlation assumptions of web pages more reasonable, and thus greatly improve the link analysis results effectiveness.

附图说明 Description of drawings

下面结合附图和具体实施方式对本发明作进一步的详细说明。The present invention will be further described in detail below in conjunction with the accompanying drawings and specific embodiments.

图1是新浪上截取的一个新闻类网页的示意图；Fig. 1 is a schematic diagram of a news webpage intercepted on Sina;

图2显示了一个从HTML文档转化来的DOM树的示例；Figure 2 shows an example of a DOM tree converted from an HTML document;

图3显示了进行第二步过滤操作之后的数据结果；Figure 3 shows the data results after the second filtering operation;

图4显示了进行三次排名之后，排在前100的页面所在站点的情况。Figure 4 shows the status of the sites where the top 100 pages are located after three rankings.

具体实施方式 Detailed ways

本发明所提出的Web无效链接过滤方法大致可以分为两部分的操作：第一部分是利用网页中的文本位置信息，通过统计方法，去除网页中不相关的广告、导航等链接；第二部分是在第一部分的基础上，对网页内容和链接所指向网页的内容进行相关性分析，去除那些内容不相关的链接。下面分别进行详细的说明。The Web invalid link filtering method that the present invention proposes roughly can be divided into the operation of two parts: the first part is to utilize the text position information in the webpage, by statistical method, removes links such as irrelevant advertisement, navigation in the webpage; The second part is On the basis of the first part, carry out a correlation analysis on the content of the web page and the content of the web page pointed to by the link, and remove those links whose content is irrelevant. Detailed descriptions are given below respectively.

一.基于文本位置的过滤1. Filtering based on text position

目前，大部分网页是通过统一的模板建立的，对于一般网页中与主题相关的链接都被网页建立者放置在一个网页正文的下方，因此这部分的过滤工作是建立在这一假设基础之上的。该过滤工作包括首先将HTML文档转化为DOM树结构，然后在DOM树中寻找包含主体内容和与主题相关的链接的最小子树，得到需要的链接信息，为后续的内容分析和去除网页中主题不相关的链接做准备。At present, most of the webpages are created through a unified template, and for general webpages, links related to topics are placed under the text of a webpage by the webpage creator, so this part of the filtering work is based on this assumption of. The filtering work includes first converting the HTML document into a DOM tree structure, and then searching for the smallest subtree containing the main content and links related to the theme in the DOM tree, obtaining the required link information, analyzing and removing the theme in the webpage for subsequent content Prepare for irrelevant links.

DOM(Document Object Model)即文档对象模型，是W3C制定的标准接口规范，是一种供HTML和XML文档使用的应用程序编程接口(API)。HTML文档被解析后，转化为DOM树结构，DOM树的每个节点是一个对象，HTML文档中的内容完全包含在各节点中。图2是一个从HTML文档转化的DOM树的例子。在本发明的一个具体实施例中，采用CyberNeko HTML Parser解析器对HTML文档进行解析并生成DOM树。DOM (Document Object Model) is the Document Object Model, a standard interface specification formulated by W3C, and an application programming interface (API) for HTML and XML documents. After the HTML document is parsed, it is transformed into a DOM tree structure. Each node of the DOM tree is an object, and the content in the HTML document is completely contained in each node. Figure 2 is an example of a DOM tree converted from an HTML document. In a specific embodiment of the present invention, a CyberNeko HTML Parser parser is used to parse the HTML document and generate a DOM tree.

如图1所示，HTML页面可以划分成不同的区域。对于一个网页来说，主体部分(图1中的第3区域)以正文为主，而其他部分的链接数目比较多。基于这样的现象，在本发明中定义了链接比(Link ratio)的概念：As shown in Figure 1, an HTML page can be divided into different areas. For a web page, the main part (the third area in Figure 1) is dominated by text, while the number of links in other parts is relatively large. Based on such phenomenon, the concept of link ratio (Link ratio) is defined in the present invention:

Linkratio(b_i)＝LinkCount(b_i)/ContentLength(b_i)Linkratio(b _i )=LinkCount(b _i )/ContentLength(b _i )

(1) (1)

其中，LinkCount(bi)表示第i块中的链接数目，ContentLength(bi)表示第i块中非链接内容的长度。设置链接比的阈值(th)，当某一块的链接比小于这一阈值时则认为这一块为网页中的主体部分。Wherein, LinkCount(bi) represents the number of links in the i-th block, and ContentLength(bi) represents the length of non-link content in the i-th block. The threshold (th) of the link ratio is set, and when the link ratio of a certain block is smaller than the threshold, this block is considered as the main part of the web page.

本发明针对DOM树结构，在DOM树中利用分块节点将DOM树分为各个子树，在每个子树中计算链接比，与阈值进行比较，如果小于阈值，则将该块设置为主体块，然后回溯查找包含该块的最近的父分块节点，以该节点作为目标节点，输出该节点中的链接，作为后续分析的基础。The present invention is aimed at the DOM tree structure, uses block nodes in the DOM tree to divide the DOM tree into various subtrees, calculates the link ratio in each subtree, compares it with the threshold value, and if it is less than the threshold value, then sets the block as the main body block , and then look back to find the nearest parent block node that contains the block, take this node as the target node, and output the link in the node as the basis for subsequent analysis.

由于对分块节点的选择决定了对网页进行分块的粒度，因此本发明通过实验，优先选取table(div)和tr节点作为分块节点。Since the selection of block nodes determines the granularity of block web pages, the present invention selects table (div) and tr nodes as block nodes preferentially through experiments.

二.基于文本内容去除主题不相关的链接2. Removing unrelated links based on text content

在网页中除了链接的结构之外，链接本身所带有的文本内容也为网页的分析提供了大量的信息，其中链接的入口文本、链接所指向的网页标题以及主要内容，利用这些信息以及原网页文本之间的内容相似度，就可以来分析链接是否具有推荐意义。因此，有必要利用这些信息对链接进行再过滤，去除主题不相关的链接。In addition to the structure of the link in the web page, the text content of the link itself also provides a lot of information for the analysis of the web page, including the entry text of the link, the title of the web page pointed to by the link, and the main content. The content similarity between web page texts can be used to analyze whether the link has recommendation significance. Therefore, it is necessary to use this information to re-filter links to remove links that are not relevant to the topic.

与其他文档相比，网页文本具有有限的结构，或者说即使具有某种形式的结构，也是着重于格式，而非文本内容，而且不同类型内容的结构也不一致；此外，文本的内容是自然语言的形式，除了全文匹配的方法，计算机很难判断两者之间的内容相似度。因此，在进行两个(或多个)网页的内容相似度分析之前，要对网页的文本内容进行预处理，抽取出可以代表两个(或多个)文本的主要内容，然后对其进行比较。Compared with other documents, web page text has limited structure, or even if it has some form of structure, it focuses on format rather than text content, and the structure of different types of content is inconsistent; in addition, the content of text is natural language In addition to the method of full-text matching, it is difficult for a computer to judge the content similarity between the two. Therefore, before analyzing the content similarity of two (or more) web pages, it is necessary to preprocess the text content of the web pages, extract the main content that can represent the two (or more) texts, and then compare them .

对于文本内容的表示，需要定义明确的文本模型以形成能够被计算机所处理的表示方式，一般来说，文本模型可以分为三类：布尔模型、概率模型和向量空间模型(Vector Space Model)，其中向量空间模型是被广泛采用的文本模型，它相对于布尔模型更加精确，也不需要概率模型的学习过程。因此，本发明使用向量空间模型来进行文本相似度的分析。For the representation of text content, it is necessary to define a clear text model to form a representation that can be processed by a computer. Generally speaking, text models can be divided into three categories: Boolean model, probability model and vector space model (Vector Space Model), Among them, the vector space model is a widely used text model, which is more accurate than the Boolean model, and does not require the learning process of the probability model. Therefore, the present invention uses a vector space model to analyze text similarity.

对所有网页中的主体内容进行文本预处理的过程包括如下的步骤：The process of performing text preprocessing on the main content in all web pages includes the following steps:

1.文本切词1. Text segmentation

对于文本来说，词是最小的能够独立活动的有意义的语言成分。由于英文单词之间是以空格作为自然分界符的，而汉语是以字为基本的书写单位，词语之间没有明显的区分标记，因此，中文词语分析是中文信息处理的基础与关键。For a text, a word is the smallest meaningful language component that can act independently. Since English words use spaces as natural delimiters, while Chinese uses characters as the basic writing unit, and there is no obvious distinction between words. Therefore, Chinese word analysis is the basis and key of Chinese information processing.

本发明通过文本切词技术对每个内容进行处理，目的是区分出内容中的每个词语，以通过文本的词集(bag-of-word)表示法来反映文本的内容，形成文本集合的特征向量，来进行后续的处理。The present invention processes each content through the text segmentation technology, and the purpose is to distinguish each word in the content, so as to reflect the content of the text through the word set (bag-of-word) representation of the text, and form the content of the text set. feature vector for subsequent processing.

2.统计词语频率2. Statistical word frequency

为了后续的计算，需要对大量词汇进行统计，因此需要统计每个文本中切分出来的词汇集合中每个词语的出现频度并进行保存。For subsequent calculations, a large number of vocabulary needs to be counted, so it is necessary to count the occurrence frequency of each word in the vocabulary set segmented from each text and save it.

3.计算TF-IDF(term frequency-inverse document frequency，词频-逆向文件频率)向量3. Calculate TF-IDF (term frequency-inverse document frequency, term frequency-inverse document frequency) vector

在向量空间模型中，采用TF-IDF向量表示法。TF-IDF向量反映了文本集合的词语空间，它的每个向量分量对应一个词语，具体的TF-IDF的定义为：In the vector space model, the TF-IDF vector representation is used. The TF-IDF vector reflects the word space of the text collection, and each of its vector components corresponds to a word. The specific definition of TF-IDF is:

d(i)＝TF-IDF(i)＝TF(W_i，Doc)*IDF(W_i)＝TF(W_i，Doc)*log(D/DF(W_i)) (2)d(i)=TF-IDF(i)=TF(W _i ,Doc)*IDF(W _i )=TF(W _i ,Doc)*log(D/DF(W _i )) (2)

其中TF(Wi，Doc)为词语Wi在文本Doc中的出现频度，D为总文本数，DF(Wi)为词语Wi在总文本集合中出现至少一次的文本数目。Among them, TF(Wi,Doc) is the frequency of occurrence of the word Wi in the text Doc, D is the total number of texts, and DF(Wi) is the number of texts in which the word Wi appears at least once in the total text set.

4.形成向量空间4. Form a vector space

在提取了文本的特征子集之后，就可以建立文本集合对应的向量空间模型。在向量空间模型中，文本空间被看作是由一组正交词条向量所组成的向量空间。After the feature subset of the text is extracted, the vector space model corresponding to the text set can be established. In the vector space model, the text space is regarded as a vector space composed of a set of orthogonal term vectors.

每个文本表示为其中一个特征向量：Each text is represented as one of the feature vectors:

V(d)＝(t₁，w₁(d)；...；t_i，w_i(d)；...；t_n，w_n(d))V(d)=(t ₁ , w ₁ (d); . . . ; t _i , w _i (d); . . . ; t _n , w _n (d))

(3)(3)

其中，ti为文本d中的第i个词，wi(d)为ti对于文本d中的权重。Among them, ti is the i-th word in text d, and wi(d) is the weight of ti for text d.

wi(d)一般被定义为ti在文本d中出现的频率tfi(d)的函数，本发明是利用TF-IDF的值来作为特征向量的，即用计算出的词语的TF-IDF值作为向量中的wi(d)。wi(d) is generally defined as a function of the frequency tfi(d) of ti appearing in the text d. The present invention uses the value of TF-IDF as a feature vector, that is, uses the calculated TF-IDF value of words as wi(d) in the vector.

得到了文本的特征向量，就可以利用特征向量来计算两个(或多个)文本之间的内容相似度，并利用内容相似度来去除网页中内容不相关的链接。After obtaining the feature vector of the text, the feature vector can be used to calculate the content similarity between two (or more) texts, and use the content similarity to remove irrelevant links in the web page.

利用特征向量来计算内容相似度有很多的方法，本发明主要采用两种方法，一种是考虑两个特征向量中所包含的词条的重叠程度。定义文本相似度为：There are many methods for calculating content similarity by using feature vectors, and the present invention mainly adopts two methods, one is to consider the degree of overlap of terms contained in two feature vectors. Define text similarity as:

$sim sim (({d d}_{i i},, {d d}_{j j})) = = \frac{{n no}_{\cap \cap} (({d d}_{i i},, {d d}_{j j}))}{{n no}_{\cup \cup} (({d d}_{i i},, {d d}_{j j}))} - - - - - - ((44))$

其中，sim(di，dj)表示文本di，dj之间的文本相似度，n_∩(d_i，d_j)是文本di和dj相应的特征向量V(di)和V(dj)具有的相同词条数目，n_∪(d_i，d_j)是V(di)和V(dj)所具有的所有词条数目。Among them, sim(di, dj) represents the text similarity between text di and dj, n _∩ (d _i , d _j ) is the same feature vector V(di) and V(dj) of text di and dj have the same The number of entries, n _∪ (d _i , d _j ) is the number of all entries that V(di) and V(dj) have.

另一种是考虑两个特征向量中的夹角余弦的方法。定义文本相似度为：The other is a method that considers the cosine of the angle between the two eigenvectors. Define text similarity as:

$sim sim (({d d}_{i i},, {d d}_{j j})) = = \frac{V V (({d d}_{i i})) * * V V (({d d}_{j j}))}{| | V V (({d d}_{i i})) | | * * | | V V (({d d}_{j j})) | |} = = \frac{{Σ Σ}_{m m = = 11}^{n no} {w w}_{im im} * * {w w}_{jm jm}}{\sqrt{(({Σ Σ}_{m m = = 11}^{n no} {w w}_{im im}^{22})) * * (({Σ Σ}_{m m = = 11}^{n no} {w w}_{jm jm}^{22}))}} - - - - - - ((55))$

其中，sim(di，dj)表示文本di，dj之间的文本相似度，V(di)为文本di的特征向量，wim表示词语tm在文本di中的TF-IDF值。Among them, sim(di, dj) represents the text similarity between text di and dj, V(di) is the feature vector of text di, and wim represents the TF-IDF value of word tm in text di.

关于去除无效链接的问题，针对网页文档与普通文档所进行的内容相似度分析是有差别的，因为前者有链接信息可以利用。实际上，链接的入口文本和相应的网页标题等，都可以被用来进行内容相似度分析。这些文本比起网页正文要简短很多，并且在很大程度上，是对于网页正文的一个扼要总结。Regarding the problem of removing invalid links, the content similarity analysis for web documents and ordinary documents is different, because the former has link information that can be used. In fact, both the entry text of the link and the corresponding web page title can be used for content similarity analysis. These texts are much shorter than the body of the web page and, to a large extent, are a brief summary of the body of the web page.

基于这样的想法，本发明采用一种递进的方式来进行判断，在处理过程中进行三层分析，如果当前层得到的文本是主题相关的，则认为它们内容相关，不再进行下面的计算；如果得到的内容不相关，则继续下一层的分析。只有当三层都得到了不相关的结果，才认为它们是内容不相关的，将该链接从列表中去除。Based on this idea, the present invention uses a progressive method to judge, and performs three-layer analysis in the processing process. If the texts obtained at the current layer are subject-related, they are considered to be related in content, and the following calculations are no longer performed. ; If the obtained content is irrelevant, proceed to the next layer of analysis. Only when all three layers get irrelevant results are they considered content irrelevant and the link is removed from the list.

第一层：根据入口文本进行内容相关性分析The first layer: content correlation analysis based on the entry text

对于网页中的链接来说，最容易得到的就是链接的入口文本。链接入口文本是其他网页的建立者对于所指向网页的一个总结，利用入口文本来代替网页的原内容是有价值的。因为链接入口文本的长度一般比较有限，这样文本相似度的计算复杂度也不会很大。For links in web pages, the easiest thing to get is the link's entry text. The link entry text is a summary of the pointed web page by the creators of other web pages, and it is valuable to use the entry text to replace the original content of the web page. Because the length of the link entry text is generally limited, the computational complexity of the text similarity will not be very large.

然而入口文本中所包含的词的数量是有限的，在建立向量的时候维度是会有差别的，不能计算得到两个向量之间的夹角余弦，因此只能采用第一种计算方法。在一个计算示例中，发明人只选取父网页中TF-IDF值处在前十位的词作为父网页的特征向量，利用第一种方法得到链接指向的网页和父网页之间的文本相似度，然后与阈值比较，如果大于阈值则认为它们具有内容相关性，如果小于阈值则继续进行下一层的计算。However, the number of words contained in the entry text is limited, and the dimension will be different when the vector is created, and the cosine of the angle between the two vectors cannot be calculated, so the first calculation method can only be used. In a calculation example, the inventor only selects words with the top ten TF-IDF values in the parent webpage as the feature vectors of the parent webpage, and uses the first method to obtain the text similarity between the webpage pointed to by the link and the parent webpage , and then compared with the threshold, if it is greater than the threshold, they are considered to have content relevance, and if it is less than the threshold, continue to the calculation of the next layer.

第二层：根据网页的标题进行内容相关性分析The second layer: Carry out content correlation analysis according to the title of the web page

对于一般的网页建立者都会给自己的网页一个简短的标题，并且在网页的HTML文档中，用title标签来表示，通常的网页的标题是对于网页的一个说明，通常包含了网页中正文的一些重要的关键词，利用这样的标题也可以代替网页中的正文来进行内容相似度的分析，但是网页的标题是网页建立者自己加在网页中的，带有网页建立者的主观性，有些网页建立者会在网页的标题中添加很多的不相关词，来提高自己的搜索引擎排名，增加网页的点击量，因此在进行分析的时候将标题的分析放在了入口文本之后。For general webpage builders, they will give their webpage a short title, and in the HTML document of the webpage, use the title tag to indicate that the title of the usual webpage is a description of the webpage, and usually includes some content of the text in the webpage. Important keywords, using such a title can also replace the text in the web page to analyze the similarity of the content, but the title of the web page is added to the web page by the web page creator himself, with the subjectivity of the web page creator, some web pages The creator will add a lot of irrelevant words to the title of the webpage to improve his search engine ranking and increase the number of clicks on the webpage. Therefore, when analyzing the title, he puts the analysis of the title after the entry text.

对于标题的相关性计算方法是与链接入口文本是一致的。The correlation calculation method for the title is consistent with the link entry text.

第三层：根据网页正文内容进行内容相关性分析The third layer: Carry out content correlation analysis according to the content of the webpage text

因为网页正文的数据量较大，因此在计算的过程中，只有前两层得到的否定的结果，才会进行这一层的分析，因为在这一层中，正文的特征向量是可以具有相同的维度的(即使没有可以通过增加词语，并将TF-IDF值设置为0来处理)，这样就可以利用第二种计算的方法，来得到两个网页之间的内容相似度。Because the amount of data in the text of the webpage is large, in the calculation process, only the negative results obtained by the first two layers will be analyzed at this layer, because in this layer, the feature vectors of the text can have the same dimension (even if there are no words, it can be processed by adding words and setting the TF-IDF value to 0), so that the second calculation method can be used to obtain the content similarity between two web pages.

如果在这三层都得到两个(或多个)网页主题不相关，则在父网页的链接列表中删除此链接。If two (or more) webpage topics are found to be irrelevant at these three levels, then delete this link in the link list of the parent webpage.

下面，将通过一系列的实验数据来说明本发明提出的基于内容相关性的无效链接过滤方法的效果，并且利用去除前后的页面PageRank值的比较，说明对于链接分析算法的改进效果。Next, a series of experimental data will be used to illustrate the effect of the invalid link filtering method based on content correlation proposed by the present invention, and the improvement effect of the link analysis algorithm will be illustrated by comparing the PageRank value of the page before and after removal.

本发明所使用的实验数据集是从CWT200g中随机获取的。发明人针对在2005年11月份搜集网页所发现的中国范围内提供Web服务的627036个主机，通过消除重复网站、去除垃圾网站后得到88303站点，对这些站点进行网页搜集，每个网站的搜集深度为3，单个网站搜集的数据量不限，得到初始数据集，再进行网页的消重处理，得到不重复的网页集合。根据网页集合所反映的网站大小，进行采样，最后得到容量为197GB的CWT200g测试集。The experimental data set used in the present invention is randomly obtained from CWT200g. The inventor aimed at the 627,036 hosts providing web services in China found in the webpage collection in November 2005, obtained 88,303 webpages by eliminating duplicate websites and spam websites, and collected the webpages of these webpages. For 3, the amount of data collected by a single website is not limited, the initial data set is obtained, and then the deduplication of web pages is performed to obtain a collection of non-repeated web pages. According to the size of the website reflected in the webpage collection, sampling is carried out, and finally the CWT200g test set with a capacity of 197GB is obtained.

本发明从CWT200g中随机抽取了1421个网站中的1524077个网页，去除了保存不完整，并且只保留html、htm、xml、jsp、asp类型的网页，最后得到1,427,001个网页，以此作为后续实验的数据集。The present invention randomly extracts 1,524,077 webpages from 1,421 websites in CWT200g, removes incomplete preservation, and only keeps html, htm, xml, jsp, asp types of webpages, and finally obtains 1,427,001 webpages as a follow-up experiment data set.

表1是对于实验数据集的一些统计数据，其中网页出度为网页中链接的个数，内部链接为网页中的链接指向实验数据集中的网页的链接个数，调整的出度为零的网页数表示将数据集中的外部链接去除之后，新得到的出度为0的网页个数。Table 1 is some statistical data for the experimental data set, where the out-degree of the web page is the number of links in the web page, the internal links are the number of links in the web page pointing to the web pages in the experimental data set, and the adjusted out-degree is zero The number represents the number of newly obtained web pages with an out-degree of 0 after the external links in the data set are removed.

参考变量 reference variable 数量(单位：个) Quantity (unit: piece) 网站个数 Number of websites 1421 1421 网页个数 Number of pages 1427001 1427001 网站平均网页数 Average number of pages on the site 1004.2 1004.2 链接数目 number of links 80457759 80457759 网页平均链接数 average number of links per page 56.38 56.38 出度为0的网页数 The number of pages with an out-degree of 0 65256 65256 内部链接 internal link 27312578 27312578 内部链接比例 Internal Link Ratio 33.95％ 33.95% 调整的出度为0的网 A net with an adjusted outdegree of 0 220122 220122 页数 Number of pages

表1Table 1

本发明采用上述的两步操作来去除无效链接，由于两步操作所处理的是不同类型的无效链接，因此两步操作去除的链接比例也有很大的不同。The present invention uses the above-mentioned two-step operation to remove invalid links. Since the two-step operations deal with different types of invalid links, the ratios of links removed by the two-step operations are also very different.

在进行了两步的过滤之后，本发明分别统计了对于不同类型的网页每一步去除链接的比例。图3是进行了第二步过滤之后的数据结果。从图3中可以看出，第一步的过滤对于链接的过滤比例是很高的，这与之前的分析是相符合的，而其中com类的网页中存在着大量的广告链接，因此在这一步的过滤之后，去除的链接比例是最高的。但同时也可以看到，对于教育和政府类网页来说，按照惯例来讲，这些网站中不会包括大量的广告链接，但从图3中可以看出，这两类网页的去除比例也很高，通过对于数据集中的一些网页的观察发现，这两类的网页中，大部分的链接都集中在网页的两侧，而这一部分的链接主要以“热门”或“最新”这样的偏袒类链接为主，但对于从主题的相关性来进行评价，这部分链接也属于无效链接，因此从链接分析的角度来分析，它们也应该从链接中过滤掉。After performing two-step filtering, the present invention counts the proportion of links removed in each step for different types of web pages. Figure 3 is the data result after the second step of filtering. It can be seen from Figure 3 that the filtering ratio of the link in the first step of filtering is very high, which is consistent with the previous analysis, and there are a lot of advertising links in com-like web pages, so in this After one-step filtering, the percentage of removed links is the highest. But at the same time, it can also be seen that for education and government web pages, according to the convention, these websites will not include a large number of advertising links, but it can be seen from Figure 3 that the removal ratio of these two types of web pages is also very high High, through the observation of some webpages in the data set, it is found that most of the links in these two types of webpages are concentrated on both sides of the webpage, and the links in this part are mainly classified as "popular" or "latest". Links are the main link, but for the evaluation of the relevance of the topic, these links are also invalid links, so from the perspective of link analysis, they should also be filtered out from the links.

从图3中可以看出，在进行了两步过滤操作之后，剩余的有效链接与人工评测得到的有效链接的比例基本一致。事实上，在本实施例中设置的过滤条件较为宽松，因此得到的有效链接比例比人工评测得到的略高。It can be seen from Figure 3 that after the two-step filtering operation, the proportion of the remaining effective links is basically the same as that obtained by manual evaluation. In fact, the filter conditions set in this embodiment are relatively loose, so the proportion of effective links obtained is slightly higher than that obtained by manual evaluation.

搜索引擎本质上是一个推荐系统，它应该尽量推荐多的高质量网站给用户。因此对于排名算法来说，应尽量避免大量排名靠前的页面来自同一网站，否则将严重影响推荐的多样性(diversity)。从这个角度出发，将进一步分析本发明所提供的无效链接过滤方法对于PageRank的改进效果。A search engine is essentially a recommendation system, and it should recommend as many high-quality websites as possible to users. Therefore, for the ranking algorithm, try to avoid a large number of top-ranked pages from the same website, otherwise it will seriously affect the diversity of recommendations. From this point of view, the improvement effect of the invalid link filtering method provided by the present invention on PageRank will be further analyzed.

图4将进行三次排名之后，排在前100的页面所在站点的情况进行了比较。其中baseline表示没有进行任何处理之后的排名结果，first是只进行了第一步过滤之后的结果，second是两步过滤之后的结果。从图4中可以看出，第一步之后的PageRank排前100中包含的网站个数为46，而进行第二步过滤之后前100中包含的网站有69个。进行了第二步过滤之后，在前100中包含的站点数有了大幅度的提高，表明通过这一步的过滤能发现更多有价值的网站，提供给用户更多有价值的选择。Figure 4 compares the sites where the top 100 pages are located after three rankings. Among them, baseline represents the ranking result without any processing, first represents the result after only the first step of filtering, and second represents the result after two steps of filtering. It can be seen from Figure 4 that the number of websites included in the top 100 of PageRank after the first step is 46, and after the second step of filtering, there are 69 websites included in the top 100. After the second step of filtering, the number of sites included in the top 100 has increased significantly, indicating that through this step of filtering, more valuable websites can be found and more valuable choices can be provided to users.

以上对本发明所提供的基于内容相关性的Web无效链接过滤方法进行了详细的说明。对本领域的一般技术人员而言，在不背离本发明实质精神的前提下对它所做的任何显而易见的改动，都将构成对本发明专利权的侵犯，将承担相应的法律责任。The method for filtering invalid Web links based on content correlation provided by the present invention has been described in detail above. For those skilled in the art, any obvious changes made to it without departing from the essence of the present invention will constitute an infringement of the patent right of the present invention and will bear corresponding legal responsibilities.

Claims

1. a kind of Web invalid link filtering method based on content relevance, it is characterized in that comprising the steps:

(1) Use the text position information in the webpage to remove irrelevant advertising links and navigation links in the webpage through statistical methods;

(2) Carry out correlation analysis on the content of the web page and the content of the web page pointed to by the link, and remove invalid links with irrelevant content.

2. Web invalid link filtering method as claimed in claim 1, is characterized in that:

In the step (1), the HTML document is first converted into a DOM tree structure, and then the smallest subtree containing the main content and links related to the topic is searched in the DOM tree structure to obtain the required link information.

3. Web invalid link filtering method as claimed in claim 2, is characterized in that:

Use CyberNeko HTML Parser to convert HTML documents into DOM tree structures.

4. Web invalid link filtering method as claimed in claim 2, is characterized in that:

For the DOM tree structure, first utilize the block node to divide the DOM tree into each subtree, calculate the link ratio in each subtree, and compare it with a predetermined threshold; if it is less than the threshold, then the block is set as the main body block, Then backtrack to find the nearest parent block node that contains the block, take the parent block node as the target node, and output the link in the parent block node as the basis for subsequent analysis.

5. Web invalid link filtering method as claimed in claim 4, is characterized in that:

When selecting block nodes, table (div) and tr nodes are preferred as block nodes.

6. the Web invalid link filtering method as claimed in claim 1, is characterized in that:

In the step (2), before performing the content correlation analysis of the webpage, the text of the webpage is preprocessed, and the content representing each text is extracted for comparison.

7. Web invalid link filtering method as claimed in claim 6, is characterized in that:

The process of text preprocessing includes the following steps:

First perform text segmentation, then count the frequency of words in the text, calculate the TF-IDF vector, and form a vector space model corresponding to the text set; use the feature vector of the text to calculate the content similarity between each text, and use the content similarity degree to remove irrelevant links in web pages.

8. Web invalid link filtering method as claimed in claim 7, is characterized in that:

The content similarity is determined by the degree of overlap of terms contained in the feature vectors of each text.

9. Web invalid link filtering method as claimed in claim 7, is characterized in that:

The content similarity is determined by the cosine of the included angle among the feature vectors of each text.

10. Web invalid link filtering method as claimed in claim 1, is characterized in that:

In described step (2), described content correlation analysis comprises three layers of operations: the first layer is to carry out content correlation analysis according to entry text; The second layer is to carry out content correlation analysis according to the title of webpage; The third layer is Carry out content correlation analysis according to the content of the webpage body; if the conclusion that the theme of the webpage is irrelevant is obtained in these three layers, delete this link in the link list of the parent webpage.