CN102722499B - Search engine and implementation method thereof - Google Patents
Search engine and implementation method thereof Download PDFInfo
- Publication number
- CN102722499B CN102722499B CN201110079699.1A CN201110079699A CN102722499B CN 102722499 B CN102722499 B CN 102722499B CN 201110079699 A CN201110079699 A CN 201110079699A CN 102722499 B CN102722499 B CN 102722499B
- Authority
- CN
- China
- Prior art keywords
- query
- original
- synonymous
- synonym
- result
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
本发明提供一种搜索引擎,其用于接收用户搜索的原始查询;分析所述原始查询,以获取存在于原始查询中的原词以及该原词的同义词,并且将所述同义词替代原始查询中的原词以获得同义查询;根据所述原始查询和同义查询搜索并获得原始查询结果网页集合与同义查询结果网页集合;计算所述原始查询结果和同义查询结果中网页的重合度;根据预定的与所述重合度对应的合并策略合并原始查询和同义查询的结果网页集合,并生成搜索结果列表。搜索引擎通过判断原始查询结果和同义查询结果的重合度,来确定同义查询结果出现转义的概率,并在转义概率较大时,打压同义查询结果,以避免不符合用户搜索需求的结果出现在搜索结果列表的前列,从而确保用户具有良好的使用体验。
The present invention provides a search engine, which is used to receive an original query searched by a user; analyze the original query to obtain the original word existing in the original query and a synonym of the original word, and replace the synonym in the original query the original word to obtain the synonymous query; search and obtain the original query result webpage collection and the synonymous query result webpage collection according to the original query and the synonymous query result; calculate the coincidence degree of the webpage in the original query result and the synonymous query result ; Merge the result web page collections of the original query and the synonym query according to a predetermined merging strategy corresponding to the coincidence degree, and generate a search result list. The search engine determines the probability of escaping in synonymous query results by judging the coincidence degree of original query results and synonymous query results, and suppresses synonymous query results when the escaping probability is high, so as not to meet user search needs The results appear in the forefront of the search result list, so as to ensure that users have a good experience.
Description
技术领域 technical field
本发明涉及搜索引擎技术,尤其涉及一种可扩展同义搜索查询的搜索引擎及其实现方法。The invention relates to search engine technology, in particular to a search engine capable of expanding synonymous search queries and its implementation method.
背景技术 Background technique
互联网的飞速发展为人们提供了一个全新的信息存储、加工、传递和使用的载体,网络信息也迅速成为了人们获取知识和信息的主要渠道之一。而如此规模的信息资源在将人类占有的几乎所有知识纳入其中的同时,也给资源的使用者带来了如何充分开发和利用的问题。搜索引擎正是在这一需求下应运而生的,它协助网络用户在互联网上查找信息。具体地,搜索引擎根据一定的策略、运用特定的计算机程序从互联网上搜集信息,在对信息进行组织和处理后,为用户提供搜索服务,将用户搜索相关的信息展示给用户。The rapid development of the Internet provides people with a new carrier for information storage, processing, transmission and use, and network information has quickly become one of the main channels for people to obtain knowledge and information. While such a scale of information resources incorporates almost all the knowledge possessed by human beings, it also brings the problem of how to fully develop and utilize resources for users of resources. The search engine came into being precisely under this demand, and it assists network users to find information on the Internet. Specifically, search engines use specific computer programs to collect information from the Internet according to certain strategies, organize and process the information, provide users with search services, and display relevant information to users.
搜索引擎提供的在线搜索服务通常是基于关键词的搜索,即用户通过搜索引擎的输入框输入查询表达式,搜索引擎进行查询并返回包含这些关键词的结果网页。由于不同用户的知识背景或使用习惯不同,对同一事物搜索所使用的关键词可能也会不同,再加上自然语言中本身就存在很多同义词或近义词,所以仅基于用户提供的关键词来搜索是不够的。目前,很多搜索引擎都具有扩展查询的功能,如同义扩展查询。当搜索引擎接收到用户输入的原始查询表达式后,会对其进行分词操作,并识别分词后的词条集合中是否有潜在的同义词对。具体地,搜索引擎会将切分后的词条和预定的同义词词库进行匹配,判断这些词条中是否存在同义词的,如果是,则会在同义词的基础上扩展搜索查询,并将扩展的查询结果和原始的查询结果合并后返回显示给用户。从而,为用户提供了更多相关的搜索结果。The online search service provided by the search engine is usually a keyword-based search, that is, the user enters a query expression through the input box of the search engine, and the search engine performs the query and returns the result webpage containing these keywords. Due to the different knowledge backgrounds or usage habits of different users, the keywords used to search for the same thing may be different. In addition, there are many synonyms or near synonyms in natural language, so it is wrong to search only based on the keywords provided by users. not enough. At present, many search engines have the function of expanding query, such as synonymously expanding query. When the search engine receives the original query expression input by the user, it will perform a word segmentation operation on it, and identify whether there are potential synonym pairs in the entry set after word segmentation. Specifically, the search engine will match the segmented entries with a predetermined thesaurus of synonyms, judge whether there are synonyms in these entries, and if so, expand the search query on the basis of synonyms, and use the expanded The query result is combined with the original query result and returned for display to the user. Thus, more relevant search results are provided for the user.
然而,同一字词在不同的语义环境中可能体现不同的含义,所以其同义词也只是在某种语义环境中同义或近义,而换做不同的语义环境,该同义词就无法适用了。所以,在这种情况下,用同义词扩展查询得到的结果可能就不是用户想要的结果,由此,反而会给用户带来较差的使用体验。例如,用户输入的原始查询是“鱼香肉丝怎么做”。随后,搜索引擎通过对原始查询的分词,并与同义词库匹配后获得了“怎么做”的潜在同义词对{“怎么做”,“菜谱”},并用“菜谱”替换了“怎么做”来执行扩展同义查询并得到相应的查询结果。但如果用户提供的原始查询是“怎么做床头柜”,显然,用户此时的需求是想了解家具的制作,而搜索引擎仍然使用“菜谱”来替换“怎么做”进行扩展同义查询的话,就得到了用户并不想要的转义结果,如此用户会对搜索的准确性提出质疑。However, the same word may have different meanings in different semantic environments, so its synonyms are only synonymous or near-synonymous in a certain semantic environment, but in a different semantic environment, the synonym cannot be applied. Therefore, in this case, the result obtained by using the synonym expansion query may not be the result that the user wants, and thus, it will bring a poor user experience to the user. For example, the original query entered by the user is "how to make fish-flavored pork shreds". Subsequently, the search engine obtains the potential synonym pair of "how to do" {"how to do", "recipe"} through word segmentation of the original query and matching with the thesaurus, and replaces "how to do" with "recipe" to execute Expand the synonym query and get the corresponding query result. But if the original query provided by the user is "how to make a bedside table", obviously, the user's demand at this time is to understand the production of furniture, and the search engine still uses "recipe" to replace "how to do" to expand the synonymous query, it is Get escaped results that the user didn't intend, so the user will question the accuracy of the search.
有鉴于此,有必要对现有的搜索引擎予以改进,以解决上述问题。In view of this, it is necessary to improve the existing search engines to solve the above problems.
发明内容 Contents of the invention
本发明的目的在于提供一种搜索引擎,其通过辨别同义扩展查询结果的转义概率来调整同义查询结果在整个搜索结果中的排序,从而避免在搜索结果的前列出现转义结果,进而确保用户具有良好的使用体验。The purpose of the present invention is to provide a search engine that adjusts the ranking of synonymous query results in the entire search results by identifying the escape probability of synonymous extended query results, thereby avoiding escape results appearing in the front row of search results, and then Make sure users have a good experience.
本发明的目的还在于提供一种上述搜索引擎的实现方法。The object of the present invention is also to provide a method for realizing the above search engine.
为实现上述发明目的之一,本发明的一种搜索引擎的实现方法,其特征在于,该方法包括如下步骤:In order to achieve one of the purposes of the above invention, a method for implementing a search engine of the present invention is characterized in that the method includes the following steps:
接收用户搜索的原始查询;Receive the raw query that the user searches for;
分析所述原始查询,以获取存在于原始查询中的原词以及该原词的同义词,并且将所述同义词替代原始查询中的原词以获得同义查询;analyzing the original query to obtain an original word present in the original query and synonyms of the original word, and replacing the original word in the original query with the synonyms to obtain a synonymous query;
根据所述原始查询和同义查询搜索并获得原始查询结果网页集合与同义查询结果网页集合;Searching and obtaining the original query result webpage set and the synonymous query result webpage set according to the original query and the synonymous query;
计算所述原始查询结果和同义查询结果中网页的重合度;calculating the coincidence degree of web pages in the original query result and the synonymous query result;
根据预定的与所述重合度对应的合并策略合并原始查询和同义查询的结果网页集合,并生成搜索结果列表。Merge the result webpage collections of the original query and the synonym query according to a predetermined merging strategy corresponding to the coincidence degree, and generate a search result list.
作为本发明的进一步改进,所述重合度的计算包括计算原始查询结果和同义查询结果中重合的网页的数量|U1∩U2|。As a further improvement of the present invention, the calculation of the coincidence degree includes calculating the number |U1∩U2| of overlapping webpages in the original query result and the synonymous query result.
作为本发明的进一步改进,所述重合度的计算还包括确定原始查询结果的网页数量和同义查询结果的网页数量中较少的一个Min(|U1|,|U2|);所述重合度I(U1,U2)=|U1∩U2|/Min(|U1|,|U2|)。As a further improvement of the present invention, the calculation of the degree of coincidence also includes determining the lesser Min(|U1|, |U2|) in the number of webpages of the original query result and the number of webpages of the synonymous query result; I(U1, U2)=|U1∩U2|/Min(|U1|, |U2|).
作为本发明的进一步改进,所述重合度的计算还包括计算原始查询结果的网页数量和同义查询结果的网页数量的总和|U1∪U2|;所述重合度I(U1,U2)=|U1∩U2|/|U1∪U2|。As a further improvement of the present invention, the calculation of the degree of coincidence also includes calculating the sum |U1∪U2| of the number of webpages of the original query result and the number of webpages of the synonymous query result; U1∩U2|/|U1∪U2|.
作为本发明的进一步改进,所述合并策略包括:当所述重合度的值在预定重合度区间中小于预定的阈值时,所述预定的合并策略为在合并同义查询结果时进行打压处理,所述打压处理包括:As a further improvement of the present invention, the merging strategy includes: when the value of the coincidence degree is less than a predetermined threshold in the predetermined coincidence degree interval, the predetermined merging strategy is to perform suppression processing when merging synonymous query results, Described suppression processing comprises:
降低同义查询结果中网页的相关度权值;或者reduce the relevancy weight of web pages in the synonymous query results; or
将同义查询结果插入到搜索结果列表的特定页之后;或者Insert synonym query results after a particular page of the search results list; or
将同义查询结果调整到原始查询结果的后面。Adjust the synonym query results behind the original query results.
作为本发明的进一步改进,所述合并策略包括:当重合度的值在预定的重合度区间中大于预定的阈值时,根据原始查询结果和同义查询结果中各网页的相关度权值合并原始和同义查询的结果。As a further improvement of the present invention, the merging strategy includes: when the coincidence degree value is greater than a predetermined threshold in a predetermined coincidence degree interval, merge the original and the result of a synonym query.
为实现上述另一目的,本发明的一种搜索引擎,其包括搜索组件,搜索组件包括:To achieve the above-mentioned another purpose, a search engine of the present invention includes a search component, and the search component includes:
查询分析模块,用于接收用户搜索的原始查询;分析所述原始查询,以获取存在于原始查询中的原词以及该原词的同义词,并且将所述同义词替代原始查询中的原词以获得同义查询;a query analysis module, configured to receive an original query searched by a user; analyze the original query to obtain the original word in the original query and a synonym of the original word, and replace the original word in the original query with the synonym to obtain synonym query;
搜索模块,用于根据所述原始查询和同义查询搜索并获得原始查询结果网页集合与同义查询结果网页集合;A search module, configured to search and obtain the original query result webpage collection and the synonymous query result webpage collection according to the original query and the synonymous query;
重合度计算及结果合并模块,用于计算所述原始查询结果和同义查询结果中网页的重合度;以及根据预定的与所述重合度对应的合并策略合并原始查询和同义查询的结果网页集合,并生成搜索结果列表。A coincidence degree calculation and result merging module, used to calculate the coincidence degree of webpages in the original query results and synonymous query results; and merge the original query and synonymous query result webpages according to a predetermined merging strategy corresponding to the coincidence degree collection, and generate a list of search results.
作为本发明的进一步改进,所述重合度的计算包括计算原始查询结果和同义查询结果中重合的网页的数量|U1∩U2|。As a further improvement of the present invention, the calculation of the coincidence degree includes calculating the number |U1∩U2| of overlapping webpages in the original query result and the synonymous query result.
作为本发明的进一步改进,所述重合度的计算还包括确定原始查询结果的网页数量和同义查询结果的网页数量中较少的一个Min(|U1|,|U2|);所述重合度I(U1,U2)=|U1∩U2|/Min(|U1|,|U2|)。As a further improvement of the present invention, the calculation of the degree of coincidence also includes determining the lesser Min(|U1|, |U2|) in the number of webpages of the original query result and the number of webpages of the synonymous query result; I(U1, U2)=|U1∩U2|/Min(|U1|, |U2|).
作为本发明的进一步改进,所述重合度的计算还包括计算原始查询结果的网页数量和同义查询结果的网页数量的总和|U1∪U2|;所述重合度I(U1,U2)=|U1∩U2|/|U1∪U2|。As a further improvement of the present invention, the calculation of the degree of coincidence also includes calculating the sum |U1∪U2| of the number of webpages of the original query result and the number of webpages of the synonymous query result; U1∩U2|/|U1∪U2|.
作为本发明的进一步改进,所述合并策略包括:当所述重合度的值在预定重合度区间中小于预定的阈值时,所述预定的合并策略为在合并同义查询结果时进行打压处理,所述打压处理包括:As a further improvement of the present invention, the merging strategy includes: when the value of the coincidence degree is less than a predetermined threshold in the predetermined coincidence degree interval, the predetermined merging strategy is to perform suppression processing when merging synonymous query results, Described suppression processing comprises:
降低同义查询结果中网页的相关度权值;或者reduce the relevancy weight of web pages in the synonymous query results; or
将同义查询结果插入到搜索结果列表的特定页之后;或者Insert synonym query results after a particular page of the search results list; or
将同义查询结果调整到原始查询结果的后面。Adjust the synonym query results behind the original query results.
作为本发明的进一步改进,所述合并策略包括:当重合度的值在预定的重合度区间中大于预定的阈值时,根据原始查询结果和同义查询结果中各网页的相关度权值合并原始和同义查询的结果。As a further improvement of the present invention, the merging strategy includes: when the coincidence degree value is greater than a predetermined threshold in a predetermined coincidence degree interval, merge the original and the result of a synonym query.
与现有技术相比,本发明的有益效果是:搜索引擎通过判断原始查询结果和同义查询结果的重合度,来确定同义查询结果出现转义的概率,并在转义概率较大时,打压同义查询结果,以避免不符合用户搜索需求的结果出现在搜索结果列表的前列,从而确保用户具有良好的使用体验。Compared with the prior art, the beneficial effect of the present invention is: the search engine determines the probability of escaping in the synonymous query results by judging the coincidence degree of the original query result and the synonymous query result, and when the escaping probability is relatively high , to suppress synonymous query results, so as to avoid the results that do not meet the user's search needs from appearing in the forefront of the search result list, so as to ensure that users have a good experience.
附图说明 Description of drawings
图1是本发明的搜索引擎的第一实施方式的工作原理框图;Fig. 1 is the working principle block diagram of the first embodiment of the search engine of the present invention;
图2是图1所示的搜索引擎挖掘同义语境的工作流程图;Fig. 2 is the workflow diagram of the search engine mining synonymous context shown in Fig. 1;
图3是图1所示的搜索引擎执行同义扩展查询的工作流程图;Fig. 3 is a workflow diagram of the search engine shown in Fig. 1 performing synonymous expansion query;
图4是本发明的搜索引擎的第二实施方式的工作原理框图;Fig. 4 is the working principle block diagram of the second embodiment of the search engine of the present invention;
图5是图4所示的搜索引擎执行同义扩展查询的工作流程图;Fig. 5 is the working flow chart that the search engine shown in Fig. 4 executes synonymous expansion query;
图6是本发明的搜索引擎的第三实施方式的工作原理框图;Fig. 6 is a working principle block diagram of the third embodiment of the search engine of the present invention;
图7是图6所示的搜索引擎执行同义扩展查询的工作流程图;Fig. 7 is the workflow diagram of the search engine shown in Fig. 6 executing synonymous expansion query;
图8是本发明的搜索引擎的第四实施方式的工作原理框图;Fig. 8 is a working principle block diagram of the fourth embodiment of the search engine of the present invention;
图9是图8所示的搜索引擎执行同义扩展查询的工作流程图;Fig. 9 is a workflow diagram of the search engine shown in Fig. 8 performing synonymous expansion query;
图10是图8所示的搜索引擎判断同义词相似度等级,并对同义词进行相应标注的一具体实施方式中的工作流程图。Fig. 10 is a working flow chart of a specific embodiment in which the search engine shown in Fig. 8 judges the similarity level of synonyms and marks the synonyms accordingly.
具体实施方式 Detailed ways
以下将结合附图所示的各实施方式对本发明进行详细描述。但这些实施方式并不限制本发明,本领域的普通技术人员根据这些实施方式所做出的结构、方法、或功能上的变换均包含在本发明的保护范围内。The present invention will be described in detail below in conjunction with various embodiments shown in the drawings. However, these embodiments do not limit the present invention, and any structural, method, or functional changes made by those skilled in the art according to these embodiments are included in the protection scope of the present invention.
图1所示的是本发明的搜索引擎100的第一实施方式的工作原理框图。本实施方式中,搜索引擎100根据一定的策略从互联网上搜集网页,在对网页进行组织和处理后,可响应客户端20的浏览器21请求而提供搜索查询的服务。其中,搜索引擎100可以包括一个或多个用来存储和管理数据、并响应搜索请求的网络服务器实体。客户端20可以包括一个或多个用户终端设备,如个人计算机、笔记本电脑、无线电话、个人数字处理(PDA)、或其它计算机装置和通信装置。FIG. 1 is a block diagram of the working principle of the first embodiment of the search engine 100 of the present invention. In this embodiment, the search engine 100 collects web pages from the Internet according to a certain strategy, and after organizing and processing the web pages, it can respond to the request of the browser 21 of the client 20 to provide search query services. Wherein, the search engine 100 may include one or more network server entities for storing and managing data and responding to search requests. Client 20 may include one or more user terminal devices, such as personal computers, notebook computers, wireless telephones, personal digital processing (PDA), or other computing devices and communication devices.
这些服务器和终端设备在架构上都包含一些基本组件,如总线、处理装置、存储装置、一个或多个输入/输出装置、和通信接口等。总线可以包括一个或多个导线,用来实现服务器或终端设备各组件之间的通信。处理装置包括各类型的用来执行指令、处理进程或线程的处理器或微处理器。存储装置可以包括存储动态信息的随机访问存储器(RAM)等动态存储器,和存储静态信息的只读存储器(ROM)等静态存储器,以及包括磁或光学记录介质与相应驱动的大容量存储器。输入装置供用户输入信息到服务器或终端设备,如键盘、鼠标、手写笔、声音识别装置、或生物测定装置等。输出装置包括用来输出信息的显示器、打印机、扬声器等。通信接口用来使服务器或终端设备与其它系统或装置进行通信。通信接口之间可通过有线连接、无线连接、或光连接连接到网络中,使搜索引擎100、客户端20间能够通过网络实现相互间的通信。网络可以包括局域网(LAN)、广域网(WAN)、电话网络如公共交换电话网(PSTN)、企业内部的互联网、因特网、或上述这些网络的结合等。服务器和终端设备上均包含有用来管理系统资源、控制其它程序运行的操作系统软件,以及用来实现特定功能模块功能的应用软件或程序指令。These servers and terminal devices all include some basic components in architecture, such as a bus, a processing device, a storage device, one or more input/output devices, and communication interfaces. A bus may include one or more conductors for communication between components of a server or terminal device. Processing devices include various types of processors or microprocessors for executing instructions, processing processes or threads. The storage device may include dynamic memory such as random access memory (RAM) storing dynamic information, static memory such as read only memory (ROM) storing static information, and mass storage including magnetic or optical recording media and corresponding drives. The input device is for the user to input information to the server or terminal equipment, such as keyboard, mouse, stylus, voice recognition device, or biometric device. The output device includes a display, a printer, a speaker, etc. for outputting information. The communication interface is used to enable the server or terminal equipment to communicate with other systems or devices. The communication interfaces can be connected to the network through a wired connection, a wireless connection, or an optical connection, so that the search engine 100 and the client 20 can communicate with each other through the network. The network may include a Local Area Network (LAN), a Wide Area Network (WAN), a telephone network such as the Public Switched Telephone Network (PSTN), an intranet, the Internet, a combination of these networks, or the like. Both the server and the terminal equipment contain operating system software used to manage system resources and control the operation of other programs, as well as application software or program instructions used to realize the functions of specific functional modules.
如图1所示,搜索引擎100可执行同义扩展查询,其整体上可分为离线部分和在线部分。在离线部分,搜索引擎100包括可存储网页数据和同义词对信息的数据存储库12、索引器13、网页抓取器14、记录用户查询信息的用户查询日志数据库16、以及对用户查询日志进行分析的日志分析器17。As shown in FIG. 1 , the search engine 100 can perform synonymous expansion query, which can be divided into an offline part and an online part as a whole. In the offline part, the search engine 100 includes a data repository 12 that can store web page data and synonym pair information, an indexer 13, a web crawler 14, a user query log database 16 that records user query information, and analyzes the user query log. Log Analyzer 17.
网页抓取器14是根据一定的策略通过网页之间的超链接关系来一个个抓取网页的程序。在具体的实施方式中,网页抓取器14从初始的URL(Universal Resource Locator,统一资源定位符)库中根据一定的调度策略选取所要抓取的URL,解析URL中标明的网络服务器地址,然后建立连接、发送请求、和接收数据,将获得的网页数据储存到数据存储库12的网页库122中并建立本地文档集合,然后从其中提取链接以进行下一步的抓取动作,如此循环往复直到所有的URL抓取完为止。网页抓取器14选取URL所依据的调度策略可以包括广度优先策略、深度优先策略、反向链接数策略等;抓取方式可以是累积式抓取,也可以是增量式抓取。索引器13用于对本地文档集合进行分析并建立索引。例如通过分词从文档的全文中抽取词条,然后过滤除去高频词或低频词,以获得索引词集合,最后将网页到索引词的映射转化为索引词到网页的映射,形成包含索引词表和倒排表的倒排文件并存储在数据存储库12的索引库121中。对网页文档进行分词的方法包括基于词典的分词方法、基于理解的分词方法、和基于统计的分词方法。其中比较常见的基于词典的分词方法又包括正向最大切分法、逆向最大切分法、和最少切分法。The webpage crawler 14 is a program that crawls webpages one by one through the hyperlink relationship between webpages according to a certain strategy. In a specific embodiment, the web crawler 14 selects the URL to be grabbed from the initial URL (Universal Resource Locator, Uniform Resource Locator) library according to a certain scheduling policy, resolves the web server address indicated in the URL, and then Establish a connection, send a request, and receive data, store the obtained webpage data in the webpage storehouse 122 of the data repository 12 and set up a local document collection, and then extract links therefrom to carry out the next step of grabbing, and so on until All URLs are crawled. The scheduling strategy based on which the webpage crawler 14 selects URLs may include breadth-first strategy, depth-first strategy, reverse link number strategy, etc.; the crawling method may be cumulative crawling or incremental crawling. The indexer 13 is used for analyzing and indexing the local document collection. For example, extract entries from the full text of the document through word segmentation, then filter out high-frequency words or low-frequency words to obtain a set of index words, and finally convert the mapping from web pages to index words into mappings from index words to web pages to form a list containing index words and the posting file of the posting list are stored in the index library 121 of the data storage library 12. Methods for segmenting webpage documents include dictionary-based word segmentation methods, comprehension-based word segmentation methods, and statistics-based word segmentation methods. Among them, the more common dictionary-based word segmentation methods include forward maximum segmentation method, reverse maximum segmentation method, and minimum segmentation method.
于本发明中,同义词是指名称不同但表达的意思相同或相近的词条,即多个词条表达相同或相近的意思,则这些词条互为同义词。本实施方式中,同义词库123包括同义词对应表1231和同义语境库1232。其中同义词对应表1231中预先指定了不同字词和其同义词的对应关系,如通过预先统计获得的原词和其同义词的对应关系表。该对应表还可以通过分析用户的历史查询点击数据来进行不断的更新。例如,当被点击的查询结果网页的标题中包含某原词的同义词但并未出现原词,且这种情形出现的频次较高,则将该原词和同义词确定为同义词对并添加入同义词对应表1231中。In the present invention, synonyms refer to entries with different names but with the same or similar meanings, that is, if multiple entries express the same or similar meanings, these entries are mutually synonymous. In this embodiment, the thesaurus 123 includes a synonym correspondence table 1231 and a synonymous context database 1232 . The synonyms correspondence table 1231 pre-designates correspondences between different words and their synonyms, such as a correspondence table between original words and their synonyms obtained through pre-statistics. The correspondence table can also be continuously updated by analyzing the user's historical query and click data. For example, when the title of the clicked query result webpage contains a synonym of an original word but does not appear the original word, and this situation occurs frequently, the original word and the synonym are determined as a synonym pair and added to the synonym Correspondence table 1231.
图2所示的是搜索引擎100挖掘同义词对的同义语境的一具体实施方式的工作流程。于本发明中,同义语境是指同义词对中原词出现的语义环境,其用来表明该同义词对所适用的语义环境,即在该语义环境下,同义词适合替换原词来进行同义扩展查询。在本实施方式中,同义语境是通过分析用户查询日志来获得的。用户查询日志数据库17是在每次搜索结束后,用来记录用户的查询点击数据,如查询词表达式、搜索时间、返回的结果列表、以及被点击的结果网页等。参照图2并配合参照图1所示,日志分析器17分析用户查询日志数据库16中包含的历史的用户查询式和点击数据(步骤411),包括分析历史的查询式以及响应于特定查询式而返回的并被点击访问的查询结果网页。接下来,日志分析器17会识别这些数据中是否存在某个同义词对的同义语境,如果是,则记录并存储到同义语境库1232中。FIG. 2 shows the workflow of a specific implementation of the search engine 100 mining the synonymous context of synonym pairs. In the present invention, the synonym context refers to the semantic environment in which the original word appears in the synonym pair, which is used to indicate the semantic environment to which the synonym pair is applicable, that is, in this semantic environment, the synonym is suitable for replacing the original word for synonymous expansion Inquire. In this embodiment, the synonymous context is obtained by analyzing user query logs. The user query log database 17 is used to record the user's query and click data after each search, such as query expression, search time, returned result list, and clicked result webpage. Referring to FIG. 2 and shown in FIG. 1 , the log analyzer 17 analyzes historical user queries and click data contained in the user query log database 16 (step 411), including analyzing historical queries and responding to specific queries. The query result web page returned and clicked to visit. Next, the log analyzer 17 will identify whether there is a synonymous context of a synonym pair in these data, and if so, record and store it in the synonymous context database 1232 .
具体地,日志分析器17首先会基于同义词对应表1231判断某一历史查询式中是否包含原词,如果是,则获取包含该原词和相应同义词的同义词对。例如,历史查询式为“鱼香肉丝怎么做”,日志分析器17基于同义词对应表1231判断该查询式中存在“怎么做”的原词(将“鱼香肉丝怎么做”切分为“鱼香肉丝”和“怎么做”两个词条,然后将这两个词条与同义词对应表中的原词进行匹配,从而找到“怎么做”的原词),并获得了相应同义词对{“怎么做”,“菜谱”}。随后,日志分析器17判断针对该查询式,用户点击访问的网页标题中是否包含了同义词但不包含原词,如果是,则记录该同义词对的同义语境。例如,针对查询式“鱼香肉丝怎么做”,用户点击过标题为“鱼香肉丝菜谱”的网页,则日志分析器17就会执行记录同义语境的操作。同义语境至少包含该历史查询式,如“鱼香肉丝怎么做”;也可以包含该历史查询式中原词的紧邻词,如“鱼香肉丝”;或者是两者都记录作为同义词对{“怎么做”,“菜谱”}的同义语境。其中,紧邻词可以位于原词前,也可以位于原词后;紧邻词也可以是空词条,即原始查询中仅包含原词,不存在紧邻词。Specifically, the log analyzer 17 first judges whether a historical query contains the original word based on the synonym correspondence table 1231, and if so, obtains a synonym pair including the original word and the corresponding synonym. For example, the historical query formula is "how to make fish-flavored pork shreds", and the log analyzer 17 judges based on the synonym correspondence table 1231 that there is an original word of "how to do it" in the query formula (segmenting "how to do fish-flavored pork shreds" into The two entries of "fish-flavored shredded pork" and "how to do it", and then match these two entries with the original words in the synonym correspondence table, so as to find the original word of "how to do it"), and obtain the corresponding synonyms Right {"How To", "Recipe"}. Subsequently, the log analyzer 17 judges whether, for the query formula, whether the title of the web page accessed by the user contains synonyms but does not contain the original word, and if so, records the synonymous context of the synonym pair. For example, for the query "how to make fish-flavored shredded pork", if the user has clicked on a webpage titled "recipe for fish-flavored pork shreds", the log analyzer 17 will perform the operation of recording the synonymous context. The synonymous context contains at least the historical query formula, such as "how to make fish-flavored pork shreds"; it can also contain the close words of the original word in the historical query formula, such as "fish-flavored pork shreds"; or both are recorded as synonyms Synonymous context for {"how to", "recipe"}. Wherein, the immediately adjacent word can be located before or after the original word; the immediately adjacent word can also be an empty entry, that is, the original query only contains the original word, and there is no immediately adjacent word.
上述实施方式中,同义语境是通过历史的用户行为获得,但在其它实施方式中,同义语境也可以根据网页的锚文本来确定。锚文本即网页的超链接中包含的文本信息。例如,网页www.sina.com.cn被引用的地方超链文本有“新浪网首页”、“新浪首页”、“sina首页”,那么这几个文字段会被记录下来作为同义词对{“新浪网”,“新浪”}的同义语境。此外,同义语境也可以根据网页标题中的并列段来确定。例如,price.mycar168.com/search.asp?factoryid=135这个网址的标题为“华晨宝马报价、华晨宝马价格·深圳汽车大世界网”。则通过分隔符,该标题可以被切分为多个并列的词条片段“华晨宝马报价”“华晨宝马价格”“深圳汽车大世界网”,而前两个片段包含同义词对{“价格”,“报价”}中的“价格”和“报价”,那么这两个片段也可以作为该同义词对的同义语境。In the above embodiments, the synonymous context is obtained through historical user behavior, but in other embodiments, the synonymous context can also be determined according to the anchor text of the webpage. Anchor text is the text information contained in the hyperlink of the web page. For example, if the hyperlinked text of the website www.sina.com.cn is quoted includes "Sina.com homepage", "Sina homepage", "sina homepage", then these text fields will be recorded as synonym pairs {"Sina Net", "Sina"} synonymous context. In addition, synonymous context can also be determined from the parallel paragraph in the title of the web page. For example, price.mycar168.com/search.asp? factoryid=135 The title of this URL is "BMW Brilliance Quotation, BMW Brilliance Price·Shenzhen Auto World Network". Then, the title can be divided into multiple juxtaposed entry segments "BMW Brilliance Quotation", "BMW Brilliance Price", "Shenzhen Auto World Network" through separators, and the first two segments contain synonym pairs {"price", "price" and "quote" in "quote"}, then these two fragments can also be used as the synonymous context of the synonym pair.
参照图2所示,在同义语境挖掘的过程中,用户的点击行为并不一定都是完全合理的,也就是说,用户在浏览搜索结果的过程中可能会无心点击一些不相关的结果,在这种情况下记录的同义语境就不会准确。所以为消除这种情形造成的负面影响,日志分析器17还会统计同义语境被记录的频次,并且,只有当频次大于或等于一个预先确定的频次阈值时,这个同义语境才会保留确定为相应同义词对的同义语境,也就是说,过滤掉低频的同义语境(步骤413)。Referring to Figure 2, in the process of synonymous context mining, the user's click behavior is not necessarily completely reasonable, that is to say, the user may unintentionally click some irrelevant results in the process of browsing the search results , in which case the synonymous context recorded would not be accurate. Therefore, in order to eliminate the negative impact caused by this situation, the log analyzer 17 also counts the recorded frequency of the synonymous context, and only when the frequency is greater than or equal to a predetermined frequency threshold, the synonymous context will be Keep the synonymous contexts determined as the corresponding synonymous word pairs, that is, filter out the synonymous contexts with low frequency (step 413).
如图1所示,搜索引擎100的在线部分主要包括搜索组件11和用户界面15。其中用户界面15通过客户端20的浏览器软件21展现,用于供用户输入查询式,并按预定展现方式显示搜索结果列表;此外,在搜索结束后,还用于记录用户的查询信息,并将其存入用户查询日志数据库16中。搜索组件11用于响应客户端30的搜索请求,将搜索结果返回给客户端20。本实施方式中,搜索组件11包括搜索模块111、查询分析模块112、和结果合成模块113。对于普通的原始查询(不包含扩展查询),查询分析模块112通常用于对当前接收到的原始查询进行分词操作,获得查询词集合,并生成查询词表。搜索模块111在接收到查询词表后,与数据索引库121中的索引词表进行匹配,找到相应的索引词以及每个索引词对应的倒排表,从而获得与查询词相关的网页文档集合。结果合成模块113根据预先确定的每个文档与查询词之间的相关度权值将搜索到的网页文档顺序排列,然后将结果列表通过用户界面15返回给客户端。As shown in FIG. 1 , the online portion of the search engine 100 mainly includes a search component 11 and a user interface 15 . Wherein the user interface 15 is presented by the browser software 21 of the client 20, and is used for inputting the query formula for the user, and displays the search result list according to a predetermined display mode; in addition, after the search is finished, it is also used for recording the user's query information, and It is stored in the user query log database 16. The search component 11 is used to respond to the search request of the client 30 and return the search result to the client 20 . In this embodiment, the search component 11 includes a search module 111 , a query analysis module 112 , and a result synthesis module 113 . For ordinary original queries (excluding extended queries), the query analysis module 112 is generally used to perform word segmentation on the currently received original query, obtain a set of query words, and generate a query vocabulary. After receiving the query word list, the search module 111 matches the index word list in the data index database 121, finds the corresponding index words and the posting list corresponding to each index word, and obtains a collection of webpage documents related to the query words . The result synthesis module 113 arranges the searched webpage documents according to the predetermined correlation weight between each document and the query word, and then returns the result list to the client through the user interface 15 .
以下结合图3所示的工作流程来说明搜索引擎100根据同义语境在线执行同义扩展查询的详细步骤。查询分析模块112通过用户界面15接收到当前用户搜索的原始查询(步骤421),然后分析查询式(步骤422),包括对原始查询进行分词操作。需要说明的是,本实施方式中的分词方法是基于词典的正向最大切分法,而该词典由同义语境包含的词条片段构造而成。前已述及,历史查询式会被作为同义语境记录,而历史查询式的片段长度要大于该查询式被切分后的词条的长度,所以,采用正向最大切分法可确保一旦当前的原始查询中包含历史查询式的片段,则该片段会被率先切分出来,从而提高了后续的计算的准确率。例如,在同义语境挖掘阶段,历史查询式是“今天诺基亚多少钱”,则在记录同义词对{“多少钱”,“价格”}的同义语境时,历史查询式“今天诺基亚多少钱”和紧邻词“诺基亚”都会作为同义语境记录下来。而当前的原始查询是“谁知道今天诺基亚多少钱”,按照正向最大切分法,同义语境词典中最长的片段“今天诺基亚多少钱”长度是8,则查询分析模块112从左到右扫描当前的原始查询,判断长度为8的短语是否出现在同义语境词典中,当发现“今天诺基亚多少钱”匹配时,就会将其先切分出来,如此,“诺基亚”就不会作为单独的关键词切分出来。在步骤422中,查询分析模块112还会将原始查询切分后得到的词条集合与同义词库123匹配,获得潜在同义词对和该同义词对的同义语境,该潜在同义词对中包含了存在于原始查询中包含的原词,以及与该原词对应的同义词。The detailed steps for the search engine 100 to perform synonymous expansion query online according to the synonymous context will be described below in conjunction with the workflow shown in FIG. 3 . The query analysis module 112 receives the original query searched by the current user through the user interface 15 (step 421 ), and then analyzes the query formula (step 422 ), including performing word segmentation on the original query. It should be noted that the word segmentation method in this embodiment is based on the forward maximum segmentation method of the dictionary, and the dictionary is constructed from the fragments of entries contained in the synonymous context. As mentioned above, the historical query will be recorded as a synonymous context, and the fragment length of the historical query is greater than the length of the segmented entry of the query. Therefore, the forward maximum segmentation method can ensure that Once the current original query contains a historical query fragment, the fragment will be segmented first, thereby improving the accuracy of subsequent calculations. For example, in the synonymous context mining stage, the historical query type is "how much is Nokia today", then when recording the synonymous context of the synonym pair {"how much money", "price"}, the historical query type "how much is Nokia today" Money" and the adjacent word "Nokia" will be recorded as synonymous context. And the current original query is "who knows how much is Nokia today", according to the forward maximum segmentation method, the length of the longest segment "how much is Nokia today" in the synonym context dictionary is 8, then the query analysis module 112 starts from the left Scan the current original query to the right to determine whether the phrase with a length of 8 appears in the synonymous context dictionary. When it finds a match for "how much is Nokia today", it will be segmented first, so that "Nokia" is It will not be segmented as a separate keyword. In step 422, the query analysis module 112 will also match the entry set obtained after the original query is segmented with the thesaurus 123 to obtain a potential synonym pair and the synonymous context of the synonym pair. The potential synonym pair contains the existing The original term contained in the original query, and the corresponding synonyms for the original term.
接下来,查询分析模块112判断同义语境和原始查询是否匹配(步骤423)。在本实施方式中,查询分析模块112会计算同义语境和原始查询的匹配度,当匹配度的值处于预先确定的匹配度区间内时,则确定同义语境和原始查询匹配,即表明当前原始查询的语义环境适合采用同义词替换原词来执行扩展查询。匹配度的计算可以根据原词始查询式除去原词后的长度,及同义语境的长度来确定。以下是本实施方式中,当原始查询的长度大于原词的长度(即q≠orig)时,匹配度M的计算公式:Next, the query analysis module 112 judges whether the synonymous context matches the original query (step 423). In this embodiment, the query analysis module 112 will calculate the matching degree of the synonymous context and the original query, and when the value of the matching degree is within a predetermined matching degree interval, it is determined that the synonymous context and the original query match, that is Indicates that the semantic environment of the current original query is suitable for performing extended queries by replacing the original words with synonyms. The calculation of the matching degree can be determined according to the length of the original query formula after removing the original word, and the length of the synonymous context. The following is the formula for calculating the matching degree M in this embodiment when the length of the original query is greater than the length of the original word (that is, q≠orig):
其中TermCount(q)表示原始查询的长度,TermCount(orig)表示原始查询中原词的长度,TermCount(pi)表示第i个同义语境的长度。因为这种情况下,原始查询中会存在非同义语境的词,因此M是处于[0,1]之间的值。预先设定一个匹配度阈值θ,则当M的值处于[θ,1]时,表明同义语境和原始查询匹配,则将同义词替换原始查询中的原词以获得同义查询,随后搜索模块111根据原始查询和同义查询搜索获得原始查询结果的网页集合及同义查询结果网页的集合(步骤424),结果合成模块113根据预定的合并策略合并原始查询和同义查询的结果(步骤425)。关于结果合并策略,将在后文中做详细描述。当M的值处于[0,θ]时,表明同义语境和原始查询不匹配,即在这种语义环境下,不适合用同义词替代原词,接下来搜索模块111只会根据原始查询来执行搜索并获得原始查询结果的网页集合(步骤426),而后结果合成模块113根据预先确定的每个网页与原始查询之间的相关度权值获得搜索结果列表(步骤425)。当原始查询仅包含原词(即q=orig)时,匹配度M=1,则用同义词之间替换原始查询,而后执行步骤424和步骤425。Where TermCount(q) represents the length of the original query, TermCount(orig) represents the length of the original word in the original query, and TermCount(pi) represents the length of the i-th synonymous context. Because in this case, there will be words in non-synonymous contexts in the original query, so M is a value between [0, 1]. A matching degree threshold θ is set in advance, and when the value of M is [θ, 1], it indicates that the synonymous context matches the original query, and the synonymous words are replaced with the original words in the original query to obtain the synonymous query, and then search Module 111 searches and obtains the collection of web pages of original query results and the collection of webpages of synonymous query results according to the original query and synonymous query (step 424), and the result synthesis module 113 merges the results of original query and synonymous query according to a predetermined merging strategy (step 424). 425). The result merging strategy will be described in detail later. When the value of M is [0, θ], it indicates that the synonym context does not match the original query, that is, in this semantic environment, it is not suitable to replace the original word with a synonym, and then the search module 111 will only search according to the original query Execute the search and obtain the webpage set of the original query result (step 426), and then the result synthesis module 113 obtains the search result list according to the predetermined correlation weight between each webpage and the original query (step 425). When the original query only contains the original word (ie q=orig) and the matching degree M=1, replace the original query with between synonyms, and then execute steps 424 and 425.
搜索引擎通过对当前用户查询需求的语义环境分析,以确定是否适用同义词变换来执行同义扩展查询,从而确保同义扩展查询的准确率,使扩展查询尽量符合用户的需求,进而确保用户具有良好的使用体验。The search engine analyzes the semantic environment of the current user's query needs to determine whether the synonym transformation is applicable to execute the synonymous expansion query, so as to ensure the accuracy of the synonymous expansion query, make the expansion query meet the user's needs as much as possible, and then ensure that the user has a good use experience.
图4和图5揭示了本发明搜索引擎的第二实施方式。相比第一实施方式,本实施方式的搜索引擎200主要通过判断同义查询结果的转义概率,来调整同义查询结果在最后展现给用户的搜索结果列表中的位置。如图4所示,搜索引擎200包括搜索组件11、数据存储库12、索引器13、抓取器14、和用户界面15。数据存储库12、索引器13、抓取器14、和用户界面15等功能模块与上述实施方式基本相同,所以申请人在此不再予以赘述。本实施方式中,搜索组件11包括搜索模块111、查询分析模块112、和重合度计算及结果合并模块114。Figure 4 and Figure 5 reveal the second embodiment of the search engine of the present invention. Compared with the first embodiment, the search engine 200 in this embodiment mainly adjusts the position of the synonymous query results in the search result list presented to the user by judging the escape probability of the synonymous query results. As shown in FIG. 4 , the search engine 200 includes a search component 11 , a data repository 12 , an indexer 13 , a crawler 14 , and a user interface 15 . Functional modules such as data storage 12 , indexer 13 , grabber 14 , and user interface 15 are basically the same as those in the above-mentioned embodiment, so the applicant will not repeat them here. In this embodiment, the search component 11 includes a search module 111 , a query analysis module 112 , and a coincidence degree calculation and result combination module 114 .
以下结合图5对本实施方式的搜索引擎执行同义扩展查询做详细说明。首先,查询分析模块112接收用户的原始查询(步骤431)。接下来,分析查询式(步骤432),包括对原始查询进行分词操作以获得查询词集合,基于同义词库123识别原始查询中的原词并获得包含原词和其同义词的同义词对,并直接将同义词替换原词以获得同义查询。搜索模块111根据原始查询和同义查询搜索获得原始查询结果的网页集合及同义查询结果网页的集合(步骤433)。接下来,重合度计算及结果合并模块114计算原始查询结果与同义查询结果中网页的重合度(步骤434)。该重合度主要是用来反应原始查询结果和同义查询结果的中相同网页的数量,如果相同网页的数量足够多,表明同义查询结果和原始查询结果比较接近,则同义查询结果出现转义的概率较小;反之,则表明同义查询结果出现转义的概率较大,需要对同义查询结果进行打压以避免不符合用户搜索需求的结果出现在结果列表的前列。The synonymous extension query performed by the search engine in this embodiment will be described in detail below in conjunction with FIG. 5 . First, the query analysis module 112 receives a user's original query (step 431). Next, analyze the query formula (step 432), including performing word segmentation on the original query to obtain a set of query words, identify the original words in the original query based on the thesaurus 123 and obtain a synonym pair comprising the original word and its synonyms, and directly Synonyms replace original words to obtain synonymous queries. The search module 111 searches the original query and the synonymous query to obtain a web page set of original query results and a web page set of synonymous query results (step 433 ). Next, the coincidence degree calculation and result merging module 114 calculates the coincidence degree of webpages in the original query result and the synonymous query result (step 434 ). The coincidence degree is mainly used to reflect the number of identical webpages in the original query results and synonymous query results. If the number of identical webpages is large enough, it indicates that the synonymous query results are relatively close to the original query results, and the synonymous query results will appear in turn. On the contrary, it indicates that the probability of escape in synonymous query results is high, and it is necessary to suppress the synonymous query results to prevent the results that do not meet the user's search requirements from appearing in the forefront of the result list.
重合度的计算可以采用多种方式,如仅计算原始查询结果和同义查询结果中重合的网页的数量|U1∩U2|,即确定相同的URL数量;或计算两个结果集合中各前100个结果的重合网页数量,然后和预定的阈值进行比较判断。作为优选的方式,重合度的计算还包括确定原始查询结果的网页数量和同义查询结果的网页数量中较少的一个Min(|U1|,|U2|);然后重合度I(U1,U2)=|U1∩U2|/Min(|U1|,|U2|)。或者在其它实施方式中,重合度的计算还包括计算原始查询结果的网页数量和同义查询结果的网页数量的总和|U1∪U2|;然后重合度I(U1,U2)=|U1∩U2|/|U1∪U2|。当重合度的值计算出来后,会判断该值是否处于预定的重合度区间内以确定是否需要打压同义查询结果,而后确定同义查询结果在搜索结果列表中的位置并输出合并后的结果(步骤435)。以重合度计算方式I(U1,U2)=|U1∩U2|/Min(|U1|,|U2|)为例,重合度的值I为处于[0,1]之间的浮点数。预先设定一个重合度阈值σ,则当I处于[σ,1]时,表明原始查询结果和同义查询结果的重合度较高,这种情况下,不需要打压同义查询结果,只需按照预先确定的每个网页的相关度权值来合并原始和同义查询的结果。当I处于[0,σ]时,表明原始查询结果和同义查询结果的重合度较低,同义查询结果的转义概率较大,这时就需要打压同义查询结果。打压的方式可以是对同义查询结果中网页的相关度权值做降权处理,从而使同义查询结果在合并后的搜索结果列表中处于较后的位置;或者将同义查询结果插入到搜索结果列表的特定页之后,如将同义查询结果调整到搜索结果列表的第二页;此外,也可以将同义查询结果调整到原始查询结果的后面,即同义查询结果出现在搜索结果列表的最后面。The coincidence degree can be calculated in many ways, such as only calculating the number of overlapping webpages |U1∩U2| in the original query results and synonymous query results, that is, determining the number of identical URLs; or calculating the top 100 in each of the two result sets The number of overlapping webpages of each result, and then compare and judge with the predetermined threshold. As a preferred mode, the calculation of the coincidence degree also includes determining the lesser Min(|U1|, |U2|) in the number of webpages of the original query result and the number of webpages of the synonymous query result; then the degree of coincidence I(U1, U2 )=|U1∩U2|/Min(|U1|, |U2|). Or in other implementations, the calculation of the coincidence degree also includes calculating the sum |U1∪U2| of the number of webpages of the original query result and the number of webpages of the synonymous query result; then coincidence degree I(U1, U2)=|U1∩U2 |/|U1∪U2|. When the value of the coincidence degree is calculated, it will judge whether the value is within the predetermined coincidence degree range to determine whether to suppress the synonymous query results, and then determine the position of the synonymous query results in the search result list and output the merged results (step 435). Taking the coincidence degree calculation method I(U1, U2)=|U1∩U2|/Min(|U1|, |U2|) as an example, the value I of the coincidence degree is a floating point number between [0, 1]. Pre-set a coincidence degree threshold σ, when I is [σ, 1], it indicates that the coincidence degree between the original query result and the synonymous query result is high. In this case, there is no need to suppress the synonymous query result, just The results of the original and synonymous queries are combined according to a predetermined relevance weight for each web page. When I is in [0, σ], it indicates that the coincidence degree between the original query result and the synonymous query result is low, and the escaping probability of the synonymous query result is high, so it is necessary to suppress the synonymous query result. The method of suppressing can be to lower the weight of the relevance weight of the web pages in the synonymous query results, so that the synonymous query results are in a lower position in the combined search result list; or insert the synonymous query results into the After a specific page in the search result list, such as adjusting the synonymous query result to the second page of the search result list; in addition, you can also adjust the synonymous query result to the back of the original query result, that is, the synonymous query result appears in the search result at the end of the list.
搜索引擎通过判断原始查询结果和同义查询结果的重合度,来确定同义查询结果出现转义的概率,并在转义概率较大时,打压同义查询结果,以避免不符合用户搜索需求的结果出现在搜索结果列表的前列,从而确保用户具有良好的使用体验。本实施方式中,在用同义词替换原词执行同义扩展查询前,并不必须采用实施方式一所提到的通过同义语境判断来确定是否进行同义词替换,然而,本领域的普通技术人员可以轻易想到的是,如果本实施方式结合第一实施方式,即在同义词替换前先执行同义语境的判断,然后在同义查询结果出来后根据原始和同义查询结果的重合度合并搜索结果,显然这样可以获得更加准确的搜索结果,从而进一步提升用户体验。The search engine determines the probability of escaping in synonymous query results by judging the coincidence degree of original query results and synonymous query results, and suppresses synonymous query results when the escaping probability is high, so as not to meet user search needs The results appear in the forefront of the search result list, so as to ensure that users have a good experience. In this embodiment, before replacing the original word with a synonym to perform a synonym expansion query, it is not necessary to use the synonymous context judgment mentioned in Embodiment 1 to determine whether to perform synonym replacement. However, those of ordinary skill in the art It can be easily imagined that if this embodiment is combined with the first embodiment, that is, before the synonyms are replaced, the judgment of the synonymous context is performed first, and then after the synonymous query results come out, the search is combined according to the degree of coincidence between the original and synonymous query results As a result, it is clear that more accurate search results can be obtained in this way, thereby further improving the user experience.
图6和图7揭示了本发明搜索引擎的第三实施方式。本实施方式是基于同义查询结果,进一步通过分析同义查询结果网页的语义主题分布来判断同义查询结果的转义概率,进而调整同义查询结果在最后展现给用户的搜索结果列表中的位置。如图6所示,与第一实施方式类似,搜索引擎300包括搜索组件11、数据存储库12、索引器13、抓取器14、用户界面15、用户查询日志数据库16、日志分析器17。其中索引器13、抓取器14、用户界面15、用户查询日志数据库16、日志分析器17等功能模块与第一实施方式中相同,申请人在此不再予以赘述。本实施方式中,搜索组件11包括搜索模块111、查询分析模块112、结果合成模块113、以及转义判定模块115。数据存储库12包含有索引库121、网页库122、同义词库123、以及网页语义主题库124。其中索引库121、网页库122、同义词库123与第一实施方式中相同,申请人在此不再予以赘述。搜索引擎300还包括一主题分析模块18,本实施方式中,该主题分析模块18包含一概率潜在语义分析(Probabilitistic Latent SemanticAnalysis,下称PLSA)模型。Figure 6 and Figure 7 reveal the third embodiment of the search engine of the present invention. This embodiment is based on the synonymous query results, and further judges the escape probability of the synonymous query results by analyzing the semantic topic distribution of the synonymous query result web pages, and then adjusts the synonymous query results in the search result list that is finally presented to the user. Location. As shown in FIG. 6 , similar to the first embodiment, the search engine 300 includes a search component 11 , a data repository 12 , an indexer 13 , a grabber 14 , a user interface 15 , a user query log database 16 , and a log analyzer 17 . The functional modules such as indexer 13 , grabber 14 , user interface 15 , user query log database 16 , and log analyzer 17 are the same as those in the first embodiment, and the applicant will not repeat them here. In this embodiment, the search component 11 includes a search module 111 , a query analysis module 112 , a result synthesis module 113 , and an escape judgment module 115 . The data repository 12 includes an index repository 121 , a webpage repository 122 , a thesaurus 123 , and a webpage semantic theme repository 124 . The index library 121 , the web page library 122 , and thesaurus 123 are the same as those in the first embodiment, and the applicant will not repeat them here. The search engine 300 also includes a topic analysis module 18. In this embodiment, the topic analysis module 18 includes a probabilistic latent semantic analysis (Probabilitistic Latent Semantic Analysis, hereinafter referred to as PLSA) model.
PLSA模型是一种自然语言处理的工具,其主要用于分析文档的潜在语义。一个文档可以被表示为一组词的集合,但由于同义词的存在,词并不是文档的最基本组成元素,所以,可以认为在词与文档之间还有一个潜在的语义层面,即主题。例如,用户输入的查询式为“瑞士军刀绿颜色”,由于{“绿颜色”,“绿色”}为同义词对,所以可以用“绿色”替换“绿颜色”来执行同义扩展查询,但这时召回的结果可能会包含标题为“系统瑞士军刀-完美卸载V2007绿色版”的网页。这是因为“瑞士军刀绿颜色”对应的主题为“物品”,而“系统瑞士军刀-完美卸载V2007绿色版”对应的主题为“软件”,显然,搜索引擎还无法理解这些隐含的主题。PLSA模型是一种通过计算文档中共现词的分布来分析潜在语义主题的主题模型,其在文档和词之间引入一个潜在的语义层,该潜在语义层由n个潜在语义主题构成。假设文档与词之间是相互独立的,则文档与词共同出现的概率由它们与主题之间的概率关系来决定。因此,通过PLSA模型可计算出文档或词与潜在语义主题之间的关系。基于此,本实施方式中可以通过PLSA模型获得同义语境和同义查询结果网页的语义主题分布,并计算两者的匹配度以确定同义查询结果的转义概率。接下来将作详细描述。The PLSA model is a tool for natural language processing, which is mainly used to analyze the latent semantics of documents. A document can be represented as a set of words, but due to the existence of synonyms, words are not the most basic elements of the document, so it can be considered that there is a potential semantic level between words and documents, that is, topics. For example, the query type entered by the user is "Swiss Army Knife Green", since {"Green", "Green"} is a synonym pair, so you can replace "Green" with "Green" to perform a synonymous expansion query, but this The results of the recall may contain a web page titled "System Swiss Army Knife - Perfect Uninstall V2007 Green Edition". This is because the theme corresponding to "Swiss Army Knife Green" is "item", and the theme corresponding to "System Swiss Army Knife - Perfect Uninstall V2007 Green Edition" is "software". Obviously, search engines cannot understand these implicit themes. The PLSA model is a topic model that analyzes latent semantic topics by calculating the distribution of co-occurring words in documents. It introduces a latent semantic layer between documents and words, and the latent semantic layer is composed of n latent semantic topics. Assuming that documents and words are independent of each other, the probability of co-occurrence of documents and words is determined by their probabilistic relationship with topics. Therefore, the relationship between documents or words and latent semantic topics can be calculated through the PLSA model. Based on this, in this embodiment, the semantic topic distribution of the synonymous context and the synonymous query result webpage can be obtained through the PLSA model, and the matching degree of the two can be calculated to determine the escape probability of the synonymous query result. A detailed description will be given next.
如图6所示,主题分析模块18从网页库122中获取网页,去掉网页中的边框广告等干扰词,然后提取能够代表该网页的关键词集合。随后,主题分析模块18通过PLSA模型计算获得表示该网页的语义主题分布的网页-潜在语义主题向量S2={s21,s22,...,s2n},其中s2n表示该网页在第n个语义主题上的概率分值。本实施方式中,网页语义主题分布的获取是在离线状态下获得,即主题分析模块18分析所有被抓取的网页,获得其语义主题分布,然后存储入网页语义主题库124中。当然,该过程也可以是在在线搜索的状态下获得,即当同义查询结果获得后,主题分析模块18仅分析查询结果中的网页,然后将这些网页的语义主题分布交给转义判定模块115来判断。本实施方式中,同义语境语义主题分布的获取是在线实现的。当查询分析模块112切分原始查询获得关键词集合后,主题分析模块18获取该关键词集合,并且从同义语境库1232中获得相应的同义语境包含的词条集合。然后,将关键词集合与同义语境的词条集合合在一起,交给PLSA模型计算并获得表示该同义语境的语义主题分布的同义语境-潜在语义主题向量S1={s11,s12,...s1n},其中s1n是指同义语境在第n个语义主题上的概率值。当得到向量S1后,主题分析模块18将其交给转义判定模块115来判断S1和S2的相似度。关于判断的步骤,将会在后文做详细地描述。As shown in FIG. 6 , the topic analysis module 18 acquires a webpage from the webpage library 122 , removes distracting words such as border advertisements in the webpage, and then extracts a set of keywords that can represent the webpage. Subsequently, the topic analysis module 18 calculates and obtains the webpage-latent semantic topic vector S2={s21, s22,...,s2n} representing the semantic topic distribution of the webpage through the PLSA model, where s2n indicates that the webpage is in the nth semantic topic The probability score on . In this embodiment, the acquisition of the semantic topic distribution of the webpage is obtained offline, that is, the topic analysis module 18 analyzes all captured webpages to obtain the semantic topic distribution, and then stores it in the webpage semantic topic database 124 . Of course, this process can also be obtained in the state of online search, that is, after the synonymous query results are obtained, the topic analysis module 18 only analyzes the webpages in the query results, and then gives the semantic topic distribution of these webpages to the escape judgment module 115 to judge. In this embodiment, the acquisition of the semantic topic distribution of the synonymous context is realized online. After the query analysis module 112 divides the original query to obtain a keyword set, the topic analysis module 18 obtains the keyword set, and obtains a corresponding set of terms contained in the synonymous context from the synonymous context database 1232 . Then, combine the keyword set and the entry set of the synonymous context, submit it to the PLSA model to calculate and obtain the synonymous context-latent semantic topic vector S1={s11 , s12,...s1n}, where s1n refers to the probability value of the synonymous context on the nth semantic topic. After getting the vector S1, the topic analysis module 18 sends it to the escape judging module 115 to judge the similarity between S1 and S2. The steps of judging will be described in detail later.
接下来将配合图7详细介绍本实施方式中搜索引擎300执行同义扩展查询的详细步骤。首先,查询分析模块112接收到用户搜索的原始查询(步骤441),然后对该原始查询进行分析(步骤442)。查询分析模块112会对原始查询进行分词操作,如同第一实施方式,分词操作是基于同义语境构建的词典做最大正向切分。经分词操作后获得原始关键词集合,一方面,查询分析模块112将原始关键词集合交给搜索模块111执行原始查询(步骤449),并获得原始查询结果(步骤450)。另一方面,查询分析模块112基于同义词库123识别原始查询中包含的原词,并获得相应的潜在同义词对及该潜在同义词对的同义语境。分析查询模块112在获得上述数据后,可以直接用同义词替换原词以获得同义查询,并交给搜索模块111执行同义扩展查询(步骤443)。作为优选的实施方式中,在执行同义词替换操作前,可以先判断是否符合原词的同义语境,如果符合的话,再执行同义替换的操作,如此可进一步提高同义查询结果的准确率。关于根据同义语境的匹配度判断来执行同义词替换的操作,已在第一实施方式中做了详细描述,申请人在此不再语义赘述。此外,查询分析模块112还将原始关键词集合交给主题分析模块18,由其通过PLSA模型计算并获得同义语境的语义主题分布(步骤447),计算的结果交给转义判定模块115。Next, the detailed steps of the search engine 300 executing the synonymous expansion query in this embodiment will be described in detail with reference to FIG. 7 . First, the query analysis module 112 receives the original query searched by the user (step 441), and then analyzes the original query (step 442). The query analysis module 112 will perform a word segmentation operation on the original query. Like the first embodiment, the word segmentation operation is based on the dictionary constructed by the synonymous context for maximum forward segmentation. After the word segmentation operation, the original keyword set is obtained. On the one hand, the query analysis module 112 sends the original keyword set to the search module 111 to execute the original query (step 449), and obtain the original query result (step 450). On the other hand, the query analysis module 112 identifies the original words included in the original query based on the thesaurus 123 , and obtains corresponding potential synonym pairs and synonymous contexts of the potential synonym pairs. After the analysis and query module 112 obtains the above data, it can directly replace the original words with synonyms to obtain a synonym query, and hand it over to the search module 111 to perform a synonym expansion query (step 443 ). As a preferred embodiment, before performing the synonym replacement operation, it is possible to judge whether it conforms to the synonymous context of the original word, and if so, perform the synonymous replacement operation, which can further improve the accuracy of the synonymous query result . The operation of performing synonym replacement according to the matching degree judgment of the synonymous context has been described in detail in the first embodiment, and the applicant will not repeat it here. In addition, the query analysis module 112 also submits the original keyword set to the topic analysis module 18, which calculates and obtains the semantic topic distribution of the synonymous context through the PLSA model (step 447), and submits the calculated result to the escape judgment module 115 .
搜索模块111执行同义查询获得同义查询结果(步骤444)后,转义判定模块115根据同义查询结果从网页语义主题库中获得结果网页的语义主题分布,即网页-潜在语义主题的向量S2={s21,s22,...,s2n}(步骤445)。另一方面,转义判定模块115从主题分析模块得到了同义语境的语义主题分布,即同义语境-潜在语义主题的向量S1={s11,s12,...s1n},接下来,转义判定模块115判断两个语义主题分布的匹配度,即计算两个向量S1、S2的相似度(步骤446);而后根据匹配度过滤同义查询结果(步骤448),即确定同义查询结果的打压方式,并据此合并原始查询和同义查询的结果,生成搜索结果列表(步骤451)。关于两个向量的相似度计算有多种,如内积相似度、余弦相似度等。以下是利用余弦相似度计算向量S1和S2之间相似度的计算公式的示例。After the search module 111 executes the synonymous query to obtain the synonymous query result (step 444), the escape judgment module 115 obtains the semantic topic distribution of the result webpage from the webpage semantic topic database according to the synonymous query result, that is, the vector of webpage-latent semantic topic S2={s21, s22, . . . , s2n} (step 445). On the other hand, the escape decision module 115 obtains the semantic topic distribution of the synonymous context from the topic analysis module, that is, the synonymous context-latent semantic topic vector S1={s11, s12,...s1n}, and then , escape judging module 115 judges the degree of matching of two semantic theme distributions, promptly calculates the similarity degree of two vectors S1, S2 (step 446); Then filter the synonymous query result (step 448) according to degree of matching, promptly determine synonymous The method of suppressing the query results, and combining the results of the original query and synonymous queries accordingly to generate a search result list (step 451). There are many ways to calculate the similarity between two vectors, such as inner product similarity, cosine similarity, etc. The following is an example of a calculation formula for calculating the similarity between vectors S1 and S2 using cosine similarity.
如果计算出来的相似度的值很高,表明该网页和同义语境在第n个语义主题上概率都很大,则可以判断两个语义主题分布匹配度高,即该网页的转义概率较小;反之,如果计算出来的相似度的值很低,表明该网页的转义概率较大,如此就需要对该结果进行打压。具体地,相似度的值sim(S1,S2)为处于[0,1]之间的浮点数。可以预先设定一个阈值α,则当sim(S1,S2)处于[α,1]时,表明两个语义主题分布的匹配度较高,这种情况下,不需要打压同义查询结果,只需按照预先确定的网页的相关度权值来合并原始和同义查询的结果。当sim(S1,S2)处于[0,α]时,表明两个语义主题分布的匹配度较低,同义查询结果的转义概率较大,这时就需要打压同义查询结果。打压的方式可以是对同义查询结果网页的相关度权值做降权处理,从而使同义查询结果在合并后的搜索结果列表中处于较后的位置;或者将同义查询结果插入到搜索结果列表的特定页之后,如将同义查询结果调整到搜索结果列表的第二页;此外也可以将同义查询结果调整到原始查询结果的后面,即同义查询结果出现在搜索结果列表的最后面。If the calculated similarity value is very high, it indicates that the webpage and the synonymous context have a high probability on the nth semantic topic, then it can be judged that the two semantic topic distributions have a high matching degree, that is, the escape probability of the webpage On the contrary, if the calculated similarity value is very low, it indicates that the escape probability of the web page is relatively high, so the result needs to be suppressed. Specifically, the value sim(S1, S2) of the similarity is a floating point number between [0, 1]. A threshold α can be set in advance. When sim(S1, S2) is in [α, 1], it indicates that the matching degree of the two semantic topic distributions is high. In this case, there is no need to suppress the synonymous query results, only The results of the original and synonymous queries need to be combined according to the predetermined relevance weight of the web pages. When sim(S1, S2) is in [0, α], it indicates that the matching degree of the two semantic topic distributions is low, and the escaping probability of the synonymous query results is high, so it is necessary to suppress the synonymous query results. The method of suppressing can be to lower the weight of the relevance weight of the synonymous query result webpage, so that the synonymous query result is in a lower position in the combined search result list; or insert the synonymous query result into the search After a specific page of the result list, such as adjusting the synonymous query result to the second page of the search result list; in addition, the synonymous query result can also be adjusted to the back of the original query result, that is, the synonymous query result appears in the search result list Last.
搜索引擎通过比较同义语境和同义查询结果网页的语义主题分布的匹配度,可以判断同义查询结果是否满足用户的潜在需求,从而据此可以相应地控制同义查询结果在整个搜索结果列表中的排序,以避免在搜索结果的前列出现转义结果,进而确保用户具有良好的使用体验。除了上述实施方式中介绍的PLSA模型外,其它主题模型也可以用来分析同义语境和同义查询结果网页的潜在的语义主题,如潜在语义分析(Latent Semantic Analysis,LSA)模型、或潜在狄利克雷分配(Latent Dirichlet Allocation,LDA)模型等。By comparing the matching degree between the synonymous context and the semantic topic distribution of the synonymous query result webpage, the search engine can judge whether the synonymous query result meets the potential needs of users, so that it can control the synonymous query result in the whole search results accordingly. The sorting in the list avoids escaping results at the top of the search results, thereby ensuring a good user experience. In addition to the PLSA model introduced in the above embodiments, other topic models can also be used to analyze the potential semantic topics of synonymous context and synonymous query result web pages, such as latent semantic analysis (Latent Semantic Analysis, LSA) model, or latent Dirichlet Allocation (Latent Dirichlet Allocation, LDA) model, etc.
图8至图10揭示了本发明搜索引擎的第四实施方式。本实施方式主要是描述搜索结果中的同义词的展现方式。如图8所示搜索引擎400的工作原理框图,其包括搜索组件11、数据存储库12、索引器13、抓取器14、和用户界面15。数据存储库12、索引器13、抓取器14、和用户界面15等功能模块与上述实施方式基本相同,所以申请人在此不再予以赘述。本实施方式中,搜索组件11包括搜索模块111、查询分析模块112、结果合成模块113、用于分析同义词和原词相似度等级的分析模块116、以及确定同义词展现方式的标注模块117。8 to 10 disclose the fourth embodiment of the search engine of the present invention. This implementation manner mainly describes the presentation of synonyms in the search results. FIG. 8 shows a block diagram of the working principle of the search engine 400 , which includes a search component 11 , a data repository 12 , an indexer 13 , a crawler 14 , and a user interface 15 . Functional modules such as data storage 12 , indexer 13 , grabber 14 , and user interface 15 are basically the same as those in the above-mentioned embodiment, so the applicant will not repeat them here. In this embodiment, the search component 11 includes a search module 111, a query analysis module 112, a result synthesis module 113, an analysis module 116 for analyzing the similarity level between synonyms and original words, and a labeling module 117 for determining the display mode of synonyms.
以下结合图9对本实施方式的搜索引擎执行同义扩展查询做详细说明。首先,查询分析模块112接收到用户搜索的原始查询(步骤461),然后对该原始查询进行分析(步骤462)。查询分析模块112会对原始查询进行分词操作,以得到原始关键词集合。查询分析模块112基于同义词库123识别原始查询中包含的原词,并获得包含该原词及其同义词的同义词对。一方面,分析查询模块112用同义词替换原词以获得同义查询,随后搜索模块111根据原始查询和同义查询执行原始查询和同义扩展查询(步骤463)。搜索模块111在获取原始查询结果和同义查询结果后,交由结果合成模块113合并并生成搜索结果列表(步骤464)。关于原始和同义查询的合并方法,上述实施方式中已详细描述,申请人在此不再予以赘述。另一方面,查询分析模块112将同义词对交给相似度等级分析模块116,由其判断同义词和原词的相似度等级(步骤465),并将判断结果交给标注模块117。接下来,标注模块117根据相似度等级的判断结果来确定同义词的展示方式,并最后通过用户界面15将标注好的搜索结果列表展现给用户(步骤466)。The synonymous extension query performed by the search engine in this embodiment will be described in detail below in conjunction with FIG. 9 . First, the query analysis module 112 receives the original query searched by the user (step 461), and then analyzes the original query (step 462). The query analysis module 112 performs a word segmentation operation on the original query to obtain the original keyword set. The query analysis module 112 identifies the original word included in the original query based on the thesaurus 123 , and obtains a synonym pair including the original word and its synonyms. On the one hand, the analysis query module 112 replaces the original words with synonyms to obtain a synonym query, and then the search module 111 executes the original query and the synonym expansion query according to the original query and the synonym query (step 463 ). After the search module 111 acquires the original query result and the synonym query result, the result synthesis module 113 will merge and generate a search result list (step 464 ). The method for merging original and synonymous queries has been described in detail in the above embodiments, and the applicant will not repeat them here. On the other hand, the query analysis module 112 sends the synonym pair to the similarity level analysis module 116 , which judges the similarity level between the synonym and the original word (step 465 ), and sends the judgment result to the labeling module 117 . Next, the tagging module 117 determines the display mode of synonyms according to the judgment result of the similarity level, and finally displays the tagged search result list to the user through the user interface 15 (step 466 ).
以下结合图10对同义词和原词的相似度等级判断及相应地展示方式进一步举例说明。相似度等级分析模块116从查询分析模块112获取同义词对(步骤471),首先判断同义词对中的同义词和原词是否属于高相似度等级(即相似度等级较高的第一等级)(步骤472)。本实施方式中,同义词和原词属于高相似度等级的情形包括专有名词缩写(如“北京大学”和“北大”,“新浪网”和“sina”)、或数字变换(如“第五集”和“第5集”)、或地域词变换(如“北京市”和“北京”)等。如果属于高相似度等级,则对同义词进行特定颜色的标注(步骤473),通常该特定颜色是比较醒目的颜色,如本实施方式中的红色;如果不属于,则接下来判断同义词对中同义词和原词是否属于中相似度等级(即相似度等级较低的第二等级)(步骤474)。本实施方式中,原词和同义词中相似度等级的判断包括语义相似度或词形相似度的判断。The judgment of the similarity level between the synonym and the original word and the corresponding display manner are further illustrated below with reference to FIG. 10 . Similarity level analysis module 116 obtains synonym pair (step 471) from query analysis module 112, at first judges whether the synonym in the synonym pair and original word belong to high similarity level (i.e. the higher first level of similarity level) (step 472 ). In this embodiment, the situation where synonyms and original words belong to high similarity level includes proper noun abbreviation (such as "Peking University" and "Peking University", "Sina" and "sina"), or digital transformation (such as "fifth "Set" and "Fifth Set"), or regional word conversion (such as "Beijing" and "Beijing"), etc. If it belongs to a high similarity level, then the synonym is marked with a specific color (step 473), usually the specific color is a more eye-catching color, such as red in the present embodiment; and whether the original word belongs to the middle similarity level (ie the lower second level of the similarity level) (step 474). In this embodiment, the determination of the similarity level between the original word and the synonym includes the determination of semantic similarity or morphological similarity.
以下是语义相似度计算公式的具体示例:The following is a specific example of the semantic similarity calculation formula:
其中ClickQueryCount(orig,syn)表示查询式中包含原词orig,同时点击访问的网页的标题中不包含原词orig但包含同义词syn的历史查询数量;QueryCount(orig)表示查询式中包含原词orig的历史查询数量。例如,用户输入的历史查询式为“北大在哪里”,然后点击了搜索结果中的标题为“北京大学在哪里”的网页,那么这次查询将被累计到ClickQueryCount(orig,syn)和QueryCount(orig)上;而如果用户对于历史查询式“北大在哪里”只是点击了搜索结果中的标题为“北大在哪里”的网页,则这次查询只会被累计到QueryCount(orig)上。显然,语义相似度的值为处于[0,1]之间的浮点数。可以预先设定一个阈值β,则当语义相似度的值处于[β,1]时,表明原词和同义词属于中相似度等级;而当语义相似度的值处于[0,β]时,则接下来还将进行词形相似度的判断。如果已经确定该同义词对属于中相似度等级,则对同义词进行特定字体的标注(步骤475),如粗体或斜体,本实施方式中为粗体。Among them, ClickQueryCount(orig, syn) indicates that the original word orig is included in the query formula, and the title of the web page that is clicked at the same time does not contain the original word orig but contains the synonym syn; QueryCount(orig) indicates that the original word orig is included in the query formula of historical queries. For example, if the historical query type entered by the user is "Where is Peking University", and then clicks the webpage titled "Where is Peking University" in the search results, then this query will be accumulated to ClickQueryCount(orig, syn) and QueryCount( orig); and if the user just clicks on the webpage titled "Where is Peking University" in the search results for the historical query "Where is Peking University", then this query will only be accumulated to QueryCount(orig). Obviously, the value of the semantic similarity is a floating point number between [0, 1]. A threshold β can be set in advance, when the value of semantic similarity is in [β, 1], it indicates that the original word and the synonym belong to the middle similarity level; and when the value of semantic similarity is in [0, β], then Next, word form similarity will be judged. If it has been determined that the synonym pair belongs to the medium similarity level, the synonym is marked with a specific font (step 475 ), such as bold or italic, which is bold in this embodiment.
以下是词形相似度计算公式的具体示例:The following is a specific example of the word form similarity calculation formula:
其中CoocAlphaCount(orig,syn)表示原词orig和同义词syn有多少个字是一样的,AllAlphaCount(orig,syn)表示原词orig和同义词syn中包含不同字的总数。例如:对于同义词对{“怎么”,“怎么样”},CoocAlphaCount(“怎么”,“怎么样”)=2,因为同义词对中“怎”和“么”这两个字同时出现在原词和同义词中;AllAlphaCount(orig,syn)=3,因为同义词对中一共有3个不同的字“怎”“么”“样”。对于英文,则统计字母的数量,例如:对于同义词对{“man”,“men”},CoocAlphaCount(“man”,“men”)=2,而AllAlphaCount(“man”,“men”)=4。显然,词形相似度的值也是处于[0,1]之间的浮点数。可以预先设定一个阈值γ,当语义相似度的值处于[γ,1]时,表明原词和同义词属于中相似度等级,则标注模块117对同义词进行标粗;而当语义相似度的值处于[0,γ]时,表明该同义词对中同义词和原词属于低相似度等级(即相似度等级比第二等级低的第三等级),从而同义词不进行任何标注(步骤476)。相对于特定颜色的标注,特定字体的醒目程度要弱一些,但仍可以引起用户的关注,所以适用于中相似度等级的同义词,因为其语义或词形虽然发生了变化,但和原词仍然比较接近;而低相似度等级的同义词由于语义或词形和原词差距比较大,如果标注的话会给用户带来突兀感;所以优选的方式是不进行标注。Among them, CoocAlphaCount(orig, syn) indicates how many words are the same in the original word orig and the synonym syn, and AllAlphaCount(orig, syn) indicates the total number of different words contained in the original word orig and the synonym syn. For example: for the synonym pair {"how", "how"}, CoocAlphaCount("how", "how")=2, because the words "how" and "what" appear in the original word and "how" at the same time in the synonym pair Among the synonyms; AllAlphaCount(orig, syn)=3, because there are 3 different words "how", "what" and "like" in the synonyms. For English, count the number of letters, for example: for synonym pairs {"man", "men"}, CoocAlphaCount("man", "men")=2, and AllAlphaCount("man", "men")=4 . Obviously, the value of word form similarity is also a floating point number between [0, 1]. A threshold γ can be preset. When the value of the semantic similarity is [γ, 1], it indicates that the original word and the synonym belong to the medium similarity level, and the labeling module 117 will mark the synonym in bold; and when the value of the semantic similarity When it is [0, γ], it indicates that the synonym and the original word in the synonym pair belong to a low similarity level (that is, the third level of similarity level is lower than the second level), so the synonym is not marked (step 476). Compared with the label of a specific color, the eye-catching degree of a specific font is weaker, but it can still attract users' attention, so it is suitable for synonyms with a medium similarity level, because although its semantics or word form have changed, it is still the same as the original word Relatively close; and synonyms with low similarity levels have a relatively large gap between semantics or word form and the original word, if marked, it will bring abruptness to the user; so the preferred way is not to mark.
搜索引擎通过辨别同义词和原词的相似度等级,来对搜索结果中的同义词进行相适应的标注,从而在供用户快速定位所需信息的同时避免给用户带来突兀感,进而提升用户的使用体验。By distinguishing the similarity level between synonyms and original words, the search engine can appropriately mark the synonyms in the search results, so that users can quickly locate the information they need while avoiding giving users a sense of abruptness, thereby improving user experience. experience.
本领域技术人员可以轻易想到的是,同义词相似度等级的判断方式、同义词展示的方式、以及不同相似度等级与不同展示方式的对应关系并不仅限于上述实施方式中所描述的。例如,还可以通过编辑距离来判断相似度等级,或者对同义词进行高亮的标注方式。此外,相似度等级可以设置更多,如将语义相似度和词形相似度拆分为两个不同的等级。当然,也可以缩减相似度等级,即将所有的同义词只归类为高相似度等级或低相似度等级。如当同义词和原词属于专有名词缩写、数字变换、或地域词变换;或者原词和同义词的语义相似度、词形相似度、或编辑距离大于或等于指定阈值时,可以认为是高相似度等级,其余则为低相似度等级。Those skilled in the art can easily imagine that the way of judging the similarity level of synonyms, the way of displaying synonyms, and the corresponding relationship between different similarity levels and different ways of displaying are not limited to those described in the above embodiments. For example, the similarity level can also be judged by the edit distance, or a way of highlighting synonyms can be used. In addition, more similarity levels can be set, such as splitting semantic similarity and morphological similarity into two different levels. Of course, the similarity level can also be reduced, that is, all synonyms are only classified into high similarity level or low similarity level. For example, when the synonym and the original word belong to proper noun abbreviation, digital transformation, or regional word transformation; or when the semantic similarity, morphological similarity, or edit distance between the original word and the synonym is greater than or equal to the specified threshold, it can be considered as high similarity degree level, and the rest are low similarity level.
应当理解,虽然本说明书按照实施方式加以描述,但并非每个实施方式仅包含一个独立的技术方案,说明书的这种叙述方式仅仅是为清楚起见,本领域技术人员应当将说明书作为一个整体,各实施例中的技术方案也可以经适当组合,形成本领域技术人员可以理解的其他实施方式。It should be understood that although this description is described according to implementation modes, not each implementation mode only contains an independent technical solution, and this description in the description is only for clarity, and those skilled in the art should take the description as a whole, and each The technical solutions in the embodiments can also be properly combined to form other implementations that can be understood by those skilled in the art.
上文所列出的一系列的详细说明仅仅是针对本发明的可行性实施方式的具体说明,它们并非用以限制本发明的保护范围,凡未脱离本发明技艺精神所作的等效实施方式或变更均应包含在本发明的保护范围之内。The series of detailed descriptions listed above are only specific descriptions for feasible implementations of the present invention, and they are not intended to limit the protection scope of the present invention. Any equivalent implementation or implementation that does not depart from the technical spirit of the present invention All changes should be included within the protection scope of the present invention.
Claims (12)
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN201110079699.1A CN102722499B (en) | 2011-03-31 | 2011-03-31 | Search engine and implementation method thereof |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN201110079699.1A CN102722499B (en) | 2011-03-31 | 2011-03-31 | Search engine and implementation method thereof |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| CN102722499A CN102722499A (en) | 2012-10-10 |
| CN102722499B true CN102722499B (en) | 2015-07-01 |
Family
ID=46948266
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN201110079699.1A Active CN102722499B (en) | 2011-03-31 | 2011-03-31 | Search engine and implementation method thereof |
Country Status (1)
| Country | Link |
|---|---|
| CN (1) | CN102722499B (en) |
Families Citing this family (11)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN104156391A (en) * | 2014-07-09 | 2014-11-19 | 北京奇虎科技有限公司 | Device and method for displaying menus in mobile search results |
| CN105989125B (en) * | 2015-02-16 | 2019-08-16 | 苏宁易购集团股份有限公司 | The searching method and system of tag recognition are carried out to no result word |
| WO2017117806A1 (en) * | 2016-01-08 | 2017-07-13 | 马岩 | Term search method and system for web information |
| CN105874457A (en) * | 2016-03-30 | 2016-08-17 | 马岩 | Network information push method and system |
| CN106250516A (en) * | 2016-08-03 | 2016-12-21 | 王晓光 | Synonym application process in big data search and system |
| WO2018023481A1 (en) * | 2016-08-03 | 2018-02-08 | 王晓光 | Method and system for applying synonym in big data search |
| CN106294784B (en) * | 2016-08-12 | 2019-12-17 | 合一智能科技(深圳)有限公司 | resource searching method and device |
| CN107729347B (en) | 2017-08-23 | 2021-06-11 | 北京百度网讯科技有限公司 | Method, device and equipment for acquiring synonym label and computer readable storage medium |
| CN110196941B (en) * | 2018-07-24 | 2024-05-14 | 腾讯科技(深圳)有限公司 | Information recommendation method, device, server and storage medium |
| CN111666417B (en) * | 2020-04-13 | 2023-06-23 | 百度在线网络技术(北京)有限公司 | Method, device, electronic equipment and readable storage medium for generating synonyms |
| CN116344012B (en) * | 2023-05-29 | 2023-08-18 | 北京梆梆安全科技有限公司 | Medical management system based on diagnosis and treatment log |
Citations (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN1873642A (en) * | 2006-04-29 | 2006-12-06 | 上海世纪互联信息系统有限公司 | Searching engine with automating sorting function |
| CN101241512A (en) * | 2008-03-10 | 2008-08-13 | 北京搜狗科技发展有限公司 | Search method for redefining enquiry word and device therefor |
| CN101576916A (en) * | 2009-06-18 | 2009-11-11 | 清华大学 | Method and device for obtaining synonyms |
| CN101645082A (en) * | 2009-04-17 | 2010-02-10 | 华中科技大学 | Similar web page duplicate-removing system based on parallel programming mode |
| CN101878476A (en) * | 2007-06-22 | 2010-11-03 | 谷歌公司 | Machine translation for query expansion |
-
2011
- 2011-03-31 CN CN201110079699.1A patent/CN102722499B/en active Active
Patent Citations (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN1873642A (en) * | 2006-04-29 | 2006-12-06 | 上海世纪互联信息系统有限公司 | Searching engine with automating sorting function |
| CN101878476A (en) * | 2007-06-22 | 2010-11-03 | 谷歌公司 | Machine translation for query expansion |
| CN101241512A (en) * | 2008-03-10 | 2008-08-13 | 北京搜狗科技发展有限公司 | Search method for redefining enquiry word and device therefor |
| CN101645082A (en) * | 2009-04-17 | 2010-02-10 | 华中科技大学 | Similar web page duplicate-removing system based on parallel programming mode |
| CN101576916A (en) * | 2009-06-18 | 2009-11-11 | 清华大学 | Method and device for obtaining synonyms |
Also Published As
| Publication number | Publication date |
|---|---|
| CN102722499A (en) | 2012-10-10 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| CN102722498B (en) | Search engine and implementation method thereof | |
| CN102722501B (en) | Search engine and realization method thereof | |
| CN102722499B (en) | Search engine and implementation method thereof | |
| CN102737021B (en) | Search engine and realization method thereof | |
| US8051080B2 (en) | Contextual ranking of keywords using click data | |
| Chirita et al. | P-tag: large scale automatic generation of personalized annotation tags for the web | |
| Agrawal et al. | A detailed study on text mining techniques | |
| EP1988476B1 (en) | Hierarchical metadata generator for retrieval systems | |
| US9507867B2 (en) | Discovery engine | |
| US20130268526A1 (en) | Discovery engine | |
| US20090070322A1 (en) | Browsing knowledge on the basis of semantic relations | |
| CN104063387A (en) | Device and method abstracting keywords in text | |
| JP2005085285A5 (en) | ||
| CN101661513A (en) | Detection method of network focus and public sentiment | |
| CN102855282B (en) | A kind of document recommendation method and device | |
| Sivakumar | Effectual web content mining using noise removal from web pages | |
| CN104462399B (en) | The processing method and processing device of search result | |
| CN102236654A (en) | Web Invalid Link Filtering Method Based on Content Correlation | |
| CN105912662A (en) | Coreseek-based vertical search engine research and optimization method | |
| CN107918644A (en) | News subject under discussion analysis method and implementation system in reputation Governance framework | |
| Gasparetti et al. | Exploiting web browsing activities for user needs identification | |
| CN111125297B (en) | Massive offline text real-time recommendation method based on search engine | |
| US8949254B1 (en) | Enhancing the content and structure of a corpus of content | |
| Hsu et al. | Efficient and effective prediction of social tags to enhance web search | |
| Bellaachia et al. | Learning from twitter hashtags: Leveraging proximate tags to enhance graph-based keyphrase extraction |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| C06 | Publication | ||
| PB01 | Publication | ||
| C10 | Entry into substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| C14 | Grant of patent or utility model | ||
| GR01 | Patent grant |