+

WO2008129339A1 - Procédé pour identification d'emplacement dans des pages web et hiérarchisation basée sur l'emplacement des résultats de recherche internet - Google Patents

Procédé pour identification d'emplacement dans des pages web et hiérarchisation basée sur l'emplacement des résultats de recherche internet Download PDF

Info

Publication number
WO2008129339A1
WO2008129339A1 PCT/IB2007/001006 IB2007001006W WO2008129339A1 WO 2008129339 A1 WO2008129339 A1 WO 2008129339A1 IB 2007001006 W IB2007001006 W IB 2007001006W WO 2008129339 A1 WO2008129339 A1 WO 2008129339A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
information
location information
geographic
subset
Prior art date
Application number
PCT/IB2007/001006
Other languages
English (en)
Inventor
Joachim Diederich
Hermann Havermann
Carsten Tautz
Original Assignee
Mitsco - Seekport Fz-Llc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Mitsco - Seekport Fz-Llc filed Critical Mitsco - Seekport Fz-Llc
Priority to PCT/IB2007/001006 priority Critical patent/WO2008129339A1/fr
Publication of WO2008129339A1 publication Critical patent/WO2008129339A1/fr

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9537Spatial or temporal dependent retrieval, e.g. spatiotemporal queries

Definitions

  • the invention relates to a method for location identification in (multilingual) Web pages and the location-based ranking of Internet search results.
  • this invention relates to a method for determining the geographic location of a service or facility described in a web page by processing location relevant information of the Web page or by using a support vector machine and for the location-based ranking by use of heuristic functions and an expert system in the context of localized Internet search.
  • the World Wide Web (in the following: WWW) is a collection of documents in specific formats provided on data processing devices connected via the Internet . These documents are accessed by specifying a particular document in a program, i.e. a Web browser. Since its beginning, the WWW has amassed a huge amount of documents and was subjected to various and frequent changes. Thus, it is not possible for a user to keep an over- view of the available information '' ,! Conventional techniques like employing conventional mercantile directories, e.g. yellow pages or the like, have proven to be inapt to cope with the amount and volatility of the available information. In this environment, search engines have been established as helpful utilities for finding the required information.
  • search engines established in the market therefore employ different algorithms and methods for ranking the display of search results in a way that results most relevant to a user are displayed on top and first.
  • methods like "relevance ranking" which aim at displaying results with the highest priority first.
  • the priority of a given Web page depends on the number of links from other Web pages referencing the page.
  • a further method, described in patent document DE 102 10 840 Al groups search results by thematical domains and provides ranks for domains, respectively.
  • the result of the location information determination can even be enhanced in accuracy and amount by applying the subject- matter of the dependent claims.
  • the Invention further provides a computer based method for providing Web search results with localized relevance, wherein the search results are ranked according to location relevance, e.g. minimal geographic distance to a user.
  • This method employs an expert system implementing a heuristic function. This method is advantageous, since it allows processing of geographic location information in a multitude of formats like geographic positions or areas, exact building addresses, fragmented addresses, suburbs, regions and the like. It is also advantageous, since it allows the combined usage of exact and vague location information, which can be adequately taken into consideration by the expert system. The ranking of the search results can even be improved by applying the subject-matter of the dependent claims.
  • the Figure shows a schematic diagram of components of an information analysis device as well as a flow chart of the methods according to the invention.
  • An information analysis device comprises one or more data processing devices which are connected to the Internet.
  • a program called Crawler C is installed. Via the Internet, the Crawler C has access to Web pages Pi constituting the WWW.
  • the Crawler C is designed to collect the information of all Web pages Pi to be made available to a user of the information analysis device.
  • the Crawler C thus collects a large set of Web pages Pi and preprocesses the Web pages in that structural page information like HTML tags are removed and the Web pages are grouped according to their content language.
  • the thus preprocessed contents of the Web pages Pi are then distributed to two components for identifying the location of the respective Web pages Pi.
  • the location of a Web page Pi does not relate to a place where a Web server hosting the Web page is located but relates to the location the content of the Web page refers to, i.e. to a location where a service or facility described or mentioned on the Web page is placed.
  • the first component A is a program, running on one or more data processing devices.
  • Component A performs a task of deter- mining a location of the Web page by processing information directly pertaining to locations provided in the Web page.
  • Web page content is searched for location relevant information using Boolean expressions. For example, if a Web page relating to an Italian restaurant in Dubai contains address information like country, city, street and street number, i.e. it contains character strings matching distinct formats like postcodes, street names and the like, this information is determined by the Boolean expression, extracted and kept for further processing.
  • ontologies are used to expand the query in the case that the above step of using the Boolean expression provides insufficient results.
  • publicly available databases like Wordnet and OpenCyc are used.
  • Wordnet and OpenCyc are used.
  • the Web page Pi of the Italian restaurant only contains a reference to the city Dubai and the street and street number
  • the country United Arabic Emirates can be determined by using ontologies or databases providing semantic information.
  • the Web page Pi only contains a reference to the "capital of the United Arab Emirates", the capital Abu Dhabi could be determined.
  • the extracted address information is complemented by external directories, such as Yellow Page services. For example, using names and/or identified phone numbers, ad- . dress information can be retrieved and attached.
  • the location information gathered in the first three steps is transmitted to a geographic information system (in the following GIS) to retrieve a geographic position.
  • GIS geographic information system
  • component B is used to gather location information of the Web page Pi in a case, in which the Web content does not have exact location information such as street addresses, postal codes or even phone numbers.
  • Component B utilizes machine learning techniques.
  • the Web page content of a Web page found by the Crawler C is converted to a so-called bag-of-words (in the following BOW) representation.
  • BOW bag-of-words
  • the thus derived BOW format of the Web page is applied to a support vector machine (in the following SVM) for determining location information of the Web page.
  • SVMs for a text classification is well known in the related art and e.g. described in Applied Soft Computing, 7 (2007) , 923-928 or Heyer, C, Diederich, J., Tibianna: A Learning-Based Search Engine with Query Refinement, Thorn, J., Kay, J. (Eds.), Proceedings of the Seventh Australian Document Computing Symposium, Sydney, Australia (16 December 2002) 105-108. Sydney, Australia: The University of Sydney (2002) , ISBM 1-86487-525- 9., the content of which hereby is incorporated by reference. A description in detail is therefore omitted.
  • the Web page Pi is indexed and stored in the search index data base together with the location information gathered by component B in the same way like the Web pages treated by component A are stored.
  • the SVM for classifying the Web pages is trained, i.e. the hy- perplanes separating input vectors in different classes are calculated by applying input vectors to the SVM.
  • the input vectors are formed by applying the above steps of conversion to a BOW representation, applying stemming, thresholding, forming of N-grams, and normalizing and transforming to Web pages with known location information. These Web pages are labeled with the known location information. Labeling with location information can be accomplished by using Web pages classified by component A, by the use of SVM transduction or even by hand-labeling Web pages with location information.
  • explicit' location information is excluded from the Web page in BOW representation, i.e., exact location information used for labeling the Web page is excluded from the input for training the SVM.
  • machine learning methods will utilise contextual information for determining location information of the Web page instead of explicit location information, which is already taken into account by the processing of component A.
  • a search engine S consisting of a computer program running on one ore more data processing devices is provided for accomplishing a search in the index database ID and to generate a result list of found Web pages.
  • the search engine S comprises a user interface in a Web page format allowing the user to input a search term and the user's geographic position.
  • the geographic position of the user could also be determined by detecting the user's IP address and determining the region in which said address is assigned.
  • the geographic position of the user can be determined by requesting position information, e.g. cell information, of the mobile terminal from the mobile network provider.
  • the user's geographic position could be determined by transmitting the positional data determined by the position detection device.
  • Search result list determined by applying the search term to the search engine is ranked by the search engine according to positional relation of the location information of each Web page Pi of the search result to the geographic position of the user. Therefore, an expert system implementing a heuristic function is used to determine the ranking of the search results in the search result list. Results are ranked according to the position of the location information and their distance from the user's geographical position. Those Web pages Pi with exact location information close to the user are ranked higher than those pages with imprecise location information or further distance from the user.
  • the geographical context may be preset or restricted. For example, a city's Web portal may restrict Internet search results to restaurants or those located within the city boundaries.
  • A is the geographic position of the user. It is further assumed that B, C, D and E are geographic locations expressed in or inferred from Web pages' contents. Then, the following are example rules for the expert system that realizes the heuristic function:
  • B is a house/building address, representing narrow/exact coordinates, and is geographically close to A, then the Web page with location B is ranked high.
  • D is a street block and geographically close to A and there are street addresses, e.g. B and C, closer to A then the Web page with the location D is ranked below B and C.
  • E is a suburb and geographically close to A and there are no street addresses or blocks close to A 7 then rank the Web page with location E high.
  • E is a suburb and geographically close to A and C is a house address being geographically more distant from A then the entire suburb E, the web page of E is ranked below the webpage of C.

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

La présente invention concerne un procédé pour déterminer un emplacement géographique d'un service ou d'une installation décrits dans une page web par traitement des informations pertinentes d'emplacement dans une page Web ou en utilisant une machine à vecteur de support et pour la hiérarchisation basée sur l'emplacement en utilisant des fonctions heuristiques et un système expert dans une recherche Internet localisée. Les pages Web sont prétraitées de différentes façons afin de pouvoir être appliquées à une machine à vecteur de support. Une machine à vecteur de support à entraînement détermine une classification d'emplacement de contenu de page Web. Cette classification d'emplacement est examinée par un système expert tout en hiérarchisant un résultat de recherche.
PCT/IB2007/001006 2007-04-18 2007-04-18 Procédé pour identification d'emplacement dans des pages web et hiérarchisation basée sur l'emplacement des résultats de recherche internet WO2008129339A1 (fr)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/IB2007/001006 WO2008129339A1 (fr) 2007-04-18 2007-04-18 Procédé pour identification d'emplacement dans des pages web et hiérarchisation basée sur l'emplacement des résultats de recherche internet

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/IB2007/001006 WO2008129339A1 (fr) 2007-04-18 2007-04-18 Procédé pour identification d'emplacement dans des pages web et hiérarchisation basée sur l'emplacement des résultats de recherche internet

Publications (1)

Publication Number Publication Date
WO2008129339A1 true WO2008129339A1 (fr) 2008-10-30

Family

ID=38375735

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/IB2007/001006 WO2008129339A1 (fr) 2007-04-18 2007-04-18 Procédé pour identification d'emplacement dans des pages web et hiérarchisation basée sur l'emplacement des résultats de recherche internet

Country Status (1)

Country Link
WO (1) WO2008129339A1 (fr)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103530321A (zh) * 2013-09-18 2014-01-22 上海交通大学 一种基于机器学习的排序系统
CN104462531A (zh) * 2014-12-23 2015-03-25 北京奇虎科技有限公司 一种确定查询词是否调用地图接口的方法与系统
CN105843934A (zh) * 2016-03-30 2016-08-10 知集市科技成都有限公司 一种专家地图的生成方法和装置

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2000025508A1 (fr) * 1998-10-28 2000-05-04 Vicinity Corporation Procede et appareil d'extension des capacites de recherche sur le web
WO2004084099A2 (fr) * 2003-03-18 2004-09-30 Metacarta, Inc. Regroupement d'un corpus, raffinement de confiance, et etablissement de rang pour une recherche de texte geographique et pour une extraction d'informations
US20060149565A1 (en) * 2004-12-30 2006-07-06 Riley Michael D Local item extraction
WO2006073977A1 (fr) * 2004-12-30 2006-07-13 Google Inc. Classification de references geographiques ambigues
US20060200490A1 (en) * 2005-03-03 2006-09-07 Abbiss Roger O Geographical indexing system and method
WO2007002800A2 (fr) * 2005-06-28 2007-01-04 Metacarta, Inc. Interface d'utilisateur pour la recherche geographique

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2000025508A1 (fr) * 1998-10-28 2000-05-04 Vicinity Corporation Procede et appareil d'extension des capacites de recherche sur le web
WO2004084099A2 (fr) * 2003-03-18 2004-09-30 Metacarta, Inc. Regroupement d'un corpus, raffinement de confiance, et etablissement de rang pour une recherche de texte geographique et pour une extraction d'informations
US20060149565A1 (en) * 2004-12-30 2006-07-06 Riley Michael D Local item extraction
WO2006073977A1 (fr) * 2004-12-30 2006-07-13 Google Inc. Classification de references geographiques ambigues
US20060200490A1 (en) * 2005-03-03 2006-09-07 Abbiss Roger O Geographical indexing system and method
WO2007002800A2 (fr) * 2005-06-28 2007-01-04 Metacarta, Inc. Interface d'utilisateur pour la recherche geographique

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
JIAN-PEI ZHANG ET AL: "A Gradual Training Algorithm of Incremental Support Vector Machine Learning", ADVANCES IN NATURAL COMPUTATION LECTURE NOTES IN COMPUTER SCIENCE;;LNCS, SPRINGER-VERLAG, BE, vol. 3610, 2005, pages 1132 - 1139, XP019013952, ISBN: 3-540-28323-4 *
LISHUANG LI ET AL: "Extracting Location Names from Chinese Texts Based on SVM and KNN", NATURAL LANGUAGE PROCESSING AND KNOWLEDGE ENGINEERING, 2005. IEEE NLP-KE '05. PROCEEDINGS OF 2005 IEEE INTERNATIONAL CONFERENCE ON WUHAN, CHINA 30-01 OCT. 2005, PISCATAWAY, NJ, USA,IEEE, 30 October 2005 (2005-10-30), pages 371 - 375, XP010896958, ISBN: 0-7803-9361-9 *
PARK S-B ET AL: "Co-trained support vector machines for large scale unstructured document classification using unlabeled data and syntactic information", INFORMATION PROCESSING & MANAGEMENT, ELSEVIER, BARKING, GB, vol. 40, no. 3, May 2004 (2004-05-01), pages 421 - 439, XP004502747, ISSN: 0306-4573 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103530321A (zh) * 2013-09-18 2014-01-22 上海交通大学 一种基于机器学习的排序系统
CN103530321B (zh) * 2013-09-18 2016-09-07 上海交通大学 一种基于机器学习的排序系统
CN104462531A (zh) * 2014-12-23 2015-03-25 北京奇虎科技有限公司 一种确定查询词是否调用地图接口的方法与系统
CN105843934A (zh) * 2016-03-30 2016-08-10 知集市科技成都有限公司 一种专家地图的生成方法和装置

Similar Documents

Publication Publication Date Title
US8463774B1 (en) Universal scores for location search queries
US7792883B2 (en) Viewport-relative scoring for location search queries
US8732155B2 (en) Categorization in a system and method for conducting a search
US6847959B1 (en) Universal interface for retrieval of information in a computer system
CA2640365C (fr) Codage geographique pour des requetes recherche de position
US10387435B2 (en) Computer application query suggestions
US20050060290A1 (en) Automatic query routing and rank configuration for search queries in an information retrieval system
US20070198495A1 (en) Geographic coding for location search queries
US20050065959A1 (en) Systems and methods for clustering search results
JP5084858B2 (ja) サマリ作成装置、サマリ作成方法及びプログラム
CN103339623A (zh) 涉及因特网搜索的方法和设备
US20090132646A1 (en) User interface and method in a local search system with static location markers
US20100114854A1 (en) Map-based websites searching method and apparatus therefor
CN101727447A (zh) 基于url的正则表达式的生成方法和装置
US20020199018A1 (en) Maping physical locations to web sites
WO2006093394A1 (fr) Serveur, procede et systeme pour service de recherche d'informations au moyen d'une page web segmentee en plusieurs blocs d'information
EP2306333A1 (fr) Bibliothèque de logiciels hors ligne
US20090132514A1 (en) method and system for building text descriptions in a search database
CN111831867B (zh) 地址查询方法、装置、电子设备和计算机可读存储介质
US10339148B2 (en) Cross-platform computer application query categories
WO2009064313A1 (fr) Corrélation de données dans un système et procédé pour effectuer une recherche
WO2009064318A1 (fr) Système de recherche et procédé pour mener une recherche locale
WO2008129339A1 (fr) Procédé pour identification d'emplacement dans des pages web et hiérarchisation basée sur l'emplacement des résultats de recherche internet
JP2004280569A (ja) 情報監視装置
CN113515687B (zh) 物流信息的获取方法和装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 07734325

Country of ref document: EP

Kind code of ref document: A1

DPE2 Request for preliminary examination filed before expiration of 19th month from priority date (pct application filed from 20040101)
NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 07734325

Country of ref document: EP

Kind code of ref document: A1

点击 这是indexloc提供的php浏览器服务,不要输入任何密码和下载