+

WO2013010557A1 - Procédé et système d'exploitation de données d'un document - Google Patents

Procédé et système d'exploitation de données d'un document Download PDF

Info

Publication number
WO2013010557A1
WO2013010557A1 PCT/EP2011/003590 EP2011003590W WO2013010557A1 WO 2013010557 A1 WO2013010557 A1 WO 2013010557A1 EP 2011003590 W EP2011003590 W EP 2011003590W WO 2013010557 A1 WO2013010557 A1 WO 2013010557A1
Authority
WO
WIPO (PCT)
Prior art keywords
document
contextual
module
hyperlink
data
Prior art date
Application number
PCT/EP2011/003590
Other languages
English (en)
Inventor
Miguel De Vega Rodrigo
Adolfo Sanchez-Barbudo Herrera
Original Assignee
Miguel De Vega Rodrigo
Adolfo Sanchez-Barbudo Herrera
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Miguel De Vega Rodrigo, Adolfo Sanchez-Barbudo Herrera filed Critical Miguel De Vega Rodrigo
Priority to PCT/EP2011/003590 priority Critical patent/WO2013010557A1/fr
Publication of WO2013010557A1 publication Critical patent/WO2013010557A1/fr

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/137Hierarchical processing, e.g. outlines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/134Hyperlinking

Definitions

  • the invention belongs to the field of data mining.
  • Data mining (sometimes called data or knowledge discovery) is the automatic analysis and extraction of useful or relevant information from raw data.
  • Data mining solutions can help humans extract relevant information from electronic documents faster and with less cognitive effort. They can automatically process document contents and extract data that may potentially contain relevant information for the user according to a context. Data mining typically involves at least two basic tasks: Clustering and classification.
  • Clustering is the task of discovering parts, groups or structures in the data that are in some way or another "similar", based on a certain division criterion.
  • Classification is the task of assigning a degree of contextual relevance to the groups, structures or parts generated by the clustering task according to a context.
  • US 7451099 discloses various techniques for generating markup information to be displayed on a client computer system. Web page document contents are analysed for selected keywords from a keyword list. Matching words are converted into a link of any designation. This allows an information distributor (e.g., an Advertiser) to add links to a web page that direct users to specific web pages and/or present relevant offers
  • the purpose of this invention is not to help humans extract relevant information from electronic documents faster and with less cognitive effort, but it uses data mining techniques that could be adapted for that purpose. In that case, the solution presents several disadvantages.
  • the parts created by the clustering task in this data mining solution are simply the words in the document. This also means that there is not a typology of parts that allows processing their contents differently depending on their type.
  • the classification task uses rather advanced mechanisms, like negative keyword match and fuzzy search. But the context it uses consists of a list of keywords not introduced by the reader of the web page, which necessarily means that it is not related to her/his context, but rather to that of the Advertiser.
  • Some Internet filters prevent the access to web page documents depending on a user concern, such as security or the suitability for children.
  • Two filters related to security are Web Sense Filter ' (http://www.websense.com/content/WebFilter.aspx), and Microsoft's SmartScreen filter (http://www.microsoft.com/security/filters/smartscreen.aspx).
  • Two filters related to the suitability of web contents for children are parental controls, such as K9 Web Protection (http://www1.k9webprotection.com), and Net Nanny (http://www.netnanny.com).
  • These filters may be understood as a data mining solution that extracts information (i.e., the suitability of the document depending on a user concern) relevant to the user from the web page document.
  • This solution presents several disadvantages. On the one hand, there is only one part created by the clustering task in this data mining solution that comprises the whole web page document. This also means that there is not a typology of parts that allows processing their contents differently depending on their type. On the other hand, the classification task just contemplates two possible degrees of contextual relevance; either the document is suitable for the user or not.
  • the proposed method performs:
  • a dividing step that divides the content of the document to be mined into a plurality of parts to be individually analyzed, based on a division criterion for defining parts.
  • An analyzing step that analyzes the at least one part by evaluating, for the extracted data, at least one contextual condition associated to the user context.
  • An assigning step that assigns a degree of contextual relevance to the at least one analyzed part according to the fulfilment of the at least one evaluated contextual condition.
  • a modifying step that modifies the appearance on screen of the at least one analyzed part in the document to be mined according to the assigned degree of contextual relevance.
  • the system may comprise the following modules:
  • a dividing module that divides a content of the document to be mined into a plurality of parts to be individually analyzed, based on a division criterion for identifying parts.
  • An extracting module that extracts data from at least one part.
  • An analyzing module that analyzes the at least one part by evaluating, for the extracted data, at least one contextual condition associated to the user context.
  • An assigning module that assigns a degree of contextual relevance for the at least one analyzed part according to the fulfilment of at least one contextual condition associated to the user context.
  • a modifying module that modifies the appearance of the at least one analyzed part in the document to be mined according to the assigned degree of contextual relevance.
  • Yet another object of the invention is to provide a computer program product, (in a computer readable storage medium such as DVD, CD or the like, or through a network connection) comprising instructions to cause a computer to carry out steps of the proposed method.
  • FIG. 1 illustrates a flowchart (100) according to the invention. Mined information from a document (110) is presented differently depending on its degree of relevance with respect to a user context (170).
  • FIG. 2 illustrates a flowchart (200) of the steps for defining the contextual conditions (175) that comprise the user context (170).
  • FIG. 3 shows one embodiment in which further steps are taken by the analysing module (160) from FIG. 1 in order to adjust the degree of contextual relevance of a hyperlink part in a document.
  • FIG. 4 shows one embodiment in which a contextual condition (175) is enhanced by an enhancing module (410).
  • the present invention is directed to a method, a system, and a corresponding computer program product for mining user-relevant information from documents.
  • the embodiments of the present invention may include or be performed with a special purpose or general-purpose computer including handheld devices such as, but not limited to, electronic book readers, smartphones and tablets. They can be performed locally in the client computer operated by the user reading documents, in remote servers, or combinations thereof (some modules operating in the client computer while others in remote servers).
  • Embodiments within the scope of the present invention include documents (110) in any computer-readable format.
  • documents (110) can comprise PDF, DOC, ODT, DJVU, HTML, XML, XHTML, TXT, or any other format which can be used to represent textual information.
  • the user context (170) is relative to the user and to her/his present context. For this reason, the user context (170) may change with time for a given user. For instance, a user interested in extracting information from an electronic newspaper related to the domain of financial institutions may be later interested in sports.
  • the user context (170) is defined by a set containing at least one contextual condition (175).
  • the document (110) is processed by a dividing module (120) that uses a division criterion (125) in order to divide the document in several parts (130).
  • the division criterion usually depends on the format of the document.
  • the dividing module should create parts (130), generally: text and hyperlink parts.
  • Text parts are parts (130) that mainly or exclusively contain text
  • hyperlink parts are parts (130) that mainly or exclusively contain hyperlinks. Text parts usually contain several words.
  • the division criterion (125) configures the dividing module (120) in a way that it transforms every paragraph of pure text into a text part, every single hyperlink into a hyperlink part, every paragraph of text containing hyperlinks into a mixed part and every image into an image part.
  • the document (110) is an HTML webpage and text parts are built from certain HTML tags, such as ⁇ p>, ⁇ h1>, ⁇ title> and ⁇ //>, and hyperlink parts are built from ⁇ a> tags.
  • the division criterion (125) could also instruct the dividing module (120) to look for scripting code that dynamically adds text or hyperlinks to the webpage and encapsulate this code in a new type of part that could be called a script part.
  • the scripting language could be here for instance Javascript, and the scripting code could reside in the HTML page, in a file referenced from the HTML page, or it could be dynamically fetched from a remote server.
  • a video and audio part is respectively formed with an embedded video and audio in the document.
  • Modules 140, 160 and 190 in FIG. 1 receive and generate pieces of information 130, 150 and 180.
  • the acts carried out by the said modules may be repeated for each part identified by the dividing module (120).
  • the user context (170), comprising one or several contextual conditions (175), is the same for each one of these parts.
  • the extracting module (140) in FIG. 1 extracts data (150) from each part (130).
  • the data extracted by this module depends on the type of contents that the part contains.
  • the data extracted is simply the text that they contain, possibly with the information concerning the style properties applied to this text. If the document is an HTML web page the extracting module (140) may also extract the HTML tags embracing the text in the text part (130).
  • the extracted data (150) is the text of the label (if any) in the hyperlinks that are found in the hyperlink part.
  • the extracting module (140) extracts as data (150) from a hyperlink part the addresses of the hyperlinked documents (320) pointed by the hyperlinks that are found in the hyperlink part, and the analysing module (155) uses these addresses to access the content of the hyperlinked documents (320).
  • the data (150) extracted from a script part embedded or referenced from an HTML document may comprise the text messages that the scripting code prints on the document (110), as well as the text comments or names of variables, procedures, methods and classes from the scripting code. If the script part dynamically adds hyperlinks to the document, the extracting module (140) could extract data (150) from their labels, from the contents of the documents pointed by the hyperlinks, and combinations thereof.
  • the extracting module (140) could extract data (150) from this information.
  • the data extracted could be the fields and values from the message.
  • the data extracted may correspond to the metadata associated to these elements, such as their title and tags.
  • the analysing module (155) analyses the data (150) extracted from each part (130) by means of evaluating one or more contextual conditions (175) in the user context (170) in order to obtain a set of evaluated contextual conditions (160).
  • the assigning module (165) receives the set of evaluated contextual conditions (160) and assigns a degree of contextual relevance (180) to the part (130).
  • the rules or criteria used by the assigning module (165) to assign a degree of contextual relevance (180) to a part (130) depend on the type of contents that the part contains.
  • the degree of contextual relevance (180) assigned by the assigning module (165) is a number that grows with the number of contextual conditions ( 75) that are fulfilled in the part (130), the number of times that each contextual condition (175) is fulfilled within the part (130), the inverse of the amount of data extracted from the part, and combinations thereof. For instance, if very little data (150) has been extracted from a part, but this data (150) fulfils many contextual conditions (175) many times, then the assigning module (165) assigns a high degree of contextual relevance (180). The converse is also true.
  • the degree of contextual relevance (180) is bounded by a minimum and a maximum value. It may take an infinite number of values between these two bounds.
  • the assigning module (165) takes into account the degree of contextual relevance (180) assigned to neighbouring parts in the document (110). In this case, the assigning module (165) assigns a higher degree of contextual relevance (180) to a part (130) when it is surrounded by parts with a high degree of contextual relevance than when it is not.
  • the modifying module (190) takes the degree of contextual relevance (180) and changes the appearance on screen of the corresponding part (130) in the document (110). That is, it changes the way this part is rendered, producing a modified part (195) version of the original part (130).
  • the changes in the appearance of the corresponding part can comprise, but are not limited to, the following actions: highlighting, underlining, changing the text font, changing the colour, the visibility, the size, deleting and combinations thereof.
  • a part containing text with a high degree of contextual relevance can be highlighted, the size of an image in a part with a low degree of contextual relevance can be reduced, the colour of the text in a part with low degree of contextual relevance (180) can be set close to the background colour of the text, and the hyperlink in a part with a low degree of contextual relevance can be deleted or hid from the document.
  • the way the document parts are presented to the reader depends on what the reader is interested in (i.e., on the user context).
  • the . same document (110) may look different to different readers and even to the same reader if she/he reads the document twice looking for different things (i.e., with a different use/ context).
  • the degree of contextual relevance (180) for one or more parts (130) in a document (110) can be stored in a local or remote database or file, or they can be coded in the document (110) itself.
  • the degree of contextual relevance (180) for each part (130) it is stored together with a value that identifies the part they correspond to. Examples of such a value are a reference to the location of the part within the document and a hash value computed from the contents of the part.
  • FIG. 2 illustrates a flowchart (200) of the steps for defining the contextual conditions (175) that comprise the user context (170) by the defining module.
  • the user context (170) is defined by a set containing at least one contextual condition (175).
  • a contextual condition (175) is any condition that can be evaluated by a computer and that upon evaluation yields either true or false as a result.
  • contextual conditions are evaluated by the analysing module (160) on the data (150) extracted by the extracting module (140).
  • Contextual conditions (175) may be defined by a defining module (210). This module gathers information from a variety of sources and uses this information in order to define the contextual conditions (175) that comprise the user context (170).
  • the defining module (210) can define a contextual condition (175) from the information entered by the user in a user interface (220), or from other sources of information comprising the user history data (260) and the document address (270).
  • the defining module (210) can define a contextual condition (175) from the information entered in an input field (230) by the user who is reading the document (110).
  • Such input field (230) could be, but is not limited to, an input field that appears in the user interface of the program used to view the document (110) for instance when a certain combination of keys is pressed (e.g., a search box), an input field (230) embedded in the document being viewed (e.g., an input form in an HTML document or the search box in an internet search engine), and an input field that belongs to an extension made to the program used to view the document (e.g., a browser extension, add-on or plugin such as a search engine toolbar).
  • a contextual condition (175) defined from the information entered from an input field (230) can consist on the evaluation of the presence or absence of a word in the data. It can also consist on the evaluation of the fulfilment of a Boolean or Regular expression on the data.
  • An example of a Boolean expression is "(park OR .children) AND -winter”, which could mean in a possible interpretation of the syntax, return true if the data contains the word "park” or the word “children” and it does not contain the word "winter”.
  • An example of Regular expression is ".at”, which means return true if the data contains any three-character string ending with "at”, including "hat”, “cat”, and "bat”.
  • the defining module (210) can define a contextual condition (175) from the information obtained from a set of user preferences (240). These user preferences can be stored in a local or remote database or file, or they can be coded in the document (110) itself. They can contain information introduced by the user in an input field (230), or information extracted from a variety of information sources, such as the history data (260) or the document address (270).
  • An example of a contextual condition defined from the information in a set of user preferences (240) is to evaluate the absence of a word in a list of offensive words, or in a list of banned subjects. In another example such a contextual condition (175) evaluates the absence of hyperlinks that link to an address that belong to a list of forbidden addresses.
  • contextual condition could evaluate the absence of hyperlinks pointing to webpages with forbidden URLs.
  • Another possible contextual condition (175) defined through a set of user preferences is to evaluate the matching of style properties on the data.
  • This can be combined with other contextual conditions
  • a combined contextual condition could be to evaluate the presence of a certain word in text that has the bold or underline style property applied.
  • the fulfilment of this combined condition could allow the assigning module (165) in FIG. 1 assign a higher degree of contextual relevance (180) to the part (130).
  • the defining module (210) can also use history data (260) collected by a data collecting module (250) from at least one previously consulted document in order to define a one or more contextual conditions (175).
  • the history data (260) can be stored in a local or remote database or file, or it can be coded in the document itself. It comprises, but is not limited to, the list of documents recently opened by the user, as well as the list of main user actions performed on those documents, such as word searches and mouse clicks on hyperlinks. This information tells what the user has been recently doing and can be therefore used to define contextual conditions (175) within the user context (170). For instance, the user could have accessed the present document by clicking on a hyperlink located in a previously opened document.
  • the data collecting module (250) can collect this event, together with the words from the label corresponding to the clicked hyperlink as part of the history data (260), and the defining module (210) can define a contextual condition (175) that evaluates the presence of these words.
  • the document address (270) can be also used by the defining module (210) in order to define a contextual condition (175).
  • the document address refers to its URL. In other cases, it refers to the path of the file that contains the document in the corresponding local or remote file system together With the name of this file.
  • the defining module (210) can extract words from the document address and define contextual conditions that evaluate the presence of these words in the document (110).
  • the input field (230) and the user preferences (240) both receive information from the user.
  • the user interface (220) are components of the user interface (220) through which users influence how the defining module (210) defines the contextual conditions (175) of the user context (170).
  • the user enters information in the input field (230) and then presses the ENTER key or presses a button and a contextual condition (175) is defined.
  • the user then configures her/his set of user preferences (240) by entering information in input fields (230), by selecting options from a check box, radio button, data picker, toggle button, list box or menu bar.
  • the user interface (220) further comprises a reset button.
  • the reset button When the user pushes the reset button all, or part of the contextual conditions (175) are eliminated from the user context (170).
  • the reset button By means of the reset button, the user has control over the life cycle of the user context (170). More specifically, the user can decide when to change her/his user context (170).
  • the user introduces in the input field (230) the words "quantum physics" and the defining module (210) defines one or more corresponding contextual conditions (175).
  • thermodynamics When the user finds in the document (110) the information about quantum physics she/he is looking for, she/he decides that her/his interest has shifted from quantum physics to thermodynamics.
  • the user then introduces in the input field (230) the words "thermodynamics" and the defining module (210) defines one or more corresponding contextual conditions (175), thus effectively changing the user context (170) at the user's command.
  • the user selects a new document to be mined by clicking on a hyperlink located in a document that has been mined the following actions take place.
  • the user context (170) is deleted, that is, all or most of its contextual conditions (175) are eliminated.
  • a new user context (170) is created by defining contextual conditions (175) from various sources of information, comprising the new document address (270), and history data (260), such as the words in the label of the clicked hyperlink. If the clicked hyperlink has a high degree of contextual relevance (180), then the user context (170) is not deleted. Additionally, new contextual conditions (175) may be added to the user context (170) by taking into account information from various sources, comprising the new document address (270), and history data (260), such as the words in the label of the clicked hyperlink.
  • the flowchart 300 of FIG. 3 describes further steps (310, 320, 330, 335, 340, 350, 355, 360, 365, 370, 375, 170, 175, 380 and 390) which are taken by the analysing module (160) in FIG. 1 according to another embodiment.
  • the accessing module 310, dividing module 330, extracting module 350, analysing module 365, assigning module 375 and adjusting module 390 receive and/or generate data from the hyperlink part 150, one or more hyperlinked documents 320, an auxiliary division criterion 335, one or more hyperlinked document parts 340, auxiliary data 360, one or more contextual conditions 175, one or more auxiliary evaluated contextual conditions 370, an auxiliary degree of contextual relevance 380 for each hyperlinked document part, and a degree of contextual relevance for the hyperlink part 180.
  • the purpose of the acts carried out by the said modules is to adjust the degree of contextual relevance (180) associated to a hyperlink part depending on the contents of the documents pointed by the hyperlinks contained in the hyperlink part.
  • An accessing module (310) accesses the hyperlinked document (320) pointed by each hyperlink contained in the data (150) extracted from the hyperlink part. For at least one of these hyperlinked documents, (320), the document is divided in hyperlinked document parts (340) by an auxiliary dividing module (330) which operates according to an auxiliary division criterion (335).
  • An auxiliary extracting module (350) further extracts auxiliary data (360) from at least one hyperlinked document part (340).
  • An auxiliary analysing module (365) analyses this auxiliary data (360) by means of evaluating one or more contextual conditions (175) in the user context (170) in order to obtain a set of auxiliary evaluated contextual conditions (370).
  • An auxiliary assigning module (375) receives the set of auxiliary evaluated contextual conditions (370) and assigns an auxiliary degree of contextual relevance (380) to the hyperlinked document part (340).
  • An adjusting module (390) evaluates the auxiliary degrees of contextual relevance (380) of each one of the analysed hyperlinked document parts (340) in the document and adjusts the value of the degree of contextual relevance (180) associated to the hyperlink part accordingly.
  • the accessing module (310) not only accesses the hyperlinked documents (320) directly pointed by the hyperlinks contained in the data (150) extracted from the hyperlink part, or level-1 hyperlinked documents. It also recursively accesses a tree of n levels of the documents directly or indirectly pointed by the level-1 hyperlinked documents, where n is any integer finite number greater than one.
  • the accessing module (310) accesses documents that are pointed by hyperlinks located in the level-1 hyperlinked documents (i.e., level-2 hyperlinked documents), documents pointed by hyperlinks located, in level-2 hyperlinked documents (i.e., level-3 hyperlinked documents), and in general, level-j hyperlinked documents pointed by hyperlinks located in level-(j-1) hyperlinked documents, for 2 ⁇ j ⁇ n.
  • the adjusting module (390) gives more importance to the degrees of contextual relevance (180) obtained from level-j hyperlinked documents when j is close to 2 than when j is close to n.
  • FIG. 4 illustrates one embodiment 400 in which at least one of the contextual conditions (175) from the user context is processed by an enhancing module (410) resulting in an enhanced contextual condition (420).
  • the enhanced contextual condition (420) is a version of the same original contextual condition (175) processed by the enhancing module which has suffered some modifications.
  • such modifications can comprise: Adding to the contextual condition at least one synonym of at least one word in the contextual condition, adding to the contextual condition at least one inflected or derived word from at least one word in the contextual condition (e.g., adding from the word “work” the inflected and derived forms "workaholic", “worked” and “working”), eliminating from the contextual condition a word when it is in a list of irrelevant words (e.g., eliminating from “the house” the article “the”), adding to the contextual condition at least one phonetically equivalent word of at least one word, in the contextual condition (e.g., adding from the word "cool” the phonetically equivalent word “kool”), adding to the contextual condition a plural or singular form of at least one word in the contextual condition (e.g., adding from the word "tree” the word “trees”), adding to the contextual condition different capitalized versions of at least one word in the contextual condition (e.g. adding from the word "tree” the

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

L'invention concerne un procédé, un système et un produit de programme informatique d'exploitation de données d'un document ou de documents choisis par un utilisateur. L'exploitation de données est effectuée conformément à un contexte d'utilisateur établi qui définit ses préférences. Comme résultat de l'analyse de l'information contenue dans les différentes parties d'un document pour un contexte d'utilisateur, différents degrés de pertinence sont affectés à des parties, et l'aspect des parties sur un écran peut être modifié en conséquence pour une meilleure identification de l'information pertinente du document. Ainsi, l'aspect de parties est modifié sur la base de sa pertinence contextuelle pour l'utilisateur. Le contenu du document peut être divisé en parties, de sorte que chaque partie est du même type (par exemple, texte ou liens hypertextes). Dans ce cas, des stratégies différentes d'extraction, d'analyse et d'affectation de données peuvent être utilisées pour chaque type de partie. De plus, l'information pertinente peut être mémorisée et utilisée.
PCT/EP2011/003590 2011-07-19 2011-07-19 Procédé et système d'exploitation de données d'un document WO2013010557A1 (fr)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/EP2011/003590 WO2013010557A1 (fr) 2011-07-19 2011-07-19 Procédé et système d'exploitation de données d'un document

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/EP2011/003590 WO2013010557A1 (fr) 2011-07-19 2011-07-19 Procédé et système d'exploitation de données d'un document

Publications (1)

Publication Number Publication Date
WO2013010557A1 true WO2013010557A1 (fr) 2013-01-24

Family

ID=44629044

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/EP2011/003590 WO2013010557A1 (fr) 2011-07-19 2011-07-19 Procédé et système d'exploitation de données d'un document

Country Status (1)

Country Link
WO (1) WO2013010557A1 (fr)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112100213A (zh) * 2020-09-07 2020-12-18 中国人民解放军海军工程大学 船舶设备技术数据搜索排序方法
US11416575B2 (en) 2020-07-06 2022-08-16 Grokit Data, Inc. Automation system and method

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040080532A1 (en) * 2002-10-29 2004-04-29 International Business Machines Corporation Apparatus and method for automatically highlighting text in an electronic document
EP1679617A2 (fr) * 2005-01-07 2006-07-12 Palo Alto Research Center Incorporated Procédé de mise en surbrillance conceptuelle automatique dans un texte électronique
US20080098300A1 (en) * 2006-10-24 2008-04-24 Brilliant Shopper, Inc. Method and system for extracting information from web pages
WO2008103623A1 (fr) * 2007-02-22 2008-08-28 Microsoft Corporation Recherche de page de mot similaire et synonyme
US7451099B2 (en) 2000-08-30 2008-11-11 Kontera Technologies, Inc. Dynamic document context mark-up technique implemented over a computer network
US20110119262A1 (en) * 2009-11-13 2011-05-19 Dexter Jeffrey M Method and System for Grouping Chunks Extracted from A Document, Highlighting the Location of A Document Chunk Within A Document, and Ranking Hyperlinks Within A Document

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7451099B2 (en) 2000-08-30 2008-11-11 Kontera Technologies, Inc. Dynamic document context mark-up technique implemented over a computer network
US20040080532A1 (en) * 2002-10-29 2004-04-29 International Business Machines Corporation Apparatus and method for automatically highlighting text in an electronic document
EP1679617A2 (fr) * 2005-01-07 2006-07-12 Palo Alto Research Center Incorporated Procédé de mise en surbrillance conceptuelle automatique dans un texte électronique
US20080098300A1 (en) * 2006-10-24 2008-04-24 Brilliant Shopper, Inc. Method and system for extracting information from web pages
WO2008103623A1 (fr) * 2007-02-22 2008-08-28 Microsoft Corporation Recherche de page de mot similaire et synonyme
US20110119262A1 (en) * 2009-11-13 2011-05-19 Dexter Jeffrey M Method and System for Grouping Chunks Extracted from A Document, Highlighting the Location of A Document Chunk Within A Document, and Ranking Hyperlinks Within A Document

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11416575B2 (en) 2020-07-06 2022-08-16 Grokit Data, Inc. Automation system and method
US11568019B2 (en) 2020-07-06 2023-01-31 Grokit Data, Inc. Automation system and method
US11580190B2 (en) 2020-07-06 2023-02-14 Grokit Data, Inc. Automation system and method
US11640440B2 (en) * 2020-07-06 2023-05-02 Grokit Data, Inc. Automation system and method
US11860967B2 (en) 2020-07-06 2024-01-02 The Iremedy Healthcare Companies, Inc. Automation system and method
US11983236B2 (en) 2020-07-06 2024-05-14 The Iremedy Healthcare Companies, Inc. Automation system and method
US12259940B2 (en) 2020-07-06 2025-03-25 Grokit Data, Inc. Automation system and method
CN112100213A (zh) * 2020-09-07 2020-12-18 中国人民解放军海军工程大学 船舶设备技术数据搜索排序方法

Similar Documents

Publication Publication Date Title
US8856100B2 (en) Displaying browse sequence with search results
US7519621B2 (en) Extracting information from Web pages
Papadakis et al. Stavies: A system for information extraction from unknown web data sources through automatic web wrapper generation using clustering techniques
US20090070366A1 (en) Method and system for web document clustering
CN101464905B (zh) 一种网页信息抽取的系统及方法
CN109033358B (zh) 新闻聚合与智能实体关联的方法
Peters et al. Content extraction using diverse feature sets
CN106095979B (zh) Url合并处理方法和装置
US11263062B2 (en) API mashup exploration and recommendation
WO2004083990A2 (fr) Procede et systeme d'adaptation de contenu web
CN105205080A (zh) 冗余文件清理方法、装置和系统
CN108733813A (zh) 面向bbs论坛网页内容的信息提取方法、系统及介质
Mehta et al. DOM tree based approach for web content extraction
KR20190058141A (ko) 문서로부터 추출되는 데이터를 생성하는 방법 및 그 장치
CN114443928B (zh) 一种网络文本数据爬虫方法与系统
CN111125485A (zh) 基于Scrapy的网站URL爬取方法
CN102257490A (zh) 文档信息选择方法和计算机程序产品
WO2013010557A1 (fr) Procédé et système d'exploitation de données d'un document
CN112989163A (zh) 一种垂直搜索方法和系统
US20120109965A1 (en) System for automatic semantic-based mining
CN114003714B (zh) 一种文档上下文感知的智能知识推送方法
CN106991144B (zh) 一种定制数据爬取工作流的方法及系统
CN105787032B (zh) 网页快照的生成方法及装置
Kaddu et al. To extract informative content from online web pages by using hybrid approach
KR101650316B1 (ko) 분산 병렬 처리 기반의 html5 문서 수집 및 분석 장치 및 방법

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 11736000

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 02/06/2014)

122 Ep: pct application non-entry in european phase

Ref document number: 11736000

Country of ref document: EP

Kind code of ref document: A1

点击 这是indexloc提供的php浏览器服务,不要输入任何密码和下载