US20090070327A1 - Method for automatically generating regular expressions for relaxed matching of text patterns - Google Patents
Method for automatically generating regular expressions for relaxed matching of text patterns Download PDFInfo
- Publication number
- US20090070327A1 US20090070327A1 US11/850,987 US85098707A US2009070327A1 US 20090070327 A1 US20090070327 A1 US 20090070327A1 US 85098707 A US85098707 A US 85098707A US 2009070327 A1 US2009070327 A1 US 2009070327A1
- Authority
- US
- United States
- Prior art keywords
- regular expression
- automatically
- token list
- operator
- rule
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Abandoned
Links
- 230000014509 gene expression Effects 0.000 title claims abstract description 128
- 238000000034 method Methods 0.000 title claims abstract description 34
- 238000012986 modification Methods 0.000 claims description 17
- 230000004048 modification Effects 0.000 claims description 17
- 230000004044 response Effects 0.000 claims description 3
- 238000000605 extraction Methods 0.000 abstract description 12
- 238000002474 experimental method Methods 0.000 description 20
- 230000008569 process Effects 0.000 description 17
- 230000008520 organization Effects 0.000 description 15
- 238000010586 diagram Methods 0.000 description 10
- 210000001072 colon Anatomy 0.000 description 6
- 230000003287 optical effect Effects 0.000 description 6
- 238000012545 processing Methods 0.000 description 6
- 238000011835 investigation Methods 0.000 description 5
- 230000006872 improvement Effects 0.000 description 4
- 238000004891 communication Methods 0.000 description 3
- 230000005540 biological transmission Effects 0.000 description 2
- 238000006243 chemical reaction Methods 0.000 description 2
- 238000004590 computer program Methods 0.000 description 2
- 238000013500 data storage Methods 0.000 description 2
- 238000001514 detection method Methods 0.000 description 2
- 239000004065 semiconductor Substances 0.000 description 2
- 238000013459 approach Methods 0.000 description 1
- 230000007812 deficiency Effects 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 238000009472 formulation Methods 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 238000012423 maintenance Methods 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 239000007787 solid Substances 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
- 238000000844 transformation Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/332—Query formulation
- G06F16/3329—Natural language query formulation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/3332—Query translation
- G06F16/3338—Query expansion
Definitions
- the present invention relates to a method and system for automatically generating regular expressions for relaxed matching of text patterns.
- One category of information extraction employs query expansion and other query processing techniques in search engines.
- Conventional query expansion techniques generate an expanded output query from an original query, where the expanded output query includes additional words obtained from a synonym dictionary.
- the results of the expanded output query are documents that contain either the keywords of the original query or the additional words from the synonym dictionary.
- a natural language dictionary e.g., standard English dictionary
- the synonym dictionary is limited in its ability to match certain text pattern variations related to punctuation, spacing, new lines between words, arbitrary capitalization, colloquial abbreviations, etc.
- known query processing techniques that employ stemming and stop word removal decrease precision in information retrieval results.
- Another category of information extraction is rule-based and utilizes regular expressions.
- the present invention provides a computer-implemented method of automatically generating regular expressions for relaxed matching of text patterns, comprising:
- the present invention provides a technique for automatically generating regular expressions for a relaxed matching of text patterns. Further, the present invention provides a generic, extensible, and widely applicable rule-based framework in which the automatic generation of regular expressions is based on the creation and updating of rules without requiring the writing and maintenance of complex and customized software programs.
- FIG. 1A is a block diagram of a first system for automatically generating regular expressions for relaxed matching of text patterns, in accordance with embodiments of the present invention.
- FIG. 1B is a block diagram of a second system for automatically generating regular expressions for relaxed matching of text patterns, in accordance with embodiments of the present invention.
- FIG. 2 is a flow diagram of a regular expression generation process implemented by the system of FIG. 1A or FIG. 1B , in accordance with embodiments of the present invention.
- FIG. 4A depicts an algorithm to apply a REPLACE_WORD rule included in the rule set of FIG. 3 , in accordance with embodiments of the present invention.
- FIG. 4B depicts an exemplary tokenized phrase generated via the process of FIG. 2 , in accordance with embodiments of the present invention.
- FIG. 4C depicts an exemplary set of tokens resulting from executing the algorithm of FIG. 4A to apply the REPLACE_WORD rule included in the rule set of FIG. 3 to an escaped version of the phrase of FIG. 4B , in accordance with embodiments of the present invention.
- FIG. 4D depicts an exemplary regular expression generated by replacing tokens in the set of tokens of FIG. 4C via the process of FIG. 2 , in accordance with embodiments of the present invention.
- FIG. 5A depicts an algorithm to apply a SPLIT_AT_CHARACTER rule included in the rule set of FIG. 3 , in accordance with embodiments of the present invention.
- FIG. 5B depicts an exemplary set of tokens resulting from applying the rule of FIG. 5A via the process of FIG. 2 , in accordance with embodiments of the present invention.
- FIG. 5C depicts an exemplary set of tokens resulting from applying the rule of FIG. 4A to the set of tokens of FIG. 5B via the process of FIG. 2 , in accordance with embodiments of the present invention.
- FIG. 5D depicts an exemplary regular expression generated by replacing tokens in the set of tokens of FIG. 5C via the process of FIG. 2 , in accordance with embodiments of the present invention.
- FIG. 6 is a table of entities and relationships used in experiments for determining recall and precision of regular expressions generated by the process of FIG. 2 , in accordance with embodiments of the present invention.
- FIGS. 7A-7D are tables of results of four sets of experiments organized according to the table of FIG. 6 , where the experiments are for determining recall and precision of regular expressions generated by the process of FIG. 2 , in accordance with embodiments of the present invention.
- FIG. 8 is a block diagram of a computing unit that includes a relaxed regular expression generator of the system of FIG. 1A or FIG. 1B , in accordance with embodiments of the present invention.
- a text pattern of interest for this example is the phrase “can be reached at”.
- a rule-based IE system identifies occurrences of the form “ ⁇ Person> can be reached at ⁇ Phone>” and generates the corresponding pairs of related Persons and Phones.
- the phrase “can be reached at” may occur with several variations: extra punctuation, multiple spaces or new lines between words, arbitrary capitalization, colloquial abbreviations for words (e.g., “reached” abbreviated as “rchd”).
- Such variation in text is particularly true for informal communication mediums such as email where the formatting and style of the text is not strictly controlled.
- a regular expression is used to account for the original input phrase “can be reached at” as well as the multiple variations.
- the present invention addresses this problem by providing a generic and extensible rule-based framework for automatically generating a regular expression from a given input phrase (i.e., a plain text pattern) provided by a user.
- the input phrase is provided in a natural, human language (e.g., a user's native English).
- the regular expression output by the present invention improves the recall (i.e., increase the set of occurrences of the input phrase and its variations that are identified in the text) with little or no decrease in precision (i.e., without increasing the identification of spurious instances in the text).
- relaxation is the method of the present invention that converts a plain text pattern to an output regular expression that matches the original plain text pattern and that matches other strings that are variations of the original plain text pattern.
- the overall algorithm whose execution provides relaxation is referred to herein as the relaxed regular expression generator.
- the relaxation disclosed herein includes syntactic relaxation and semantic relaxation. Syntactic relaxation includes matching to text patterns whose variation from the original plain text pattern is based on primarily syntactic aspects of the original plain text pattern such as punctuation and whitespace between words (i.e., matching to patterns that have different punctuation and/or whitespace while having the same words and the same meaning as the original plain text pattern). Semantic relaxation includes matching to text patterns whose variation from the original plain text pattern is based on a modification of the words of the original plain text pattern while retaining the meaning of the original plain text pattern.
- FIG. 1A is a block diagram of a first system for automatically generating regular expressions for relaxed matching of text patterns, in accordance with embodiments of the present invention.
- First system 100 includes a user input phrase 102 , a relaxed regular expression generator 104 , a relaxation rule file 106 and an output regular expression 108 .
- User input phrase 102 is input into relaxed regular expression generator 104 as a phrase expressed in a natural, human language (e.g., a native English phrase).
- Relaxed regular expression generator 104 obtains relaxation rules from relaxation rule file 106 and applies the obtained rules to user input phrase 102 to automatically generate regular expression 108 as output.
- the relaxation rules in file 106 are predefined manually by, for example, an administrator of system 100 .
- relaxed regular expression generator 104 is also referred to simply as regular expression generator 104 or generator 104 .
- Relaxation rules included in relaxation rule file 106 are also referred to herein simply as rules. The functionalities of the components of system 100 are described in more detail below relative to FIG. 2 .
- system 100 includes an information extraction system (not shown) that includes an annotator generator (not shown).
- the annotator generator is coupled to relaxed regular expression generator 104 .
- generator 104 receives as input an annotator rule expressed in a natural, human language and outputs an annotator rule as regular expression 108 .
- the output regular expression is a relaxed regular expression in that it matches the original input annotator rule as well as variations of the annotator rule.
- the annotator generator then uses output regular expression 108 to generate an annotator that facilitates information extraction.
- FIG. 1B is a block diagram of a second system for automatically generating regular expressions for relaxed matching of text patterns, in accordance with embodiments of the present invention.
- Second system 120 implements another embodiment of the present invention and includes user input phrase 102 , relaxed regular expression generator 104 , a software-based rule learning component 122 , one or more output relaxation rules 124 and an output regular expression 108 .
- user input phrase 102 is expressed in a natural, human language and is input into generator 104 .
- rule learning component 122 automatically learns one or more relaxation rules and outputs one or more rules 124 , which are then obtained by generator 104 and applied by generator 104 to user input phrase 102 to generate regular expression 108 .
- step 204 regular expression generator 104 (see FIG. 1A and FIG. 1B ) receives user input phrase 102 and determines whether user input phrase 102 is already a regular expression or whether phrase 102 is a plain text pattern.
- step 204 determines that phrase 102 is a plain text pattern
- step 206 generator 104 (see FIG. 1A and FIG. 1B ) detects word boundaries in phrase 102 and tokenizes the plain text pattern that comprises phrase 102 to generate a set of input tokens.
- step 208 generator 104 (see FIG. 1A and FIG. 1B ) maps each of the aforementioned input tokens to a specific, internal representation for the system (e.g., system 100 of FIG. 1A ) to produce a token list (i.e., a sequence of tokens).
- step 210 generator 104 (see FIG. 1A and FIG. 1B ) replaces regular expression special characters in each entry of the token list produced in step 208 with escaped characters to generate a transformed token list (i.e., a tokenized and escaped phrase).
- a transformed token list i.e., a tokenized and escaped phrase.
- Java® regular expression characters in a token list produced in step 208 are replaced with escaped characters.
- step 212 generator 104 (see FIG. 1A and FIG. 1B ) applies one or more rules from the predefined rule set loaded in step 202 to the token list generated in step 210 in an order specified in relaxation rule file 106 .
- the application of the one or more rules in step 212 generates a modified token list (a.k.a. a tokenized and modified phrase) that is a transformed version of input phrase 102 .
- step 212 includes applying the modification operator to the token list generated in step 210 or to an intermediate token list generated during the execution of step 210 .
- step 214 generator 104 converts the modified token list generated in step 212 into a string, which represents output regular expression 108 (see FIG. 1A and FIG. 1B ).
- step 214 the regular expression generation process ends at step 216 .
- step 204 if generator 104 (see FIG. 1A and FIG. 1B ) determines that input phrase 102 is already a regular expression, then the above-described processing of steps 206 , 208 , 210 , 212 and 214 is not performed, the input is passed to the output unchanged, and the regular expression generation process ends at step 216 .
- input phrase 102 is:
- generator 104 recognizes that the input phrase is a regular expression and returns the input phrase unchanged as output 108 (see FIG. 1A and FIG. 1B ).
- input phrase 102 is the following phrase:
- generator 104 (see FIG. 1A and FIG. 1B ) outputs the following relaxed regular expression as the result of performing the transformations of steps 206 , 208 , 210 , 212 and 214 :
- Section 5 presented below describes experiments that demonstrate that utilizing the process of FIG. 2 to generate such relaxed regular expressions results in significantly higher recall and similar precision when compared to the input plain text pattern.
- This section includes a sample rule set and algorithms for applying rules in the sample rule set.
- Relaxation rules are defined in a special file 106 (see FIG. 1A and FIG. 1B ), which is loaded when the regular expression generator 104 (see FIG. 1A and FIG. 1B ) is started.
- the rules are composed using a predefined set of modification operators. While the framework for relaxation disclosed herein is generic and can be customized by any number of modification operators, this section restricts its attention to three basic operators: WHITESPACE, REPLACE_WORD and SPLIT_AT_CHARACTER.
- FIG. 3 depicts an example of a rule set 300 that is included in relaxation rule file 106 (see FIG. 1A and FIG. 1B ).
- Rule set 300 includes four rules that are expressed in a simple Extensible Markup Language (XML) format and that include the aforementioned basic operators. Note that in rule set 300 , each rule has an attribute ⁇ stackposition> that controls the order in which the rules must be applied.
- the operators included in the rules of rule set 300 are briefly described below:
- WHITESPACE This operator replaces whitespace which has been identified as token delimiters with the replacement regular expression defined in the attribute ⁇ replacement>.
- REPLACE_WORD This operator replaces a sequence of one or more tokens with a replacement regular expression.
- the tokens “did not” are replaced by a regular expression that matches either the phrase did ⁇ s+not or the phrase didn't.
- a token consisting of a single colon character i.e., “:” is replaced with a regular expression that allows for arbitrary whitespace before and after the colon.
- SPLIT_AT_CHARACTER This operator allows a particular token to be split into two tokens based on the presence of a particular character. In the example of FIG. 3 , the SPLIT_AT_CHARACTER operator splits a token based on the presence of the colon character.
- a reference to a WHITESPACE rule, a REPLACE_WORD rule or a SPLIT_AT_CHARACTER rule indicates a rule from a rule set, where the rule includes the aforementioned WHITESPACE, REPLACE_WORD or SPLIT_AT_CHARACTER operator, respectively.
- FIG. 4A depicts an algorithm 400 whose execution applies a REPLACE_WORD rule included in the rule set of FIG. 3 , in accordance with embodiments of the present invention.
- Algorithm 400 takes as input three parameters: (1) a search phrase, which is did not in this example; (2) the replacement regular expression which replaces the search phrase (e.g., ((did ⁇ s+not)
- the input to algorithm 400 is a set of tokens that has been tokenized by a whitespace tokenizer and in which regular expression special characters have been escaped already.
- Algorithm 400 produces an output list of tokens which includes the replacements made by using the aforementioned replacement regular expression to replace any occurrence of the search phrase.
- all offsets i.e., ordered from their left to right occurrences
- the search phrase matches the tokenized input (see line 1 of algorithm 400 ).
- an empty list of tokens is initialized (see line 2 of algorithm 400 ) to eventually hold the set of modified tokens.
- all tokens before the offset are copied to the output token set (see line 7 of algorithm 400 ).
- the token for the replacement regular expression is added (see line 8 of algorithm 400 ).
- the tokens from the last replacement tokens are added until the end of the input list is reached (see line 11 of algorithm 400 ).
- the input phrase I did not call is transformed initially into a tokenized representation that is illustrated in FIG. 4B as a tokenized phrase 420 .
- tokenized phrase 420 is escaped in step 210 of FIG. 2
- the resulting tokenized and escaped phrase is stored in tokenizedInput, the list of input tokens that is input into algorithm 400 (see FIG. 4A ).
- algorithm 400 applies the REPLACE_WORD rule of sample rule set 300 (see FIG. 3 ) to replace all occurrences of the search phrase did not in tokenizedInput by the replacement regular expression ((did ⁇ s+not)
- FIG. 4C depicts an exemplary set of tokens 440 that result from executing algorithm 400 (see FIG. 4A ) to apply the REPLACE_WORD rule of rule set 300 (see FIG. 3 ) to tokenized phrase 420 (see FIG. 4B ).
- the set of tokens 440 is generated by performing step 212 of FIG. 2 .
- step 214 (see FIG. 2 ) generates a conversion of the set of tokens 440 by replacing each DELIM token with the WHITESPACE token defined in rule set 300 (see FIG. 3 ) (i.e., ⁇ W+) and by replacing each BOUNDARY token with ⁇ b (i.e., the regular expression syntax for denoting word boundaries).
- the result of the aforementioned replacements in step 214 is an output regular expression 460 depicted in FIG. 4D .
- FIG. 5A depicts an algorithm 500 whose execution applies a SPLIT_AT_CHARACTER rule included in the rule set of FIG. 3 , in accordance with embodiments of the present invention.
- Algorithm 500 takes as input a list of input tokens (i.e., tokenizedinput in algorithm 500 ), which is a tokenized and escaped set of tokens resulting from step 210 of FIG. 2 .
- the input to algorithm 500 is a set of tokens that has been tokenized by a whitespace tokenizer and in which regular expression special characters have been escaped already.
- Algorithm 500 applies the SPLIT_AT_CHARACTER rule, which splits up a token based on the presence of a colon character. For example, consider the following input phrase to algorithm 500 :
- Executing algorithm 500 in step 212 applies the SPLIT_AT_CHARACTER rule of rule set 300 (see FIG. 3 ) to the token list shown above.
- the application of the SPLIT_AT_CHARACTER rule splits on the colon included the token list shown above and generates a token list 520 shown in FIG. 5B .
- the second REPLACE_WORD rule of rule set 300 is applied to generate a token list 540 shown in FIG. 5C . That is, the REPLACE_WORD rule in FIG. 3 that includes the colon as the search phrase is applied to generate token list 540 .
- Token list 540 is the result of executing algorithm 400 of FIG. 4A in step 212 (see FIG. 2 ).
- step 214 converts token list 540 into a regular expression by replacing the BOUNDARY tokens with ⁇ b (i.e., the regular expression syntax for denoting word boundaries).
- ⁇ b i.e., the regular expression syntax for denoting word boundaries.
- the result of the conversion in step 214 is an output regular expression 560 depicted in FIG. 5D .
- FIG. 6 is a table 600 of entities and relationships selected for the experiments in this section.
- the entities and relationships of table 600 are selected from the Enron email dataset. A constant window of 30 characters was used for each selected relationship, as indicated by the values in the #chars column of table 600 .
- Precision determines the number of matched annotations against the number of correct annotations.
- Correct entity type The entities must match the correct type. For example, I can be reached at is not counted as a correct match if the requested entity is a Person and not the Author of the email. As another example, Paul can be reached at his fax number 5223 is not counted as a correct match since the requested entity is not a phone number.
- FIG. 7A is a table 700 of results of the investigation of the person . . . phone number relationship.
- an annotator for a person . . . phone number relationship relates the phone number and a verb.
- this person . . . phone number relationship is modeled using multiple different handcrafted expressions based on the following native English phrases: can be reached at, can be contacted at, a call at, #, number is, and at. All of the aforementioned native English phrases express the relationship give me the phone number of a person, and therefore are handled as a single semantic relationship.
- the high precision and recall of this set of experiments shown in table 700 is mainly due to the influence of the “strong” pattern of the phrase “at”.
- an entity recognizer is a known component that recognizes entities (e.g., persons, phone numbers, organizations, etc.) for an information extraction task.
- An entity recognizer may be a component (not shown) of a system that includes relaxed regular expression generator 104 (see FIG. 1A and FIG. 1B ).
- FIG. 7B is a table 720 of results of the investigation of the person . . . person relationship.
- the reason for the high precision of the handcrafted regular expression is the usage of the right regular expression line limiter $ and the definition of selected optional words before (e.g., research and executive) and after (e.g., to and is) the noun assistant.
- detecting semantically relevant words before and after the native English input is far beyond the scope of a pure syntactic regular expression generator.
- improving the performance of the entity recognizer will enhance the precision of the generated regular expressions significantly.
- FIG. 7C is a table 740 of results of the investigation of the person . . . organization relationship.
- the reason for the low recall of the handcrafted regular expression is the line boundary tokens ⁇ and $, in particular for the phrase working with.
- the regular expression generator is improved by including an option to switch this line boundary functionality off or on.
- the regular expression generator is improved by including an option that allows a user to define how many words are ignored before and after the native English input.
- FIG. 7D is a table 760 of results of the investigation of the organization . . . organization relationship.
- FIG. 8 is a block diagram of a computing unit 800 that includes a relaxed regular expression generator 104 of the system of FIG. 1A or FIG. 1B and that implements the process of FIG. 2 , in accordance with embodiments of the present invention.
- Computing unit 800 generally comprises a central processing unit (CPU) 802 , a memory 804 , an input/output (I/O) interface 806 and a bus 808 , and is coupled to I/O devices 810 and a storage unit 812 .
- CPU 802 performs computation and control functions of computing unit 800 .
- CPU 802 may comprise a single processing unit, or be distributed across one or more processing units in one or more locations (e.g., on a client and server).
- Memory 804 may comprise any known type of data storage and/or transmission media, including bulk storage, magnetic media, optical media, random access memory (RAM), read-only memory (ROM), a data cache, a data object, etc.
- Cache memory elements of memory 804 provide temporary storage of at least some program code in order to reduce the number of times code must be retrieved from bulk storage during execution.
- Storage unit 812 is, for example, a magnetic disk drive or an optical disk drive that stores data including relaxation rule file 106 .
- memory 804 may reside at a single physical location, comprising one or more types of data storage, or be distributed across a plurality of physical systems in various forms. Further, memory 804 can include data distributed across, for example, a LAN, WAN or storage area network (SAN) (not shown).
- SAN storage area network
- I/O interface 806 comprises any system for exchanging information to or from an external source.
- I/O devices 810 comprise any known type of external device, including a display monitor, keyboard, mouse, printer, speakers, handheld device, printer, facsimile, etc.
- Bus 808 provides a communication link between each of the components in computing unit 800 , and may comprise any type of transmission link, including electrical, optical, wireless, etc.
- I/O interface 806 also allows computing unit 800 to store and retrieve information (e.g., program instructions or data) from an auxiliary storage device (e.g., storage unit 812 ).
- the auxiliary storage device may be a non-volatile storage device (e.g., a CD-ROM drive which receives a CD-ROM disk).
- Computing unit 800 can store and retrieve information from other auxiliary storage devices (not shown), which can include a direct access storage device (DASD) (e.g., hard disk or floppy diskette), a magneto-optical disk drive, a tape drive, or a wireless communication device.
- DASD direct access storage device
- Memory 804 includes program code for relaxed regular expression generator 104 . Further, memory 804 may include other systems not shown in FIG. 8 , such as an operating system (e.g., Linux) that runs on CPU 802 and provides control of various components within and/or connected to computing unit 102 .
- an operating system e.g., Linux
- the invention can take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment containing both hardware and software elements.
- the invention is implemented in software, which includes but is not limited to firmware, resident software, microcode, etc.
- the invention can take the form of a computer program product accessible from a computer-usable or computer-readable medium providing program code 104 for use by or in connection with a computing system 800 or any instruction execution system to provide and facilitate the capabilities of the present invention.
- a computer-usable or computer-readable medium can be any apparatus that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.
- the medium can be an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system (or apparatus or device) or a propagation medium.
- Examples of a computer-readable medium include a semiconductor or solid state memory, magnetic tape, a removable computer diskette, RAM 804 , ROM, a rigid magnetic disk and an optical disk.
- Current examples of optical disks include compact disk-read-only memory (CD-ROM), compact disk-read/write (CD-R/W) and DVD.
- any of the components of the present invention can be deployed, managed, serviced, etc. by a service provider that offers to deploy or integrate computing infrastructure with respect to the method of automatically generating regular expressions for relaxed matching of text patterns.
- the present invention discloses a process for supporting computer infrastructure, comprising integrating, hosting, maintaining and deploying computer-readable code into a computing system (e.g., computing unit 800 ), wherein the code in combination with the computing unit is capable of performing a method of automatically generating regular expressions for relaxed matching of text patterns.
- the invention provides a business method that performs the process steps of the invention on a subscription, advertising and/or fee basis. That is, a service provider, such as a Solution Integrator, can offer to create, maintain, support, etc. a method of automatically generating regular expressions for relaxed matching of text patterns. In this case, the service provider can create, maintain, support, etc. a computer infrastructure that performs the process steps of the invention for one or more customers. In return, the service provider can receive payment from the customer(s) under a subscription and/or fee agreement, and/or the service provider can receive payment from the sale of advertising content to one or more third parties.
- a service provider such as a Solution Integrator
- the service provider can receive payment from the customer(s) under a subscription and/or fee agreement, and/or the service provider can receive payment from the sale of advertising content to one or more third parties.
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Databases & Information Systems (AREA)
- General Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Computational Linguistics (AREA)
- Mathematical Physics (AREA)
- Artificial Intelligence (AREA)
- Human Computer Interaction (AREA)
- Machine Translation (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Description
- The present invention relates to a method and system for automatically generating regular expressions for relaxed matching of text patterns.
- One category of information extraction employs query expansion and other query processing techniques in search engines. Conventional query expansion techniques generate an expanded output query from an original query, where the expanded output query includes additional words obtained from a synonym dictionary. The results of the expanded output query are documents that contain either the keywords of the original query or the additional words from the synonym dictionary. Being based on a natural language dictionary (e.g., standard English dictionary), the synonym dictionary is limited in its ability to match certain text pattern variations related to punctuation, spacing, new lines between words, arbitrary capitalization, colloquial abbreviations, etc. Further, known query processing techniques that employ stemming and stop word removal decrease precision in information retrieval results. Another category of information extraction is rule-based and utilizes regular expressions. Conventional tools (e.g., Expresso offered by Ultrapico) in this second category allow a programmer to generate a regular expression using a graphical user interface and to check the syntax of a generated regular expression. These known regular expression generation tools are hampered by restricted usability because their users are required to have knowledge of the formulation and usage of syntactic constructs in regular expressions. Thus, there exists a need to overcome at least one of the preceding deficiencies and limitations of the related art.
- The present invention provides a computer-implemented method of automatically generating regular expressions for relaxed matching of text patterns, comprising:
- receiving, by a computing system, an input phrase expressed in a natural language;
- determining, by the computing system, that the input phrase is a plain text pattern;
- automatically tokenizing, by the computing system, the plain text pattern, wherein the automatically tokenizing includes automatically generating a first token list;
- automatically applying, by the computing system, one or more rules to the first token list, wherein the automatically applying includes automatically modifying the first token list and automatically generating a modified token list in response to the automatically modifying the first token list; and
- automatically converting, by the computing system, the modified token list into a regular expression, wherein the regular expression matches the plain text pattern and one or more variations of the plain text pattern.
- A system and computer program product corresponding to the above-summarized method are also described and claimed herein.
- Advantageously, the present invention provides a technique for automatically generating regular expressions for a relaxed matching of text patterns. Further, the present invention provides a generic, extensible, and widely applicable rule-based framework in which the automatic generation of regular expressions is based on the creation and updating of rules without requiring the writing and maintenance of complex and customized software programs.
-
FIG. 1A is a block diagram of a first system for automatically generating regular expressions for relaxed matching of text patterns, in accordance with embodiments of the present invention. -
FIG. 1B is a block diagram of a second system for automatically generating regular expressions for relaxed matching of text patterns, in accordance with embodiments of the present invention. -
FIG. 2 is a flow diagram of a regular expression generation process implemented by the system ofFIG. 1A orFIG. 1B , in accordance with embodiments of the present invention. -
FIG. 3 depicts an example of a rule set included in the relaxation rule file of the system ofFIG. 1A orFIG. 1B , in accordance with embodiments of the present invention. -
FIG. 4A depicts an algorithm to apply a REPLACE_WORD rule included in the rule set ofFIG. 3 , in accordance with embodiments of the present invention. -
FIG. 4B depicts an exemplary tokenized phrase generated via the process ofFIG. 2 , in accordance with embodiments of the present invention. -
FIG. 4C depicts an exemplary set of tokens resulting from executing the algorithm ofFIG. 4A to apply the REPLACE_WORD rule included in the rule set ofFIG. 3 to an escaped version of the phrase ofFIG. 4B , in accordance with embodiments of the present invention. -
FIG. 4D depicts an exemplary regular expression generated by replacing tokens in the set of tokens ofFIG. 4C via the process ofFIG. 2 , in accordance with embodiments of the present invention. -
FIG. 5A depicts an algorithm to apply a SPLIT_AT_CHARACTER rule included in the rule set ofFIG. 3 , in accordance with embodiments of the present invention. -
FIG. 5B depicts an exemplary set of tokens resulting from applying the rule ofFIG. 5A via the process ofFIG. 2 , in accordance with embodiments of the present invention. -
FIG. 5C depicts an exemplary set of tokens resulting from applying the rule ofFIG. 4A to the set of tokens ofFIG. 5B via the process ofFIG. 2 , in accordance with embodiments of the present invention. -
FIG. 5D depicts an exemplary regular expression generated by replacing tokens in the set of tokens ofFIG. 5C via the process ofFIG. 2 , in accordance with embodiments of the present invention. -
FIG. 6 is a table of entities and relationships used in experiments for determining recall and precision of regular expressions generated by the process ofFIG. 2 , in accordance with embodiments of the present invention. -
FIGS. 7A-7D are tables of results of four sets of experiments organized according to the table ofFIG. 6 , where the experiments are for determining recall and precision of regular expressions generated by the process ofFIG. 2 , in accordance with embodiments of the present invention. -
FIG. 8 is a block diagram of a computing unit that includes a relaxed regular expression generator of the system ofFIG. 1A orFIG. 1B , in accordance with embodiments of the present invention. - The goal of information extraction (IE) is to extract structured information from unstructured text (a.k.a. plain text) (e.g., documents, files, emails, web pages, etc.). In rule-based IE, rules are written that describe textual patterns of interest, which are to be extracted from unstructured text. Regular expressions are used for expressing such textual patterns of interest. As used herein, a regular expression is defined as a compact representation that describes a set of strings without listing all the elements of the set. A regular expression matches each of the strings in the set.
- For example, consider the information extraction task of identifying text patterns that associate a person with his or her phone number. A text pattern of interest for this example is the phrase “can be reached at”. Using such a pattern, a rule-based IE system identifies occurrences of the form “<Person> can be reached at <Phone>” and generates the corresponding pairs of related Persons and Phones. In free-form text, however, the phrase “can be reached at” may occur with several variations: extra punctuation, multiple spaces or new lines between words, arbitrary capitalization, colloquial abbreviations for words (e.g., “reached” abbreviated as “rchd”). Such variation in text is particularly true for informal communication mediums such as email where the formatting and style of the text is not strictly controlled. A regular expression is used to account for the original input phrase “can be reached at” as well as the multiple variations.
- The task of creating a regular expression that not only matches an original input phrase like “can be reached at” in the example presented above, but also the other variations is beyond the knowledge of the average untrained user of an information extraction system. The present invention addresses this problem by providing a generic and extensible rule-based framework for automatically generating a regular expression from a given input phrase (i.e., a plain text pattern) provided by a user. The input phrase is provided in a natural, human language (e.g., a user's native English). The regular expression output by the present invention improves the recall (i.e., increase the set of occurrences of the input phrase and its variations that are identified in the text) with little or no decrease in precision (i.e., without increasing the identification of spurious instances in the text).
- As used herein, relaxation is the method of the present invention that converts a plain text pattern to an output regular expression that matches the original plain text pattern and that matches other strings that are variations of the original plain text pattern. The overall algorithm whose execution provides relaxation is referred to herein as the relaxed regular expression generator. The relaxation disclosed herein includes syntactic relaxation and semantic relaxation. Syntactic relaxation includes matching to text patterns whose variation from the original plain text pattern is based on primarily syntactic aspects of the original plain text pattern such as punctuation and whitespace between words (i.e., matching to patterns that have different punctuation and/or whitespace while having the same words and the same meaning as the original plain text pattern). Semantic relaxation includes matching to text patterns whose variation from the original plain text pattern is based on a modification of the words of the original plain text pattern while retaining the meaning of the original plain text pattern.
-
FIG. 1A is a block diagram of a first system for automatically generating regular expressions for relaxed matching of text patterns, in accordance with embodiments of the present invention.First system 100 includes auser input phrase 102, a relaxedregular expression generator 104, arelaxation rule file 106 and an outputregular expression 108.User input phrase 102 is input into relaxedregular expression generator 104 as a phrase expressed in a natural, human language (e.g., a native English phrase). Relaxedregular expression generator 104 obtains relaxation rules fromrelaxation rule file 106 and applies the obtained rules touser input phrase 102 to automatically generateregular expression 108 as output. The relaxation rules infile 106 are predefined manually by, for example, an administrator ofsystem 100. Hereinafter, relaxedregular expression generator 104 is also referred to simply asregular expression generator 104 orgenerator 104. Relaxation rules included inrelaxation rule file 106 are also referred to herein simply as rules. The functionalities of the components ofsystem 100 are described in more detail below relative toFIG. 2 . - In one embodiment,
system 100 includes an information extraction system (not shown) that includes an annotator generator (not shown). The annotator generator is coupled to relaxedregular expression generator 104. In this embodiment,generator 104 receives as input an annotator rule expressed in a natural, human language and outputs an annotator rule asregular expression 108. The output regular expression is a relaxed regular expression in that it matches the original input annotator rule as well as variations of the annotator rule. The annotator generator then uses outputregular expression 108 to generate an annotator that facilitates information extraction. - In another embodiment,
system 100 includes a search engine (not shown) that is coupled to relaxedregular expression generator 104. In this embodiment,generator 104 receives as input a search query expressed in a natural, human language and outputs a query asregular expression 108. The outputregular expression 108 is a relaxed regular expression in that it matches the input search query as well as variations of the search query. The search engine then uses outputregular expression 108 to generate results (e.g., documents) of a search that uses the input search query and its variations. -
FIG. 1B is a block diagram of a second system for automatically generating regular expressions for relaxed matching of text patterns, in accordance with embodiments of the present invention.Second system 120 implements another embodiment of the present invention and includesuser input phrase 102, relaxedregular expression generator 104, a software-basedrule learning component 122, one or moreoutput relaxation rules 124 and an outputregular expression 108. Again,user input phrase 102 is expressed in a natural, human language and is input intogenerator 104. In this embodiment,rule learning component 122 automatically learns one or more relaxation rules and outputs one ormore rules 124, which are then obtained bygenerator 104 and applied bygenerator 104 touser input phrase 102 to generateregular expression 108. - Similar to the embodiment described above relative to system 100 (see
FIG. 1A ),system 120 may include an information extraction system (not shown) that includes an annotator generator (not shown) coupled to relaxedregular expression generator 104. The functionality of the information extraction system, the annotator generator andgenerator 104 is the same as described above relative to system 100 (seeFIG. 1A ). In another embodiment similar to an embodiment described above relative to system 100 (seeFIG. 1A ),system 120 may include a search engine coupled togenerator 104. The functionality of the search engine andgenerator 104 is the same as described above relative to system 100 (seeFIG. 1A ). -
FIG. 2 is a flow diagram of a regular expression generation process implemented by the system ofFIG. 1A orFIG. 1B , in accordance with embodiments of the present invention. The regular expression generation process starts atstep 200. Instep 202 regular expression generator 104 (seeFIG. 1A andFIG. 1B ) loads and parses a predefined rule set fromrelaxation rule file 106. The predefined rule set loaded instep 202 includes rules that include predefined modification operators (e.g., replacement and splitting operators). Each predefined modification operator may be employed by one or more rules in the rule set. Each rule in the predefined rule set that employs a predefined modification operator specifies one or more attributes or one or more parameters used when applying the modification operator to a token list. - In
step 204, regular expression generator 104 (seeFIG. 1A andFIG. 1B ) receivesuser input phrase 102 and determines whetheruser input phrase 102 is already a regular expression or whetherphrase 102 is a plain text pattern. - If
step 204 determines thatphrase 102 is a plain text pattern, then instep 206 generator 104 (seeFIG. 1A andFIG. 1B ) detects word boundaries inphrase 102 and tokenizes the plain text pattern that comprisesphrase 102 to generate a set of input tokens. Instep 208, generator 104 (seeFIG. 1A andFIG. 1B ) maps each of the aforementioned input tokens to a specific, internal representation for the system (e.g.,system 100 ofFIG. 1A ) to produce a token list (i.e., a sequence of tokens). - In
step 210, generator 104 (seeFIG. 1A andFIG. 1B ) replaces regular expression special characters in each entry of the token list produced instep 208 with escaped characters to generate a transformed token list (i.e., a tokenized and escaped phrase). For example, instep 210, Java® regular expression characters in a token list produced instep 208 are replaced with escaped characters. - In
step 212, generator 104 (seeFIG. 1A andFIG. 1B ) applies one or more rules from the predefined rule set loaded instep 202 to the token list generated instep 210 in an order specified inrelaxation rule file 106. The application of the one or more rules instep 212 generates a modified token list (a.k.a. a tokenized and modified phrase) that is a transformed version ofinput phrase 102. For any applied rule that includes a modification operator,step 212 includes applying the modification operator to the token list generated instep 210 or to an intermediate token list generated during the execution ofstep 210. - In
step 214,generator 104 converts the modified token list generated instep 212 into a string, which represents output regular expression 108 (seeFIG. 1A andFIG. 1B ). Followingstep 214, the regular expression generation process ends atstep 216. - Returning to step 204, if generator 104 (see
FIG. 1A andFIG. 1B ) determines thatinput phrase 102 is already a regular expression, then the above-described processing ofsteps step 216. For example, given thatinput phrase 102 is: -
meet\s+(\w+\s+){0,5}<RoomNumber> - generator 104 (see
FIG. 1A andFIG. 1B ) recognizes that the input phrase is a regular expression and returns the input phrase unchanged as output 108 (seeFIG. 1A andFIG. 1B ). - If, however,
input phrase 102 is the following phrase: -
meet at <RoomNumber> - then generator 104 (see
FIG. 1A andFIG. 1B ) outputs the following relaxed regular expression as the result of performing the transformations ofsteps -
\bmeet\b\W+\bat\b - which matches any string in which meet and at are adjacent words with an arbitrary whitespace between meet and at.
Section 5 presented below describes experiments that demonstrate that utilizing the process ofFIG. 2 to generate such relaxed regular expressions results in significantly higher recall and similar precision when compared to the input plain text pattern. - This section includes a sample rule set and algorithms for applying rules in the sample rule set.
- Relaxation rules are defined in a special file 106 (see
FIG. 1A andFIG. 1B ), which is loaded when the regular expression generator 104 (seeFIG. 1A andFIG. 1B ) is started. The rules are composed using a predefined set of modification operators. While the framework for relaxation disclosed herein is generic and can be customized by any number of modification operators, this section restricts its attention to three basic operators: WHITESPACE, REPLACE_WORD and SPLIT_AT_CHARACTER.FIG. 3 depicts an example of arule set 300 that is included in relaxation rule file 106 (seeFIG. 1A andFIG. 1B ). Rule set 300 includes four rules that are expressed in a simple Extensible Markup Language (XML) format and that include the aforementioned basic operators. Note that in rule set 300, each rule has an attribute <stackposition> that controls the order in which the rules must be applied. The operators included in the rules of rule set 300 are briefly described below: - WHITESPACE: This operator replaces whitespace which has been identified as token delimiters with the replacement regular expression defined in the attribute <replacement>.
- REPLACE_WORD: This operator replaces a sequence of one or more tokens with a replacement regular expression. In the example shown in
FIG. 3 , the tokens “did not” are replaced by a regular expression that matches either the phrase did\s+not or the phrase didn't. Similarly, a token consisting of a single colon character (i.e., “:”) is replaced with a regular expression that allows for arbitrary whitespace before and after the colon. - SPLIT_AT_CHARACTER: This operator allows a particular token to be split into two tokens based on the presence of a particular character. In the example of
FIG. 3 , the SPLIT_AT_CHARACTER operator splits a token based on the presence of the colon character. - Hereinafter, a reference to a WHITESPACE rule, a REPLACE_WORD rule or a SPLIT_AT_CHARACTER rule indicates a rule from a rule set, where the rule includes the aforementioned WHITESPACE, REPLACE_WORD or SPLIT_AT_CHARACTER operator, respectively.
-
FIG. 4A depicts analgorithm 400 whose execution applies a REPLACE_WORD rule included in the rule set ofFIG. 3 , in accordance with embodiments of the present invention.Algorithm 400 takes as input three parameters: (1) a search phrase, which is did not in this example; (2) the replacement regular expression which replaces the search phrase (e.g., ((did\s+not)|(didn\'t)) is the replacement regular expression that replaces did not); and (3) a list of input tokens (i.e., tokenizedInput in algorithm 400), which is a tokenized and escaped set of tokens resulting fromstep 210 ofFIG. 2 . For example, the input toalgorithm 400 is a set of tokens that has been tokenized by a whitespace tokenizer and in which regular expression special characters have been escaped already. -
Algorithm 400 produces an output list of tokens which includes the replacements made by using the aforementioned replacement regular expression to replace any occurrence of the search phrase. - During an initialization phase, all offsets (i.e., ordered from their left to right occurrences) are determined where the search phrase matches the tokenized input (see
line 1 of algorithm 400). Furthermore, an empty list of tokens is initialized (seeline 2 of algorithm 400) to eventually hold the set of modified tokens. After the initialization, for each offset, all tokens before the offset are copied to the output token set (seeline 7 of algorithm 400). Next, the token for the replacement regular expression is added (seeline 8 of algorithm 400). Finally, after considering all offsets, the tokens from the last replacement tokens are added until the end of the input list is reached (seeline 11 of algorithm 400). - In the example of
Section 4, the input phrase I did not call is transformed initially into a tokenized representation that is illustrated inFIG. 4B as atokenized phrase 420. Aftertokenized phrase 420 is escaped instep 210 ofFIG. 2 , the resulting tokenized and escaped phrase is stored in tokenizedInput, the list of input tokens that is input into algorithm 400 (seeFIG. 4A ). Thenalgorithm 400 applies the REPLACE_WORD rule of sample rule set 300 (seeFIG. 3 ) to replace all occurrences of the search phrase did not in tokenizedInput by the replacement regular expression ((did\s+not)|(didn\'t)). -
FIG. 4C depicts an exemplary set oftokens 440 that result from executing algorithm 400 (seeFIG. 4A ) to apply the REPLACE_WORD rule of rule set 300 (seeFIG. 3 ) to tokenized phrase 420 (seeFIG. 4B ). The set oftokens 440 is generated by performingstep 212 ofFIG. 2 . Following the generation of the set oftokens 440, step 214 (seeFIG. 2 ) generates a conversion of the set oftokens 440 by replacing each DELIM token with the WHITESPACE token defined in rule set 300 (seeFIG. 3 ) (i.e., \W+) and by replacing each BOUNDARY token with \b (i.e., the regular expression syntax for denoting word boundaries). The result of the aforementioned replacements in step 214 (seeFIG. 2 ) is an outputregular expression 460 depicted inFIG. 4D . -
FIG. 5A depicts analgorithm 500 whose execution applies a SPLIT_AT_CHARACTER rule included in the rule set ofFIG. 3 , in accordance with embodiments of the present invention.Algorithm 500 takes as input a list of input tokens (i.e., tokenizedinput in algorithm 500), which is a tokenized and escaped set of tokens resulting fromstep 210 ofFIG. 2 . For example, the input toalgorithm 500 is a set of tokens that has been tokenized by a whitespace tokenizer and in which regular expression special characters have been escaped already.Algorithm 500 applies the SPLIT_AT_CHARACTER rule, which splits up a token based on the presence of a colon character. For example, consider the following input phrase to algorithm 500: -
phonenumber: 123-4567-890 - which is represented as the following token
list following step 210 ofFIG. 2 : -
<BOUNDARY> <TXT>phonenumber:123-4567-890<TXT> <BOUNDARY> - Executing
algorithm 500 in step 212 (seeFIG. 2 ) applies the SPLIT_AT_CHARACTER rule of rule set 300 (seeFIG. 3 ) to the token list shown above. The application of the SPLIT_AT_CHARACTER rule splits on the colon included the token list shown above and generates atoken list 520 shown inFIG. 5B . - Following the application of the SPLIT_AT_CHARACTER rule, the second REPLACE_WORD rule of rule set 300 (see
FIG. 3 ) is applied to generate atoken list 540 shown inFIG. 5C . That is, the REPLACE_WORD rule inFIG. 3 that includes the colon as the search phrase is applied to generatetoken list 540.Token list 540 is the result of executingalgorithm 400 ofFIG. 4A in step 212 (seeFIG. 2 ). - Following the generation of
token list 540, step 214 (seeFIG. 2 ) convertstoken list 540 into a regular expression by replacing the BOUNDARY tokens with \b (i.e., the regular expression syntax for denoting word boundaries). The result of the conversion in step 214 (seeFIG. 2 ) is an outputregular expression 560 depicted inFIG. 5D . - This section describes experiments for determining recall and precision of regular expressions generated by the process of
FIG. 2 . Experiments in this section are based on the Enron email dataset, which was collected and prepared by the CALO Project led by SRI International of Menlo Park, Calif. -
FIG. 6 is a table 600 of entities and relationships selected for the experiments in this section. The entities and relationships of table 600 are selected from the Enron email dataset. A constant window of 30 characters was used for each selected relationship, as indicated by the values in the #chars column of table 600. - The following metrics are used in this section to measure the efficiency and effectiveness of the selected relationships in table 600:
- Precision: determines the number of matched annotations against the number of correct annotations.
- Recall: determines the number of relevant annotations against the number of all possible relevant annotations.
- Each generated annotation is manually evaluated using the following constraints:
- Sentence boundaries: Both entities and the relationship must be within the same sentence. Thus, examples like the following are not counted:
-
. . . Peter Meyer. He can be reached at 56666. - Correct entity type: The entities must match the correct type. For example, I can be reached at is not counted as a correct match if the requested entity is a Person and not the Author of the email. As another example, Paul can be reached at his fax number 5223 is not counted as a correct match since the requested entity is not a phone number.
- Four sets of experiments were conducted regarding the recall and precision of the generated regular expressions in contrast to handcrafted regular expressions.
- In the first set of experiments, the relationship between a person and phone number is investigated and is hereinafter referred to as the person . . . phone number relationship.
FIG. 7A is a table 700 of results of the investigation of the person . . . phone number relationship. Typically, an annotator for a person . . . phone number relationship relates the phone number and a verb. Currently, this person . . . phone number relationship is modeled using multiple different handcrafted expressions based on the following native English phrases: can be reached at, can be contacted at, a call at, #, number is, and at. All of the aforementioned native English phrases express the relationship give me the phone number of a person, and therefore are handled as a single semantic relationship. The high precision and recall of this set of experiments shown in table 700 is mainly due to the influence of the “strong” pattern of the phrase “at”. - Improvement potential for the regular expression generator: In the experiment regarding the person . . . phone number relationship, the main reason for false positives are sentence boundaries. A careful sentence boundary detection combined with a co-reference resolution could help to improve the precision. All handcrafted regular expressions use the line limiter ̂ and $. This operator lowers the recall significantly, while increasing the precision only slightly. In one embodiment, the regular expression generator interface is improved by allowing the user to turn off or turn on this sentence boundary detection feature. Another reason for the loss in precision is the poor performance of an entity recognizer, which influences the precision of the generated regular expressions indirectly. As used herein, an entity recognizer is a known component that recognizes entities (e.g., persons, phone numbers, organizations, etc.) for an information extraction task. An entity recognizer may be a component (not shown) of a system that includes relaxed regular expression generator 104 (see
FIG. 1A andFIG. 1B ). - In the second set of experiments, the relationship expressing that one person works for another person is investigated and is hereinafter referred to as the person . . . person relationship. To express the person . . . person relationship, versions of the phrase works for and the noun assistant were used in the second set of experiments.
FIG. 7B is a table 720 of results of the investigation of the person . . . person relationship. - Improvement potential for the regular expression generator: The reason for the high precision of the handcrafted regular expression is the usage of the right regular expression line limiter $ and the definition of selected optional words before (e.g., research and executive) and after (e.g., to and is) the noun assistant. However, detecting semantically relevant words before and after the native English input is far beyond the scope of a pure syntactic regular expression generator. Again, improving the performance of the entity recognizer will enhance the precision of the generated regular expressions significantly.
- In the third set of experiments, the relationship expressing the semantics that a person works for a particular organization is investigated and is hereinafter referred to as the person . . . organization relationship. To express the person . . . organization relationship, the following variants of the verb work and the prepositions with and for were used: works for, working for, work with, and working with.
FIG. 7C is a table 740 of results of the investigation of the person . . . organization relationship. - Improvement potential for the regular expression generator: The reason for the low recall of the handcrafted regular expression is the line boundary tokens ̂ and $, in particular for the phrase working with. In one embodiment, the regular expression generator is improved by including an option to switch this line boundary functionality off or on. In another embodiment, the regular expression generator is improved by including an option that allows a user to define how many words are ignored before and after the native English input.
- In the fourth set of experiments, the relationship expressing the semantics that an organization has been merged with or has been acquired by another organization is investigated and is hereinafter referred to as the organization . . . organization relationship. To express the organization . . . organization relationship, the following variants were used: agreed to buy, merged with, acquisition of, acquired, and acquires.
FIG. 7D is a table 760 of results of the investigation of the organization . . . organization relationship. - Improvement potential for the regular expression generator: Again, this experiment shows that the main value of a handcrafted regular expression is the careful disjunctive combination of relevant verbs for a particular relationship (e.g., the combination of the verbs merge and acquire). An ideal generated regular expression is a disjunctive expression consisting of relevant variants for merge and acquire (e.g., merge OR merged OR acquire OR acquired).
- The experiments described above in
Section 5 show that generated regular expressions based on native English user input can replace handcrafted regular expressions for derived annotators in Avatar. Generated regular expressions are a powerful concept and, in terms of recall and precision, perform similarly to handcrafted regular expressions. However, for some of the experiments described above, false positives were observed which lower precision and recall. To overcome these shortcomings, the following conclusions for the Avatar implementation are derived: - 1. The usage of line boundaries, such as ̂ and $, enhances the precision slightly, but lowers the recall drastically. Therefore, the regular expression generator does not consider line boundaries.
- 2. Regular expressions matching entities across sentences are a minor source for false positives in one of the experiments. To overcome this problem, only text matches within the boundaries of one sentence are considered. However, a few matches may be missed using this approach. To overcome this problem, further investigations are needed to allow the capture of matching entities across sentences.
- 3. Another major source for false positives is incorrectly identified entities, as recognized from the entity recognizer, which is not part of the regular expression generator. The base annotator for entity recognition has been improved so these false positives will no longer appear.
-
FIG. 8 is a block diagram of acomputing unit 800 that includes a relaxedregular expression generator 104 of the system ofFIG. 1A orFIG. 1B and that implements the process ofFIG. 2 , in accordance with embodiments of the present invention.Computing unit 800 generally comprises a central processing unit (CPU) 802, a memory 804, an input/output (I/O)interface 806 and abus 808, and is coupled to I/O devices 810 and astorage unit 812.CPU 802 performs computation and control functions ofcomputing unit 800.CPU 802 may comprise a single processing unit, or be distributed across one or more processing units in one or more locations (e.g., on a client and server). - Memory 804 may comprise any known type of data storage and/or transmission media, including bulk storage, magnetic media, optical media, random access memory (RAM), read-only memory (ROM), a data cache, a data object, etc. Cache memory elements of memory 804 provide temporary storage of at least some program code in order to reduce the number of times code must be retrieved from bulk storage during execution.
Storage unit 812 is, for example, a magnetic disk drive or an optical disk drive that stores data includingrelaxation rule file 106. Moreover, similar toCPU 802, memory 804 may reside at a single physical location, comprising one or more types of data storage, or be distributed across a plurality of physical systems in various forms. Further, memory 804 can include data distributed across, for example, a LAN, WAN or storage area network (SAN) (not shown). - I/
O interface 806 comprises any system for exchanging information to or from an external source. I/O devices 810 comprise any known type of external device, including a display monitor, keyboard, mouse, printer, speakers, handheld device, printer, facsimile, etc.Bus 808 provides a communication link between each of the components incomputing unit 800, and may comprise any type of transmission link, including electrical, optical, wireless, etc. - I/
O interface 806 also allows computingunit 800 to store and retrieve information (e.g., program instructions or data) from an auxiliary storage device (e.g., storage unit 812). The auxiliary storage device may be a non-volatile storage device (e.g., a CD-ROM drive which receives a CD-ROM disk).Computing unit 800 can store and retrieve information from other auxiliary storage devices (not shown), which can include a direct access storage device (DASD) (e.g., hard disk or floppy diskette), a magneto-optical disk drive, a tape drive, or a wireless communication device. - Memory 804 includes program code for relaxed
regular expression generator 104. Further, memory 804 may include other systems not shown inFIG. 8 , such as an operating system (e.g., Linux) that runs onCPU 802 and provides control of various components within and/or connected tocomputing unit 102. - The invention can take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment containing both hardware and software elements. In a preferred embodiment, the invention is implemented in software, which includes but is not limited to firmware, resident software, microcode, etc.
- Furthermore, the invention can take the form of a computer program product accessible from a computer-usable or computer-readable medium providing
program code 104 for use by or in connection with acomputing system 800 or any instruction execution system to provide and facilitate the capabilities of the present invention. For the purposes of this description, a computer-usable or computer-readable medium can be any apparatus that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device. - The medium can be an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system (or apparatus or device) or a propagation medium. Examples of a computer-readable medium include a semiconductor or solid state memory, magnetic tape, a removable computer diskette, RAM 804, ROM, a rigid magnetic disk and an optical disk. Current examples of optical disks include compact disk-read-only memory (CD-ROM), compact disk-read/write (CD-R/W) and DVD.
- Any of the components of the present invention can be deployed, managed, serviced, etc. by a service provider that offers to deploy or integrate computing infrastructure with respect to the method of automatically generating regular expressions for relaxed matching of text patterns. Thus, the present invention discloses a process for supporting computer infrastructure, comprising integrating, hosting, maintaining and deploying computer-readable code into a computing system (e.g., computing unit 800), wherein the code in combination with the computing unit is capable of performing a method of automatically generating regular expressions for relaxed matching of text patterns.
- In another embodiment, the invention provides a business method that performs the process steps of the invention on a subscription, advertising and/or fee basis. That is, a service provider, such as a Solution Integrator, can offer to create, maintain, support, etc. a method of automatically generating regular expressions for relaxed matching of text patterns. In this case, the service provider can create, maintain, support, etc. a computer infrastructure that performs the process steps of the invention for one or more customers. In return, the service provider can receive payment from the customer(s) under a subscription and/or fee agreement, and/or the service provider can receive payment from the sale of advertising content to one or more third parties.
- The flow diagrams depicted herein are provided by way of example. There may be variations to these diagrams or the steps (or operations) described herein without departing from the spirit of the invention. For instance, in certain cases, the steps may be performed in differing order, or steps may be added, deleted or modified. All of these variations are considered a part of the present invention as recited in the appended claims.
- While embodiments of the present invention have been described herein for purposes of illustration, many modifications and changes will become apparent to those skilled in the art. Accordingly, the appended claims are intended to encompass all such modifications and changes as fall within the true spirit and scope of this invention.
Claims (2)
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US11/850,987 US20090070327A1 (en) | 2007-09-06 | 2007-09-06 | Method for automatically generating regular expressions for relaxed matching of text patterns |
US12/125,290 US8484238B2 (en) | 2007-09-06 | 2008-05-22 | Automatically generating regular expressions for relaxed matching of text patterns |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US11/850,987 US20090070327A1 (en) | 2007-09-06 | 2007-09-06 | Method for automatically generating regular expressions for relaxed matching of text patterns |
Related Child Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US12/125,290 Continuation US8484238B2 (en) | 2007-09-06 | 2008-05-22 | Automatically generating regular expressions for relaxed matching of text patterns |
Publications (1)
Publication Number | Publication Date |
---|---|
US20090070327A1 true US20090070327A1 (en) | 2009-03-12 |
Family
ID=40432984
Family Applications (2)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US11/850,987 Abandoned US20090070327A1 (en) | 2007-09-06 | 2007-09-06 | Method for automatically generating regular expressions for relaxed matching of text patterns |
US12/125,290 Expired - Fee Related US8484238B2 (en) | 2007-09-06 | 2008-05-22 | Automatically generating regular expressions for relaxed matching of text patterns |
Family Applications After (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US12/125,290 Expired - Fee Related US8484238B2 (en) | 2007-09-06 | 2008-05-22 | Automatically generating regular expressions for relaxed matching of text patterns |
Country Status (1)
Country | Link |
---|---|
US (2) | US20090070327A1 (en) |
Cited By (31)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20050132198A1 (en) * | 2003-12-10 | 2005-06-16 | Ahuja Ratinder P.S. | Document de-registration |
US20050132034A1 (en) * | 2003-12-10 | 2005-06-16 | Iglesia Erik D.L. | Rule parser |
US20100174718A1 (en) * | 2009-01-05 | 2010-07-08 | International Business Machines Corporation | Indexing for Regular Expressions in Text-Centric Applications |
US20100191732A1 (en) * | 2004-08-23 | 2010-07-29 | Rick Lowe | Database for a capture system |
US20100312764A1 (en) * | 2005-10-04 | 2010-12-09 | West Services Inc. | Feature engineering and user behavior analysis |
US20110004599A1 (en) * | 2005-08-31 | 2011-01-06 | Mcafee, Inc. | A system and method for word indexing in a capture system and querying thereof |
US20110093414A1 (en) * | 2009-10-15 | 2011-04-21 | 2167959 Ontario Inc. | System and method for phrase identification |
US20110149959A1 (en) * | 2005-08-12 | 2011-06-23 | Mcafee, Inc., A Delaware Corporation | High speed packet capture |
US20110167212A1 (en) * | 2004-08-24 | 2011-07-07 | Mcafee, Inc., A Delaware Corporation | File system for a capture system |
US20110197284A1 (en) * | 2006-05-22 | 2011-08-11 | Mcafee, Inc., A Delaware Corporation | Attributes of captured objects in a capture system |
US20110208861A1 (en) * | 2004-06-23 | 2011-08-25 | Mcafee, Inc. | Object classification in a capture system |
US20120114119A1 (en) * | 2010-11-04 | 2012-05-10 | Ratinder Paul Singh Ahuja | System and method for protecting specified data combinations |
US8504537B2 (en) | 2006-03-24 | 2013-08-06 | Mcafee, Inc. | Signature distribution in a document registration system |
US8601537B2 (en) | 2008-07-10 | 2013-12-03 | Mcafee, Inc. | System and method for data mining and security policy management |
US20140059078A1 (en) * | 2012-08-27 | 2014-02-27 | Microsoft Corporation | Semantic query language |
US8667121B2 (en) | 2009-03-25 | 2014-03-04 | Mcafee, Inc. | System and method for managing data and policies |
US8700561B2 (en) | 2011-12-27 | 2014-04-15 | Mcafee, Inc. | System and method for providing data protection workflows in a network environment |
US8706709B2 (en) | 2009-01-15 | 2014-04-22 | Mcafee, Inc. | System and method for intelligent term grouping |
US8762386B2 (en) | 2003-12-10 | 2014-06-24 | Mcafee, Inc. | Method and apparatus for data capture and analysis system |
US8850591B2 (en) | 2009-01-13 | 2014-09-30 | Mcafee, Inc. | System and method for concept building |
US8918359B2 (en) | 2009-03-25 | 2014-12-23 | Mcafee, Inc. | System and method for data mining and security policy management |
US9195937B2 (en) | 2009-02-25 | 2015-11-24 | Mcafee, Inc. | System and method for intelligent state management |
US9235639B2 (en) | 2013-03-28 | 2016-01-12 | Hewlett Packard Enterprise Development Lp | Filter regular expression |
US9253154B2 (en) | 2008-08-12 | 2016-02-02 | Mcafee, Inc. | Configuration management for a capture/registration system |
CN106407168A (en) * | 2016-09-06 | 2017-02-15 | 首都师范大学 | Automatic generation method for practical writing |
WO2019241425A1 (en) * | 2018-06-13 | 2019-12-19 | Oracle International Corporation | Regular expression generation based on positive and negative pattern matching examples |
US11321525B2 (en) | 2019-08-23 | 2022-05-03 | Micro Focus Llc | Generation of markup-language script representing identity management rule from natural language-based rule script defining identity management rule |
US11354305B2 (en) | 2018-06-13 | 2022-06-07 | Oracle International Corporation | User interface commands for regular expression generation |
US11494558B2 (en) | 2020-01-06 | 2022-11-08 | Netiq Corporation | Conversion of script with rule elements to a natural language format |
US11580166B2 (en) | 2018-06-13 | 2023-02-14 | Oracle International Corporation | Regular expression generation using span highlighting alignment |
US11941018B2 (en) | 2018-06-13 | 2024-03-26 | Oracle International Corporation | Regular expression generation for negative example using context |
Families Citing this family (18)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7730011B1 (en) | 2005-10-19 | 2010-06-01 | Mcafee, Inc. | Attributes of captured objects in a capture system |
US8812459B2 (en) * | 2009-04-01 | 2014-08-19 | Touchstone Systems, Inc. | Method and system for text interpretation and normalization |
US11423029B1 (en) | 2010-11-09 | 2022-08-23 | Google Llc | Index-side stem-based variant generation |
US9317499B2 (en) | 2013-04-11 | 2016-04-19 | International Business Machines Corporation | Optimizing generation of a regular expression |
US9898467B1 (en) * | 2013-09-24 | 2018-02-20 | Amazon Technologies, Inc. | System for data normalization |
US9471875B2 (en) * | 2013-12-31 | 2016-10-18 | International Business Machines Corporation | Using ontologies to comprehend regular expressions |
CN105868166B (en) | 2015-01-22 | 2020-01-17 | 阿里巴巴集团控股有限公司 | Regular expression generation method and system |
US9916296B2 (en) | 2015-09-24 | 2018-03-13 | International Business Machines Corporation | Expanding entity and relationship patterns to a collection of document annotators using run traces |
US10268750B2 (en) * | 2016-01-29 | 2019-04-23 | Cisco Technology, Inc. | Log event summarization for distributed server system |
US9767094B1 (en) | 2016-07-07 | 2017-09-19 | International Business Machines Corporation | User interface for supplementing an answer key of a question answering system using semantically equivalent variants of natural language expressions |
US9910848B2 (en) | 2016-07-07 | 2018-03-06 | International Business Machines Corporation | Generating semantic variants of natural language expressions using type-specific templates |
US9928235B2 (en) | 2016-07-07 | 2018-03-27 | International Business Machines Corporation | Type-specific rule-based generation of semantic variants of natural language expression |
US10474750B1 (en) * | 2017-03-08 | 2019-11-12 | Amazon Technologies, Inc. | Multiple information classes parsing and execution |
CN110928793B (en) * | 2019-11-28 | 2023-07-28 | Oppo广东移动通信有限公司 | A regular expression detection method, device and computer-readable storage medium |
WO2021207936A1 (en) * | 2020-04-14 | 2021-10-21 | 深圳市欢太科技有限公司 | Text matching method and apparatus, electronic device, and storage medium |
US11520831B2 (en) * | 2020-06-09 | 2022-12-06 | Servicenow, Inc. | Accuracy metric for regular expression |
US20220335075A1 (en) * | 2021-04-14 | 2022-10-20 | International Business Machines Corporation | Finding expressions in texts |
CN113792261B (en) * | 2021-09-26 | 2024-12-10 | 东南大学 | A method for constructing an information matrix of the state of electromechanical systems of highway bridges |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20040225999A1 (en) * | 2003-05-06 | 2004-11-11 | Andrew Nuss | Grammer for regular expressions |
US20060020937A1 (en) * | 2004-07-21 | 2006-01-26 | Softricity, Inc. | System and method for extraction and creation of application meta-information within a software application repository |
Family Cites Families (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6754650B2 (en) * | 2001-05-08 | 2004-06-22 | International Business Machines Corporation | System and method for regular expression matching using index |
US6842796B2 (en) * | 2001-07-03 | 2005-01-11 | International Business Machines Corporation | Information extraction from documents with regular expression matching |
US7502788B2 (en) * | 2005-11-08 | 2009-03-10 | International Business Machines Corporation | Method for retrieving constant values using regular expressions |
-
2007
- 2007-09-06 US US11/850,987 patent/US20090070327A1/en not_active Abandoned
-
2008
- 2008-05-22 US US12/125,290 patent/US8484238B2/en not_active Expired - Fee Related
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20040225999A1 (en) * | 2003-05-06 | 2004-11-11 | Andrew Nuss | Grammer for regular expressions |
US20060020937A1 (en) * | 2004-07-21 | 2006-01-26 | Softricity, Inc. | System and method for extraction and creation of application meta-information within a software application repository |
Cited By (73)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20050132198A1 (en) * | 2003-12-10 | 2005-06-16 | Ahuja Ratinder P.S. | Document de-registration |
US20050132034A1 (en) * | 2003-12-10 | 2005-06-16 | Iglesia Erik D.L. | Rule parser |
US8762386B2 (en) | 2003-12-10 | 2014-06-24 | Mcafee, Inc. | Method and apparatus for data capture and analysis system |
US9092471B2 (en) | 2003-12-10 | 2015-07-28 | Mcafee, Inc. | Rule parser |
US8656039B2 (en) | 2003-12-10 | 2014-02-18 | Mcafee, Inc. | Rule parser |
US8548170B2 (en) | 2003-12-10 | 2013-10-01 | Mcafee, Inc. | Document de-registration |
US9374225B2 (en) | 2003-12-10 | 2016-06-21 | Mcafee, Inc. | Document de-registration |
US20110208861A1 (en) * | 2004-06-23 | 2011-08-25 | Mcafee, Inc. | Object classification in a capture system |
US20100191732A1 (en) * | 2004-08-23 | 2010-07-29 | Rick Lowe | Database for a capture system |
US8560534B2 (en) | 2004-08-23 | 2013-10-15 | Mcafee, Inc. | Database for a capture system |
US20110167212A1 (en) * | 2004-08-24 | 2011-07-07 | Mcafee, Inc., A Delaware Corporation | File system for a capture system |
US8707008B2 (en) | 2004-08-24 | 2014-04-22 | Mcafee, Inc. | File system for a capture system |
US20110149959A1 (en) * | 2005-08-12 | 2011-06-23 | Mcafee, Inc., A Delaware Corporation | High speed packet capture |
US8730955B2 (en) | 2005-08-12 | 2014-05-20 | Mcafee, Inc. | High speed packet capture |
US20110004599A1 (en) * | 2005-08-31 | 2011-01-06 | Mcafee, Inc. | A system and method for word indexing in a capture system and querying thereof |
US8554774B2 (en) | 2005-08-31 | 2013-10-08 | Mcafee, Inc. | System and method for word indexing in a capture system and querying thereof |
US9552420B2 (en) * | 2005-10-04 | 2017-01-24 | Thomson Reuters Global Resources | Feature engineering and user behavior analysis |
US20100312764A1 (en) * | 2005-10-04 | 2010-12-09 | West Services Inc. | Feature engineering and user behavior analysis |
US10387462B2 (en) | 2005-10-04 | 2019-08-20 | Thomson Reuters Global Resources Unlimited Company | Feature engineering and user behavior analysis |
US8504537B2 (en) | 2006-03-24 | 2013-08-06 | Mcafee, Inc. | Signature distribution in a document registration system |
US9094338B2 (en) | 2006-05-22 | 2015-07-28 | Mcafee, Inc. | Attributes of captured objects in a capture system |
US8683035B2 (en) | 2006-05-22 | 2014-03-25 | Mcafee, Inc. | Attributes of captured objects in a capture system |
US20110197284A1 (en) * | 2006-05-22 | 2011-08-11 | Mcafee, Inc., A Delaware Corporation | Attributes of captured objects in a capture system |
US8601537B2 (en) | 2008-07-10 | 2013-12-03 | Mcafee, Inc. | System and method for data mining and security policy management |
US8635706B2 (en) | 2008-07-10 | 2014-01-21 | Mcafee, Inc. | System and method for data mining and security policy management |
US10367786B2 (en) | 2008-08-12 | 2019-07-30 | Mcafee, Llc | Configuration management for a capture/registration system |
US9253154B2 (en) | 2008-08-12 | 2016-02-02 | Mcafee, Inc. | Configuration management for a capture/registration system |
US8548979B2 (en) * | 2009-01-05 | 2013-10-01 | International Business Machines Corporation | Indexing for regular expressions in text-centric applications |
US8266135B2 (en) * | 2009-01-05 | 2012-09-11 | International Business Machines Corporation | Indexing for regular expressions in text-centric applications |
US20100174718A1 (en) * | 2009-01-05 | 2010-07-08 | International Business Machines Corporation | Indexing for Regular Expressions in Text-Centric Applications |
US8850591B2 (en) | 2009-01-13 | 2014-09-30 | Mcafee, Inc. | System and method for concept building |
US8706709B2 (en) | 2009-01-15 | 2014-04-22 | Mcafee, Inc. | System and method for intelligent term grouping |
US9602548B2 (en) | 2009-02-25 | 2017-03-21 | Mcafee, Inc. | System and method for intelligent state management |
US9195937B2 (en) | 2009-02-25 | 2015-11-24 | Mcafee, Inc. | System and method for intelligent state management |
US9313232B2 (en) | 2009-03-25 | 2016-04-12 | Mcafee, Inc. | System and method for data mining and security policy management |
US8918359B2 (en) | 2009-03-25 | 2014-12-23 | Mcafee, Inc. | System and method for data mining and security policy management |
US8667121B2 (en) | 2009-03-25 | 2014-03-04 | Mcafee, Inc. | System and method for managing data and policies |
US20110093258A1 (en) * | 2009-10-15 | 2011-04-21 | 2167959 Ontario Inc. | System and method for text cleaning |
US8868469B2 (en) | 2009-10-15 | 2014-10-21 | Rogers Communications Inc. | System and method for phrase identification |
US20110093414A1 (en) * | 2009-10-15 | 2011-04-21 | 2167959 Ontario Inc. | System and method for phrase identification |
US8380492B2 (en) | 2009-10-15 | 2013-02-19 | Rogers Communications Inc. | System and method for text cleaning by classifying sentences using numerically represented features |
US20120114119A1 (en) * | 2010-11-04 | 2012-05-10 | Ratinder Paul Singh Ahuja | System and method for protecting specified data combinations |
US10666646B2 (en) | 2010-11-04 | 2020-05-26 | Mcafee, Llc | System and method for protecting specified data combinations |
US20150067810A1 (en) * | 2010-11-04 | 2015-03-05 | Ratinder Paul Singh Ahuja | System and method for protecting specified data combinations |
US10313337B2 (en) | 2010-11-04 | 2019-06-04 | Mcafee, Llc | System and method for protecting specified data combinations |
US8806615B2 (en) * | 2010-11-04 | 2014-08-12 | Mcafee, Inc. | System and method for protecting specified data combinations |
US11316848B2 (en) * | 2010-11-04 | 2022-04-26 | Mcafee, Llc | System and method for protecting specified data combinations |
US9794254B2 (en) * | 2010-11-04 | 2017-10-17 | Mcafee, Inc. | System and method for protecting specified data combinations |
US9430564B2 (en) | 2011-12-27 | 2016-08-30 | Mcafee, Inc. | System and method for providing data protection workflows in a network environment |
US8700561B2 (en) | 2011-12-27 | 2014-04-15 | Mcafee, Inc. | System and method for providing data protection workflows in a network environment |
US9659082B2 (en) * | 2012-08-27 | 2017-05-23 | Microsoft Technology Licensing, Llc | Semantic query language |
US10579656B2 (en) * | 2012-08-27 | 2020-03-03 | Microsoft Technology Licensing, Llc | Semantic query language |
US20140059078A1 (en) * | 2012-08-27 | 2014-02-27 | Microsoft Corporation | Semantic query language |
US20170220673A1 (en) * | 2012-08-27 | 2017-08-03 | Microsoft Technology Licensing, Llc | Semantic query language |
US9235639B2 (en) | 2013-03-28 | 2016-01-12 | Hewlett Packard Enterprise Development Lp | Filter regular expression |
CN106407168A (en) * | 2016-09-06 | 2017-02-15 | 首都师范大学 | Automatic generation method for practical writing |
US11269934B2 (en) | 2018-06-13 | 2022-03-08 | Oracle International Corporation | Regular expression generation using combinatoric longest common subsequence algorithms |
US11347779B2 (en) | 2018-06-13 | 2022-05-31 | Oracle International Corporation | User interface for regular expression generation |
WO2019241416A1 (en) * | 2018-06-13 | 2019-12-19 | Oracle International Corporation | Regular expression generation using longest common subsequence algorithm on regular expression codes |
CN112236763A (en) * | 2018-06-13 | 2021-01-15 | 甲骨文国际公司 | Regular Expression Generation Using Longest Universal Subsequence Algorithm on Regular Expression Code |
US11263247B2 (en) * | 2018-06-13 | 2022-03-01 | Oracle International Corporation | Regular expression generation using longest common subsequence algorithm on spans |
WO2019241428A1 (en) * | 2018-06-13 | 2019-12-19 | Oracle International Corporation | User interface for regular expression generation |
WO2019241425A1 (en) * | 2018-06-13 | 2019-12-19 | Oracle International Corporation | Regular expression generation based on positive and negative pattern matching examples |
CN112236763B (en) * | 2018-06-13 | 2025-04-29 | 甲骨文国际公司 | Regular expression generation using the longest common subsequence algorithm on regular expression code |
US11321368B2 (en) | 2018-06-13 | 2022-05-03 | Oracle International Corporation | Regular expression generation using longest common subsequence algorithm on combinations of regular expression codes |
WO2019241422A1 (en) * | 2018-06-13 | 2019-12-19 | Oracle International Corporation | Regular expression generation using longest common subsequence algorithm on combinations of regular expression codes |
US11354305B2 (en) | 2018-06-13 | 2022-06-07 | Oracle International Corporation | User interface commands for regular expression generation |
US11941018B2 (en) | 2018-06-13 | 2024-03-26 | Oracle International Corporation | Regular expression generation for negative example using context |
US11580166B2 (en) | 2018-06-13 | 2023-02-14 | Oracle International Corporation | Regular expression generation using span highlighting alignment |
US11755630B2 (en) | 2018-06-13 | 2023-09-12 | Oracle International Corporation | Regular expression generation using longest common subsequence algorithm on combinations of regular expression codes |
US11797582B2 (en) | 2018-06-13 | 2023-10-24 | Oracle International Corporation | Regular expression generation based on positive and negative pattern matching examples |
US11321525B2 (en) | 2019-08-23 | 2022-05-03 | Micro Focus Llc | Generation of markup-language script representing identity management rule from natural language-based rule script defining identity management rule |
US11494558B2 (en) | 2020-01-06 | 2022-11-08 | Netiq Corporation | Conversion of script with rule elements to a natural language format |
Also Published As
Publication number | Publication date |
---|---|
US8484238B2 (en) | 2013-07-09 |
US20090070328A1 (en) | 2009-03-12 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US8484238B2 (en) | Automatically generating regular expressions for relaxed matching of text patterns | |
US11113304B2 (en) | Techniques for creating computer generated notes | |
Padró et al. | Freeling 3.0: Towards wider multilinguality | |
CN103548023B (en) | Automatic self-standing user based on body is supported | |
Howard et al. | Automatically mining software-based, semantically-similar words from comment-code mappings | |
KR101139903B1 (en) | Semantic processor for recognition of Whole-Part relations in natural language documents | |
US9846692B2 (en) | Method and system for machine-based extraction and interpretation of textual information | |
Llopis et al. | How to make a natural language interface to query databases accessible to everyone: An example | |
US20160292153A1 (en) | Identification of examples in documents | |
WO2009007181A1 (en) | A method, system and computer program for intelligent text annotation | |
Hill | Integrating natural language and program structure information to improve software search and exploration | |
US10606903B2 (en) | Multi-dimensional query based extraction of polarity-aware content | |
JP3139658B2 (en) | Document display method | |
JP2997469B2 (en) | Natural language understanding method and information retrieval device | |
JP2003323425A (en) | Bilingual dictionary creation device, translation device, bilingual dictionary creation program, and translation program | |
Tiwari et al. | Mold-a framework for entity extraction and summarization | |
WO2020026229A2 (en) | Proposition identification in natural language and usage thereof | |
CN112836477B (en) | Method and device for generating code annotation document, electronic equipment and storage medium | |
Love | Benchmarking the performance of Two Automated Term-extraction systems: LOGOS and ATAO | |
Pasquale | Automatic generation of a navigation tree for conversational web browsing | |
WO2001024053A9 (en) | System and method for automatic context creation for electronic documents | |
JP2001034630A (en) | Document-based search system and method | |
Plant et al. | A natural language help system shell through functional programming | |
JP2007213157A (en) | Example sentence search device and example sentence search method | |
Balcha et al. | Design and Development of Sentence Parser for Afan Oromo Language |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AS | Assignment |
Owner name: INTERNATIONAL BUSINESS MACHINES CORPORATION, NEW Y Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:LOESER, ALEXANDER STEPHAN;RAGHAVEN, SRIRAM;VAITHYANATHAN, SHIVAKUMAR;REEL/FRAME:019791/0695;SIGNING DATES FROM 20070829 TO 20070831 |
|
AS | Assignment |
Owner name: INTERNATIONAL BUSINESS MACHINES CORPORATION, NEW Y Free format text: CORRECTIVE ASSIGNMENT TO CORRECT THE THE LAST NAME OF THE SECOND ASSIGNOR PREVIOUSLY RECORDED ON REEL 019791 FRAME 0695;ASSIGNORS:LOESER, ALEXANDER STEPHAN;RAGSHAVAN, SRIRAM;VAITHYANATHAN, SHIVAKUMAR;REEL/FRAME:019928/0045;SIGNING DATES FROM 20070829 TO 20070831 |
|
STCB | Information on status: application discontinuation |
Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION |