US20160283607A1 - Search engine - Google Patents
Search engine Download PDFInfo
- Publication number
- US20160283607A1 US20160283607A1 US15/143,402 US201615143402A US2016283607A1 US 20160283607 A1 US20160283607 A1 US 20160283607A1 US 201615143402 A US201615143402 A US 201615143402A US 2016283607 A1 US2016283607 A1 US 2016283607A1
- Authority
- US
- United States
- Prior art keywords
- search
- data
- result
- results
- search engine
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Abandoned
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/80—Information retrieval; Database structures therefor; File system structures therefor of semi-structured data, e.g. markup language structured data such as SGML, XML or HTML
- G06F16/84—Mapping; Conversion
-
- G06F17/30914—
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/3332—Query translation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/3332—Query translation
- G06F16/3338—Query expansion
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/338—Presentation of query results
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/80—Information retrieval; Database structures therefor; File system structures therefor of semi-structured data, e.g. markup language structured data such as SGML, XML or HTML
- G06F16/83—Querying
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/80—Information retrieval; Database structures therefor; File system structures therefor of semi-structured data, e.g. markup language structured data such as SGML, XML or HTML
- G06F16/83—Querying
- G06F16/835—Query processing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/953—Querying, e.g. by the use of web search engines
- G06F16/9538—Presentation of query results
-
- G06F17/30672—
-
- G06F17/30864—
-
- G06F17/30923—
Definitions
- the invention relates to an improved search engine. More particularly, the present invention relates to an improved search engine for creating a search query for retrieving stored information from a file index or remote data source and also relates to an improved de-duplication process for removing duplicate entries from received search results.
- a search engine is an information retrieval system that allows users of a computer system to specify criteria about an item of interest, “search terms”, and to have the search engine find the matching items.
- search terms criteria about an item of interest
- search engine find the matching items.
- search query is typically expressed as a set of words.
- a search engine will typically collect metadata about the group of items beforehand in a process referred to as indexing.
- An index typically requires a smaller amount of computer storage and provides a basis for the search engine to calculate the item relevance.
- Desktop search is the name given to search tools which search the content of a user's hard drive rather than the Internet. Such tools may find information including web browser histories, email archives, text documents, sound files etc. Such search tools can be extremely fast but may not search the entire hard drive. For example, only operating system specific applications may be searched (e.g. Microsoft documents, folders) and information contained in email or contact databases may not be included.
- Desktop search engines build and maintain an index database to optimise search performance. Indexing takes place when the computer is idle and the search engine generally collects information relating to file/directory names; metadata such as titles, authors etc; and, the content of the supported data items/documents.
- An example of a desktop search tool is “Windows Search”, an indexed desktop search platform released by Microsoft for the Windows operating platform.
- Web search engines provide an interface to search for information on the Internet.
- a web search engine works by storing information about a large number of web pages which are retrieved by a web crawler, an automated web browser that follows every link it sees. The contents of each page are then indexed and stored in an index database for use in later queries.
- the web search engine examines its index and provides a listing of best matching web pages according to its criteria.
- Most search engines support Boolean operators AND, OR and NOT to further specify the search and some engines provide a proximity search which allows users to specify the distance between keywords.
- search engine Given the current size and speed of growth of the Internet it is important that the initial search query is relevant in order that relevant search results are returned.
- usefulness of a search engine also depends on the relevance of the result set it gives back and one of the main problems with current search engines is the tendency of the results set to contain duplicate search results.
- De-duplication of search results is currently handled by means of hash algorithms in which each chunk of data is processed by a hash algorithm thereby generating a unique number which is stored in an index.
- a piece of data receives a hash number, that number is compared with the index of other existing hash numbers. If the hash number is already in the index then the piece of data is considered a duplicate and is not stored. Otherwise the new hash number is added to the index and the new data is stored.
- the hash algorithm may produce the same hash number for two different chunks of data. When such a hash collision occurs, the system will not store the new data because it sees its hash number already exists in the data index. Such false positives can result in lost data.
- hash algorithms are complex.
- search engines index and search unstructured data resources. This therefore leaves large amounts of data that is tied into structured data stores, e.g. databases, that cannot be accessed by the traditional search engine. If the structured data is indexed separately then this index may be made available to the search engine but this creates a further data store for data which is already indexed within its own structure.
- the present invention provides a search engine for generating an improved search query, the engine comprising: input means for receiving a search request, the search request comprising N search terms; processing means arranged to formulate a search query from the received search request; output means arranged to output the search query wherein the processing means is arranged to formulate the search query by generating a plurality of search strings, each search string comprising a different combination of a subset of the N search terms.
- the first aspect of the present invention provides a search engine that is arranged to take a search request that has been entered by a user and to re-arrange and mix up the terms contained within the request in order to form a search query (a search query generation engine). It is noted that by splitting up the input search phrase in this way into smaller groupings the search that is performed by the search engine is more likely to return an accurate set of results.
- the search query may be output to a local file index, for example if the search engine is part of a desktop PC, or may be output to a server based file index or a remote database.
- each search string comprises a different combination of M search terms, where N>M.
- the number of search terms in each of the plurality of search strings may vary between 1 and M.
- the search query that is created by the search engine may contain any number of search terms from the input search request.
- the processing means may be arranged to formulate the search query by determining the combination N C M of search terms, each combination being a different search string. It is noted that this option would ensure that all possible combinations of input search terms are formulated into search queries.
- the processing means may be arranged to formulate the search query based upon predetermined arrangements of search terms for a search request comprising N terms.
- the search engine may be pre-programmed with a generic format for the search queries for an input search request of a given length N.
- the search engine may receive a search request comprising 6 search terms. It may then look up a number of pre-programmed arrangements of those search terms that should be used whenever the input search request comprises 6 terms. So, one arrangement may comprise terms 1 , 3 , 5 and 6 , another may contain 2, 4 and 6 and so on.
- the processing means may be arranged to assign a different identifier to each search term received in the search request and the predetermined arrangements may comprise different arrangements and selections of identifiers.
- the processing means may be arranged to group two or more search terms together and assign the grouped terms to a single identifier.
- the processing means may be arranged to perform a synonym lookup on the received search terms and additionally may be arranged to substitute any search terms with their synonyms to form additional search strings for inclusion in the search query.
- the output means may be arranged to output the search query to a file index comprising an index database of file information and the processing means may be arranged to generate the search query in the form of an SQL query.
- the output means is arranged to output the search query to a structured data source.
- the search engine further comprises a configuration file for the structured data source, the configuration file having connectivity information relating to the structured data source to allow the processing means to generate the search query.
- the search engine of the first aspect of the invention may also be conveniently combined with a de-duplication engine which is arranged to remove duplicate search results that are returned as a result of the generated search query.
- the de-duplication engine is arranged to output a results set to a user and comprises: search result input means arranged to receive a plurality of search results, each search result comprising one or more data items; de-duplication means arranged to remove duplicate search results from the plurality of search results received at the search result input means and to generate a results set; a stored data set comprising one or more data categories and one or more data items distributed throughout the one or more data categories; results set output means arranged to output the results set wherein the de-duplication means is arranged to compare the stored data set to the search results and to determine if a first search result is a duplicate result based on the following criteria: a) if the content of all of the one or more data categories in the stored data set are determined to match data items in the first search result then the first search result is determined to be
- the present invention provides a search engine for outputting a results set to a user, the search engine comprising: input means arranged to receive a plurality of search results, each search result comprising one or more data items; processing means arranged to remove duplicate search results from the plurality of search results received at the input means and to generate a results set; a stored data set comprising one or more data categories and one or more data items distributed throughout the one or more data categories; output means arranged to output the results set wherein the processing means is arranged to compare the stored data set to the search results and to determine if a first search result is a duplicate result based on the following criteria: a) if the content of all of the one or more data categories in the stored data set are determined to match data items in the first search result then the first search result is determined to be a duplicate and is discarded; or b) if the content of at least one of the data categories in the stored data set is determined not to match any data items in the first search result then the first search result is added to the results
- the second aspect of the present invention provides a mechanism for mechanically removing duplicate results and outputting a results set to a user (also referred to herein as a de-duplication engine).
- Received search results are compared against a stored data set that comprises one or more data categories. Where the content in the stored data set matches all content items in a given search result that result is deemed a duplicate and is discarded. Where at least some of the content in the stored data set does not match the search result then that search result is deemed to be a new search result and is added to the results set.
- An output means (results set output means) is then arranged to output the results set (e.g. to a user).
- the second aspect of the present invention therefore provides a way of improving the set of search results returned to a user of a search engine.
- the stored data set comprises an array of XML fields (“XML nodes”) and the received search results are either in an XML format or are transformed into an XML format.
- the content of each XML field in the stored array can then be compared to the received search result to determine (i) if the search result comprises that field and (ii) if it does comprise that field, whether the value of that field matches what is already in the stored array.
- the processing means may be arranged to add the one or more data items of the first search result to the stored data set. This ensures that as further search results are received they are compared against the data items of all the previous search results to have been received thereby improving the efficiency of the system.
- the processing means may be arranged to take each of the plurality of search results in turn and compare them to the stored set of data items in order to determine if each search result is a duplicate result based on the criteria (a) and (b) above.
- the search results may be received in a format that the search engine can process.
- the processing means may be arranged to transform each search result received at the input means into a structured data format comprising one or more data categories, each of the one or more data categories containing a data item.
- the search results may be transformed into an XML format wherein each data category is an XML field (e.g. an address field) and each data item is the value of that field (e.g. an actual street name or postcode).
- the plurality of search results may be received at the input means (search result input means) in the form of a structured data format comprising one or more data categories, each of the one or more data categories comprising a data item.
- the stored data may be stored in the structured data set format.
- the stored data set may be an array of XML fields.
- the stored data set may be in the form of an array table comprising one or more arrays, each array being associated with a data category and the one or more data items being distributed throughout the one or more arrays.
- the array table may be comprised of a number of arrays (or XML fields).
- the processing means may be arranged to determine if the first search result comprises all of the data categories in the array table and further the processing means may be arranged to add the first result to the results set if any of the data categories in the array table are not present in the first search result.
- the processing means may be arranged to compare the content of each data category in the array table against the content of the corresponding data category in the first search result in order to determine if the first data category is a duplicate result.
- the first search result may be determined to be a duplicate result and may be discarded.
- the array table and search result may contain exactly the same data categories, e.g. the array table may comprise three categories, Address 1 , Postcode and Contact Name, which are also present in the search result.
- the search result may be determined to be duplicate.
- the first search result may be determined to be a new result (i.e. not a duplicate result) and may be added to the results set.
- the processing means may be arranged to take each of the plurality of search results in turn and compare them to the array table in order to determine if each search result is a duplicate result.
- search results may be received or transformed into an XML format comprising one or more XML nodes, each of the one or more XML nodes corresponding to a data category.
- the stored data set may initially be empty and may be populated with data items common to the first two search results received at the input means. This conveniently allows the search engine to define appropriate data categories on a search-by-search basis.
- the stored data set may initially be empty but may comprise pre-defined data categories.
- the array table may effectively be predefined before the search is initiated.
- the search engine may further comprise a configuration file defining a plurality of data categories, the one or more data categories in the stored data set being chosen from the plurality of data categories in the configuration file.
- the present invention provides a search engine for generating a search query and for processing received search results comprising: a search engine for generating an improved search query according to the first aspect of the present invention; and a search engine for outputting a results set to a user according to any the second aspect of the present invention.
- the third aspect of the present invention conveniently provides a search engine that comprises the features of both the first and second aspects of the invention. It is noted that the third aspect of the invention may comprise preferred/optional features of the first and/or second aspects of the invention
- the present invention provides a method of mapping a data structure of a data source to a predetermined set of data labels, the method comprising: displaying a set of predetermined data labels; displaying the data structure of the data source, the data structure of the data source comprising a plurality of data field names; comparing the set of data labels to the data structure of the data source and identifying data field names that correspond to data labels in the set of predetermined data labels; storing the relationship identified in the comparing step.
- the search engine may construct a SQL query for a database.
- the fourth aspect of the present invention conveniently provides a method of mapping the data structure of a data source (e.g. a database) to a predetermined set of data labels (e.g. the data labels of the search engine).
- the relationships identified in the comparing step may be stored in an XML format.
- the data source may be a database and the relationship may be stored with login information for the database.
- the invention also extends to a method of generating a search query and a method of outputting a results set to a user.
- the present invention also extends to a computer program when embodied on a record medium/read-only memory/electrical carrier signal or stored in a computer memory, the computer program comprising program instructions for causing a computer to perform the process of the method of the present invention.
- FIG. 1 a is a schematic of a desktop version of a search engine in accordance with an embodiment of the present invention
- FIG. 1 b is a schematic of a server based version of a search engine in accordance with an embodiment of the present invention
- FIG. 2 is an alternative schematic of a search engine in accordance with an embodiment of the present invention.
- FIG. 3 is a flow chart showing how a search query is created and search results are processed on a desktop computer in accordance with an embodiment of the present invention
- FIG. 4 is a flow chart showing how a search query is created and search results are processed on a server computer in accordance with an embodiment of the present invention
- FIG. 5 a illustrates how combinations of search terms used to construct a search query are created in accordance with an embodiment of the present invention
- FIG. 5 b shows an alternative to FIG. 5 a
- FIG. 6 shows an example of an SQL search query constructed from the combinations of FIG. 5 ;
- FIG. 7 shows how search results may be transformed into XML format
- FIGS. 8, 9 a and 10 show how duplicate search results are removed from the results received by the search engine in accordance with an embodiment of the present invention
- FIG. 9 b is a flow chart that shows an alternative technique (to that of FIG. 9 a ) for processing the search results in accordance with an embodiment of the present invention
- FIG. 11 shows an example of a configuration file that may be used in accordance with an embodiment of the present invention to construct a SQL query
- FIG. 12 is a screenshot showing how features of a database may be mapped to features of a search engine in accordance with an embodiment of the present invention.
- the computer 1 comprises a data store 3 (e.g. a hard drive) upon which are stored items of information that a user may wish to search through and an index file database 5 which contains data such as file names, metadata and content information relating to the items of interest in the data store.
- a data store 3 e.g. a hard drive
- an index file database 5 which contains data such as file names, metadata and content information relating to the items of interest in the data store.
- a search engine 7 in accordance with an embodiment of the present invention is also provided in the computer 1 .
- the search engine has an input means 9 through which a user may input a number of search terms (for example via entry of text on a keyboard), a processing means 10 for processing received search requests into search queries or de-duplicating received search results (received from, for example, the database 5 ) and an output means 11 through which a set of search results may be output for display, for example, on a display screen 13 of the computer 1 .
- the search engine 7 Upon receiving a search request from a user the search engine 7 communicates with the index database 5 , as described in more detail with reference to FIG. 3 below.
- FIG. 1 b shows a version of the present invention in which a search engine according to an embodiment of the present invention is located on a server computer 15 . It is noted that like numerals are used to denote like features between FIGS. 1 a and 1 b.
- the server 15 is connected via a telecommunications network 17 (for example the Internet or a local area network) to two desktop computers 1 , 1 ′ and two other external data sources (in this case databases 19 , 21 ).
- a telecommunications network 17 for example the Internet or a local area network
- Each desktop 1 , 1 ′ comprises a data store 3 , 3 ′ and a file index database 5 , 5 ′.
- the server 15 comprises a search engine in accordance with an embodiment of the present invention, a local file index 23 and a local data store 25 .
- the server 15 may receive a search request from any connected computer resource 1 , 1 ′ via the input/outputs 27 .
- a search query may then be constructed, as described in more detail with respect to FIG. 4 below, and sent to the file indexes 5 , 5 ′, 23 and databases 19 , 21 .
- desktop version of the invention is shown connected to a file index database only it is noted that the search engine could also access an external database in a manner similar to that shown in FIG. 1 b .
- a desktop version searches a file index only and a server version of the search engine searches local and remote file indexes as well as external data sources.
- FIG. 2 shows an alternative schematic for a search engine in accordance with an embodiment of the present invention.
- the search engine 7 is in communication with documents, databases and other applications.
- the search engine comprises a number of components such as: application connectors 29 for facilitating the interface between the search engine and the applications; secure file indexes 31 ; and modules for controlling user interaction 33 and access management 35 .
- the search engine may be embodied on a desktop or may be accessed via a web browser.
- FIGS. 1 a , 1 b and 2 the processing means 10 is shown as a single component, capable of both generating search queries and de-duplicating received search results. It is however to be appreciated that there may be more than one processing means—for example, one to generate search queries and one to de-duplicate received results.
- references below to the search engine 7 generating search queries/search term combinations or de-duplicating results into a results set should be understood as meaning the processing means 10 (either single or multiple processing components) within search engine 7 as performing these functions.
- FIG. 3 is a flow chart showing an overview of the search process at a desktop computer 1 comprising a search engine 7 in accordance with an embodiment of the present invention such as that shown in FIG. 1 .
- the search engine receives via the input means 9 a search request from a user.
- a search request would normally comprise a plurality of text-based search terms (e.g. BBC News Archive European space agency).
- the search engine 7 Upon receipt of the search terms the search engine 7 creates a number of combinations of a subset of the search terms. For example, in the search request example above there are six separate terms. In Step 32 , therefore, the search engine may create a number of groups of up to, for example four words. The creation of such combinations of search terms increases the chances of returning a true set of results. The creation of combinations of search terms is described in greater detail with relation to FIG. 5 below).
- Step 34 the combinations of search terms are combined into a single search query (e.g. in basic SQL format) which can then be sent to the file index database 5 .
- a single search query e.g. in basic SQL format
- FIG. 6 a An example of a typical SQL query is shown in FIG. 6 a.
- Steps 30 to 34 relate to the generation of an improved search query in accordance with the first aspect of the present invention.
- Steps 36 , 38 and 40 (below dotted line 35 in FIG. 3 ) relate to the outputting of a results set to the user in accordance with the second aspect of the present invention.
- the search engine 7 receives a plurality of search results from the file index 5 . These results may be received in XML format or may be transformed into XML format by the search engine as described in relation to FIG. 7 below.
- Step 38 the various elements (or nodes) of the XML results are compared against stored nodes in order to remove duplicate search results. This process is described in detail with reference to FIGS. 8 to 10 below. Unique search results are incorporated into a results set during Step 38 and are then output (at Step 40 ) to the user.
- FIG. 4 is a flow chart showing a similar process to FIG. 3 except in this case the search engine 7 is located on a server 15 and may access data held in an external database 19 , 21 . Process steps that are the same as FIG. 3 are referred to using the same reference numerals. It is noted that the file index referred to in FIG. 4 may either be a local index 23 on a server or may be a file index 5 held on a remotely located user computer.
- the search engine in FIG. 4 Upon receipt of the user input search terms the search engine in FIG. 4 creates search term combinations (Step 32 ), creates a search query ( 34 ), receives search results (Step 36 ), de-duplicates the results (Step 38 ) and outputs a result set (Step 40 ) as per FIG. 3 .
- Step 30 additionally causes, in Step 42 , the search engine 7 to decrypt a database configuration file which stores details of the external database 19 , 21 that the search engine can access in order that the search engine can create a database query for the database (in addition to the SQL query for the file index in Step 34 ).
- Step 44 a database reports SQL is created in Step 44 and sent to the data base (an example SQL is shown in FIG. 6 ).
- the search engine then moves to Step 36 and the database results are combined with those from the file index and processed in the same way.
- FIG. 5 a shows the process of creating various search term combinations in more detail compared to Step 32 shown in FIGS. 3 and 4 .
- each box of the process depicted in FIG. 5 a is a worked example based on the previously mentioned search request of “BBC News Archive European Space agency”.
- Step 30 the user initially, at Step 30 , enters his search terms into a display window of the display 13 of his computer.
- the search engine then, in Step 46 , places the search terms into a numbered array such that each search term is associated with its own numerical identifier (in this case “1” through “6”).
- a stored arrangement of search phrases for an input string of six search terms is then retrieved by the search engine in Step 48 and the current search terms are arranged as per the stored arrangement in Step 50 .
- Step 50 The output of Step 50 is nine separate combinations of search terms (nine different search strings) that can be placed into a search query (Step 34 in FIGS. 3 and 4 ).
- search engine 7 may also perform a synonym lookup and create further search combinations based on the lookup results.
- search combinations 1 and 3 comprise the term “BBC” which can be expanded into “British Broadcasting Corporation” which therefore yields a further two combinations of search terms (two further search strings) which can then be placed into a search query (as per Step 34 in FIGS. 3 and 4 ).
- system may also handle complex search query combinations, for example:
- search engine would not arrange to words “BBC News Archive” (and would treat them as a single search term) but would arrange the words, “European”, “space” and “agency”.
- FIG. 5 b This variation of the invention, along with a further variation in which alternate terms are selected, is shown in FIG. 5 b.
- FIG. 6 shows an example of an SQL query that may be constructed from the eleven search combinations depicted in FIG. 5 a.
- search language used when entering search requests may be configurable based on the user's preferences.
- search language may be:
- search criteria (which are shown in italics) may be changed to suit a specific organisation's data structures, e.g. AUTHOR may be replaced with CUSTOMERNUMBER.
- the output from the SQL query that is sent to the file index/external database is an SQL return.
- FIG. 7 is an example of the format of such a return for two results.
- the return may be in an XML compatible format but, in the event that it is not in XML format, may be transformed into XML format in one of the following two ways:
- FIG. 8 shows three XML search results that have been received by a search engine in accordance with an embodiment of the present invention.
- Each XML result comprises a number of classes or nodes, each of which contains (or can contain) search data. It is noted that the XML results of FIG. 8 are an example of the general format shown in FIG. 7 .
- the nodes are “reference”, “Company_name”, “creditLimit”, “lastInvoice”, “outstanding”, “60 Days” and “90 Days”.
- Each of these nodes is associated with various search data, e.g. the “creditLimit” node contains the value “100,000” and the “outstanding” node contains the value “3000.00”.
- Step 36 of FIG. 9 a XML search results are received. It is noted that the results may be transformed into XML format as described above in relation to FIG. 7 .
- Step 60 the first two XML results are selected and, at Step 62 , a common class identified.
- nodes “reference” and “Company_name” are common to the two results and the search engine takes these two nodes and creates two arrays holding the values of these common nodes.
- the search engine 7 then chooses, in Step 64 , up to three further node identifiers at random from those contained in Result 1 and Result 2 and creates further arrays to these nodes. It is also noted that the search engine may hold a configuration file (as described in relation to FIGS. 11 and 12 below) that details the various node identifiers that are present in the file index/external database and the further node identifiers may be selected from this configuration file. In any case it is noted that the nodes that are present in Results 1 and 2 will be present in any such configuration file.
- Steps 60 , 62 and 64 The outcome of Steps 60 , 62 and 64 is the array table 100 shown in FIG. 10 . It can be seen in this example that the search engine has chosen the node “Address2” from XML Result 2, node “postcode” from XML Result 2 and node “creditLimit” from XML Result 1. It is noted that the individual search data contained in these nodes in Results 1 and 2 has been added to the array table 100 (Step 66 ).
- the number of nodes compared in the de-duplication process may be altered depending on the set up of the search engine or depending on the context of the search. For example, if emails are being searched then the system may be set up such that only a few items of meta-data are searched (e.g. “From”, “Date”, Subject” “Id” fields). Or possibly only a single meta-data item (e.g. “Id”). By contrast, if an accounting application is being searched then more fields may be appropriate (e.g. “customer”, “supplier”, “outstanding amounts”, “current address”, “last order” etc.).
- the search engine then begins the process of checking each search result against the array table to determine if the result is a duplicate entry.
- Step 68 the search engine takes XML result 2 and, in Step 70 , checks its nodes against those contained in the array table 100 . Where search data held in each node is a match to the values in the array table, the search engine designates this as “true”. Where search data held in each node is not a match to the values in the array table, then the search engine designates this as “false”.
- Step 70 The checking function of Step 70 is represented by the array table 102 in FIG. 10 in which it can be seen that the search data of XML result 2 has been matched to Arrays 1 and 2 but not Arrays 3 to 5.
- the search engine determines that the XML result comprises matching search data for each node in the array table, the result is deemed to be a duplicate and, in Step 72 , is discarded.
- the search engine determines that the XML result does not comprise matching search data for each node in the array table, the result is added to a Results set, in Step 74 .
- Step 76 the search engine moves to the next search result (in Step 76 ) and then cycles back to Step 70 to check the search data held in the nodes of the next search result.
- the search data contained within Result 2 is added to the array table. This is shown in array table 104 in FIG. 10 in which the additional search data from Result 2 is shown in italic font and the search data from Step 62 is shown in bold font.
- Result 3 Since Result 3 is not a duplicate result its search data is added to the array table as previously described above with reference to Result 2.
- the resultant array table 108 can be seen in FIG. 10 where, in addition to the search data from earlier array table 104 , the table now comprises the values “123” in the reference node and “Total Information Access Ltd” in the Company_name node.
- FIG. 9 b is a flow chart that shows an alternative technique (to that of FIG. 9 a ) for processing the search results in accordance with an embodiment of the present invention.
- the nodes contained within the array table are predefined which means that Steps 60 to 66 shown in FIG. 9 a are not required.
- Step 36 the XML search results are received in Step 36 .
- Step 78 the values contained within XML result 1 are added to the array table (in so far as there are common nodes). As XML result 1 is the first result it cannot be a duplicate and so in Step 80 is sent to the user.
- Step 82 the next result in the received search results is selected and in Step 84 is checked against the values held in the array table. It is noted that this checking step is the same as described above in relation to FIG. 10 .
- Step 86 If all the arrays within the array table return a “true” designation, the result in question is deemed a duplicate and is discarded in Step 86 .
- the search engine 7 then cycles round to Step 82 and selects the next result.
- Step 88 the values of the current result are added to the array table and in Step 90 the result is forwarded to the user.
- the search engine then cycles round to Step 82 and the next result is selected.
- FIG. 9 a therefore shows an embodiment where the array table has been predefined, e.g. by the user, and also shows an embodiment where results are returned as they are processed to the user
- FIGS. 11 and 12 show, respectively, an example configuration file that may be used by a search engine 7 in accordance with an embodiment of the present invention to construct the SQL query described above and a screenshot of a data mapping function that enables the configuration file to be created.
- the configuration file of FIG. 11 may be created by an administrator when the search engine is installed.
- the configuration file is a simple XML file which holds various pieces of information, namely:
- mapping relationship is generally shown in region 124 of the configuration file. It is noted that the database fields 128 are shown on the right and the search engine/configuration fields 130 are shown in the left.
- FIG. 12 shows the data structure of the database being connected to the search engine, the search engine/configuration field names and the mapping relationship between the data structure and the field names.
- a user In order to set up the mapping within the configuration file, a user displays the data structure of the database in question in window area 140 .
- Configuration field names are displayed in window area 142 and the user selects each field name in turn and then scans the data structure list to identify a corresponding field name. In this manner the relationship of the database to the search engine structure may be mapped and each association may be stored in window area 144 .
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
A search engine for generating an improved search query, the engine comprising: input means for receiving a search request, the search request comprising N search terms; processing means arranged to formulate a search query from the received search request; output means arranged to output the search query wherein the processing means is arranged to formulate the search query by generating a plurality of search strings, each search string comprising a different combination of a subset of the N search terms.
Description
- The invention relates to an improved search engine. More particularly, the present invention relates to an improved search engine for creating a search query for retrieving stored information from a file index or remote data source and also relates to an improved de-duplication process for removing duplicate entries from received search results.
- A search engine is an information retrieval system that allows users of a computer system to specify criteria about an item of interest, “search terms”, and to have the search engine find the matching items. In the case of text search engine, e.g. Google, the search query is typically expressed as a set of words.
- In order to speed up the search process, a search engine will typically collect metadata about the group of items beforehand in a process referred to as indexing. An index typically requires a smaller amount of computer storage and provides a basis for the search engine to calculate the item relevance.
- Desktop search is the name given to search tools which search the content of a user's hard drive rather than the Internet. Such tools may find information including web browser histories, email archives, text documents, sound files etc. Such search tools can be extremely fast but may not search the entire hard drive. For example, only operating system specific applications may be searched (e.g. Microsoft documents, folders) and information contained in email or contact databases may not be included.
- Since a significant proportion of company data can be stored within unstructured data (e.g. user created directory structures) it is important that a desktop search engine work be able to search within all areas of the computer.
- Desktop search engines build and maintain an index database to optimise search performance. Indexing takes place when the computer is idle and the search engine generally collects information relating to file/directory names; metadata such as titles, authors etc; and, the content of the supported data items/documents. An example of a desktop search tool is “Windows Search”, an indexed desktop search platform released by Microsoft for the Windows operating platform.
- Web search engines provide an interface to search for information on the Internet. A web search engine works by storing information about a large number of web pages which are retrieved by a web crawler, an automated web browser that follows every link it sees. The contents of each page are then indexed and stored in an index database for use in later queries. When a user enters a query into a search engine, e.g. by the use of key words, the web search engine examines its index and provides a listing of best matching web pages according to its criteria. Most search engines support Boolean operators AND, OR and NOT to further specify the search and some engines provide a proximity search which allows users to specify the distance between keywords.
- Given the current size and speed of growth of the Internet it is important that the initial search query is relevant in order that relevant search results are returned. The usefulness of a search engine also depends on the relevance of the result set it gives back and one of the main problems with current search engines is the tendency of the results set to contain duplicate search results.
- De-duplication of search results is currently handled by means of hash algorithms in which each chunk of data is processed by a hash algorithm thereby generating a unique number which is stored in an index. When a piece of data receives a hash number, that number is compared with the index of other existing hash numbers. If the hash number is already in the index then the piece of data is considered a duplicate and is not stored. Otherwise the new hash number is added to the index and the new data is stored. In some cases, however, the hash algorithm may produce the same hash number for two different chunks of data. When such a hash collision occurs, the system will not store the new data because it sees its hash number already exists in the data index. Such false positives can result in lost data. It is also noted that hash algorithms are complex.
- A further drawback of known search engines is a limitation in the types of data source they can search. Traditionally, search engines index and search unstructured data resources. This therefore leaves large amounts of data that is tied into structured data stores, e.g. databases, that cannot be accessed by the traditional search engine. If the structured data is indexed separately then this index may be made available to the search engine but this creates a further data store for data which is already indexed within its own structure.
- It is therefore an object of the present invention to provide a search engine that overcomes or substantially mitigates the above problems with the prior art.
- According to a first aspect the present invention provides a search engine for generating an improved search query, the engine comprising: input means for receiving a search request, the search request comprising N search terms; processing means arranged to formulate a search query from the received search request; output means arranged to output the search query wherein the processing means is arranged to formulate the search query by generating a plurality of search strings, each search string comprising a different combination of a subset of the N search terms.
- The first aspect of the present invention provides a search engine that is arranged to take a search request that has been entered by a user and to re-arrange and mix up the terms contained within the request in order to form a search query (a search query generation engine). It is noted that by splitting up the input search phrase in this way into smaller groupings the search that is performed by the search engine is more likely to return an accurate set of results.
- The search query may be output to a local file index, for example if the search engine is part of a desktop PC, or may be output to a server based file index or a remote database.
- Preferably each search string comprises a different combination of M search terms, where N>M. Conveniently, the number of search terms in each of the plurality of search strings may vary between 1 and M. It is noted that the search query that is created by the search engine may contain any number of search terms from the input search request.
- Optionally, the processing means may be arranged to formulate the search query by determining the combination NCM of search terms, each combination being a different search string. It is noted that this option would ensure that all possible combinations of input search terms are formulated into search queries.
- As an alternative to the above formulation of search queries, the processing means may be arranged to formulate the search query based upon predetermined arrangements of search terms for a search request comprising N terms. For example, the search engine may be pre-programmed with a generic format for the search queries for an input search request of a given length N. To take one (non limiting) example by way of explanation, the search engine may receive a search request comprising 6 search terms. It may then look up a number of pre-programmed arrangements of those search terms that should be used whenever the input search request comprises 6 terms. So, one arrangement may comprise
terms - In cases where the search engine comprises predetermined arrangements of search terms, the processing means may be arranged to assign a different identifier to each search term received in the search request and the predetermined arrangements may comprise different arrangements and selections of identifiers.
- Optionally, the processing means may be arranged to group two or more search terms together and assign the grouped terms to a single identifier.
- Conveniently, where the search terms comprise text, the processing means may be arranged to perform a synonym lookup on the received search terms and additionally may be arranged to substitute any search terms with their synonyms to form additional search strings for inclusion in the search query.
- Conveniently, the output means may be arranged to output the search query to a file index comprising an index database of file information and the processing means may be arranged to generate the search query in the form of an SQL query.
- Conveniently, the output means is arranged to output the search query to a structured data source. Preferably, where the search query is output to a structured data source, the search engine further comprises a configuration file for the structured data source, the configuration file having connectivity information relating to the structured data source to allow the processing means to generate the search query.
- The search engine of the first aspect of the invention may also be conveniently combined with a de-duplication engine which is arranged to remove duplicate search results that are returned as a result of the generated search query. Conveniently, the de-duplication engine is arranged to output a results set to a user and comprises: search result input means arranged to receive a plurality of search results, each search result comprising one or more data items; de-duplication means arranged to remove duplicate search results from the plurality of search results received at the search result input means and to generate a results set; a stored data set comprising one or more data categories and one or more data items distributed throughout the one or more data categories; results set output means arranged to output the results set wherein the de-duplication means is arranged to compare the stored data set to the search results and to determine if a first search result is a duplicate result based on the following criteria: a) if the content of all of the one or more data categories in the stored data set are determined to match data items in the first search result then the first search result is determined to be a duplicate and is discarded; or b) if the content of at least one of the data categories in the stored data set is determined not to match any data items in the first search result then the first search result is added to the results set.
- According to a second aspect the present invention provides a search engine for outputting a results set to a user, the search engine comprising: input means arranged to receive a plurality of search results, each search result comprising one or more data items; processing means arranged to remove duplicate search results from the plurality of search results received at the input means and to generate a results set; a stored data set comprising one or more data categories and one or more data items distributed throughout the one or more data categories; output means arranged to output the results set wherein the processing means is arranged to compare the stored data set to the search results and to determine if a first search result is a duplicate result based on the following criteria: a) if the content of all of the one or more data categories in the stored data set are determined to match data items in the first search result then the first search result is determined to be a duplicate and is discarded; or b) if the content of at least one of the data categories in the stored data set is determined not to match any data items in the first search result then the first search result is added to the results set.
- As noted above one of the main problems with known search engines is the de-duplication of the results. The second aspect of the present invention provides a mechanism for mechanically removing duplicate results and outputting a results set to a user (also referred to herein as a de-duplication engine). Received search results are compared against a stored data set that comprises one or more data categories. Where the content in the stored data set matches all content items in a given search result that result is deemed a duplicate and is discarded. Where at least some of the content in the stored data set does not match the search result then that search result is deemed to be a new search result and is added to the results set. An output means (results set output means) is then arranged to output the results set (e.g. to a user).
- The second aspect of the present invention therefore provides a way of improving the set of search results returned to a user of a search engine.
- In a preferred version of this aspect of the invention, the stored data set comprises an array of XML fields (“XML nodes”) and the received search results are either in an XML format or are transformed into an XML format. The content of each XML field in the stored array can then be compared to the received search result to determine (i) if the search result comprises that field and (ii) if it does comprise that field, whether the value of that field matches what is already in the stored array.
- Preferably, if the first search result is determined to be a new result (i.e. it is not a duplicate) and is added to the results set, the processing means (de-duplication means) may be arranged to add the one or more data items of the first search result to the stored data set. This ensures that as further search results are received they are compared against the data items of all the previous search results to have been received thereby improving the efficiency of the system.
- Conveniently, the processing means may be arranged to take each of the plurality of search results in turn and compare them to the stored set of data items in order to determine if each search result is a duplicate result based on the criteria (a) and (b) above.
- As noted above, the search results may be received in a format that the search engine can process. In the event that the search results cannot immediately be used however the processing means may be arranged to transform each search result received at the input means into a structured data format comprising one or more data categories, each of the one or more data categories containing a data item. For example, the search results may be transformed into an XML format wherein each data category is an XML field (e.g. an address field) and each data item is the value of that field (e.g. an actual street name or postcode).
- Conveniently, the plurality of search results may be received at the input means (search result input means) in the form of a structured data format comprising one or more data categories, each of the one or more data categories comprising a data item.
- Preferably, the stored data may be stored in the structured data set format. For example, the stored data set may be an array of XML fields.
- Preferably, the stored data set may be in the form of an array table comprising one or more arrays, each array being associated with a data category and the one or more data items being distributed throughout the one or more arrays. For example, the array table may be comprised of a number of arrays (or XML fields).
- Preferably, where the stored data is in the form of an array table, the processing means may be arranged to determine if the first search result comprises all of the data categories in the array table and further the processing means may be arranged to add the first result to the results set if any of the data categories in the array table are not present in the first search result.
- For a given search result, all the data categories in the array table may also be present in the search result. In such instances the processing means may be arranged to compare the content of each data category in the array table against the content of the corresponding data category in the first search result in order to determine if the first data category is a duplicate result. Conveniently, if, for each and all corresponding data categories, the same data item is present in both the array table and the first search result then the first search result may be determined to be a duplicate result and may be discarded. In this example, the array table and search result may contain exactly the same data categories, e.g. the array table may comprise three categories,
Address 1, Postcode and Contact Name, which are also present in the search result. In order to determine if the result is a duplicate the value of the data item in each of those categories may be compared and if the search result comprises values that match a value within the corresponding category in the array table then the search result may be determined to be duplicate. - In the event that any data category in the array table does not match to the corresponding data category in the first search result then the first search result may be determined to be a new result (i.e. not a duplicate result) and may be added to the results set.
- Preferably, the processing means may be arranged to take each of the plurality of search results in turn and compare them to the array table in order to determine if each search result is a duplicate result.
- Conveniently, search results may be received or transformed into an XML format comprising one or more XML nodes, each of the one or more XML nodes corresponding to a data category.
- Conveniently, the stored data set may initially be empty and may be populated with data items common to the first two search results received at the input means. This conveniently allows the search engine to define appropriate data categories on a search-by-search basis.
- Alternatively, the stored data set may initially be empty but may comprise pre-defined data categories. In this option, the array table may effectively be predefined before the search is initiated.
- Preferably, the search engine may further comprise a configuration file defining a plurality of data categories, the one or more data categories in the stored data set being chosen from the plurality of data categories in the configuration file.
- According to a third aspect, the present invention provides a search engine for generating a search query and for processing received search results comprising: a search engine for generating an improved search query according to the first aspect of the present invention; and a search engine for outputting a results set to a user according to any the second aspect of the present invention.
- The third aspect of the present invention conveniently provides a search engine that comprises the features of both the first and second aspects of the invention. It is noted that the third aspect of the invention may comprise preferred/optional features of the first and/or second aspects of the invention
- According to a fourth aspect the present invention provides a method of mapping a data structure of a data source to a predetermined set of data labels, the method comprising: displaying a set of predetermined data labels; displaying the data structure of the data source, the data structure of the data source comprising a plurality of data field names; comparing the set of data labels to the data structure of the data source and identifying data field names that correspond to data labels in the set of predetermined data labels; storing the relationship identified in the comparing step.
- In the above aspects of the present invention the search engine may construct a SQL query for a database. In order that the search engine and database are able to interact with each other the fourth aspect of the present invention conveniently provides a method of mapping the data structure of a data source (e.g. a database) to a predetermined set of data labels (e.g. the data labels of the search engine).
- Conveniently, the relationships identified in the comparing step may be stored in an XML format. Preferably, the data source may be a database and the relationship may be stored with login information for the database.
- The invention also extends to a method of generating a search query and a method of outputting a results set to a user.
- The present invention also extends to a computer program when embodied on a record medium/read-only memory/electrical carrier signal or stored in a computer memory, the computer program comprising program instructions for causing a computer to perform the process of the method of the present invention.
- In order that the invention may be more readily understood, reference will now be made, by way of example, to the accompanying drawings in which:
-
FIG. 1a is a schematic of a desktop version of a search engine in accordance with an embodiment of the present invention; -
FIG. 1b is a schematic of a server based version of a search engine in accordance with an embodiment of the present invention; -
FIG. 2 is an alternative schematic of a search engine in accordance with an embodiment of the present invention; -
FIG. 3 is a flow chart showing how a search query is created and search results are processed on a desktop computer in accordance with an embodiment of the present invention -
FIG. 4 is a flow chart showing how a search query is created and search results are processed on a server computer in accordance with an embodiment of the present invention -
FIG. 5a illustrates how combinations of search terms used to construct a search query are created in accordance with an embodiment of the present invention -
FIG. 5b shows an alternative toFIG. 5 a; -
FIG. 6 shows an example of an SQL search query constructed from the combinations ofFIG. 5 ; -
FIG. 7 shows how search results may be transformed into XML format -
FIGS. 8, 9 a and 10 show how duplicate search results are removed from the results received by the search engine in accordance with an embodiment of the present invention; -
FIG. 9b is a flow chart that shows an alternative technique (to that ofFIG. 9a ) for processing the search results in accordance with an embodiment of the present invention; -
FIG. 11 shows an example of a configuration file that may be used in accordance with an embodiment of the present invention to construct a SQL query -
FIG. 12 is a screenshot showing how features of a database may be mapped to features of a search engine in accordance with an embodiment of the present invention. - It is noted that throughout the Figures and description below, like numerals are used to denote like features.
- Turning to
FIG. 1a , adesktop computer 1 is depicted. Thecomputer 1 comprises a data store 3 (e.g. a hard drive) upon which are stored items of information that a user may wish to search through and anindex file database 5 which contains data such as file names, metadata and content information relating to the items of interest in the data store. - A
search engine 7 in accordance with an embodiment of the present invention is also provided in thecomputer 1. The search engine has an input means 9 through which a user may input a number of search terms (for example via entry of text on a keyboard), a processing means 10 for processing received search requests into search queries or de-duplicating received search results (received from, for example, the database 5) and an output means 11 through which a set of search results may be output for display, for example, on adisplay screen 13 of thecomputer 1. Upon receiving a search request from a user thesearch engine 7 communicates with theindex database 5, as described in more detail with reference toFIG. 3 below. -
FIG. 1b shows a version of the present invention in which a search engine according to an embodiment of the present invention is located on aserver computer 15. It is noted that like numerals are used to denote like features betweenFIGS. 1a and 1 b. - In
FIG. 1b theserver 15 is connected via a telecommunications network 17 (for example the Internet or a local area network) to twodesktop computers case databases 19, 21). - Each
desktop data store file index database - The
server 15 comprises a search engine in accordance with an embodiment of the present invention, alocal file index 23 and alocal data store 25. In use, theserver 15 may receive a search request from anyconnected computer resource FIG. 4 below, and sent to thefile indexes databases - Although the desktop version of the invention is shown connected to a file index database only it is noted that the search engine could also access an external database in a manner similar to that shown in
FIG. 1b . However, for the purposes of the following description it is assumed that a desktop version searches a file index only and a server version of the search engine searches local and remote file indexes as well as external data sources. -
FIG. 2 shows an alternative schematic for a search engine in accordance with an embodiment of the present invention. As shown inFIG. 2 thesearch engine 7 is in communication with documents, databases and other applications. The search engine comprises a number of components such as:application connectors 29 for facilitating the interface between the search engine and the applications;secure file indexes 31; and modules for controllinguser interaction 33 andaccess management 35. As noted above in relation toFIGS. 1a and 1b , the search engine may be embodied on a desktop or may be accessed via a web browser. - It is noted in
FIGS. 1a, 1b and 2 that the processing means 10 is shown as a single component, capable of both generating search queries and de-duplicating received search results. It is however to be appreciated that there may be more than one processing means—for example, one to generate search queries and one to de-duplicate received results. - It is further to be noted that references below to the
search engine 7 generating search queries/search term combinations or de-duplicating results into a results set should be understood as meaning the processing means 10 (either single or multiple processing components) withinsearch engine 7 as performing these functions. -
FIG. 3 is a flow chart showing an overview of the search process at adesktop computer 1 comprising asearch engine 7 in accordance with an embodiment of the present invention such as that shown inFIG. 1 . - At
Step 30, the search engine receives via the input means 9 a search request from a user. Such a search request would normally comprise a plurality of text-based search terms (e.g. BBC News Archive European space agency). - Upon receipt of the search terms the
search engine 7 creates a number of combinations of a subset of the search terms. For example, in the search request example above there are six separate terms. InStep 32, therefore, the search engine may create a number of groups of up to, for example four words. The creation of such combinations of search terms increases the chances of returning a true set of results. The creation of combinations of search terms is described in greater detail with relation toFIG. 5 below). - In
Step 34, the combinations of search terms are combined into a single search query (e.g. in basic SQL format) which can then be sent to thefile index database 5. An example of a typical SQL query is shown inFIG. 6 a. -
Steps 30 to 34 (above dottedline 35 inFIG. 3 ) relate to the generation of an improved search query in accordance with the first aspect of the present invention.Steps line 35 inFIG. 3 ) relate to the outputting of a results set to the user in accordance with the second aspect of the present invention. - At
Step 36, thesearch engine 7 receives a plurality of search results from thefile index 5. These results may be received in XML format or may be transformed into XML format by the search engine as described in relation toFIG. 7 below. - At
Step 38 the various elements (or nodes) of the XML results are compared against stored nodes in order to remove duplicate search results. This process is described in detail with reference toFIGS. 8 to 10 below. Unique search results are incorporated into a results set duringStep 38 and are then output (at Step 40) to the user. -
FIG. 4 is a flow chart showing a similar process toFIG. 3 except in this case thesearch engine 7 is located on aserver 15 and may access data held in anexternal database FIG. 3 are referred to using the same reference numerals. It is noted that the file index referred to inFIG. 4 may either be alocal index 23 on a server or may be afile index 5 held on a remotely located user computer. - Upon receipt of the user input search terms the search engine in
FIG. 4 creates search term combinations (Step 32), creates a search query (34), receives search results (Step 36), de-duplicates the results (Step 38) and outputs a result set (Step 40) as perFIG. 3 . - However, the receipt of the search terms in
Step 30 additionally causes, inStep 42, thesearch engine 7 to decrypt a database configuration file which stores details of theexternal database - Following decryption of the configuration file a database reports SQL is created in
Step 44 and sent to the data base (an example SQL is shown inFIG. 6 ). The search engine then moves to Step 36 and the database results are combined with those from the file index and processed in the same way. -
FIG. 5a shows the process of creating various search term combinations in more detail compared to Step 32 shown inFIGS. 3 and 4 . Alongside each box of the process depicted inFIG. 5a is a worked example based on the previously mentioned search request of “BBC News Archive European Space agency”. - As shown in
FIGS. 3 and 4 , the user initially, atStep 30, enters his search terms into a display window of thedisplay 13 of his computer. The search engine then, inStep 46, places the search terms into a numbered array such that each search term is associated with its own numerical identifier (in this case “1” through “6”). - A stored arrangement of search phrases for an input string of six search terms is then retrieved by the search engine in
Step 48 and the current search terms are arranged as per the stored arrangement inStep 50. - The output of
Step 50 is nine separate combinations of search terms (nine different search strings) that can be placed into a search query (Step 34 inFIGS. 3 and 4 ). - Optionally, in
Step 52, thesearch engine 7 may also perform a synonym lookup and create further search combinations based on the lookup results. In thiscase search combinations Step 34 inFIGS. 3 and 4 ). - It is noted that in addition to the process of creating search term combinations as described above, the system according to an embodiment of the present invention may also handle complex search query combinations, for example:
- BBC+News+Archive European space agency
- In this embodiment the search engine would not arrange to words “BBC News Archive” (and would treat them as a single search term) but would arrange the words, “European”, “space” and “agency”. This variation of the invention, along with a further variation in which alternate terms are selected, is shown in
FIG. 5 b. -
FIG. 6 shows an example of an SQL query that may be constructed from the eleven search combinations depicted inFIG. 5 a. - It is noted that in accordance with embodiments of the present invention the search language used when entering search requests may be configurable based on the user's preferences. For instance, an example of the search language may be:
- SELECT TEXT=‘Text to Search’
- STARTDATE=‘A Date’
- END DATE=‘A Date’
- AUTHOR=‘Author Name’
- FILENAME=‘A Filename’
- In this example the search criteria (which are shown in italics) may be changed to suit a specific organisation's data structures, e.g. AUTHOR may be replaced with CUSTOMERNUMBER.
- The output from the SQL query that is sent to the file index/external database is an SQL return.
FIG. 7 is an example of the format of such a return for two results. The return may be in an XML compatible format but, in the event that it is not in XML format, may be transformed into XML format in one of the following two ways: -
- 1) For each result in the output, the fieldname data may be normalised to an XML syntax compliant element name; or
- 2) A mapping file stored on the computer containing the relationships between fieldnames and XML elements may be consulted. The data output in each field may then be mapped to an XML element and added to the output. Each element may then be closed with a tag.
-
FIG. 8 shows three XML search results that have been received by a search engine in accordance with an embodiment of the present invention. Each XML result comprises a number of classes or nodes, each of which contains (or can contain) search data. It is noted that the XML results ofFIG. 8 are an example of the general format shown inFIG. 7 . - Taking
XML result 1 as an example, the nodes are “reference”, “Company_name”, “creditLimit”, “lastInvoice”, “outstanding”, “60 Days” and “90 Days”. - Each of these nodes is associated with various search data, e.g. the “creditLimit” node contains the value “100,000” and the “outstanding” node contains the value “3000.00”.
- The process of de-duplication of search results by the search engine will now be described with reference to the flow chart of
FIG. 9a and the array tables ofFIG. 10 . - In
Step 36 ofFIG. 9a , XML search results are received. It is noted that the results may be transformed into XML format as described above in relation toFIG. 7 . - At
Step 60, the first two XML results are selected and, atStep 62, a common class identified. In this case it is noted that nodes “reference” and “Company_name” are common to the two results and the search engine takes these two nodes and creates two arrays holding the values of these common nodes. - The
search engine 7 then chooses, inStep 64, up to three further node identifiers at random from those contained inResult 1 andResult 2 and creates further arrays to these nodes. It is also noted that the search engine may hold a configuration file (as described in relation toFIGS. 11 and 12 below) that details the various node identifiers that are present in the file index/external database and the further node identifiers may be selected from this configuration file. In any case it is noted that the nodes that are present inResults - The outcome of
Steps FIG. 10 . It can be seen in this example that the search engine has chosen the node “Address2” fromXML Result 2, node “postcode” fromXML Result 2 and node “creditLimit” fromXML Result 1. It is noted that the individual search data contained in these nodes inResults - It is noted that the number of nodes compared in the de-duplication process may be altered depending on the set up of the search engine or depending on the context of the search. For example, if emails are being searched then the system may be set up such that only a few items of meta-data are searched (e.g. “From”, “Date”, Subject” “Id” fields). Or possibly only a single meta-data item (e.g. “Id”). By contrast, if an accounting application is being searched then more fields may be appropriate (e.g. “customer”, “supplier”, “outstanding amounts”, “current address”, “last order” etc.).
- The search engine then begins the process of checking each search result against the array table to determine if the result is a duplicate entry.
- Therefore, in
Step 68, the search engine takesXML result 2 and, inStep 70, checks its nodes against those contained in the array table 100. Where search data held in each node is a match to the values in the array table, the search engine designates this as “true”. Where search data held in each node is not a match to the values in the array table, then the search engine designates this as “false”. - The checking function of
Step 70 is represented by the array table 102 inFIG. 10 in which it can be seen that the search data ofXML result 2 has been matched toArrays Arrays 3 to 5. - Where the search engine determines that the XML result comprises matching search data for each node in the array table, the result is deemed to be a duplicate and, in
Step 72, is discarded. - Where the search engine determines that the XML result does not comprise matching search data for each node in the array table, the result is added to a Results set, in
Step 74. - Following
Step 72/74, the search engine moves to the next search result (in Step 76) and then cycles back toStep 70 to check the search data held in the nodes of the next search result. - When a search result is added to the Results set, the search data contained within
Result 2 is added to the array table. This is shown in array table 104 inFIG. 10 in which the additional search data fromResult 2 is shown in italic font and the search data fromStep 62 is shown in bold font. - As the process of
FIG. 9 is repeated forResult 3, the array table 106 is created atStep 70. It can be seen thatResult 3 is not a duplicate result asArrays - Since
Result 3 is not a duplicate result its search data is added to the array table as previously described above with reference toResult 2. The resultant array table 108 can be seen inFIG. 10 where, in addition to the search data from earlier array table 104, the table now comprises the values “123” in the reference node and “Total Information Access Ltd” in the Company_name node. -
FIG. 9b is a flow chart that shows an alternative technique (to that ofFIG. 9a ) for processing the search results in accordance with an embodiment of the present invention. InFIG. 9b the nodes contained within the array table are predefined which means thatSteps 60 to 66 shown inFIG. 9a are not required. - Turning to
FIG. 9b , the XML search results are received inStep 36. - In
Step 78 the values contained withinXML result 1 are added to the array table (in so far as there are common nodes). AsXML result 1 is the first result it cannot be a duplicate and so inStep 80 is sent to the user. - In
Step 82 the next result in the received search results is selected and inStep 84 is checked against the values held in the array table. It is noted that this checking step is the same as described above in relation toFIG. 10 . - If all the arrays within the array table return a “true” designation, the result in question is deemed a duplicate and is discarded in
Step 86. Thesearch engine 7 then cycles round to Step 82 and selects the next result. - If any array returns a “false” designation then the result is not a duplicate. In
Step 88 the values of the current result are added to the array table and inStep 90 the result is forwarded to the user. The search engine then cycles round to Step 82 and the next result is selected. -
FIG. 9a therefore shows an embodiment where the array table has been predefined, e.g. by the user, and also shows an embodiment where results are returned as they are processed to the user -
FIGS. 11 and 12 show, respectively, an example configuration file that may be used by asearch engine 7 in accordance with an embodiment of the present invention to construct the SQL query described above and a screenshot of a data mapping function that enables the configuration file to be created. - The configuration file of
FIG. 11 may be created by an administrator when the search engine is installed. As can be seen from the Figure, the configuration file is a simple XML file which holds various pieces of information, namely: -
- 1) Database login details (see
username 120 andpassword 122 details); - 2) System access levels
- 3) The mapping relationship between the databases that the search engine accesses to the XML elements used in the de-duplication process—see
reference numeral 124; - 4)
Initial SQL 126 for the SQL query.
- 1) Database login details (see
- It is noted that in any one configuration file there would be one or more of these database connections.
- The mapping relationship is generally shown in
region 124 of the configuration file. It is noted that the database fields 128 are shown on the right and the search engine/configuration fields 130 are shown in the left. - The mechanism by which the configuration file is constructed is shown in
FIG. 12 which shows the data structure of the database being connected to the search engine, the search engine/configuration field names and the mapping relationship between the data structure and the field names. - In order to set up the mapping within the configuration file, a user displays the data structure of the database in question in
window area 140. Configuration field names are displayed inwindow area 142 and the user selects each field name in turn and then scans the data structure list to identify a corresponding field name. In this manner the relationship of the database to the search engine structure may be mapped and each association may be stored inwindow area 144. - It is noted that further databases may be added by selecting
button 146 and then browsing the computer system for the database in question. User specific SQL commands may be added viawindow area 148. - It will be understood that the embodiments described above are given by way of example only and are not intended to limit the invention, the scope of which is defined in the appended claims. It will also be understood that the embodiments described may be used individually or in combination.
Claims (7)
1.-31. (canceled)
32. A method of mapping a data structure of a data source to a predetermined set of data labels, the method comprising:
displaying a set of predetermined data labels;
displaying the data structure of the data source, the data structure of the data source comprising a plurality of data field names
comparing the set of data labels to the data structure of the data source and identifying data field names that correspond to data labels in the set of predetermined data labels
storing the relationship identified in the comparing step.
33. A method as claimed in claim 32 , wherein the relationships identified in the comparing step are stored in an XML format.
34. A method as claimed in claim 32 , wherein the data source is a database.
35. A method as claimed in claim 34 , wherein the relationship is stored with login information for the database.
36. A method as claimed in claim 33 , wherein the data source is a database.
37. A method as claimed in claim 36 , wherein the relationship is stored with login information for the database.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US15/143,402 US20160283607A1 (en) | 2008-03-13 | 2016-04-29 | Search engine |
Applications Claiming Priority (6)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
GB0804695.5 | 2008-03-13 | ||
GB0804695A GB2458309A (en) | 2008-03-13 | 2008-03-13 | Search engine |
PCT/GB2009/050240 WO2009112862A2 (en) | 2008-03-13 | 2009-03-12 | Improved search engine |
US92238010A | 2010-11-16 | 2010-11-16 | |
US13/909,898 US9330178B2 (en) | 2008-03-13 | 2013-07-12 | Search engine |
US15/143,402 US20160283607A1 (en) | 2008-03-13 | 2016-04-29 | Search engine |
Related Parent Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US13/909,898 Continuation US9330178B2 (en) | 2008-03-13 | 2013-07-12 | Search engine |
Publications (1)
Publication Number | Publication Date |
---|---|
US20160283607A1 true US20160283607A1 (en) | 2016-09-29 |
Family
ID=39328074
Family Applications (3)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US12/922,380 Active US8489573B2 (en) | 2008-03-13 | 2009-03-12 | Search engine |
US13/909,898 Active US9330178B2 (en) | 2008-03-13 | 2013-07-12 | Search engine |
US15/143,402 Abandoned US20160283607A1 (en) | 2008-03-13 | 2016-04-29 | Search engine |
Family Applications Before (2)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US12/922,380 Active US8489573B2 (en) | 2008-03-13 | 2009-03-12 | Search engine |
US13/909,898 Active US9330178B2 (en) | 2008-03-13 | 2013-07-12 | Search engine |
Country Status (6)
Country | Link |
---|---|
US (3) | US8489573B2 (en) |
EP (1) | EP2277117A2 (en) |
CN (2) | CN102027471B (en) |
CA (1) | CA2755319C (en) |
GB (1) | GB2458309A (en) |
WO (1) | WO2009112862A2 (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117909389A (en) * | 2024-03-19 | 2024-04-19 | 成都虚谷伟业科技有限公司 | SQL fuzzy query method, device and storage medium |
US12189626B1 (en) * | 2023-08-08 | 2025-01-07 | Rubrik, Inc. | Automatic query optimization |
Families Citing this family (19)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8498982B1 (en) * | 2010-07-07 | 2013-07-30 | Openlogic, Inc. | Noise reduction for content matching analysis results for protectable content |
EP2463785A1 (en) * | 2010-12-13 | 2012-06-13 | Fujitsu Limited | Database and search-engine query system |
GB201111554D0 (en) | 2011-07-06 | 2011-08-24 | Business Partners Ltd | Search index |
US8972387B2 (en) * | 2011-07-28 | 2015-03-03 | International Business Machines Corporation | Smarter search |
US20130066861A1 (en) * | 2011-09-13 | 2013-03-14 | Chacha Search, Inc. | Method and system of management of search results |
US9058392B1 (en) * | 2012-03-22 | 2015-06-16 | Google Inc. | Client state result de-duping |
EP3042500B1 (en) * | 2013-09-06 | 2022-11-02 | RealNetworks, Inc. | Metadata-based file-identification systems and methods |
US9552378B2 (en) * | 2013-11-21 | 2017-01-24 | Adobe Systems Incorporated | Method and apparatus for saving search query as metadata with an image |
RU2014125471A (en) | 2014-06-24 | 2015-12-27 | Общество С Ограниченной Ответственностью "Яндекс" | SEARCH QUERY PROCESSING METHOD AND SERVER |
US10152488B2 (en) | 2015-05-13 | 2018-12-11 | Samsung Electronics Co., Ltd. | Static-analysis-assisted dynamic application crawling architecture |
CN105095399B (en) * | 2015-07-06 | 2019-06-28 | 百度在线网络技术(北京)有限公司 | Search result method for pushing and device |
US10015244B1 (en) | 2016-04-29 | 2018-07-03 | Rich Media Ventures, Llc | Self-publishing workflow |
US10083672B1 (en) | 2016-04-29 | 2018-09-25 | Rich Media Ventures, Llc | Automatic customization of e-books based on reader specifications |
CN106528590B (en) * | 2016-09-18 | 2023-04-07 | 海信视像科技股份有限公司 | Query method and device |
US10685057B1 (en) * | 2016-12-30 | 2020-06-16 | Shutterstock, Inc. | Style modification of images in search results |
US20200043479A1 (en) * | 2018-08-02 | 2020-02-06 | Soundhound, Inc. | Visually presenting information relevant to a natural language conversation |
US10762114B1 (en) * | 2018-10-26 | 2020-09-01 | X Mobile Co. | Ecosystem for providing responses to user queries entered via a conversational interface |
US11140042B2 (en) * | 2019-09-18 | 2021-10-05 | Servicenow, Inc. | Dictionary-based service mapping |
US11272007B2 (en) * | 2020-07-21 | 2022-03-08 | Servicenow, Inc. | Unified agent framework including push-based discovery and real-time diagnostics features |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20050165804A1 (en) * | 2003-08-04 | 2005-07-28 | Todd Rothman | System and process for generating interactive learning packages |
US20070005654A1 (en) * | 2005-05-20 | 2007-01-04 | Avichai Schachar | Systems and methods for analyzing relationships between entities |
US20070192300A1 (en) * | 2006-02-16 | 2007-08-16 | Mobile Content Networks, Inc. | Method and system for determining relevant sources, querying and merging results from multiple content sources |
US20080155192A1 (en) * | 2006-12-26 | 2008-06-26 | Takayoshi Iitsuka | Storage system |
Family Cites Families (17)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5987446A (en) * | 1996-11-12 | 1999-11-16 | U.S. West, Inc. | Searching large collections of text using multiple search engines concurrently |
WO2000010097A1 (en) * | 1998-08-14 | 2000-02-24 | Song Oliver Yuh Shen | Universal business management system and method |
US7266553B1 (en) * | 2002-07-01 | 2007-09-04 | Microsoft Corporation | Content data indexing |
US20040064447A1 (en) * | 2002-09-27 | 2004-04-01 | Simske Steven J. | System and method for management of synonymic searching |
US8055669B1 (en) * | 2003-03-03 | 2011-11-08 | Google Inc. | Search queries improved based on query semantic information |
US7885963B2 (en) * | 2003-03-24 | 2011-02-08 | Microsoft Corporation | Free text and attribute searching of electronic program guide (EPG) data |
US7840557B1 (en) * | 2004-05-12 | 2010-11-23 | Google Inc. | Search engine cache control |
JP4189369B2 (en) * | 2004-09-24 | 2008-12-03 | 株式会社東芝 | Structured document search apparatus and structured document search method |
US20060161520A1 (en) * | 2005-01-14 | 2006-07-20 | Microsoft Corporation | System and method for generating alternative search terms |
US20070124671A1 (en) * | 2005-11-29 | 2007-05-31 | Keith Hackworth | Field name abstraction for control of data labels |
ES2452735T3 (en) * | 2006-08-25 | 2014-04-02 | Motorola Mobility Llc | Method and system for classifying data using a self-organizing map |
SG140510A1 (en) | 2006-09-01 | 2008-03-28 | Yokogawa Electric Corp | System and method for database indexing, searching and data retrieval |
US8099415B2 (en) * | 2006-09-08 | 2012-01-17 | Simply Hired, Inc. | Method and apparatus for assessing similarity between online job listings |
US20080065633A1 (en) * | 2006-09-11 | 2008-03-13 | Simply Hired, Inc. | Job Search Engine and Methods of Use |
US7836085B2 (en) * | 2007-02-05 | 2010-11-16 | Google Inc. | Searching structured geographical data |
US8166045B1 (en) * | 2007-03-30 | 2012-04-24 | Google Inc. | Phrase extraction using subphrase scoring |
CN101075229A (en) * | 2007-06-09 | 2007-11-21 | 腾讯科技(深圳)有限公司 | Method and system for analyzing phrase semantic tendency |
-
2008
- 2008-03-13 GB GB0804695A patent/GB2458309A/en not_active Withdrawn
-
2009
- 2009-03-12 CN CN200980117385.7A patent/CN102027471B/en not_active Expired - Fee Related
- 2009-03-12 WO PCT/GB2009/050240 patent/WO2009112862A2/en active Application Filing
- 2009-03-12 CA CA2755319A patent/CA2755319C/en active Active
- 2009-03-12 EP EP09719838A patent/EP2277117A2/en not_active Withdrawn
- 2009-03-12 US US12/922,380 patent/US8489573B2/en active Active
- 2009-03-12 CN CN201410593426.2A patent/CN104361038B/en not_active Expired - Fee Related
-
2013
- 2013-07-12 US US13/909,898 patent/US9330178B2/en active Active
-
2016
- 2016-04-29 US US15/143,402 patent/US20160283607A1/en not_active Abandoned
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20050165804A1 (en) * | 2003-08-04 | 2005-07-28 | Todd Rothman | System and process for generating interactive learning packages |
US20070005654A1 (en) * | 2005-05-20 | 2007-01-04 | Avichai Schachar | Systems and methods for analyzing relationships between entities |
US20070192300A1 (en) * | 2006-02-16 | 2007-08-16 | Mobile Content Networks, Inc. | Method and system for determining relevant sources, querying and merging results from multiple content sources |
US20080155192A1 (en) * | 2006-12-26 | 2008-06-26 | Takayoshi Iitsuka | Storage system |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US12189626B1 (en) * | 2023-08-08 | 2025-01-07 | Rubrik, Inc. | Automatic query optimization |
CN117909389A (en) * | 2024-03-19 | 2024-04-19 | 成都虚谷伟业科技有限公司 | SQL fuzzy query method, device and storage medium |
Also Published As
Publication number | Publication date |
---|---|
WO2009112862A3 (en) | 2009-11-19 |
CN104361038A (en) | 2015-02-18 |
WO2009112862A9 (en) | 2010-11-18 |
EP2277117A2 (en) | 2011-01-26 |
CN104361038B (en) | 2018-06-05 |
US8489573B2 (en) | 2013-07-16 |
US20130290290A1 (en) | 2013-10-31 |
CN102027471A (en) | 2011-04-20 |
CN102027471B (en) | 2014-12-03 |
CA2755319C (en) | 2017-10-03 |
WO2009112862A2 (en) | 2009-09-17 |
GB2458309A (en) | 2009-09-16 |
US20110055191A1 (en) | 2011-03-03 |
GB0804695D0 (en) | 2008-04-16 |
CA2755319A1 (en) | 2009-09-17 |
US9330178B2 (en) | 2016-05-03 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US9330178B2 (en) | Search engine | |
US7266553B1 (en) | Content data indexing | |
US10275434B1 (en) | Identifying a primary version of a document | |
US6801904B2 (en) | System for keyword based searching over relational databases | |
JP6006267B2 (en) | System and method for narrowing a search using index keys | |
US8301633B2 (en) | System and method for semantic search | |
EP2780838B1 (en) | Locating relevant content items across multiple disparate content sources | |
US6651052B1 (en) | System and method for data storage and retrieval | |
US8296279B1 (en) | Identifying results through substring searching | |
US20030078915A1 (en) | Generalized keyword matching for keyword based searching over relational databases | |
US20100106729A1 (en) | System and method for metadata search | |
US20090055374A1 (en) | Method and apparatus for generating search keys based on profile information | |
Fatima et al. | New framework for semantic search engine | |
Hughes et al. | A metadata search engine for digital language archives | |
CN105183736A (en) | Integrated search system and method for network device configuration and status information | |
US20050210005A1 (en) | Methods and systems for searching data containing both text and numerical/tabular data formats | |
Deng et al. | LAF: a new XML encoding and indexing strategy for keyword‐based XML search | |
Kantorski et al. | Choosing values for text fields in Web forms | |
Bhaskar et al. | Cross lingual query dependent snippet generation | |
Sima et al. | Keyword query approach over rdf data based on tree template | |
Keerthana et al. | Dspaa: A data sharing platform with automated annotation | |
Rao et al. | Web Search Engine | |
Parimala et al. | Extended Change Identification System | |
Eskicioğlu | A Search Engine for Turkish with Stemming | |
KR20010081455A (en) | Web Searching System |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AS | Assignment |
Owner name: BUSINESS PARTNERS LIMITED, UNITED KINGDOM Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:BAIN, SIMON IAN;REEL/FRAME:038426/0662 Effective date: 20101210 |
|
STCB | Information on status: application discontinuation |
Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION |