+

CN104123288B - A kind of data query method and device - Google Patents

A kind of data query method and device Download PDF

Info

Publication number
CN104123288B
CN104123288B CN201310146187.1A CN201310146187A CN104123288B CN 104123288 B CN104123288 B CN 104123288B CN 201310146187 A CN201310146187 A CN 201310146187A CN 104123288 B CN104123288 B CN 104123288B
Authority
CN
China
Prior art keywords
query
clause
metadata
query clause
clauses
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201310146187.1A
Other languages
Chinese (zh)
Other versions
CN104123288A (en
Inventor
乔平
夏晨
傅跃兵
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang Tmall Technology Co Ltd
Original Assignee
Alibaba Group Holding Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba Group Holding Ltd filed Critical Alibaba Group Holding Ltd
Priority to CN201310146187.1A priority Critical patent/CN104123288B/en
Publication of CN104123288A publication Critical patent/CN104123288A/en
Application granted granted Critical
Publication of CN104123288B publication Critical patent/CN104123288B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2453Query optimisation

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

This application discloses a kind of data query method and device, including query statement is decomposed into more than one inquiry clause;Traversal queries clause, inquiry clause and meta data match are replaced with the implementing result preserved in metadata;Otherwise inquiry clause and more new metadata are performed;Final implementing result is obtained into query result for main inquiry.The inquiry clause performed and its implementing result are managed by the application as metadata, for inquiry clause identical in current sessions or different sessions, inquiry clause is directly replaced with to the implementing result preserved in metadata, repeating for inquiry clause is avoided, so as to improve the query performance of total system to a greater extent.Meanwhile compared with the optimization of existing query statement level, the query optimization granularity more refinement of the application, the big data inquiry being more suitable in data warehouse applications, so as to further improve the query performance of total system.

Description

Data query method and device
Technical Field
The present application relates to data storage and analysis technologies, and in particular, to a method and an apparatus for querying data in a data warehouse for mass data with less reads and writes.
Background
In many application fields such as scientific research, computer simulation, internet application, electronic commerce, and the like, the amount of data is increasing at an extremely fast rate. The explosion of information in the big data era, especially the explosive growth of unstructured data, is deeply affecting the entire data storage and analysis field.
In order to meet emerging business requirements, traditional data processing modes are gradually abandoned, and new modes are tried instead to access, process and analyze various types of data including unstructured data. In the field of data storage and analysis, MapReduce is undoubtedly the most interesting new generation technology, and developers can develop programs capable of processing massive unstructured data in parallel across processor distributed clusters or independent computers by using a MapReduce programming framework. The MapReduce is a programming model and is used for parallel operation of large-scale data, Map represents mapping, and Reduce represents simplification.
In the big data era, the traditional Structured Query Language (SQL) has not been able to satisfy all business requirements, and although mapreduce is widely applied, the programming interface exposed by the mapreduce framework is still low-level, time-consuming to develop, and code is difficult to reuse. And then, turning to a program which can convert SQL written by a user into a corresponding MapReduce program, and executing the tasks by using a MapReduce framework to process mass data on a Hadoop Distributed File System (HDFS). The most widely used at present is Hive, which is contributed by Facebook. The Hive based on the MapReduce has strong advantages in the aspects of expansibility and fault tolerance. The Hive is a data warehouse tool based on Hadoop, can map the structured data file into a database table, provides a complete SQL query function, and can convert the SQL statement into a MapReduce task for operation.
In the enterprise data mining process, the mining tasks usually executed every day are hundreds of thousands, even tens of thousands, so that the problem of repeated calculation in different mining tasks is inevitable, such as: engineer a needs to screen a batch of data of condition W from table S1 and table S2 and then combine with table S3 to regenerate the data desired by engineer a. Meanwhile, engineer B also needs to select the same data from tables S1 and S2 under the same condition W, and combine the same data with table X to reproduce the data desired by engineer B. It is not usually known between these two engineers that there are some requirements that are partially identical to each other, so in a production task the same computation will be done twice and the same common query information is repeatedly accessed. Clearly, this reduces the query performance of the overall system.
Disclosure of Invention
In order to solve the technical problem, the application provides a data query method and a data query device, which can avoid repeated operation of public query information, thereby improving the query performance of the whole system.
To achieve the object of the present application, the present application provides a data query method, including: decomposing the query statement into more than one query clause;
traversing the query clause, matching the query clause with the metadata, replacing the query clause with an execution result stored in the metadata, and otherwise, executing the query clause and updating the metadata;
using the finally obtained execution result for main query to obtain a query result;
the query clauses comprise sub-queries and main queries, the sub-queries are executed before the main queries, and execution results of the sub-queries are used for the main queries.
The method also comprises the following steps: and acquiring the query statement, and verifying the validity and the correctness of the query statement.
The method further comprises the following steps: and carrying out normalization and query optimization on the query statement.
The traversal query clause specifically includes:
performing subsequent traversal on a query clause sub-tree formed by the query clauses, matching the triple information of the query clauses with the metadata,
if the triple information of the query clause exists in the metadata database, matching the triple information of the query clause with the metadata, and replacing the query clause with a corresponding execution result; resetting the expiration time of the metadata;
if the triple information of the query clause does not exist in the metadata base, the triple information of the query clause is not matched with the metadata, the query clause is executed, the execution result is persisted, and the corresponding triple information is added into the metadata; setting the expiration time of the metadata and the level of the query clause at the query clause subtree.
When the triple information of the query clause is matched with the metadata, the method further comprises the following steps: and adding one to the number of times of using the record corresponding to the execution result.
In decomposing the query statement into a plurality of query clauses, the method further comprises: optimizing the query clause.
The method further comprises the following steps: presetting an optimization strategy of the query clause;
the optimized query clause includes: and optimizing the query clauses obtained by decomposition according to the query clause optimization strategy, and eliminating unnecessary query clauses.
The query clause optimization strategy comprises the following steps:
deleting the query clause when the input set and the output set of the query clause are consistent; and/or the presence of a gas in the gas,
when the difference between the query clauses is in the output columns and the output column of the first query clause is a subset of the output column of the second query clause, executing the second query clause and merging the output result set of the second query clause into the first query clause; and/or the presence of a gas in the gas,
when the difference between the query clauses is the output columns and the output columns of the first query clause and the second query clause are complementary sets, combining the first query clause and the second query clause; and/or the presence of a gas in the gas,
when the input sets of the query clauses are the same and the filtering conditions are different, extracting the common parts of the query clauses with the same input sets and different filtering conditions to form a new query clause, executing the new query clause, and respectively merging the output result sets of the new query clauses into the query clauses with the same input sets and different filtering conditions; and/or the presence of a gas in the gas,
splitting the query clauses with existence or relationship into two or more query clauses.
The metadata is stored in a metadata database; the method also includes managing the metadata repository:
deleting the invalid query clause when the expiration time corresponding to the metadata expires, or a user defined function UDF for generating the query clause result data expires, or one of input data sets for generating the query clause result data is changed, or an output set of the query clause is forcibly recycled by a user;
and/or the presence of a gas in the gas,
and scanning the metadata according to the originally set interval time, and deleting the invalid query clause information.
The deleting the stale query clause comprises: deleting all query clauses that depend on the execution result of the query clause;
and if the execution result corresponding to the query clause is an intermediate result, deleting the intermediate result from the distributed file storage system.
The metadata includes: the query node tree query method comprises the steps of three-tuple information corresponding to each query clause, failure time of the query clause corresponding to the three-tuple information, the number of times of the query clause being used, the level of the query clause in the query node tree and query result identification.
The triplets include an input set, an output set, and an operational instruction set to complete an input-to-output conversion.
The application provides a data query device, which at least comprises a metadata database, a decomposition module, a processing module and an output module, wherein,
the metadata base is used for storing metadata and comprises triple information corresponding to all the query clauses, the failure time of the query clauses corresponding to the triple information, the using times of the query clauses, the level of the query clauses in the query node tree and query result identification information;
the decomposition module is used for decomposing the query statement into more than one query clause;
the processing module is used for traversing the query clause, matching the query clause with the metadata and replacing the query clause with an execution result stored in the metadata; otherwise, executing the query clause and updating the metadata;
the output module is used for using the finally obtained execution result in the main query to obtain a query result;
the query clauses comprise sub-queries and main queries, the sub-queries are executed before the main queries, and execution results of the sub-queries are used for the main queries.
The device also includes: and the acquisition module is used for acquiring the query statement and verifying the validity and the correctness of the query statement.
The obtaining module is further configured to perform normalization and query optimization on the query statement.
The processing module is specifically configured to match the triple information of the query clause with the metadata:
when the triple information of the query clause exists in the metadata base, replacing the query clause with an execution result; resetting the expiration time of the metadata;
when the triple information of the query clause does not exist in the metadata base, executing the current query clause, persisting the execution result, and adding corresponding triple information in the metadata; setting the expiration time of the metadata and the level of the query clause at the query clause subtree.
And the processing module is further used for adding one to the used times of the record corresponding to the execution result when the triple information of the query clause is matched with the metadata.
The output module is specifically configured to iterate execution results of the sub-queries in the query clause and use the iterated execution results in the main query to obtain a query result.
A query clause optimization strategy is preset in the decomposition module;
and the decomposition module is also used for optimizing the query clauses obtained by decomposition according to an optimization strategy.
The processing module is further configured to manage the metadata base and remove the invalid query clause:
deleting the invalid query clause when the expiration time corresponding to the metadata expires, or the UDF generating the query clause result data expires, or one of the input data sets generating the query clause result data is changed, or the output set of the query clause is forcibly recovered by a user;
and/or the presence of a gas in the gas,
scanning metadata according to the originally set interval time, and deleting invalid query clauses;
deleting all query clauses which depend on the execution result of the query clause; and if the execution result corresponding to the query clause is an intermediate result, deleting the intermediate result from the distributed file storage system.
Compared with the prior art, the method comprises the steps of decomposing the query statement into one or more query clauses; traversing the query clause, matching the query clause with the metadata, replacing the query clause with an execution result stored in the metadata, and otherwise, executing the query clause and updating the metadata; and using the finally obtained execution result for the main query to obtain a query result. The method and the device manage the executed query clauses and the execution results thereof as metadata, directly replace the query clauses with the execution results stored in the metadata for the same query clauses in the current session or different sessions, and avoid repeated execution of the query clauses, thereby improving the query performance of the whole system to the maximum extent. Meanwhile, the query statement is decomposed into a plurality of query clauses, and compared with the existing optimization mode of the query statement level, the query optimization granularity of the data query method is finer, and the data query method is more suitable for large data query in data warehouse application, so that the query performance of the whole system is further improved.
Additional features and advantages of the application will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by the practice of the application. The objectives and other advantages of the application may be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.
Drawings
The accompanying drawings are included to provide a further understanding of the claimed subject matter and are incorporated in and constitute a part of this specification, illustrate embodiments of the subject matter and together with the description serve to explain the principles of the subject matter and not to limit the subject matter.
FIG. 1 is a flow chart of a data query method of the present application;
fig. 2 is a schematic diagram of a structure of the data query device according to the present application.
Detailed Description
To make the objects, technical solutions and advantages of the present application more apparent, embodiments of the present application will be described in detail below with reference to the accompanying drawings. It should be noted that the embodiments and features of the embodiments in the present application may be arbitrarily combined with each other without conflict.
In one exemplary configuration of the present application, a computing device includes one or more processors (CPUs), input/output interfaces, a network interface, and memory.
The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.
Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, computer readable media does not include non-transitory computer readable media (transient media), such as modulated data signals and carrier waves.
The steps illustrated in the flow charts of the figures may be performed in a computer system such as a set of computer-executable instructions. Also, while a logical order is shown in the flow diagrams, in some cases, the steps shown or described may be performed in an order different than here.
A query clause in this context means that when a query is a condition of another query, the query is called a sub-query, wherein the outer query is called a main query. The sub-queries are executed prior to the main query, with the results of the execution of the sub-queries being used for the main query. The sub-queries and the main query are collectively referred to herein as query clauses. A query clause having a defined input set Pin and output set Pout, and an operation instruction set phi for performing input-to-output conversionoperatorAnd the three parts are formed and called query clause triplets as tau ═ Pin, Pout, phioperatorAnd the query clause is used for uniquely identifying the query clause. The source of an entry for a query clause is either another query clause or from the Table or Table Partition, denoted as Pin ═ dom | τ, Table, Partition }.
Fig. 1 is a flowchart of a data query method of the present application, as shown in fig. 1, which mainly includes:
step 100: the query statement is decomposed into more than one query clause.
In this step, before querying, the query statement is decomposed into one or more query clauses, and the query clauses obtained by decomposition are used to construct a determined query clause operator tree. Each query clause is composed of a determined input set, an output set and an operation operator tree, whereinThe operator tree is composed of a set of operation instruction set phioperatorForming; a common operation Operator instruction is shown in table 1:
operate Operator instruction Means of
TableScanOperator Scan table data
SelectOperator Selecting an output column
FilterOperator Filtering input data
FileSinkOperator Building result data and outputting the result data to a file
ReduceSinkOperator Creating a message to be sent to a Reducer end<Key,Value>To pair
JoinOperator Multi-source data JOIN
GroupByOperator GroupBy statement
LimitOperator Limit statement
UnionOperator Union statements
TABLE 1
In this step, the specific implementation of the method for decomposing the query statement into the query clause belongs to the conventional technical means of those skilled in the art, and the specific implementation thereof is not used to limit the scope of the present application. This step emphasizes that the query statement is decomposed into one or more query clauses prior to the query. Therefore, compared with the existing query mode positioned in a query statement layer, the query granularity of the data query method is finer, and the data query method is more suitable for large data query in data warehouse application.
Step 100 is preceded by: obtaining the query statement, verifying the validity and the correctness of the query statement, and further normalizing and optimizing the obtained query statement. The purpose of normalization and query optimization is normalization, that is, query statements implementing the same function are expressed in a common optimal manner according to known set formatting rules and optimization strategies. Common optimization strategies include column reduction, partition reduction, predicate push-down, and the like. The detailed implementation is a routine matter for those skilled in the art, and is not described herein nor used to limit the scope of the present application.
Step 101: traversing the query clause, matching the query clause with the metadata, and replacing the query clause with an execution result stored in the metadata; otherwise, the query clause is executed and the metadata is updated.
The metadata is stored in a preset metadata base, and comprises triple information corresponding to each query clause, and also comprises information such as the failure time of the query clause corresponding to the triple information, the number of times of using the query clause, the level of the query clause in a query node tree, query result identification and the like. The failure time is used for clearing metadata when the failure time is up, and recovering resources occupied by the query clauses which are not used for a long time; the number of used times is used for optimizing the execution priority of the query clause from the perspective of the whole system, the number of used times and the number (hierarchy) of query nodes can be used as the weight of priority calculation, for example, the query clauses with higher priorities share more calculation resources and priority in time when the number of used times is more, the priority is higher, and the like; the level of the query clause is the maximum value of the levels of all query clauses in the input set plus 1, and if the input set is a table or a table partition, the level of the query clause is 0; the query result identification can be understood as a table name, and if the query result identification is a query clause root node and the table name is specified by a user, the table name is used; otherwise, the table name of a temporary table is selected, which is not transparent to the user, and the user does not sense the existence of the temporary table. The data of the query clause result set can be located according to the table name.
The method specifically comprises the following steps: the query sub-sentence subtree determined in step 100 is traversed, e.g., in a subsequent order, matching the metadata with the triplet information of the query sub-sentence,
if the triple information of the current query clause exists in the metadata base, the triple information of the query clause is matched with the metadata, at the moment, the query clause is indicated to be executed before, and the data of the execution result is effective, then the query clause is directly replaced by the execution result stored in the metadata, namely the table related information of the corresponding record; resetting the expiration time of the metadata according to the existing session parameter settings (if the session parameter settings do not exist, a default value such as 3 days may be used), and further, the number of times the record is used may be increased by one;
if the triple information of the current query clause does not exist in the metadata base, the triple information of the query clause is not matched with the metadata, at the moment, the query clause is not executed before, then the current query clause is executed, the persistent execution result is to store the execution result in the distributed file storage system, and the corresponding triple information is added in the metadata; the expiry time of this metadata is set according to the existing session parameter settings (if no session parameter settings exist, a default value such as 3 days may be used), and the depth (level) of the query clause in the query clause subtree. It should be noted that, if the query clause is a sub-query, the name of the output data set is identified by using TMP _ as a prefix, and a specific name can be randomly generated to ensure that the name is globally unique; if the query clause is a main query, if the name of the defined output table exists, the name is used, and the failure time is set as never failure (the table created by the corresponding user is invalid only when the user deletes the table); otherwise, the setting method is consistent with the sub-query.
Step 102: and using the finally obtained execution result for the main query to obtain a query result.
After all the query clauses are traversed, the execution results of the sub-queries in the query clauses are iterated, namely the execution results of the query clause root node query clauses are used for main query, and finally the query results are obtained.
To sum up, the present application specifically includes: decomposing the query statement into one or more query clauses and outputting a query clause operator tree; traversing the query clause calculation tree, if the query clause is matched with the metadata, replacing the matched query clause with an execution result stored in the metadata; executing the unmatched query clauses and updating the execution result information to the metadata; and using the execution result of the query clause sub-tree root node in the main query to obtain a query result.
As can be seen from the matching process in step 101 of the present application, the present application manages the executed query clauses and the execution results thereof as metadata, and directly replaces the query clauses with the execution results stored in the metadata database for the same query clauses in the current session or different sessions, thereby avoiding repeated execution of the query clauses and further improving the query performance of the entire system to a greater extent. Meanwhile, the query statement is decomposed into a plurality of query clauses before query, and compared with the optimization of the existing query statement level, the query optimization granularity of the data query method is finer, and the data query method is more suitable for large data query in data warehouse application, so that the query performance of the whole system is further improved.
Further, in step 100 of the present application, when the query statement is decomposed into a plurality of query clauses, the method may further include: the query clause is optimized. Unnecessary query clauses are eliminated by further optimizing the query clauses, so that unnecessary queries are reduced, and the query performance of the whole system is further improved.
The optimizing the query clause specifically includes: and presetting a query clause optimization strategy, optimizing the query clauses obtained by decomposition according to the optimization strategy, and eliminating unnecessary query clauses. The optimization strategy may include, but is not limited to, the following:
1) the input set and the output set of the query clause are consistent (Pin ═ Pout), namely the query clause is directly equal to the input set, the result data set of the query clause has no help for optimization, and the result data set does not need to be synchronously added into a metadata base of the query clause, namely the query clause is deleted;
2) the only difference between query clauses is in the output columns, and the output column of one query clause is a subset of the output column of another query clause, such as:
query clause a) is: SELECT name FROM employee title ═ Teamlead';
query clause b) is: SELECT FROM element thread tier title ═ teamlead';
the query clause a completes query on the name field of the employee table, and records meeting the title field and having the value equal to the TeamLeader are taken out; and the query clause b completes the query of all fields of the employee table, and takes out the record of which the value of the field conforming to the title is equal to the TeamLeader.
That is, the data result set for query clause a) is a subset of query clause b); in this case, the optimization method is as follows: query clause b) is executed first, assuming the output result set is: TMP _20130318_012345, then, query clause a) can be optimized as: SELECT name FROM TMP _20130318_ 012345. This case is equivalent to incorporating the output result set of query clause b) into query clause a), thereby simplifying query clause a) while also deleting query clause b).
3) The query clauses differ only in the output columns, and the output column of one query clause is complementary to the other output column, such as:
query clause a) is: SELECT name FROM employee title ═ Teamlead';
the query statement b) is: SELECT addr FROM empty tier title ═ TeamLeader';
wherein, the query clause a) completes the query of the name field of the employee table, and takes out all records of which the value conforming to the title field is equal to the TeamLeader; and b) finishing the query of the addr field of the employee table, and taking out all records of which the values meeting the title field are equal to the TeamLeader.
In this case, the optimization method is as follows: and merging the two query clauses, wherein the merged query clause is a query clause c): SELECT name, addr FROM employee title ═ 11'.
4) The two query clause input sets are the same, and the filtering conditions are different, such as:
query clause a) is: SELECT name FROM employee title ═ Teamlead';
query clause b) is: SELECT addr FROM WHERE tier title ═ TeamLeader 'and name ═ Smich';
wherein, the query clause a) completes the query of the name field of the employee table, and takes out all records of which the value conforming to the title field is equal to the TeamLeader; a query clause b) completes the query of the addr field of the employee table, and takes out all records of which the value of the title field is equal to the TeamLeader and the value of the name field is equal to Smich;
in this case, the optimization method is as follows: extracting the common part of the query clause constitutes query clause c), such as:
query clause c) is: SELECT name, addr FROM employee name title ═ TeamLeader';
and c) finishing the query of the employee table name and addr fields, and taking out all records of which the values meeting the title fields are equal to the TeamLeader.
Assume that the output result set for query clause c) is: TMP _20130318_222222, then the query clause can be simplified according to the resulting output result set, specifically:
query clause a) may be optimized as: SELECT name FROM TMP _20130318_ 222222;
optimizing query clause b) as: SELECT addr FROM TMP _20130318_222222WHERE name is 'Smich';
5) one query clause with existence or relationship is divided into two or more query clauses, so that the hit rate of the common query clause can be improved. Such as for
Query clause a): SELECT FROM element WHERE title ═ TeamLeader 'or title ═ Manager';
wherein, the query clause a) completes the query of all fields of the employee table, and takes out all records of which the values meeting the title fields are equal to the TeamLeader or Manager.
The method can be split into:
query clause b): SELECT FROM element thread tier title ═ teamlead';
query clause c): SELECT FROM EMPLOYEE WHERE title'
Wherein, the query clause b) completes the query of all fields of the employee table, and takes out all records of which the values meeting the title fields are equal to the TeamLeader; and c) finishing the query of all fields of the employee table, and taking out all records of which the values of the fields conforming to the title are equal to the Manager.
The method of the present application further comprises: and managing the metadata base, namely scanning the metadata at intervals and clearing out the expired query clauses. Wherein, the interval time can be preset to days or hours;
in addition to the normal situation of clearing the corresponding query clause, such as the expiry of expiry time, there are several situations that can cause the result data of the query clause to expire, such as: a User Defined Function (UDF) generating result data of a certain query clause is expired; for another example: generating one of the input data sets of the result data of a certain query clause, wherein the data change occurs; the following steps are repeated: the output set of a certain query clause is forcibly recycled by the user. When the query clause is invalid, it means that all query clauses depending on the execution result of the query clause are simultaneously invalid; and if the execution result corresponding to the query clause is an intermediate result, deleting the intermediate result from the distributed file storage system at the same time.
The present application also provides a data query apparatus, as shown in fig. 2, including at least a metadata database, a decomposition module, a processing module, and an output module, wherein,
and the metadata base is used for storing metadata, and comprises triple information corresponding to each query clause, and also comprises information such as the failure time of the query clause corresponding to the triple information, the number of times of using the query clause, the level of the query clause in the query node tree, the query result identification and the like.
The decomposition module is used for decomposing the query statement into one or more query clauses;
the processing module is used for traversing the query clause, matching the query clause with the metadata and replacing the query clause with an execution result stored in the metadata; otherwise, executing the query clause and updating the metadata;
and the output module is used for using the finally obtained execution result in the main query to obtain a query result.
The data query device of the application further comprises: the acquisition module is used for acquiring the query statement and verifying the validity and the correctness of the query statement; and further, normalizing and optimizing the obtained query statement.
Wherein,
the processing module is specifically configured to match the triple information of the query clause with the metadata:
when the triple information of the query clause exists in the metadata base, replacing the query clause with an execution result, namely table related information of a corresponding record; resetting the expiration time of the metadata, further, adding one to the number of times the record is used;
when the triple information of the query clause does not exist in the metadata base, executing the current query clause, persisting the execution result, and adding corresponding triple information in the metadata; the expiration time of the metadata is set, and the depth (level) of the query clause in the query clause subtree.
And the output module is specifically used for iterating the execution result of the sub-query in the query clause, using the iteration result to carry out the main query, and finally obtaining a query result.
Besides, a query clause optimization strategy is preset in the decomposition module, and the decomposition module is further used for optimizing the query clauses obtained through decomposition according to the optimization strategy.
The processing module is also used for managing the metadata base and eliminating invalid query clauses:
deleting the invalid query clause when the expiration time corresponding to the metadata expires, or the UDF generating the query clause result data expires, or one of the input data sets generating the query clause result data is changed, or the output set of the query clause is forcibly recovered by a user;
and/or the presence of a gas in the gas,
scanning metadata according to the originally set interval time, and deleting invalid query clauses;
deleting all query clauses which depend on the execution result of the query clause; and if the execution result corresponding to the query clause is an intermediate result, deleting the intermediate result from the distributed file storage system.
According to the device, the executed query clauses and the execution results of the query clauses are managed as metadata, and the query clauses are directly replaced by the execution results stored in the metadata database for the same query clauses in the current session or different sessions, so that repeated execution of the query clauses is avoided, and the query performance of the whole system is improved to the maximum extent. Meanwhile, the query statement is decomposed into a plurality of query clauses before query, and compared with the optimization of the existing query statement level, the query optimization granularity of the data query method is finer, and the data query method is more suitable for large data query in data warehouse application, so that the query performance of the whole system is further improved.
In addition, unnecessary query clauses are removed through further optimization of the query clauses, and unnecessary queries are reduced, so that the query performance of the whole system is further improved.
It will be appreciated by those skilled in the art that the components of the apparatus and steps of the method provided in the embodiments of the present application described above may be centralized on a single computing device or distributed across a network of multiple computing devices. Alternatively, they may be implemented in program code executable by a computing device. Thus, they may be stored in a memory device for execution by a computing device, or they may be separately fabricated as individual integrated circuit modules, or multiple modules or steps thereof may be fabricated as a single integrated circuit module for implementation. Thus, the present application is not limited to any specific combination of hardware and software.
Although the embodiments disclosed in the present application are described above, the descriptions are only for the convenience of understanding the present application, and are not intended to limit the present application. It will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the disclosure as defined by the appended claims.

Claims (17)

1. A method for querying data, comprising:
presetting a query clause optimization strategy, decomposing a query clause into more than one query clause, and optimizing the decomposed query clauses according to the optimization strategy;
traversing the query clause, matching the query clause with the metadata, replacing the query clause with an execution result stored in the metadata, and otherwise, executing the query clause and updating the metadata;
using the finally obtained execution result for main query to obtain a query result;
wherein the query clause optimization strategy comprises:
deleting the query clause when the input set and the output set of the query clause are consistent; and/or the presence of a gas in the gas,
when the difference between the query clauses is in the output columns and the output column of the first query clause is a subset of the output column of the second query clause, executing the second query clause and merging the output result set of the second query clause into the first query clause; and/or the presence of a gas in the gas,
when the difference between the query clauses is the output columns and the output columns of the first query clause and the second query clause are complementary sets, combining the first query clause and the second query clause; and/or the presence of a gas in the gas,
when the input sets of the query clauses are the same and the filtering conditions are different, extracting the common parts of the query clauses with the same input sets and different filtering conditions to form a new query clause, executing the new query clause, and respectively merging the output result sets of the new query clauses into the query clauses with the same input sets and different filtering conditions; and/or the presence of a gas in the gas,
and splitting the query clause with the existence or relationship into more than two query clauses.
2. The data query method of claim 1, further comprising, before the method: and acquiring the query statement, and verifying the validity and the correctness of the query statement.
3. The data query method of claim 2, further comprising: and carrying out normalization and query optimization on the query statement.
4. The data query method according to any one of claims 1 to 3, wherein traversing the query clause specifically includes:
performing subsequent traversal on a query clause sub-tree formed by the query clauses, matching the triple information of the query clauses with the metadata,
if the triple information of the query clause exists in the metadata database, matching the triple information of the query clause with the metadata, and replacing the query clause with a corresponding execution result; resetting the expiration time of the metadata;
if the triple information of the query clause does not exist in the metadata base, the triple information of the query clause is not matched with the metadata, the query clause is executed, the execution result is persisted, and the corresponding triple information is added into the metadata; setting the expiration time of the metadata and the level of the query clause at the query clause subtree.
5. The data query method of claim 4, wherein when the triplet information of the query clause matches the metadata, the method further comprises: and adding one to the number of times of using the record corresponding to the execution result.
6. The data query method of claim 1, wherein the optimizing the decomposed query clause according to the query clause optimization strategy comprises: and eliminating unnecessary query clauses.
7. The data query method according to any one of claims 1 to 3, wherein the metadata is stored in a metadata database; the method also includes managing the metadata repository:
deleting the invalid query clause when the expiration time corresponding to the metadata expires, or a user defined function UDF for generating the query clause result data expires, or one of input data sets for generating the query clause result data is changed, or an output set of the query clause is forcibly recycled by a user;
and/or the presence of a gas in the gas,
and scanning the metadata according to the originally set interval time, and deleting the invalid query clause information.
8. The data query method of claim 7, wherein the deleting the stale query clause comprises: deleting all query clauses that depend on the execution result of the query clause;
and if the execution result corresponding to the query clause is an intermediate result, deleting the intermediate result from the distributed file storage system.
9. The data query method of claim 8, wherein the metadata comprises: the query node tree query method comprises the steps of three-tuple information corresponding to each query clause, failure time of the query clause corresponding to the three-tuple information, the number of times of the query clause being used, the level of the query clause in the query node tree and query result identification.
10. The data query method of claim 9, wherein the triples include an input set, an output set, and an operation instruction set to complete an input-to-output conversion.
11. A data query device is characterized by at least comprising a metadata database, a decomposition module, a processing module and an output module, wherein,
the metadata base is used for storing metadata and comprises triple information corresponding to all the query clauses, the failure time of the query clauses corresponding to the triple information, the using times of the query clauses, the level of the query clauses in the query node tree and query result identification information;
the decomposition module is used for presetting a query clause optimization strategy, decomposing the query clause into more than one query clause and optimizing the query clause obtained by decomposition according to the optimization strategy;
the processing module is used for traversing the query clause, matching the query clause with the metadata and replacing the query clause with an execution result stored in the metadata; otherwise, executing the query clause and updating the metadata;
the output module is used for using the finally obtained execution result in the main query to obtain a query result;
wherein the query clause optimization strategy comprises:
deleting the query clause when the input set and the output set of the query clause are consistent; and/or the presence of a gas in the gas,
when the difference between the query clauses is in the output columns and the output column of the first query clause is a subset of the output column of the second query clause, executing the second query clause and merging the output result set of the second query clause into the first query clause; and/or the presence of a gas in the gas,
when the difference between the query clauses is the output columns and the output columns of the first query clause and the second query clause are complementary sets, combining the first query clause and the second query clause; and/or the presence of a gas in the gas,
when the input sets of the query clauses are the same and the filtering conditions are different, extracting the common parts of the query clauses with the same input sets and different filtering conditions to form a new query clause, executing the new query clause, and respectively merging the output result sets of the new query clauses into the query clauses with the same input sets and different filtering conditions; and/or the presence of a gas in the gas,
and splitting the query clause with the existence or relationship into more than two query clauses.
12. The data query device of claim 11, further comprising: and the acquisition module is used for acquiring the query statement and verifying the validity and the correctness of the query statement.
13. The data query device of claim 12, wherein the obtaining module is further configured to perform normalization and query optimization on the query statement.
14. The data query device according to any one of claims 11 to 13, wherein the processing module is specifically configured to perform matching using the triplet information of the query clause and the metadata:
when the triple information of the query clause exists in the metadata base, replacing the query clause with an execution result; resetting the expiration time of the metadata;
when the triple information of the query clause does not exist in the metadata base, executing the current query clause, persisting the execution result, and adding corresponding triple information in the metadata; setting the expiration time of the metadata and the level of the query clause at the query clause subtree.
15. The apparatus according to claim 14, wherein the processing module is further configured to, when the triplet information of the query clause matches the metadata, increment a number of times that the record corresponding to the execution result is used by one.
16. The data query device according to any one of claims 11 to 13, wherein the output module is specifically configured to iterate execution results of sub-queries in the query clause and use the iterated execution results for main queries to obtain query results.
17. The data query device according to any one of claims 11 to 13, wherein the processing module is further configured to manage a metadata database, and clear an invalid query clause:
deleting the invalid query clause when the expiration time corresponding to the metadata expires, or the UDF generating the query clause result data expires, or one of the input data sets generating the query clause result data is changed, or the output set of the query clause is forcibly recovered by a user;
and/or the presence of a gas in the gas,
scanning metadata according to the originally set interval time, and deleting invalid query clauses;
deleting all query clauses which depend on the execution result of the query clause; and if the execution result corresponding to the query clause is an intermediate result, deleting the intermediate result from the distributed file storage system.
CN201310146187.1A 2013-04-24 2013-04-24 A kind of data query method and device Active CN104123288B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310146187.1A CN104123288B (en) 2013-04-24 2013-04-24 A kind of data query method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310146187.1A CN104123288B (en) 2013-04-24 2013-04-24 A kind of data query method and device

Publications (2)

Publication Number Publication Date
CN104123288A CN104123288A (en) 2014-10-29
CN104123288B true CN104123288B (en) 2018-06-01

Family

ID=51768701

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310146187.1A Active CN104123288B (en) 2013-04-24 2013-04-24 A kind of data query method and device

Country Status (1)

Country Link
CN (1) CN104123288B (en)

Families Citing this family (34)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104462317A (en) * 2014-12-01 2015-03-25 苏州朗米尔照明科技有限公司 Data screening method based on tag
CN104866608B (en) * 2015-06-05 2018-01-09 中国人民大学 Enquiring and optimizing method based on join index in a kind of data warehouse
CN106325832A (en) * 2015-06-18 2017-01-11 阿里巴巴集团控股有限公司 Data generation method and device
CN106411826B (en) * 2015-08-03 2019-06-18 阿里巴巴集团控股有限公司 A kind of method and apparatus of data access
CN106598963B (en) * 2015-10-14 2021-08-10 五八同城信息技术有限公司 Query statement optimization method and device
US10678792B2 (en) * 2015-10-23 2020-06-09 Oracle International Corporation Parallel execution of queries with a recursive clause
CN106776639B (en) * 2015-11-24 2020-04-10 腾讯科技(深圳)有限公司 Data processing method and data processing device based on structured query language
CN107239710B (en) * 2016-03-29 2020-06-16 北京明略软件系统有限公司 Database permission implementation method and system
CN105930407B (en) * 2016-04-18 2019-05-17 北京思特奇信息技术股份有限公司 A kind of inter-library relation query method of distributed data base and system
CN106445645B (en) * 2016-09-06 2019-11-26 北京百度网讯科技有限公司 Method and apparatus for executing distributed computing task
CN108089915B (en) * 2016-11-22 2021-10-15 北京京东尚科信息技术有限公司 Method and system for business control processing based on message queue
CN106802927A (en) * 2016-12-22 2017-06-06 上海大汉三通数据通信有限公司 A kind of date storage method and querying method
WO2018218504A1 (en) * 2017-05-31 2018-12-06 华为技术有限公司 Method and device for data query
CN107315790B (en) * 2017-06-14 2021-07-06 腾讯科技(深圳)有限公司 Optimization method and device for non-relevant sub-queries
CN110019394A (en) * 2017-11-28 2019-07-16 阿里巴巴集团控股有限公司 Data query method and apparatus
CN108268589B (en) * 2017-12-05 2019-07-23 北京百度网讯科技有限公司 Aggregate query method, apparatus, computer equipment and the readable medium of time series data
CN108388606B (en) * 2018-02-06 2022-01-25 福建星瑞格软件有限公司 Method for checking base table field names in Sql sentences and computer equipment
CN110297710B (en) * 2018-03-23 2022-05-10 畅想科技有限公司 Common priority information for multiple resource arbitration
CN110727672A (en) * 2018-06-28 2020-01-24 北京京东尚科信息技术有限公司 Data mapping relation query method and device, electronic equipment and readable medium
CN109308300B (en) * 2018-09-27 2021-11-12 上海达梦数据库有限公司 Logic operation processing method and device, conversion plug-in and storage medium
CN110968615B (en) * 2018-09-30 2023-05-23 北京国双科技有限公司 Data query method and device
CN111125155B (en) * 2018-10-31 2023-10-27 北京国双科技有限公司 Access path-based data query method, device, storage medium and processor
CN109508339B (en) * 2018-11-12 2020-10-09 上海达梦数据库有限公司 Data query method and device, terminal equipment and storage medium
CN111488326B (en) * 2019-01-25 2023-04-07 阿里巴巴集团控股有限公司 MapReduce job execution method, node device and storage medium
CN111597214A (en) * 2019-02-21 2020-08-28 北京京东尚科信息技术有限公司 Method, apparatus, computer system and medium for data query
CN109947804B (en) * 2019-03-20 2021-04-06 上海达梦数据库有限公司 Data set query optimization method and device, server and storage medium
CN110008244A (en) * 2019-03-29 2019-07-12 国家计算机网络与信息安全管理中心 A kind of data query method and data query device
CN110109946B (en) * 2019-04-23 2021-09-14 北京小米移动软件有限公司 Question query method and device
CN110096489A (en) * 2019-04-30 2019-08-06 阿里巴巴集团控股有限公司 A kind of data query method, system, device and electronic equipment
CN113010539A (en) * 2019-12-20 2021-06-22 北京奇艺世纪科技有限公司 Data processing method and device
CN111159218B (en) * 2019-12-31 2023-10-31 中科曙光国际信息产业有限公司 Data processing method, device and readable storage medium
CN111694943B (en) * 2020-05-29 2021-03-23 刘韩 Accurate information query method and system
CN111949684B (en) * 2020-08-06 2023-08-11 上海达梦数据库有限公司 SQL sentence execution method, device, equipment and storage medium
CN114637759B (en) * 2020-12-16 2023-04-18 金篆信科有限责任公司 Data query method, electronic device and storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR100413784B1 (en) * 1997-04-29 2004-02-14 삼성전자주식회사 Device for translating english-korean having clause unit separator and method for the same
CN101075255A (en) * 2007-06-08 2007-11-21 北京神舟航天软件技术有限公司 Method for removing SQL query statement constant condition
CN102045652A (en) * 2009-10-21 2011-05-04 深圳市彩讯科技有限公司 Garbage short message interception method based on characteristic similarity
CN102073719A (en) * 2011-01-10 2011-05-25 复旦大学 Interval coding-based geography markup language (GML) document index method
CN102789475A (en) * 2012-05-08 2012-11-21 华为技术有限公司 Data synchronization method and device for result set cache at server end of database
CN102799622A (en) * 2012-06-19 2012-11-28 北京大学 Distributed structured query language (SQL) query method based on MapReduce expansion framework

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP4215418B2 (en) * 2001-08-24 2009-01-28 インターナショナル・ビジネス・マシーンズ・コーポレーション Word prediction method, speech recognition method, speech recognition apparatus and program using the method
US7461052B2 (en) * 2004-12-06 2008-12-02 International Business Machines Corporation Abstract query plan
CN102682047A (en) * 2011-10-18 2012-09-19 国网电力科学研究院 Mixed structured query language (SQL) injection protection method

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR100413784B1 (en) * 1997-04-29 2004-02-14 삼성전자주식회사 Device for translating english-korean having clause unit separator and method for the same
CN101075255A (en) * 2007-06-08 2007-11-21 北京神舟航天软件技术有限公司 Method for removing SQL query statement constant condition
CN102045652A (en) * 2009-10-21 2011-05-04 深圳市彩讯科技有限公司 Garbage short message interception method based on characteristic similarity
CN102073719A (en) * 2011-01-10 2011-05-25 复旦大学 Interval coding-based geography markup language (GML) document index method
CN102789475A (en) * 2012-05-08 2012-11-21 华为技术有限公司 Data synchronization method and device for result set cache at server end of database
CN102799622A (en) * 2012-06-19 2012-11-28 北京大学 Distributed structured query language (SQL) query method based on MapReduce expansion framework

Also Published As

Publication number Publication date
CN104123288A (en) 2014-10-29

Similar Documents

Publication Publication Date Title
CN104123288B (en) A kind of data query method and device
CN107122443B (en) A kind of distributed full-text search system and method based on Spark SQL
JP6617117B2 (en) Scalable analysis platform for semi-structured data
Khan et al. SQL Database with physical database tuning technique and NoSQL graph database comparisons
Li et al. An integration approach of hybrid databases based on SQL in cloud computing environment
RU2507574C2 (en) Page-by-page breakdown of hierarchical data
CN108052635A (en) A kind of heterogeneous data source unifies conjunctive query method
DE202012013469U1 (en) Data Processing Service
EP3867772B1 (en) Distributed join index for shared-nothing and log-structured databases
DE112011101200T5 (en) Column-oriented memory representations of data records
CN105630881A (en) Data storage method and query method for RDF (Resource Description Framework)
CN107291770B (en) A method and device for querying massive data in a distributed system
CN107491476B (en) Data model conversion and query analysis method suitable for various big data management systems
US20180150544A1 (en) Synchronized updates across multiple database partitions
Das et al. A study on big data integration with data warehouse
US20230418808A1 (en) Maintaining data separation for data consolidated from multiple data artifact instances
CN113946580B (en) Massive heterogeneous log data retrieval middleware
CN113468209A (en) High-speed memory database access method for power grid monitoring system
AGGOUNE et al. A method for transforming object-relational to document-oriented databases
Mittal et al. Efficient random data accessing in MapReduce
Hassan Storage structures in the era of big data: from data warehouse to lakehouse
Arputhamary et al. A review on big data integration
Pothuganti Big data analytics: Hadoop-Map reduce & NoSQL databases
Albahli et al. TripleFCA: FCA-based approach to enhance semantic web data management
Suganya et al. Efficient fragmentation and allocation in distributed databases

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20211109

Address after: Room 507, floor 5, building 3, No. 969, Wenyi West Road, Wuchang Street, Yuhang District, Hangzhou City, Zhejiang Province

Patentee after: ZHEJIANG TMALL TECHNOLOGY Co.,Ltd.

Address before: A four-storey 847 mailbox in Grand Cayman Capital Building, British Cayman Islands

Patentee before: ALIBABA GROUP HOLDING Ltd.

TR01 Transfer of patent right
点击 这是indexloc提供的php浏览器服务,不要输入任何密码和下载