CN104123288B

CN104123288B - A kind of data query method and device

Info

Publication number: CN104123288B
Application number: CN201310146187.1A
Authority: CN
Inventors: 乔平; 夏晨; 傅跃兵
Original assignee: Alibaba Group Holding Ltd
Current assignee: Zhejiang Tmall Technology Co Ltd
Priority date: 2013-04-24
Filing date: 2013-04-24
Publication date: 2018-06-01
Anticipated expiration: 2033-04-24
Also published as: CN104123288A

Abstract

This application discloses a kind of data query method and device, including query statement is decomposed into more than one inquiry clause；Traversal queries clause, inquiry clause and meta data match are replaced with the implementing result preserved in metadata；Otherwise inquiry clause and more new metadata are performed；Final implementing result is obtained into query result for main inquiry.The inquiry clause performed and its implementing result are managed by the application as metadata, for inquiry clause identical in current sessions or different sessions, inquiry clause is directly replaced with to the implementing result preserved in metadata, repeating for inquiry clause is avoided, so as to improve the query performance of total system to a greater extent.Meanwhile compared with the optimization of existing query statement level, the query optimization granularity more refinement of the application, the big data inquiry being more suitable in data warehouse applications, so as to further improve the query performance of total system.

Description

Data query method and device

Technical Field

The present application relates to data storage and analysis technologies, and in particular, to a method and an apparatus for querying data in a data warehouse for mass data with less reads and writes.

Background

In many application fields such as scientific research, computer simulation, internet application, electronic commerce, and the like, the amount of data is increasing at an extremely fast rate. The explosion of information in the big data era, especially the explosive growth of unstructured data, is deeply affecting the entire data storage and analysis field.

In order to meet emerging business requirements, traditional data processing modes are gradually abandoned, and new modes are tried instead to access, process and analyze various types of data including unstructured data. In the field of data storage and analysis, MapReduce is undoubtedly the most interesting new generation technology, and developers can develop programs capable of processing massive unstructured data in parallel across processor distributed clusters or independent computers by using a MapReduce programming framework. The MapReduce is a programming model and is used for parallel operation of large-scale data, Map represents mapping, and Reduce represents simplification.

In the big data era, the traditional Structured Query Language (SQL) has not been able to satisfy all business requirements, and although mapreduce is widely applied, the programming interface exposed by the mapreduce framework is still low-level, time-consuming to develop, and code is difficult to reuse. And then, turning to a program which can convert SQL written by a user into a corresponding MapReduce program, and executing the tasks by using a MapReduce framework to process mass data on a Hadoop Distributed File System (HDFS). The most widely used at present is Hive, which is contributed by Facebook. The Hive based on the MapReduce has strong advantages in the aspects of expansibility and fault tolerance. The Hive is a data warehouse tool based on Hadoop, can map the structured data file into a database table, provides a complete SQL query function, and can convert the SQL statement into a MapReduce task for operation.

In the enterprise data mining process, the mining tasks usually executed every day are hundreds of thousands, even tens of thousands, so that the problem of repeated calculation in different mining tasks is inevitable, such as: engineer a needs to screen a batch of data of condition W from table S1 and table S2 and then combine with table S3 to regenerate the data desired by engineer a. Meanwhile, engineer B also needs to select the same data from tables S1 and S2 under the same condition W, and combine the same data with table X to reproduce the data desired by engineer B. It is not usually known between these two engineers that there are some requirements that are partially identical to each other, so in a production task the same computation will be done twice and the same common query information is repeatedly accessed. Clearly, this reduces the query performance of the overall system.

Disclosure of Invention

In order to solve the technical problem, the application provides a data query method and a data query device, which can avoid repeated operation of public query information, thereby improving the query performance of the whole system.

To achieve the object of the present application, the present application provides a data query method, including: decomposing the query statement into more than one query clause;

traversing the query clause, matching the query clause with the metadata, replacing the query clause with an execution result stored in the metadata, and otherwise, executing the query clause and updating the metadata;

using the finally obtained execution result for main query to obtain a query result;

the query clauses comprise sub-queries and main queries, the sub-queries are executed before the main queries, and execution results of the sub-queries are used for the main queries.

The method also comprises the following steps: and acquiring the query statement, and verifying the validity and the correctness of the query statement.

The method further comprises the following steps: and carrying out normalization and query optimization on the query statement.

The traversal query clause specifically includes:

performing subsequent traversal on a query clause sub-tree formed by the query clauses, matching the triple information of the query clauses with the metadata,

if the triple information of the query clause exists in the metadata database, matching the triple information of the query clause with the metadata, and replacing the query clause with a corresponding execution result; resetting the expiration time of the metadata;

if the triple information of the query clause does not exist in the metadata base, the triple information of the query clause is not matched with the metadata, the query clause is executed, the execution result is persisted, and the corresponding triple information is added into the metadata; setting the expiration time of the metadata and the level of the query clause at the query clause subtree.

When the triple information of the query clause is matched with the metadata, the method further comprises the following steps: and adding one to the number of times of using the record corresponding to the execution result.

In decomposing the query statement into a plurality of query clauses, the method further comprises: optimizing the query clause.

The method further comprises the following steps: presetting an optimization strategy of the query clause;

the optimized query clause includes: and optimizing the query clauses obtained by decomposition according to the query clause optimization strategy, and eliminating unnecessary query clauses.

The query clause optimization strategy comprises the following steps:

deleting the query clause when the input set and the output set of the query clause are consistent; and/or the presence of a gas in the gas,

when the difference between the query clauses is in the output columns and the output column of the first query clause is a subset of the output column of the second query clause, executing the second query clause and merging the output result set of the second query clause into the first query clause; and/or the presence of a gas in the gas,

when the difference between the query clauses is the output columns and the output columns of the first query clause and the second query clause are complementary sets, combining the first query clause and the second query clause; and/or the presence of a gas in the gas,

when the input sets of the query clauses are the same and the filtering conditions are different, extracting the common parts of the query clauses with the same input sets and different filtering conditions to form a new query clause, executing the new query clause, and respectively merging the output result sets of the new query clauses into the query clauses with the same input sets and different filtering conditions; and/or the presence of a gas in the gas,

splitting the query clauses with existence or relationship into two or more query clauses.

The metadata is stored in a metadata database; the method also includes managing the metadata repository:

deleting the invalid query clause when the expiration time corresponding to the metadata expires, or a user defined function UDF for generating the query clause result data expires, or one of input data sets for generating the query clause result data is changed, or an output set of the query clause is forcibly recycled by a user;

and/or the presence of a gas in the gas,

and scanning the metadata according to the originally set interval time, and deleting the invalid query clause information.

The deleting the stale query clause comprises: deleting all query clauses that depend on the execution result of the query clause;

and if the execution result corresponding to the query clause is an intermediate result, deleting the intermediate result from the distributed file storage system.

The metadata includes: the query node tree query method comprises the steps of three-tuple information corresponding to each query clause, failure time of the query clause corresponding to the three-tuple information, the number of times of the query clause being used, the level of the query clause in the query node tree and query result identification.

The triplets include an input set, an output set, and an operational instruction set to complete an input-to-output conversion.

The application provides a data query device, which at least comprises a metadata database, a decomposition module, a processing module and an output module, wherein,

the metadata base is used for storing metadata and comprises triple information corresponding to all the query clauses, the failure time of the query clauses corresponding to the triple information, the using times of the query clauses, the level of the query clauses in the query node tree and query result identification information;

the decomposition module is used for decomposing the query statement into more than one query clause;

the processing module is used for traversing the query clause, matching the query clause with the metadata and replacing the query clause with an execution result stored in the metadata; otherwise, executing the query clause and updating the metadata;

the output module is used for using the finally obtained execution result in the main query to obtain a query result;

The device also includes: and the acquisition module is used for acquiring the query statement and verifying the validity and the correctness of the query statement.

The obtaining module is further configured to perform normalization and query optimization on the query statement.

The processing module is specifically configured to match the triple information of the query clause with the metadata:

when the triple information of the query clause exists in the metadata base, replacing the query clause with an execution result; resetting the expiration time of the metadata;

when the triple information of the query clause does not exist in the metadata base, executing the current query clause, persisting the execution result, and adding corresponding triple information in the metadata; setting the expiration time of the metadata and the level of the query clause at the query clause subtree.

And the processing module is further used for adding one to the used times of the record corresponding to the execution result when the triple information of the query clause is matched with the metadata.

The output module is specifically configured to iterate execution results of the sub-queries in the query clause and use the iterated execution results in the main query to obtain a query result.

A query clause optimization strategy is preset in the decomposition module;

and the decomposition module is also used for optimizing the query clauses obtained by decomposition according to an optimization strategy.

The processing module is further configured to manage the metadata base and remove the invalid query clause:

deleting the invalid query clause when the expiration time corresponding to the metadata expires, or the UDF generating the query clause result data expires, or one of the input data sets generating the query clause result data is changed, or the output set of the query clause is forcibly recovered by a user;

and/or the presence of a gas in the gas,

scanning metadata according to the originally set interval time, and deleting invalid query clauses;

deleting all query clauses which depend on the execution result of the query clause; and if the execution result corresponding to the query clause is an intermediate result, deleting the intermediate result from the distributed file storage system.

Compared with the prior art, the method comprises the steps of decomposing the query statement into one or more query clauses; traversing the query clause, matching the query clause with the metadata, replacing the query clause with an execution result stored in the metadata, and otherwise, executing the query clause and updating the metadata; and using the finally obtained execution result for the main query to obtain a query result. The method and the device manage the executed query clauses and the execution results thereof as metadata, directly replace the query clauses with the execution results stored in the metadata for the same query clauses in the current session or different sessions, and avoid repeated execution of the query clauses, thereby improving the query performance of the whole system to the maximum extent. Meanwhile, the query statement is decomposed into a plurality of query clauses, and compared with the existing optimization mode of the query statement level, the query optimization granularity of the data query method is finer, and the data query method is more suitable for large data query in data warehouse application, so that the query performance of the whole system is further improved.

Additional features and advantages of the application will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by the practice of the application. The objectives and other advantages of the application may be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.

Drawings

The accompanying drawings are included to provide a further understanding of the claimed subject matter and are incorporated in and constitute a part of this specification, illustrate embodiments of the subject matter and together with the description serve to explain the principles of the subject matter and not to limit the subject matter.

FIG. 1 is a flow chart of a data query method of the present application;

fig. 2 is a schematic diagram of a structure of the data query device according to the present application.

Detailed Description

To make the objects, technical solutions and advantages of the present application more apparent, embodiments of the present application will be described in detail below with reference to the accompanying drawings. It should be noted that the embodiments and features of the embodiments in the present application may be arbitrarily combined with each other without conflict.

In one exemplary configuration of the present application, a computing device includes one or more processors (CPUs), input/output interfaces, a network interface, and memory.

The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.

Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, computer readable media does not include non-transitory computer readable media (transient media), such as modulated data signals and carrier waves.

The steps illustrated in the flow charts of the figures may be performed in a computer system such as a set of computer-executable instructions. Also, while a logical order is shown in the flow diagrams, in some cases, the steps shown or described may be performed in an order different than here.

A query clause in this context means that when a query is a condition of another query, the query is called a sub-query, wherein the outer query is called a main query. The sub-queries are executed prior to the main query, with the results of the execution of the sub-queries being used for the main query. The sub-queries and the main query are collectively referred to herein as query clauses. A query clause having a defined input set Pin and output set Pout, and an operation instruction set phi for performing input-to-output conversion_operatorAnd the three parts are formed and called query clause triplets as tau ═ Pin, Pout, phi_operatorAnd the query clause is used for uniquely identifying the query clause. The source of an entry for a query clause is either another query clause or from the Table or Table Partition, denoted as Pin ═ dom | τ, Table, Partition }.

Fig. 1 is a flowchart of a data query method of the present application, as shown in fig. 1, which mainly includes:

step 100: the query statement is decomposed into more than one query clause.

In this step, before querying, the query statement is decomposed into one or more query clauses, and the query clauses obtained by decomposition are used to construct a determined query clause operator tree. Each query clause is composed of a determined input set, an output set and an operation operator tree, whereinThe operator tree is composed of a set of operation instruction set phi_operatorForming; a common operation Operator instruction is shown in table 1:

operate Operator instruction	Means of
		TableScanOperator	Scan table data
SelectOperator	Selecting an output column
		FilterOperator	Filtering input data
FileSinkOperator	Building result data and outputting the result data to a file
		ReduceSinkOperator	Creating a message to be sent to a Reducer end<Key,Value>To pair
JoinOperator	Multi-source data JOIN
		GroupByOperator	GroupBy statement
LimitOperator	Limit statement
		UnionOperator	Union statements
…

TABLE 1

In this step, the specific implementation of the method for decomposing the query statement into the query clause belongs to the conventional technical means of those skilled in the art, and the specific implementation thereof is not used to limit the scope of the present application. This step emphasizes that the query statement is decomposed into one or more query clauses prior to the query. Therefore, compared with the existing query mode positioned in a query statement layer, the query granularity of the data query method is finer, and the data query method is more suitable for large data query in data warehouse application.

Step 100 is preceded by: obtaining the query statement, verifying the validity and the correctness of the query statement, and further normalizing and optimizing the obtained query statement. The purpose of normalization and query optimization is normalization, that is, query statements implementing the same function are expressed in a common optimal manner according to known set formatting rules and optimization strategies. Common optimization strategies include column reduction, partition reduction, predicate push-down, and the like. The detailed implementation is a routine matter for those skilled in the art, and is not described herein nor used to limit the scope of the present application.

Step 101: traversing the query clause, matching the query clause with the metadata, and replacing the query clause with an execution result stored in the metadata; otherwise, the query clause is executed and the metadata is updated.

The metadata is stored in a preset metadata base, and comprises triple information corresponding to each query clause, and also comprises information such as the failure time of the query clause corresponding to the triple information, the number of times of using the query clause, the level of the query clause in a query node tree, query result identification and the like. The failure time is used for clearing metadata when the failure time is up, and recovering resources occupied by the query clauses which are not used for a long time; the number of used times is used for optimizing the execution priority of the query clause from the perspective of the whole system, the number of used times and the number (hierarchy) of query nodes can be used as the weight of priority calculation, for example, the query clauses with higher priorities share more calculation resources and priority in time when the number of used times is more, the priority is higher, and the like; the level of the query clause is the maximum value of the levels of all query clauses in the input set plus 1, and if the input set is a table or a table partition, the level of the query clause is 0; the query result identification can be understood as a table name, and if the query result identification is a query clause root node and the table name is specified by a user, the table name is used; otherwise, the table name of a temporary table is selected, which is not transparent to the user, and the user does not sense the existence of the temporary table. The data of the query clause result set can be located according to the table name.

The method specifically comprises the following steps: the query sub-sentence subtree determined in step 100 is traversed, e.g., in a subsequent order, matching the metadata with the triplet information of the query sub-sentence,

if the triple information of the current query clause exists in the metadata base, the triple information of the query clause is matched with the metadata, at the moment, the query clause is indicated to be executed before, and the data of the execution result is effective, then the query clause is directly replaced by the execution result stored in the metadata, namely the table related information of the corresponding record; resetting the expiration time of the metadata according to the existing session parameter settings (if the session parameter settings do not exist, a default value such as 3 days may be used), and further, the number of times the record is used may be increased by one;

if the triple information of the current query clause does not exist in the metadata base, the triple information of the query clause is not matched with the metadata, at the moment, the query clause is not executed before, then the current query clause is executed, the persistent execution result is to store the execution result in the distributed file storage system, and the corresponding triple information is added in the metadata; the expiry time of this metadata is set according to the existing session parameter settings (if no session parameter settings exist, a default value such as 3 days may be used), and the depth (level) of the query clause in the query clause subtree. It should be noted that, if the query clause is a sub-query, the name of the output data set is identified by using TMP _ as a prefix, and a specific name can be randomly generated to ensure that the name is globally unique; if the query clause is a main query, if the name of the defined output table exists, the name is used, and the failure time is set as never failure (the table created by the corresponding user is invalid only when the user deletes the table); otherwise, the setting method is consistent with the sub-query.

Step 102: and using the finally obtained execution result for the main query to obtain a query result.

After all the query clauses are traversed, the execution results of the sub-queries in the query clauses are iterated, namely the execution results of the query clause root node query clauses are used for main query, and finally the query results are obtained.

To sum up, the present application specifically includes: decomposing the query statement into one or more query clauses and outputting a query clause operator tree; traversing the query clause calculation tree, if the query clause is matched with the metadata, replacing the matched query clause with an execution result stored in the metadata; executing the unmatched query clauses and updating the execution result information to the metadata; and using the execution result of the query clause sub-tree root node in the main query to obtain a query result.

As can be seen from the matching process in step 101 of the present application, the present application manages the executed query clauses and the execution results thereof as metadata, and directly replaces the query clauses with the execution results stored in the metadata database for the same query clauses in the current session or different sessions, thereby avoiding repeated execution of the query clauses and further improving the query performance of the entire system to a greater extent. Meanwhile, the query statement is decomposed into a plurality of query clauses before query, and compared with the optimization of the existing query statement level, the query optimization granularity of the data query method is finer, and the data query method is more suitable for large data query in data warehouse application, so that the query performance of the whole system is further improved.

Further, in step 100 of the present application, when the query statement is decomposed into a plurality of query clauses, the method may further include: the query clause is optimized. Unnecessary query clauses are eliminated by further optimizing the query clauses, so that unnecessary queries are reduced, and the query performance of the whole system is further improved.

The optimizing the query clause specifically includes: and presetting a query clause optimization strategy, optimizing the query clauses obtained by decomposition according to the optimization strategy, and eliminating unnecessary query clauses. The optimization strategy may include, but is not limited to, the following:

1) the input set and the output set of the query clause are consistent (Pin ═ Pout), namely the query clause is directly equal to the input set, the result data set of the query clause has no help for optimization, and the result data set does not need to be synchronously added into a metadata base of the query clause, namely the query clause is deleted;

2) the only difference between query clauses is in the output columns, and the output column of one query clause is a subset of the output column of another query clause, such as:

query clause a) is: SELECT name FROM employee title ═ Teamlead';

query clause b) is: SELECT FROM element thread tier title ═ teamlead';

the query clause a completes query on the name field of the employee table, and records meeting the title field and having the value equal to the TeamLeader are taken out; and the query clause b completes the query of all fields of the employee table, and takes out the record of which the value of the field conforming to the title is equal to the TeamLeader.

That is, the data result set for query clause a) is a subset of query clause b); in this case, the optimization method is as follows: query clause b) is executed first, assuming the output result set is: TMP _20130318_012345, then, query clause a) can be optimized as: SELECT name FROM TMP _20130318_ 012345. This case is equivalent to incorporating the output result set of query clause b) into query clause a), thereby simplifying query clause a) while also deleting query clause b).

3) The query clauses differ only in the output columns, and the output column of one query clause is complementary to the other output column, such as:

query clause a) is: SELECT name FROM employee title ═ Teamlead';

the query statement b) is: SELECT addr FROM empty tier title ═ TeamLeader';

wherein, the query clause a) completes the query of the name field of the employee table, and takes out all records of which the value conforming to the title field is equal to the TeamLeader; and b) finishing the query of the addr field of the employee table, and taking out all records of which the values meeting the title field are equal to the TeamLeader.

In this case, the optimization method is as follows: and merging the two query clauses, wherein the merged query clause is a query clause c): SELECT name, addr FROM employee title ═ 11'.

4) The two query clause input sets are the same, and the filtering conditions are different, such as:

query clause a) is: SELECT name FROM employee title ═ Teamlead';

query clause b) is: SELECT addr FROM WHERE tier title ═ TeamLeader 'and name ═ Smich';

wherein, the query clause a) completes the query of the name field of the employee table, and takes out all records of which the value conforming to the title field is equal to the TeamLeader; a query clause b) completes the query of the addr field of the employee table, and takes out all records of which the value of the title field is equal to the TeamLeader and the value of the name field is equal to Smich;

in this case, the optimization method is as follows: extracting the common part of the query clause constitutes query clause c), such as:

query clause c) is: SELECT name, addr FROM employee name title ═ TeamLeader';

and c) finishing the query of the employee table name and addr fields, and taking out all records of which the values meeting the title fields are equal to the TeamLeader.

Assume that the output result set for query clause c) is: TMP _20130318_222222, then the query clause can be simplified according to the resulting output result set, specifically:

query clause a) may be optimized as: SELECT name FROM TMP _20130318_ 222222;

optimizing query clause b) as: SELECT addr FROM TMP _20130318_222222WHERE name is 'Smich';

5) one query clause with existence or relationship is divided into two or more query clauses, so that the hit rate of the common query clause can be improved. Such as for

Query clause a): SELECT FROM element WHERE title ═ TeamLeader 'or title ═ Manager';

wherein, the query clause a) completes the query of all fields of the employee table, and takes out all records of which the values meeting the title fields are equal to the TeamLeader or Manager.

The method can be split into:

query clause b): SELECT FROM element thread tier title ═ teamlead';

query clause c): SELECT FROM EMPLOYEE WHERE title'

Wherein, the query clause b) completes the query of all fields of the employee table, and takes out all records of which the values meeting the title fields are equal to the TeamLeader; and c) finishing the query of all fields of the employee table, and taking out all records of which the values of the fields conforming to the title are equal to the Manager.

The method of the present application further comprises: and managing the metadata base, namely scanning the metadata at intervals and clearing out the expired query clauses. Wherein, the interval time can be preset to days or hours;

in addition to the normal situation of clearing the corresponding query clause, such as the expiry of expiry time, there are several situations that can cause the result data of the query clause to expire, such as: a User Defined Function (UDF) generating result data of a certain query clause is expired; for another example: generating one of the input data sets of the result data of a certain query clause, wherein the data change occurs; the following steps are repeated: the output set of a certain query clause is forcibly recycled by the user. When the query clause is invalid, it means that all query clauses depending on the execution result of the query clause are simultaneously invalid; and if the execution result corresponding to the query clause is an intermediate result, deleting the intermediate result from the distributed file storage system at the same time.

The present application also provides a data query apparatus, as shown in fig. 2, including at least a metadata database, a decomposition module, a processing module, and an output module, wherein,

and the metadata base is used for storing metadata, and comprises triple information corresponding to each query clause, and also comprises information such as the failure time of the query clause corresponding to the triple information, the number of times of using the query clause, the level of the query clause in the query node tree, the query result identification and the like.

The decomposition module is used for decomposing the query statement into one or more query clauses;

and the output module is used for using the finally obtained execution result in the main query to obtain a query result.

The data query device of the application further comprises: the acquisition module is used for acquiring the query statement and verifying the validity and the correctness of the query statement; and further, normalizing and optimizing the obtained query statement.

Wherein,

when the triple information of the query clause exists in the metadata base, replacing the query clause with an execution result, namely table related information of a corresponding record; resetting the expiration time of the metadata, further, adding one to the number of times the record is used;

when the triple information of the query clause does not exist in the metadata base, executing the current query clause, persisting the execution result, and adding corresponding triple information in the metadata; the expiration time of the metadata is set, and the depth (level) of the query clause in the query clause subtree.

And the output module is specifically used for iterating the execution result of the sub-query in the query clause, using the iteration result to carry out the main query, and finally obtaining a query result.

Besides, a query clause optimization strategy is preset in the decomposition module, and the decomposition module is further used for optimizing the query clauses obtained through decomposition according to the optimization strategy.

The processing module is also used for managing the metadata base and eliminating invalid query clauses:

and/or the presence of a gas in the gas,

According to the device, the executed query clauses and the execution results of the query clauses are managed as metadata, and the query clauses are directly replaced by the execution results stored in the metadata database for the same query clauses in the current session or different sessions, so that repeated execution of the query clauses is avoided, and the query performance of the whole system is improved to the maximum extent. Meanwhile, the query statement is decomposed into a plurality of query clauses before query, and compared with the optimization of the existing query statement level, the query optimization granularity of the data query method is finer, and the data query method is more suitable for large data query in data warehouse application, so that the query performance of the whole system is further improved.

In addition, unnecessary query clauses are removed through further optimization of the query clauses, and unnecessary queries are reduced, so that the query performance of the whole system is further improved.

It will be appreciated by those skilled in the art that the components of the apparatus and steps of the method provided in the embodiments of the present application described above may be centralized on a single computing device or distributed across a network of multiple computing devices. Alternatively, they may be implemented in program code executable by a computing device. Thus, they may be stored in a memory device for execution by a computing device, or they may be separately fabricated as individual integrated circuit modules, or multiple modules or steps thereof may be fabricated as a single integrated circuit module for implementation. Thus, the present application is not limited to any specific combination of hardware and software.

Although the embodiments disclosed in the present application are described above, the descriptions are only for the convenience of understanding the present application, and are not intended to limit the present application. It will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the disclosure as defined by the appended claims.

Claims

1. A method for querying data, comprising:

presetting a query clause optimization strategy, decomposing a query clause into more than one query clause, and optimizing the decomposed query clauses according to the optimization strategy;

wherein the query clause optimization strategy comprises:

and splitting the query clause with the existence or relationship into more than two query clauses.

2. The data query method of claim 1, further comprising, before the method: and acquiring the query statement, and verifying the validity and the correctness of the query statement.

3. The data query method of claim 2, further comprising: and carrying out normalization and query optimization on the query statement.

4. The data query method according to any one of claims 1 to 3, wherein traversing the query clause specifically includes:

5. The data query method of claim 4, wherein when the triplet information of the query clause matches the metadata, the method further comprises: and adding one to the number of times of using the record corresponding to the execution result.

6. The data query method of claim 1, wherein the optimizing the decomposed query clause according to the query clause optimization strategy comprises: and eliminating unnecessary query clauses.

7. The data query method according to any one of claims 1 to 3, wherein the metadata is stored in a metadata database; the method also includes managing the metadata repository:

and/or the presence of a gas in the gas,

8. The data query method of claim 7, wherein the deleting the stale query clause comprises: deleting all query clauses that depend on the execution result of the query clause;

9. The data query method of claim 8, wherein the metadata comprises: the query node tree query method comprises the steps of three-tuple information corresponding to each query clause, failure time of the query clause corresponding to the three-tuple information, the number of times of the query clause being used, the level of the query clause in the query node tree and query result identification.

10. The data query method of claim 9, wherein the triples include an input set, an output set, and an operation instruction set to complete an input-to-output conversion.

11. A data query device is characterized by at least comprising a metadata database, a decomposition module, a processing module and an output module, wherein,

the decomposition module is used for presetting a query clause optimization strategy, decomposing the query clause into more than one query clause and optimizing the query clause obtained by decomposition according to the optimization strategy;

wherein the query clause optimization strategy comprises:

12. The data query device of claim 11, further comprising: and the acquisition module is used for acquiring the query statement and verifying the validity and the correctness of the query statement.

13. The data query device of claim 12, wherein the obtaining module is further configured to perform normalization and query optimization on the query statement.

14. The data query device according to any one of claims 11 to 13, wherein the processing module is specifically configured to perform matching using the triplet information of the query clause and the metadata:

15. The apparatus according to claim 14, wherein the processing module is further configured to, when the triplet information of the query clause matches the metadata, increment a number of times that the record corresponding to the execution result is used by one.

16. The data query device according to any one of claims 11 to 13, wherein the output module is specifically configured to iterate execution results of sub-queries in the query clause and use the iterated execution results for main queries to obtain query results.

17. The data query device according to any one of claims 11 to 13, wherein the processing module is further configured to manage a metadata database, and clear an invalid query clause:

and/or the presence of a gas in the gas,