CN116244333B

CN116244333B - Database query performance prediction method and system based on cost factor calibration

Info

Publication number: CN116244333B
Application number: CN202310124785.2A
Authority: CN
Inventors: 彭朝晖; 王一博; 张甲康; 陈倩
Original assignee: Shandong University
Current assignee: Shandong University
Priority date: 2023-02-16
Filing date: 2023-02-16
Publication date: 2025-07-29
Anticipated expiration: 2043-02-16
Also published as: CN116244333A

Abstract

The invention belongs to the technical field of data processing, and provides a database query performance prediction method and system based on cost factor calibration, wherein the method comprises the steps of calibrating cost factors in a database query optimizer module in a statistical regression mode; extracting class characteristics of nodes of the execution plan by adopting a dictionary coding mode for the history execution plan executed in the optimizer, generating embedded characteristics for the class characteristics of the nodes of the execution plan through a characteristic processing neural network, splicing the embedded characteristics into an embedded characteristic tree according to an execution structure of the original execution plan, performing binary treeing on the embedded characteristic tree, training the convolutional neural network according to the embedded characteristic tree, and obtaining query performance by adopting the trained tree-shaped convolutional neural network for the execution plan to be processed. The more accurate and more effective query performance prediction effect is realized.

Description

Database query performance prediction method and system based on cost factor calibration

Technical Field

The invention belongs to the technical field of data processing, and particularly relates to a database query performance prediction method and system based on cost factor calibration.

Background

The statements in this section merely provide background information related to the present disclosure and may not necessarily constitute prior art.

In recent years, much research effort has focused on both optimizing the traditional cost model of the database and using neural networks to learn and replace the traditional cost model. For the former, the computing power of the traditional cost model of the database can be effectively improved, the iterative update of the database optimizer is promoted, but the cost computing formula can not be obtained all the time, the existing formula can not accurately predict the query performance in many complex scenes, the focus of the existing cost formula is mostly the relative magnitude value of the cost between execution plans, no specific value unit exists, and for different hardware and execution environments, the default relative ratio between computing units is used, and the cost computing result can not reflect the execution performance of the query well. For the latter, the neural network can effectively fit the execution performance of the query, but the feature construction aspect is often low-density and sparse in information, so that the association and structure information between nodes in an execution plan cannot be well extracted by the features input by the network model, meanwhile, the knowledge of the existing cost model in the database is often ignored by the existing machine learning method, and the prediction accuracy of the query performance is greatly improved.

For the two current query performance prediction modes, the common advantages are that the used technologies are mature, but lack of fusion, and three problems of query performance prediction are not solved, namely firstly, the attention point of a cost model lacks accuracy in the aspect of query performance prediction because of comparing the relative values of planned costs, and a cost formula facing the aspect of query performance needs to be designed for supplement or correction. Secondly, many neural networks ignore partial order relations in an execution plan when constructing features, obtain sparse coding information for the execution nodes, and cannot capture correlation of the execution nodes and tree structure information of the execution plan well. Third, the existing neural network does not combine the cost model knowledge of the database to learn, and cannot effectively fuse the domain knowledge of the cost model of the database.

Disclosure of Invention

In order to solve at least one technical problem in the background art, the invention provides a database query performance prediction method and system based on cost factor calibration, which effectively fuses the domain knowledge of a cost model in a database query optimizer, establishes a feature processing neural network and an execution plan-oriented tree convolution neural network, effectively captures the relevance of execution nodes and tree structure information of an execution plan, and effectively improves the accuracy of a query performance prediction result. The method can solve key problems in query optimization and query scheduling.

In order to achieve the above purpose, the present invention adopts the following technical scheme:

The first aspect of the invention provides a database query performance prediction method based on cost factor calibration, which comprises the following steps:

acquiring historical data of an execution plan corresponding to the regression query statement;

fitting the numerical value of the cost factor in the cost formula by adopting a statistical regression calibration method according to the historical data of the execution plan corresponding to the regression query statement to obtain a cost model of the database query optimizer;

Acquiring an original training data set, acquiring a training data feature set and a training data label set based on the original training data set, training a tree-shaped deep convolutional neural network model based on the training data feature set and the training data label set, and obtaining a deep calibration model;

and aiming at the execution plan to be predicted, combining a depth calibration model to obtain a predicted error time offset, obtaining cost time based on a database query optimizer cost model, and summing the predicted error time offset and the cost time to obtain the execution time of the query execution plan to be predicted.

A second aspect of the present invention provides a database query performance prediction system based on cost factor calibration, comprising:

The data acquisition module is used for acquiring historical data of the execution plan corresponding to the regression inquiry statement;

The cost model construction module is used for fitting the numerical value of the cost factor in the cost formula by adopting a statistical regression calibration method according to the historical data of the execution plan corresponding to the regression query statement to obtain a cost model of the database query optimizer;

the depth calibration model building module is used for acquiring an original training data set, acquiring a training data feature set and a training data label set based on the original training data set, training a tree-shaped depth convolutional neural network model based on the training data feature set and the training data label set, and acquiring a depth calibration model;

The query performance prediction module is used for obtaining a predicted error time offset by combining a depth calibration model aiming at an execution plan to be predicted, obtaining cost time based on a database query optimizer cost model, and summing the predicted error time offset and the cost time to obtain the execution time of the execution plan to be predicted.

A third aspect of the present invention provides a computer-readable storage medium.

A computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of a method of predicting database query performance based on cost factor calibration as described above.

A fourth aspect of the invention provides a computer device.

A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps in a cost factor calibration based database query performance prediction method as described above when the program is executed.

Compared with the prior art, the invention has the beneficial effects that:

1. the neural network prediction scheme used by the invention effectively merges the domain knowledge of the cost model in the database query optimizer, and can effectively improve the accuracy of the prediction result.

2. According to the invention, the cost factors are calibrated in a statistical regression mode, each cost factor is adjusted to be a numerical value which is suitable for the current hardware environment, and the numerical value represents the real execution time of the cost factor, so that the result calculated by the cost formula is more referential.

3. The invention adopts the characteristic processing neural network for the coding of the execution plan nodes, converts the characteristic vector with sparse information into the embedded vector with dense information, and effectively enriches the characteristic information of the execution nodes.

4. According to the invention, for the performance prediction model of the execution plan, a tree-shaped convolutional neural network facing the execution plan is established, the mutual relevance of the execution nodes and the tree-shaped structure information of the execution plan are effectively captured, the model prediction error time offset is used, and the query performance jointly predicted by the result of the joint database cost formula is more accurate.

Additional aspects of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the invention.

FIG. 1 is a flowchart of an overall prediction method for predicting database query performance according to an embodiment of the present invention;

FIG. 2 is a flowchart of an overall framework for implementing a method according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a neural network for feature processing according to an embodiment of the present invention;

FIG. 4 is a binary tree representation of a query execution plan in accordance with an embodiment of the present invention;

fig. 5 is a schematic diagram of a tree convolutional neural network according to an embodiment of the present invention.

Detailed Description

The invention will be further described with reference to the drawings and examples.

It should be noted that the following detailed description is illustrative and is intended to provide further explanation of the invention. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments according to the present invention. As used herein, the singular is also intended to include the plural unless the context clearly indicates otherwise, and furthermore, it is to be understood that the terms "comprises" and/or "comprising" when used in this specification are taken to specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof.

Example 1

As shown in fig. 1-2, the present embodiment provides a method for predicting database query performance based on cost factor calibration, including the following steps:

s1, acquiring historical data of an execution plan corresponding to a regression query statement;

s2, fitting the numerical value of the cost factor in the cost formula by adopting a statistical regression calibration method according to the historical data of the execution plan corresponding to the regression query statement to obtain a cost model of the database query optimizer;

S3, acquiring an original training data set, and acquiring a training data feature set and a training data label set based on the original training data set;

and S4, aiming at the execution plan to be predicted, combining a depth calibration model to obtain a predicted error time offset, obtaining cost time based on a database query optimizer cost model, and summing the predicted error time offset and the cost time to obtain the execution time of the query execution plan to be predicted.

The step S1 of obtaining the historical data of the execution plan corresponding to the regression query statement comprises the following steps:

and inputting and operating a group of regression query sentences into the database to obtain and store historical data of execution plans corresponding to the regression query sentences, wherein the historical data comprises information such as a base number, the number of scanning blocks, execution time and the like.

In this embodiment, the regression query includes the following five queries Q1 through Q5, wherein R, T is the relationship table present in the database and A, B is the index column on the R relationship table.

Q1:SELECT*FROM R

Q2:SELECT*FROM T

Q3:SELECT COUNT(*)FROM R

Q4:SELECT*FROM R where R.A<a

Q5:SELECT*FROM R where R.B<b

The historical data of the execution plan corresponding to the regression query statement comprises real execution time T1-T5 corresponding to Q1-Q5, R table base number n _R, R table occupied data page number p _R, T table base number n _T, T table occupied data page number p _T, COUNT function processing line number n _F, R.A index processed line number n _RA, R.A index occupied data page number p _RA, R.A < a filter operation processed line number n _FA, R.B index processed line number n _RB, R.B index occupied data page number p _RB, R.B < b filter operation processed line number n _FB.

And calibrating the cost model of the database query optimizer according to the real execution time of the regression query statement. And fitting the numerical value of the cost factor in the cost formula by adopting a statistical regression calibration method, and giving the actual representation meaning of the cost factor, wherein the execution time is taken as a measurement unit.

The cost formulas corresponding to the regression queries Q1 through Q5 are as follows:

Q1:Cost=p_R·c_s+n_R·c_t

Q2:Cost=p_T·c_s+n_T·c_t

Q3:Cost=p_R·c_s+n_R·c_t+n_F·co

Q4:Cost=p_RA·c_r+n_RA·c_t+n_RA·c_i+n_FA·c_o

Q5:Cost=p_RB·c_r+n_RB·c_t+n_RB·c_i+n_FB·c_o

The cost factors of the statistical regression calibration comprise an I/O cost factor c _s for scanning one data page sequentially, an I/O cost factor c _r for scanning one data page randomly, a CPU cost factor c _t for processing one row of data, a CPU cost factor c _i for processing one index item and a CPU cost factor c _o for processing each function or operator.

The cost calculation formula related to each query in the regression query group comprises cost factors of different types, the regression query with enough quantity is designed and ensured in the mode, the real execution time of each query and the operation quantity of each cost factor are executed and collected to serve as the true value result and coefficient of the cost calculation formula, and the numerical values of the different cost factors are obtained through statistical regression by different query combinations;

And carrying out Q1-Q5 cost formulas to obtain the numerical value of the cost factor through the obtained execution plan historical data, including information such as the base number, the number of scanning blocks, the execution time and the like.

S3, the original training data set acquisition method comprises the following steps:

A set of SQL query workloads are input and run into the database system, which generates a corresponding execution plan for each SQL, each execution plan comprising a plurality of operations and execution times, and after each SQL execution, the obtained execution plan is stored as an original training dataset.

In S3, feature encoding processing is performed on the execution plan of the original training data set, as shown in fig. 3 to 4.

Firstly, extracting category characteristics from an execution plan node, extracting database system table data and generating a dictionary table, and carrying out dictionary coding on the execution plan node according to the dictionary table;

The category characteristics contained in the dictionary comprise category characteristics such as node types, connection modes, scanning directions, relation table names, relation table aliases, index names, aggregation strategies and the like;

The node types comprise a scanning node type, a control node type, a materialized node type and a connection node type, wherein the scanning node type comprises a Seq Scan, an Index Scan, a Bitmap Index/Heap Scan and the like, the control node type comprises an application and the like, the materialized node type comprises a Materialize, sort, group, aggregation and the like, and the connection node type comprises a close Loop Join, a Hash Join, a Merge Join and the like.

And constructing dictionary table data according to the extracted metadata information, namely database system table data, wherein the numerical value of each item in the dictionary is one-hot code, for example, if the node type information of the extracted node from the execution plan is Seq Scan, the numerical value at the corresponding position of the Seq Scan is 1, and the numerical value at the rest positions of the node types is 0.

Based on an original training data set, sequentially extracting information of nodes in each execution plan in a subsequent traversing mode, coding the nodes according to category characteristic information in a dictionary table to obtain dictionary coding characteristics, and taking the running time of the nodes as a label corresponding to the characteristics;

Training the feature processing neural network based on dictionary coding features and corresponding labels to obtain a trained feature processing neural network;

The characteristic neural network is characterized in that a first layer is an input layer, a middle 6 layer is a hidden layer, a last layer is an output layer, the first two hidden layers are RNN layers, the third hidden layer is a flattening layer, the last three hidden layers are full-connection layers, activation functions used by the hidden layers and the output layer are Relu, the first two RNN hidden layers output 35 nodes, each node dimension is 128, the number of the nodes output by the flattening layer is 4480, the number of the nodes of the last three full-connection layers is 256, 128 and 64 in sequence, the loss function is a mean square error cost function, the obtained embedded vector is used as a characteristic to process the tail end hidden layers of the neural network, and the output layer is real time for executing a planned node operation.

And processing the neural network based on the original training data set and the trained features to obtain an embedded vector of each node of the original training data set execution plan.

The acquiring the training data feature set and the training data label set based on the original training data set comprises:

Executing the embedded vector of each node of the plan based on the original training data set, combining the embedded vectors according to the tree structure of the execution plan, and performing binary trepanning on the embedded vector tree to generate a feature vector binary tree of each plan, so as to obtain a training data feature set;

and (3) making a difference between the real execution time of each execution plan of the original training data set and the cost time calculated by the cost factor to obtain an error time offset, and obtaining a training data tag set.

S3, training the tree-shaped deep convolutional neural network model based on the training data feature set and the training data label set to obtain a deep calibration model;

As shown in FIG. 5, the model structure of the depth calibration model is that the first layer is an input layer, the middle 7 layers are hidden layers, the last layer is an output layer, the first three hidden layers are convolution layers, the fourth hidden layers are pooling layers, the last three hidden layers are full-connection layers, the activation functions used by the hidden layers and the output layers are Relu, the number of nodes of the first three convolution layers is 512, 256 and 128 in sequence, the number of nodes output by the pooling layers of the hidden layers is 128, the number of nodes of the last three full-connection layers is 128, 64 and 32 in sequence, the loss function is a mean square error cost function, and the obtained output layer is the error time offset of an execution plan node.

Example two

The embodiment provides a database query performance prediction system based on cost factor calibration, which comprises the following steps:

Example III

The present embodiment provides a computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of a database query performance prediction method based on cost factor calibration as described above.

Example IV

The present embodiment provides a computer device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor implements the steps in a database query performance prediction method based on cost factor calibration as described above when the program is executed by the processor.

It will be appreciated by those skilled in the art that embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of a hardware embodiment, a software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, magnetic disk storage, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

Those skilled in the art will appreciate that implementing all or part of the above-described methods in accordance with the embodiments may be accomplished by way of a computer program stored on a computer readable storage medium, which when executed may comprise the steps of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a random-access Memory (Random Access Memory, RAM), or the like.

The above description is only of the preferred embodiments of the present invention and is not intended to limit the present invention, but various modifications and variations can be made to the present invention by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A database query performance prediction method based on cost factor calibration, characterized by comprising the following steps:

Obtain historical data of the execution plan corresponding to the regression query statement;

Based on the historical data of the execution plan corresponding to the regression query statement, the statistical regression calibration method is used to fit the value of the cost factor in the cost formula to obtain the database query optimizer cost model;

Obtaining an original training data set, and obtaining a training data feature set and a training data label set based on the original training data set; training a tree-shaped deep convolutional neural network model based on the training data feature set and the training data label set to obtain a deep calibration model;

For the execution plan to be predicted, the deep calibration model is combined to obtain the predicted error time offset. The cost time is obtained based on the database query optimizer cost model. The predicted error time offset and the cost time are summed to obtain the execution time of the query execution plan to be predicted.

The cost factors in the cost formula include the I/O cost factor of sequentially scanning a data page, the I/O cost factor of randomly scanning a data page, the CPU cost factor of processing a row of data, the CPU cost factor of processing an index item, and the CPU cost factor of processing each function or operator;

The step of obtaining a training data feature set and a training data label set based on the original training data set includes:

Based on the embedding vector of each node of the execution plan of the original training data set, the embedding vectors are combined according to the tree structure of the execution plan, and the embedding vector tree is binarized to generate a binary tree of feature vectors for each plan to obtain the training data feature set;

The actual execution time of each execution plan in the original training data set is subtracted from the cost time calculated by the cost factor to obtain the error time offset and the training data label set.

2. The database query performance prediction method based on cost factor calibration according to claim 1, wherein the cost factor values in the cost formula are fitted using a statistical regression calibration method based on historical data of execution plans corresponding to regression query statements, and the cost factor values in the cost formula are fitted using the following methods:

Obtain known quantities in the cost formula corresponding to all regression query statements based on historical data;

Substitute the known quantity into the corresponding cost formula, solve all the cost formulas together to obtain the value of the cost factor.

3. A database query performance prediction method based on cost factor calibration as described in claim 1 is characterized in that the cost calculation formula involved in each query in the regression query contains different types of cost factors, the actual execution time of each query and the number of operations of each cost factor are executed and collected, and the true value result and coefficient of the cost calculation formula are used as the true value result and coefficient of the cost calculation formula, and statistical regression is performed through different query combinations to obtain the numerical values of different cost factors.

4. The database query performance prediction method based on cost factor calibration according to claim 1, wherein the embedding vector of each node in the execution plan of the original training dataset is generated by:

Based on the original training data set, the node information in each execution plan is extracted in sequence through subsequent traversal. The nodes are encoded according to the category feature information in the dictionary table to obtain dictionary encoding features, and the node running time is used as the label corresponding to the feature. The feature processing neural network is trained based on the dictionary encoding features and the corresponding labels to obtain a trained feature processing neural network.

Based on the original training dataset and the trained feature processing neural network, the embedding vector of each node of the execution plan of the original training dataset is obtained.

5. A database query performance prediction method based on cost factor calibration as described in claim 3, characterized in that the historical data of the execution plan corresponding to the regression query statement includes: cardinality, number of scan blocks and execution time.

6. A database query performance prediction system based on cost factor calibration, characterized by comprising:

The data acquisition module is used to obtain historical data of the execution plan corresponding to the regression query statement;

A cost model construction module is used to fit the values of cost factors in a cost formula using a statistical regression calibration method based on historical data of execution plans corresponding to regression query statements to obtain a database query optimizer cost model; wherein the cost factors in the cost formula include the I/O cost factor of sequentially scanning a data page, the I/O cost factor of randomly scanning a data page, the CPU cost factor of processing a row of data, the CPU cost factor of processing an index item, and the CPU cost factor of processing each function or operator;

A deep calibration model construction module is used to obtain an original training data set, obtain a training data feature set and a training data label set based on the original training data set; and train a tree-shaped deep convolutional neural network model based on the training data feature set and the training data label set to obtain a deep calibration model.

Subtract the actual execution time of each execution plan in the original training dataset from the cost time calculated using the cost factor to obtain the error time offset and the training data label set.

The query performance prediction module is used to obtain the predicted error time offset for the execution plan to be predicted in combination with the deep calibration model, obtain the cost time based on the database query optimizer cost model, and sum the predicted error time offset and the cost time to obtain the execution time of the query execution plan to be predicted.

7. A computer-readable storage medium having a computer program stored thereon, wherein when the program is executed by a processor, the program implements the steps of the database query performance prediction method based on cost factor calibration according to any one of claims 1 to 5.

8. A computer device comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein when the processor executes the program, the processor implements the steps of a database query performance prediction method based on cost factor calibration as described in any one of claims 1 to 5.