CN116150190A - Database query optimization processing method and system based on tree QRNN - Google Patents
Database query optimization processing method and system based on tree QRNN Download PDFInfo
- Publication number
- CN116150190A CN116150190A CN202310120892.8A CN202310120892A CN116150190A CN 116150190 A CN116150190 A CN 116150190A CN 202310120892 A CN202310120892 A CN 202310120892A CN 116150190 A CN116150190 A CN 116150190A
- Authority
- CN
- China
- Prior art keywords
- node
- tree
- neural network
- quasi
- execution plan
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2453—Query optimisation
- G06F16/24532—Query optimisation of parallel queries
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2453—Query optimisation
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0463—Neocognitrons
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/084—Backpropagation, e.g. using gradient descent
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Databases & Information Systems (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Evolutionary Computation (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
本发明公开了基于树型QRNN的数据库查询优化处理方法机系统;其中所述方法,包括:获取输入的结构化查询语句;确定所述结构化查询语句对应的执行计划,获取执行计划的树状结构,执行计划树状结构中的每一个节点包括表、操作类型和/或谓词条件;将执行计划树状结构中每一个节点对应的信息转化成特征向量;构建并训练查询优化器模型;查询优化器模型在训练时,是将历史执行计划作为样本数据、所述历史执行计划对应的实际查询成本作为目标值进行训练的;将执行计划树状结构中每一个节点的特征向量输入到查询优化器模型的树型准循环神经网络的对应节点中,得到多层感知机输出的查询成本,基于查询成本确定目标执行计划。
The invention discloses a tree-type QRNN-based database query optimization processing method and system; wherein the method includes: obtaining an input structured query statement; determining the execution plan corresponding to the structured query statement, and obtaining the tree structure of the execution plan Structure, each node in the execution plan tree structure includes tables, operation types and/or predicate conditions; convert the information corresponding to each node in the execution plan tree structure into feature vectors; build and train the query optimizer model; query When the optimizer model is trained, the historical execution plan is used as sample data, and the actual query cost corresponding to the historical execution plan is used as the target value for training; the feature vector of each node in the execution plan tree structure is input into the query optimization The query cost output by the multi-layer perceptron is obtained from the corresponding nodes of the tree-type quasi-cyclic neural network of the machine model, and the target execution plan is determined based on the query cost.
Description
技术领域technical field
本发明涉及数据库查询优化技术领域,特别是涉及基于树型QRNN的数据库查询优化处理方法及系统。The invention relates to the technical field of database query optimization, in particular to a tree QRNN-based database query optimization processing method and system.
背景技术Background technique
本部分的陈述仅仅是提到了与本发明相关的背景技术,并不必然构成现有技术。The statements in this section merely mention the background technology related to the present invention and do not necessarily constitute the prior art.
基础软件是信息产业发展和信息化建设的重要基础和支撑,核心的基础软件越来越被企业所重视。其中,数据库作为基础软件的重要组成之一,数据库的应用领域非常广泛,很多公司都需要利用数据库来存储数据信息。而数据库中的查询优化器,能够决定数据库的执行性能。其中的成本估计对查询优化器至关重要,它可以指导执行计划的选择,为查询语句选择出一个高效的执行计划。并且,随着大数据时代的到来,数据大量增长,因此,建立合理的并且能够持续处理新增数据的查询优化器的成本估计模型具有非常重要的现实意义。Basic software is an important foundation and support for the development of the information industry and the construction of informatization, and the core basic software is increasingly valued by enterprises. Among them, the database is one of the important components of the basic software. The application fields of the database are very extensive, and many companies need to use the database to store data information. The query optimizer in the database can determine the execution performance of the database. The cost estimation is very important to the query optimizer, it can guide the selection of the execution plan, and select an efficient execution plan for the query statement. Moreover, with the advent of the era of big data, the data has grown in a large amount. Therefore, it is of great practical significance to establish a reasonable cost estimation model for the query optimizer that can continuously process new data.
随着数据库近年来的不断发展,数据库开始与各项新兴技术结合,如人工智能、区块链、密态计算等。数据库与人工智能的结合是一种双赢,数据库和人工智能都能从这次结合中受益。由于查询优化基于启发式的本质,有很多人尝试将深度学习应用于查询优化器中的成本估计。例如对执行计划的树结构进行异构建模,并采用线性的方法去捕获执行计划的特征信息。然而这类方法不对执行计划中的谓词进行编码,依赖于关系型数据库估计的基数和估计的成本,最终导致在预测的时候产生误差,并且采用线性的方法无法准确地捕捉到复杂执行计划的信息。同时还有另一部分的研究是采用循环神经网络的变体长短期记忆网络(LSTM)对执行计划中的节点进行建模,这样能够解决梯度消失和梯度爆炸的问题,但是此类方法计算量大,并且无法并行计算,训练困难。此外,还有研究者采用了Transformer模型去捕获执行计划中节点的信息以及采用树偏注意力机制去捕获执行计划的树结构特征。With the continuous development of databases in recent years, databases have begun to combine with various emerging technologies, such as artificial intelligence, blockchain, and dense computing. The combination of database and artificial intelligence is a win-win situation, and both database and artificial intelligence can benefit from this combination. Due to the heuristic-based nature of query optimization, there have been many attempts to apply deep learning to cost estimation in query optimizers. For example, heterogeneous modeling is performed on the tree structure of the execution plan, and a linear method is used to capture the characteristic information of the execution plan. However, this type of method does not encode the predicates in the execution plan, and relies on the estimated cardinality and estimated cost of the relational database, which eventually leads to errors in prediction, and the linear method cannot accurately capture the information of complex execution plans. . At the same time, another part of the research is to use a variant of the cyclic neural network, the long-term short-term memory network (LSTM), to model the nodes in the execution plan, which can solve the problems of gradient disappearance and gradient explosion, but such methods are computationally intensive. , and cannot be calculated in parallel, making training difficult. In addition, some researchers use the Transformer model to capture the information of the nodes in the execution plan and use the tree partial attention mechanism to capture the tree structure characteristics of the execution plan.
上述的方法虽然都取得了一定的成果,但是都没有将查询优化器中成本估计的影响因素全面的考虑在在内;其次,它们大多只考虑了静态数据下的情况,忽略了现实场景中增量数据对成本估计模型产生的影响。Although the above-mentioned methods have achieved certain results, they have not fully considered the factors affecting the cost estimation in the query optimizer; secondly, most of them only consider the situation under static data, ignoring the increase in real-world scenarios. The impact of quantitative data on cost estimation models.
发明内容Contents of the invention
为了解决现有技术的不足,本发明提供了基于树型QRNN的数据库查询优化处理方法及系统;根据已有的执行计划以及统计信息对成本进行估计,同时还采用增量学习方法去持续学习新数据,使得数据库能够根据成本估计值去更好地选择执行计划去执行,进而达到提高数据库执行查询语句性能目的。In order to solve the deficiencies of the prior art, the present invention provides a tree-type QRNN-based database query optimization processing method and system; the cost is estimated according to the existing execution plan and statistical information, and the incremental learning method is also used to continuously learn new Data, so that the database can better select the execution plan to execute according to the cost estimate, and then achieve the purpose of improving the performance of the database to execute query statements.
第一方面,本发明提供了基于树型QRNN的数据库查询优化处理方法;In the first aspect, the present invention provides a tree-type QRNN-based database query optimization processing method;
基于树型QRNN的数据库查询优化处理方法,包括:A tree-type QRNN-based database query optimization processing method, including:
获取输入的结构化查询语句;Obtain the input structured query statement;
确定所述结构化查询语句对应的执行计划,获取执行计划的树状结构,执行计划树状结构中的每一个节点包括表、操作类型和/或谓词条件;将执行计划树状结构中每一个节点对应的信息转化成特征向量;Determine the execution plan corresponding to the structured query statement, obtain the tree structure of the execution plan, each node in the tree structure of the execution plan includes a table, operation type and/or predicate condition; each node in the tree structure of the execution plan The information corresponding to the node is converted into a feature vector;
构建并训练查询优化器模型;所述查询优化器模型,包括:树型准循环神经网络和多层感知机;树型准循环神经网络是树状结构,树型准循环神经网络的树状结构中的节点与执行计划的树状结构中的节点是一一对应的;树型准循环神经网络的每个节点均采用准循环神经网络来实现,树型准循环神经网络的树状结构的根节点与多层感知机连接;查询优化器模型在训练时,是将历史执行计划作为样本数据、所述历史执行计划对应的实际查询成本作为目标值进行训练的;Build and train the query optimizer model; The query optimizer model includes: a tree-type quasi-cyclic neural network and a multi-layer perceptron; a tree-type quasi-cyclic neural network is a tree structure, and a tree-type quasi-cycle neural network has a tree structure There is a one-to-one correspondence between the nodes in the tree and the nodes in the tree structure of the execution plan; each node of the tree-type quasi-recurrent neural network is realized by a quasi-recurrent neural network, and the root of the tree-like structure of the tree-type quasi-recurrent neural network The nodes are connected to the multi-layer perceptron; when the query optimizer model is trained, the historical execution plan is used as sample data, and the actual query cost corresponding to the historical execution plan is used as the target value for training;
将执行计划树状结构中每一个节点的特征向量输入到查询优化器模型的树型准循环神经网络的对应节点中,得到多层感知机输出的查询成本,基于查询成本确定目标执行计划。The feature vector of each node in the execution plan tree structure is input into the corresponding node of the tree quasi-cyclic neural network of the query optimizer model, and the query cost output by the multi-layer perceptron is obtained, and the target execution plan is determined based on the query cost.
第二方面,本发明提供了基于树型QRNN的数据库查询优化处理系统;In a second aspect, the present invention provides a tree-type QRNN-based database query optimization processing system;
基于树型QRNN的数据库查询优化处理系统,包括:Database query optimization processing system based on tree QRNN, including:
获取模块,其被配置为:获取输入的结构化查询语句;An acquisition module configured to: acquire an input structured query statement;
确定模块,其被配置为:确定所述结构化查询语句对应的执行计划,获取执行计划的树状结构,执行计划树状结构中的每一个节点包括表、操作类型和/或谓词条件;将执行计划树状结构中每一个节点对应的信息转化成特征向量;A determining module configured to: determine the execution plan corresponding to the structured query statement, obtain a tree structure of the execution plan, each node in the tree structure of the execution plan includes a table, an operation type and/or a predicate condition; The information corresponding to each node in the execution plan tree structure is converted into a feature vector;
处理模块,其被配置为:构建并训练查询优化器模型;所述查询优化器模型,包括:树型准循环神经网络和多层感知机;树型准循环神经网络是树状结构,树型准循环神经网络的树状结构中的节点与执行计划的树状结构中的节点是一一对应的;树型准循环神经网络的每个节点均采用准循环神经网络来实现,树型准循环神经网络的树状结构的根节点与多层感知机连接;查询优化器模型在训练时,是将历史执行计划作为样本数据、所述历史执行计划对应的实际查询成本作为目标值进行训练的;The processing module is configured to: build and train a query optimizer model; the query optimizer model includes: a tree-type quasi-cyclic neural network and a multi-layer perceptron; the tree-type quasi-cyclic neural network is a tree structure, and the tree type There is a one-to-one correspondence between the nodes in the tree structure of the quasi-cyclic neural network and the nodes in the tree structure of the execution plan; The root node of the tree structure of the neural network is connected to the multi-layer perceptron; when the query optimizer model is trained, the historical execution plan is used as the sample data, and the actual query cost corresponding to the historical execution plan is used as the target value for training;
输出模块,其被配置为:将执行计划树状结构中每一个节点的特征向量输入到查询优化器模型的树型准循环神经网络的对应节点中,得到多层感知机输出的查询成本,基于查询成本确定目标执行计划。The output module is configured to: input the feature vector of each node in the execution plan tree structure to the corresponding node of the tree quasi-cyclic neural network of the query optimizer model, and obtain the query cost output by the multi-layer perceptron, based on The query cost determines the target execution plan.
第三方面,本发明还提供了一种电子设备,包括:In a third aspect, the present invention also provides an electronic device, comprising:
存储器,用于非暂时性存储计算机可读指令;以及memory for non-transitory storage of computer readable instructions; and
处理器,用于运行所述计算机可读指令,a processor for executing said computer readable instructions,
其中,所述计算机可读指令被所述处理器运行时,执行上述第一方面所述的方法。Wherein, when the computer-readable instructions are executed by the processor, the method described in the first aspect above is performed.
第四方面,本发明还提供了一种存储介质,非暂时性地存储计算机可读指令,其中,当非暂时性计算机可读指令由计算机执行时,执行第一方面所述方法的指令。In a fourth aspect, the present invention also provides a storage medium that non-transitorily stores computer-readable instructions, wherein when the non-transitory computer-readable instructions are executed by a computer, the instructions of the method described in the first aspect are executed.
第五方面,本发明还提供了一种计算机程序产品,包括计算机程序,所述计算机程序当在一个或多个处理器上运行的时候用于实现上述第一方面所述的方法。In a fifth aspect, the present invention also provides a computer program product, including a computer program, which is used to implement the method described in the first aspect when running on one or more processors.
与现有技术相比,本发明的有益效果是:Compared with prior art, the beneficial effect of the present invention is:
首先从传统数据库的统计信息中提取出直方图以及最频值,采用线性差值的方法去重新分配直方图横轴值,再通过谓词信息给直方图与最频值信息进行特征编码,这样能够充分利用数据库统计信息。First, the histogram and the most frequent value are extracted from the statistical information of the traditional database, and the linear difference method is used to redistribute the horizontal axis value of the histogram, and then the histogram and the most frequent value information are encoded by the predicate information, which can Take advantage of database statistics.
然后再利用树型准循环神经网络为执行计划进行建模,模型支持并行计算,并且能够捕捉长期依赖性。Then, the tree-type quasi-cyclic neural network is used to model the execution plan. The model supports parallel computing and can capture long-term dependencies.
最后,方法采用了增量学习,对增量数据进行处理,进一步迭代更新模型权重,使得模型在动态数据的情况下,更为准确。Finally, the method uses incremental learning to process the incremental data and further iteratively update the model weights, making the model more accurate in the case of dynamic data.
附图说明Description of drawings
构成本发明的一部分的说明书附图用来提供对本发明的进一步理解,本发明的示意性实施例及其说明用于解释本发明,并不构成对本发明的不当限定。The accompanying drawings constituting a part of the present invention are used to provide a further understanding of the present invention, and the schematic embodiments of the present invention and their descriptions are used to explain the present invention and do not constitute improper limitations to the present invention.
图1为本发明实施例一的模型框架图;Fig. 1 is a model frame diagram of
图2为本发明实施例一的树型准循环神经网络单元原理示例图;Fig. 2 is an example diagram of the principle of a tree-type quasi-cyclic neural network unit in
图3为本发明实施例一的增量学习部分原理示例图;FIG. 3 is an example diagram of the principle of incremental learning in
图4为本发明实施例一的直方图编码示意图;FIG. 4 is a schematic diagram of histogram encoding in
图5为本发明实施例一的两种树结构一一对应关系示意图。FIG. 5 is a schematic diagram of the one-to-one correspondence between two tree structures in
具体实施方式Detailed ways
应该指出,以下详细说明都是示例性的,旨在对本发明提供进一步的说明。除非另有指明,本文使用的所有技术和科学术语具有与本发明所属技术领域的普通技术人员通常理解的相同含义。It should be noted that the following detailed description is exemplary and intended to provide further explanation of the present invention. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.
需要注意的是,这里所使用的术语仅是为了描述具体实施方式,而非意图限制根据本发明的示例性实施方式。如在这里所使用的,除非上下文另外明确指出,否则单数形式也意图包括复数形式,此外,还应当理解的是,术语“包括”和“具有”以及他们的任何变形,意图在于覆盖不排他的包含,例如,包含了一系列步骤或单元的过程、方法、系统、产品或设备不必限于清楚地列出的那些步骤或单元,而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它步骤或单元。It should be noted that the terminology used here is only for describing specific embodiments, and is not intended to limit exemplary embodiments according to the present invention. As used herein, unless the context clearly dictates otherwise, the singular is intended to include the plural, and it should also be understood that the terms "comprising" and "having" and any variations thereof are intended to cover a non-exclusive Comprising, for example, a process, method, system, product, or device comprising a series of steps or units is not necessarily limited to those steps or units explicitly listed, but may include steps or units not explicitly listed or for these processes, methods, Other steps or units inherent in a product or equipment.
在不冲突的情况下,本发明中的实施例及实施例中的特征可以相互组合。In the case of no conflict, the embodiments and the features in the embodiments of the present invention can be combined with each other.
本实施例所有数据的获取都在符合法律法规和用户同意的基础上,对数据的合法应用。The acquisition of all data in this embodiment is based on compliance with laws and regulations and user consent, and the legal application of data.
现有的基于执行计划的深度学习成本估计方法没有充分地考虑成本估计中的影响因素,并且没有在保证建模准确的情况下提高模型的并行度,以减少训练时长。此外,没有考虑在增量数据下,模型权重的迭代更新问题。Existing deep learning cost estimation methods based on execution plans do not fully consider the influencing factors in cost estimation, and do not increase the parallelism of the model to reduce the training time while ensuring accurate modeling. In addition, the iterative update of model weights under incremental data is not considered.
实施例一Embodiment one
本实施例提供了基于树型QRNN的数据库查询优化处理方法;This embodiment provides a tree-type QRNN-based database query optimization processing method;
如图1所示,基于树型QRNN的数据库查询优化处理方法,包括:As shown in Figure 1, the tree-type QRNN-based database query optimization processing method includes:
S101:获取输入的结构化查询语句;S101: Obtain an input structured query statement;
S102:确定所述结构化查询语句对应的执行计划,获取执行计划的树状结构,执行计划树状结构中的每一个节点包括表、操作类型和/或谓词条件;将执行计划树状结构中每一个节点对应的信息转化成特征向量;S102: Determine the execution plan corresponding to the structured query statement, obtain the tree structure of the execution plan, each node in the tree structure of the execution plan includes a table, operation type and/or predicate condition; The information corresponding to each node is converted into a feature vector;
S103:构建并训练查询优化器模型;所述查询优化器模型,包括:树型准循环神经网络和多层感知机;树型准循环神经网络是树状结构,树型准循环神经网络的树状结构中的节点与执行计划的树状结构中的节点是一一对应的;树型准循环神经网络的每个节点均采用准循环神经网络来实现,树型准循环神经网络的树状结构的根节点与多层感知机连接;查询优化器模型在训练时,是将历史执行计划作为样本数据、所述历史执行计划对应的实际查询成本作为目标值进行训练的;S103: Construct and train a query optimizer model; the query optimizer model includes: a tree-type quasi-cyclic neural network and a multi-layer perceptron; a tree-type quasi-cyclic neural network is a tree structure, a tree of a tree-type quasi-cyclic neural network There is a one-to-one correspondence between the nodes in the tree-like structure and the nodes in the tree-like structure of the execution plan; each node of the tree-type quasi-cyclic neural network is realized by a quasi-cyclic neural network, and the tree-like structure of the tree-type quasi-cyclic neural network The root node of is connected to the multi-layer perceptron; when the query optimizer model is trained, the historical execution plan is used as sample data, and the actual query cost corresponding to the historical execution plan is used as the target value for training;
S104:将执行计划树状结构中每一个节点的特征向量输入到查询优化器模型的树型准循环神经网络的对应节点中,得到多层感知机输出的查询成本,基于查询成本确定目标执行计划。S104: Input the eigenvector of each node in the execution plan tree structure to the corresponding node of the tree-type quasi-cyclic neural network of the query optimizer model, obtain the query cost output by the multi-layer perceptron, and determine the target execution plan based on the query cost .
应理解地,所述S101:获取输入的结构化查询语句,其中结构化查询语句,是一种数据库查询和程序设计语言,用于存取数据以及查询、更新和管理关系数据库中存储的数据。It should be understood that S101: Obtain an input structured query statement, where the structured query statement is a database query and programming language used for accessing data and querying, updating and managing data stored in a relational database.
进一步地,所述S102:确定所述结构化查询语句对应的执行计划,具体包括:Further, the S102: determine the execution plan corresponding to the structured query statement, specifically including:
在PostgreSQL数据库中使用explain analyse SQL查询到所述结构化查询语句SQL的执行计划。Use explain analyze SQL to query the execution plan of the structured query statement SQL in the PostgreSQL database.
应理解地,PostgreSQL数据库在查询规划路径过程中,查询请求的不同执行方案是通过建立不同的路径来表达的,在生成较多符合条件的路径之后,要从中选择出代价最小的路径,把它转化为一个执行计划,传递给执行器执行。It should be understood that in the process of query planning path of PostgreSQL database, different execution schemes of query requests are expressed by establishing different paths. Convert it into an execution plan and pass it to the executor for execution.
进一步地,所述执行计划树状结构中的每一个节点包括表、操作类型和/或谓词条件,其中,执行计划树状结构中的每一个节点均包括:表和操作类型,执行计划树状结构中的每一个节点允许包括谓词条件,也允许不包括谓词条件。Further, each node in the execution plan tree structure includes a table, an operation type and/or a predicate condition, wherein each node in the execution plan tree structure includes: a table and an operation type, and the execution plan tree structure Each node in the structure allows predicate conditions to be included or not.
进一步地,执行计划树状结构中的每一个节点,是指执行计划中执行路径上的一个节点。Further, each node in the execution plan tree structure refers to a node on the execution path in the execution plan.
进一步地,获取执行计划的树状结构,包括:Further, obtain the tree structure of the execution plan, including:
如果当前节点与邻接节点之间有父子节点的连接关系,则当前节点与节点之间存在连接关系;If there is a parent-child connection relationship between the current node and the adjacent node, then there is a connection relationship between the current node and the node;
如果当前节点与邻接节点之间无父子节点的连接关系,则当前节点与邻接节点之间不存在连接关系。If there is no parent-child connection relationship between the current node and the adjacent node, then there is no connection relationship between the current node and the adjacent node.
进一步地,所述S102:将执行计划树状结构中每一个节点对应的信息转化成特征向量,具体包括:Further, said S102: converting the information corresponding to each node in the execution plan tree structure into a feature vector, specifically including:
S102-1:对执行计划树状结构中每一个节点中的表、操作类型以及谓词条件,分别进行特征编码;S102-1: Perform feature encoding on tables, operation types and predicate conditions in each node in the tree structure of the execution plan;
S102-2:对被查询数据库统计信息中的直方图和最频值,分别进行特征编码;S102-2: Perform feature encoding on the histogram and mode values in the statistical information of the queried database;
S102-3:将所有的特征编码结果串联到一起,得到特征向量。S102-3: Concatenate all feature encoding results together to obtain a feature vector.
优选地,所述操作类型,是指:执行计划树状结构中当前节点的操作类型,所述操作类型,包括:全表扫描(Seq Scan)、索引扫描(Index Scan)、哈希连接(Hash Join)等若干个扫描类型。Preferably, the operation type refers to: the operation type of the current node in the execution plan tree structure, and the operation type includes: full table scan (Seq Scan), index scan (Index Scan), hash join (Hash Join) and other scanning types.
进一步地,所述谓词条件,是指:在执行计划树状结构中当前节点下使用的过滤条件或者连接条件。例如:当前节点是对A表进行全表扫描(Seq Scan),其谓词条件为过滤条件“A.id>12”。Further, the predicate condition refers to a filter condition or connection condition used under the current node in the execution plan tree structure. For example: the current node is performing a full table scan (Seq Scan) on table A, and its predicate condition is the filter condition "A.id>12".
应理解地,PostgreSQL中的pg_stats存放有PostgreSQL数据库收集到的有关库内所有表的统计信息,直方图和最频值就在其中。PostgreSQL先从表中筛选出频率最高的值,作为“最频值”单独存储起来,剩下的所有值再采用“等频直方图”的形式进行数据统计。It should be understood that pg_stats in PostgreSQL stores statistical information about all tables in the database collected by the PostgreSQL database, including the histogram and the most frequent values. PostgreSQL first filters out the value with the highest frequency from the table, and stores it separately as the "most frequent value", and then uses the "equal frequency histogram" for data statistics on all remaining values.
进一步地,所述S102-1:对执行计划树状结构中每一个节点中的表、操作类型以及谓词条件,分别进行特征编码,具体包括:Further, said S102-1: performing feature encoding on tables, operation types, and predicate conditions in each node in the execution plan tree structure, specifically including:
对表、操作类型这两种类型的信息进行one-hot编码,即ET和EO;Perform one-hot encoding on the two types of information, table and operation type, namely E T and E O ;
在对谓词条件中的原子谓词条件进行编码时,采用one-hot编码与word2vec模型编码相结合的方式,对列、操作符采用one-hot编码的形式,对值采用word2vec模型编码的形式,并将二者串联起来,作为原子谓词的向量表示EA,When encoding the atomic predicate conditions in the predicate conditions, the combination of one-hot encoding and word2vec model encoding is adopted, the form of one-hot encoding is used for columns and operators, and the form of word2vec model encoding is used for values, and Concatenating the two, the vector representation E A as an atomic predicate,
在对谓词条件中的复合谓词条件进行编码时,采用最大-最小池化操作对其中的原子谓词条件进行处理,对析取操作“OR”采用最大池化操作,对合取谓词“AND”采用最小池化操作,并将最后的输出结果作为谓词条件的特征嵌入向量表示EP。When encoding the compound predicate conditions in the predicate conditions, the maximum-minimum pooling operation is used to process the atomic predicate conditions, the maximum pooling operation is used for the disjunctive operation "OR", and the conjunction predicate "AND" is used The minimum pooling operation, and the final output as the feature embedding vector representation of the predicate condition E P .
进一步地,所述S102-2:对被查询数据库统计信息中的直方图和最频值,分别进行特征编码,具体包括:Further, said S102-2: respectively performing feature encoding on the histogram and mode values in the statistical information of the queried database, specifically including:
S102-21:将所有列的直方图重新分组为相同的N个直方图的桶,对直方图的横轴通过线性插值来逼近新的直方图边界,线性插值假设在每个桶内均匀分布,再采用大小为N的向量对直方图进行编码,其中每个元素对应一个桶,并用谓词的可满足性进行标记;S102-21: Regroup the histograms of all columns into the same N histogram buckets, and approach the new histogram boundary by linear interpolation on the horizontal axis of the histogram. The linear interpolation is assumed to be uniformly distributed in each bucket, Then encode the histogram with a vector of size N, where each element corresponds to a bucket and is marked with the satisfiability of the predicate;
S102-22:将最频值特征为最频值,采用谓词的可满足性进行标记。S102-22: Characterize the most frequent value as the most frequent value, and use the satisfiability of the predicate to mark it.
示例性地,所述S102-21:如图4所示,有一个5桶的直方图,并且直方图横轴的各个边界集合为{2000,2005,2007,2019,2020,2023}。那么对于谓词条件year>2006,查询区域覆盖第二个桶的以及第三个桶、第四个桶、第五个桶的全部,那么该直方图编码为EH=[0,0.5,1,1,1]。Exemplarily, the S102-21: as shown in FIG. 4 , there is a histogram with 5 buckets, and each boundary set of the horizontal axis of the histogram is {2000, 2005, 2007, 2019, 2020, 2023}. Then for the predicate condition year>2006, the query area covers the second bucket and all of the third bucket, the fourth bucket, and the fifth bucket, then the histogram is coded as E H =[0,0.5,1,1,1].
示例性地,所述S102-22:现有最频值为2012,则满足谓词条件year>2006,如此,最频值的编码为EM=[1]。Exemplarily, the S102-22: if the existing mode value is 2012, then the predicate condition year>2006 is satisfied, and thus, the coding of the mode value is E M =[1].
进一步地,所述S102-3:将所有的特征编码结果串联到一起,得到特征向量,具体包括:用Linear层与激活函数Relu层去处理所有特征向量,并将其串联起来,得到特征向量表示E。Further, said S102-3: concatenating all feature encoding results together to obtain a feature vector, specifically including: processing all feature vectors with a Linear layer and an activation function Relu layer, and concatenating them to obtain a feature vector representation e.
应理解地,特征提取及编码主要是在充分的考虑成本估计的影响因素的前提下,将各个影响成本估计的因素信息分门别类地提取出来,并采用相应的方法进行特征编码,最后所有的特征向量串联起来,作为下一步模型的输入,为模型学习到执行计划以及数据库统计信息与成本估计的相关关系做好充分的铺垫。It should be understood that the feature extraction and encoding is mainly to extract the information of each factor that affects cost estimation under the premise of fully considering the influencing factors of cost estimation, and use the corresponding method for feature encoding, and finally all the feature vectors In series, as the input of the next model, it will fully pave the way for the model to learn the execution plan and the relationship between database statistical information and cost estimation.
进一步地,所述S103还包括:树型准循环神经网络QRNN的树状结构包括若干个节点。Further, the S103 also includes: the tree structure of the tree-type quasi-recurrent neural network QRNN includes several nodes.
进一步地,查询优化器模型,用于学习执行计划节点以及数据库统计信息中的特征与查询成本之间的关系。Further, the query optimizer model is used to learn the relationship between the execution plan nodes and the characteristics in the database statistical information and the query cost.
进一步地,多层感知机MLP,将树型准循环神经网络QRNN根节点的输出值作为输入值,对结构化查询语句对应的执行计划的总成本进行估计。Furthermore, the multi-layer perceptron MLP uses the output value of the root node of the tree quasi-recurrent neural network QRNN as an input value to estimate the total cost of the execution plan corresponding to the structured query statement.
进一步地,如图5所示,所述树型准循环神经网络,包括:Further, as shown in Figure 5, the tree quasi-cyclic neural network includes:
如果在执行计划的树状结构中当前节点与邻接节点相连,则与执行计划的树状结构对应的查询优化器模型的树型结构中,当前节点与邻接节点之间也存在连接关系;If the current node is connected to adjacent nodes in the tree structure of the execution plan, then there is also a connection relationship between the current node and the adjacent nodes in the tree structure of the query optimizer model corresponding to the tree structure of the execution plan;
如果在执行计划的树状结构中当前节点与邻接节点不相连,则与执行计划的树状结构对应的查询优化器模型的树型结构中,当前节点与邻接节点之间不存在连接关系。If the current node is not connected to the adjacent nodes in the tree structure of the execution plan, then there is no connection relationship between the current node and the adjacent nodes in the tree structure of the query optimizer model corresponding to the tree structure of the execution plan.
进一步地,所述树型准循环神经网络的每个节点均采用准循环神经网络来实现,包括:Further, each node of the tree-type quasi-cyclic neural network is realized by a quasi-cyclic neural network, including:
将树型准循环神经网络的所有节点根据节点所处位置划分为叶子节点、中间节点和根节点;根节点的输出端与多层感知机的输入端连接;根节点的输入端分别与左中间节点的输出端和右中间节点的输出端连接;左中间节点的输入端分别与左叶子节点的输出端和右叶子节点的输出端连接;Divide all nodes of the tree-type quasi-cyclic neural network into leaf nodes, middle nodes and root nodes according to the location of the nodes; the output of the root node is connected to the input of the multi-layer perceptron; the input of the root node is respectively connected to the left middle The output end of the node is connected to the output end of the right middle node; the input end of the left middle node is respectively connected to the output end of the left leaf node and the output end of the right leaf node;
对于叶子节点而言,叶子节点所对应的准循环神经网络的输入值cl等于0,首个准循环神经网络的输入值cr等于0,叶子节点的准循环神经网络对输入值cl和输入值cr进行平均化处理,得到记忆单元ct-1;叶子节点的准循环神经网络对记忆单元ct-1和输入序列X进行处理,得到叶子节点的记忆单元ct;输入序列X是指当前准循环神经网络节点所对应的执行计划节点的特征向量;For a leaf node, the input value c l of the quasi-cyclic neural network corresponding to the leaf node is equal to 0, the input value c r of the first quasi-cyclic neural network is equal to 0, and the quasi-cyclic neural network of the leaf node has an input value c l and The input value c r is averaged to obtain the memory unit c t-1 ; the quasi-cyclic neural network of the leaf node processes the memory unit c t-1 and the input sequence X to obtain the memory unit c t of the leaf node; the input sequence X refers to the feature vector of the execution plan node corresponding to the current quasi-cyclic neural network node;
对于中间节点而言,中间节点所对应的准循环神经网络的输入值cl等于与中间节点所连接的左叶子节点的记忆单元ct;中间节点所对应的准循环神经网络的输入值Cr等于与中间节点所连接的右叶子节点的记忆单元Ct,中间节点的准循环神经网络对输入值cl和输入值Cr进行平均化处理,得到记忆单元Ct-1;中间节点的准循环神经网络对记忆单元ct-1和输入序列X进行处理,得到中间节点的记忆单元ct;For the middle node, the input value c l of the quasi-recurrent neural network corresponding to the middle node is equal to the memory unit c t of the left leaf node connected to the middle node; the input value C r of the quasi-recurrent neural network corresponding to the middle node It is equal to the memory unit C t of the right leaf node connected to the middle node, and the quasi-circular neural network of the middle node averages the input value c l and the input value C r to obtain the memory unit C t-1 ; the quasi-cyclic neural network of the middle node The cyclic neural network processes the memory unit c t-1 and the input sequence X to obtain the memory unit c t of the intermediate node;
对于根节点而言,根节点所对应的准循环神经网络的输入值cl等于与根节点所连接的左中间节点的记忆单元ct;根节点所对应的准循环神经网络的输入值cr等于与根节点所连接的右中间节点的记忆单元ct,根节点的准循环神经网络对输入值cl和输入值cr进行平均化处理,得到记忆单元ct-1;根节点的准循环神经网络对记忆单元ct-1和输入序列X进行处理,得到根节点的记忆单元ct。For the root node, the input value c l of the quasi-recurrent neural network corresponding to the root node is equal to the memory unit c t of the left middle node connected to the root node; the input value c r of the quasi-recurrent neural network corresponding to the root node equal to the memory unit c t of the middle right node connected to the root node, the quasi-circular neural network of the root node averages the input value c l and the input value c r to obtain the memory unit c t-1 ; the quasi-circular neural network of the root node The cyclic neural network processes the memory unit c t-1 and the input sequence X to obtain the memory unit c t of the root node.
进一步地,所述中间节点可以是多个中间节点。Further, the intermediate node may be multiple intermediate nodes.
进一步地,所述叶子节点的准循环神经网络对记忆单元ct-1和输入序列X进行处理,得到叶子节点的记忆单元ct;与Further, the quasi-cyclic neural network of the leaf node processes the memory unit c t-1 and the input sequence X to obtain the memory unit c t of the leaf node; and
中间节点的准循环神经网络对记忆单元ct-1和输入序列X进行处理,得到中间节点的记忆单元ct;以及The quasi-recurrent neural network of the intermediate node processes the memory unit c t-1 and the input sequence X to obtain the memory unit c t of the intermediate node; and
根节点的准循环神经网络对记忆单元ct-1和输入序列X进行处理,得到根节点的记忆单元ct;具体处理过程均是一样的。The quasi-cyclic neural network of the root node processes the memory unit c t-1 and the input sequence X to obtain the memory unit c t of the root node; the specific processing process is the same.
进一步地,所述叶子节点的准循环神经网络对记忆单元ct-1和输入序列X进行处理,得到叶子节点的记忆单元ct,具体包括:Further, the quasi-cyclic neural network of the leaf node processes the memory unit c t-1 and the input sequence X to obtain the memory unit c t of the leaf node, which specifically includes:
对于输入序列X,分别通过三个卷积层和非线性层得到Z、F、O:For the input sequence X, Z, F, and O are obtained through three convolutional layers and a nonlinear layer respectively:
Z=tanh(Wz*X)Z=tanh(W z *X)
F=σ(Wf*X)F=σ(W f *X)
O=σ(Wo*X)O=σ(W o *X)
其中,*是卷积运算;Z是使用额外的内核库Wz获得的“输入门”输出;F是使用额外的内核库Wf获得的“忘记门”输出;O是使用额外的内核库Wf获得的“输出门”输出;Among them, * is the convolution operation; Z is the output of the "input gate" obtained by using the additional kernel library W z ; F is the output of the "forget gate" obtained by using the additional kernel library W f ; O is the output of the "forget gate" obtained by using the additional kernel library W The output of the "output gate" obtained by f ;
为了保持模型的因果关系,即标记过去才能预测未来,这里采用了遮罩卷积(masked-convolutions)的方法。输入序列的左边是“kernel_size-1”。因此,只有'sequence_length-kernel_size+1'过去的标记可以预测给定的标记;kernel_size表示卷积隐藏层的内核大小;sequence_length是序列长度;In order to maintain the causal relationship of the model, that is, to mark the past to predict the future, the method of masked-convolutions is used here. The left side of the input sequence is "kernel_size-1". Therefore, only 'sequence_length-kernel_size+1' past tokens can predict a given token; kernel_size represents the kernel size of the convolutional hidden layer; sequence_length is the sequence length;
当kernel_size=2时,公式表示为:When kernel_size=2, the formula is expressed as:
考虑序列数据的情况下,池化组件部分选择使用“忘记门”ft和“输出门”ot,这种池化操作被称为fo-pooling,其公式如下:In the case of sequential data, the pooling component part chooses to use the "forget gate" f t and the "output gate" o t . This pooling operation is called fo-pooling, and its formula is as follows:
ct=ft⊙ct-1+(1-ft)⊙zt c t =f t ⊙c t-1 +(1-f t )⊙z t
ht=ot⊙ct h t = o t ⊙c t
其中,“忘记门”ft决定准循环神经网络单元对当前时刻输入序列的保留程度;记忆单元ct决定准循环神经网络单元对从上一时刻传输过来的序列信息的保留程度;“输出门”ot决定准循环神经网络单元的输出。Among them, the "forget gate" f t determines the retention degree of the quasi-cyclic neural network unit for the input sequence at the current moment; the memory unit c t determines the retention degree of the quasi-cyclic neural network unit for the sequence information transmitted from the previous moment; the "output gate ” o t determines the output of the quasi-recurrent neural network unit.
应理解地,准循环神经网络QRNN,如图2所示,它接收来自左右节点的数据信息,并学习执行计划树状结构中每一个节点中的信息及其数据库统计信息结合的特征向量表示,在整体模型结构搭建上,符合执行计划树型结构的数据特征以及数据的时序顺序。具体流程如下:It should be understood that the quasi-cyclic neural network QRNN, as shown in Figure 2, receives data information from the left and right nodes, and learns the eigenvector representation of the combination of information in each node in the execution plan tree structure and its database statistical information, In the construction of the overall model structure, it conforms to the data characteristics of the execution plan tree structure and the chronological order of the data. The specific process is as follows:
(1)基于准循环神经网络QRNN学习执行计划节点信息以及数据库统计信息;(1) Learning execution plan node information and database statistical information based on quasi-cyclic neural network QRNN;
为了符合树型执行计划中数据时序顺序,准循环神经网络QRNN接收来自左节点的记忆单元cl与右节点的记忆单元cr,并且通过一个平均化处理,得到包含其全体子树信息的记忆单元ct-1。In order to conform to the sequence of data in the tree-type execution plan, the quasi-cyclic neural network QRNN receives the memory unit c l from the left node and the memory unit cr r from the right node, and obtains the memory containing the information of all subtrees through an averaging process Unit c t-1 .
为了整个模型能够捕获长期依赖性以及支持并行计算,方法选择了准循环神经网络(QRNN)为主体组件。准循环神经网络体系结构中有2个组件,它们分别对应于卷积神经网络(CNN)中的卷积组件以及池化(fo-pooling)组件。In order for the whole model to capture long-term dependencies and support parallel computing, the method chooses quasi-recurrent neural network (QRNN) as the main component. There are two components in the quasi-recurrent neural network architecture, which correspond to the convolutional component and the pooling (fo-pooling) component in the convolutional neural network (CNN).
对于输入序列X,分别通过三个卷积层和非线性层得到Z、F、O:For the input sequence X, Z, F, and O are obtained through three convolutional layers and a nonlinear layer respectively:
Z=tanh(Wz*X)Z=tanh(W z *X)
F=σ(Wf*X)F=σ(W f *X)
O=σ(Wo*X)O=σ(W o *X)
其中,*是卷积运算;Z是使用额外的内核库Wz获得的“输入门”输出;F是使用额外的内核库Wf获得的“忘记门”输出;O是使用额外的内核库Wf获得的“输出门”输出。Among them, * is the convolution operation; Z is the output of the "input gate" obtained by using the additional kernel library W z ; F is the output of the "forget gate" obtained by using the additional kernel library W f ; O is the output of the "forget gate" obtained by using the additional kernel library W The output of the "output gate" obtained by f .
为了保留模型的因果关系(即只有使用过去的标记才可以预测未来的结果),使用了一种称为遮罩卷积(masked-convolutions)的概念。也就是说,输入序列的左边是“kernel_size-1”零。因此,只有'sequence_length-kernel_size+1'过去的标记可以预测给定的标记。In order to preserve the causality of the model (i.e. only using past labels can predict future outcomes), a concept called masked-convolutions is used. That is, "kernel_size-1" zeros to the left of the input sequence. Therefore, only 'sequence_length-kernel_size+1' past tokens can predict a given token.
当kernel_size=2时,上述的公式可以表示为:When kernel_size=2, the above formula can be expressed as:
考虑序列数据的情况下,池化组件部分选择使用“忘记门”ft和“输出门”ot,这种池化操作被称为fo-pooling,其公式如下:In the case of sequential data, the pooling component part chooses to use the "forget gate" f t and the "output gate" o t . This pooling operation is called fo-pooling, and its formula is as follows:
ct=ft⊙ct-1+(1-ft)⊙zt c t =f t ⊙c t-1 +(1-f t )⊙z t
ht=ot⊙ct h t =o t ⊙c t
其中,“忘记门”ft决定准循环神经网络单元对当前时刻输入序列的保留程度;记忆单元ct决定准循环神经网络单元对从上一时刻传输过来的序列信息的保留程度;“输出门”ot决定准循环神经网络单元的输出。Among them, the "forget gate" f t determines the retention degree of the quasi-cyclic neural network unit for the input sequence at the current moment; the memory unit c t determines the retention degree of the quasi-cyclic neural network unit for the sequence information transmitted from the previous moment; the "output gate ” o t determines the output of the quasi-recurrent neural network unit.
根据上述操作,就可以对从执行计划的叶子节点到执行计划顶端节点的数据信息流进行学习,构建出一个合理的树型结构模型。According to the above operations, the data information flow from the leaf nodes of the execution plan to the top nodes of the execution plan can be learned, and a reasonable tree structure model can be constructed.
应理解地,使用多层感知机对执行计划总体成本进行估计:It should be appreciated that the overall cost of the execution plan is estimated using a multi-layer perceptron:
树型模型最顶端输出的包含整个执行计划树信息的向量C作为多层感知机(MLP)的输入,对该执行计划的成本进行估计。首先将执行计划数据集拆分为训练集和测试集。在训练集上,将执行计划中真实的成本作为真值,将多层感知机的输出作为估计的成本,然后计算损失,损失函数可以表示为:The vector C containing the information of the entire execution plan tree output from the top of the tree model is used as the input of the multi-layer perceptron (MLP) to estimate the cost of the execution plan. Firstly, the execution plan dataset is split into training set and test set. On the training set, the real cost in the execution plan is taken as the true value, the output of the multi-layer perceptron is taken as the estimated cost, and then the loss is calculated. The loss function can be expressed as:
在本发明实施例中,所述查询优化器模型,是将历史执行计划作为样本数据、所述历史执行计划对应的实际查询成本作为目标值进行训练得到的,包括:In the embodiment of the present invention, the query optimizer model is obtained by training the historical execution plan as sample data and the actual query cost corresponding to the historical execution plan as the target value, including:
S103-1:构建训练集,所述训练集为已知实际查询成本的历史执行计划;S103-1: Construct a training set, where the training set is a historical execution plan with known actual query costs;
S103-2:将训练集,输入到查询优化器模型中,对模型进行训练,当模型的第一损失函数值不再下降时,停止训练,得到初步训练后的查询优化器模型。S103-2: Input the training set into the query optimizer model, train the model, stop the training when the value of the first loss function of the model no longer decreases, and obtain the query optimizer model after preliminary training.
在本发明实施例中,所述第一损失函数,是指:In the embodiment of the present invention, the first loss function refers to:
其中,estimated_cost为成本预测的结果,cost为目标成本值。Among them, estimated_cost is the result of cost prediction, and cost is the target cost value.
在本发明实施例中,所述方法还包括:In an embodiment of the present invention, the method further includes:
S103-3:当训练集有更新时,获取更新后的训练集,将更新后的训练集,输入到初步训练后的查询优化器模型中,对模型进行再次训练,当模型的第二损失函数值不再下降时,停止训练,得到最终训练后的查询优化器模型;第二损失函数,用Fisher信息矩阵和L2正则化对网络权重进行约束。S103-3: When the training set is updated, obtain the updated training set, input the updated training set into the query optimizer model after preliminary training, and train the model again. When the second loss function of the model When the value no longer drops, stop training to obtain the final trained query optimizer model; the second loss function uses Fisher information matrix and L2 regularization to constrain the network weight.
在本发明实施例中,第二损失函数,是指:In the embodiment of the present invention, the second loss function refers to:
其中,LB(θ)表示未添加EWC算法前,直接训练旧的训练集的损失函数,即树型准循环神经网络部分中提及的损失函数;Fi表示Fisher信息矩阵;θi表示训练新的数据集B时的网络权重;表示训练完旧的训练集A得到的最优权重;λ为L2的正则化系数,L(θ)是第二损失函数。Among them, L B (θ) represents the loss function of directly training the old training set before adding the EWC algorithm, that is, the loss function mentioned in the part of the tree quasi-circular neural network; F i represents the Fisher information matrix; θ i represents the training The network weight of the new data set B; Indicates the optimal weight obtained after training the old training set A; λ is the regularization coefficient of L2, and L(θ) is the second loss function.
其中,(x,y)i为旧任务A中的(estimated_cost,cost)i,Loss(θ|(x,y)i)即上述的损失函数Loss,对于每个参数θ,计算梯度,并累加所有的梯度,最后除以样本数量,即可得到对应参数的Fisher信息矩阵项。Among them, (x, y) i is (estimated_cost, cost) i in the old task A, and Loss(θ|(x, y) i ) is the above loss function Loss. For each parameter θ, calculate the gradient and accumulate All the gradients are finally divided by the number of samples to obtain the Fisher information matrix items of the corresponding parameters.
应理解地,S103-3是增量学习动态数据部分,该部分主要是采用了弹性权重巩固算法(EWC,Elastic Weight Consolidation),如图3所示,算法通过度量网络权重对旧任务的重要程度,推导出重要度矩阵,即Fisher信息矩阵。为了保证网络中的这些对旧任务重要的权重在完成新增加的训练集的训练时变化不大,算法在训练新增加的数据集时,添加了L2正则项并结合Fisher信息矩阵来对旧任务重要的网络权重进行约束。It should be understood that S103-3 is the dynamic data part of incremental learning, which mainly uses the elastic weight consolidation algorithm (EWC, Elastic Weight Consolidation), as shown in Figure 3, the algorithm measures the importance of the network weight to the old task , deduce the importance matrix, namely the Fisher information matrix. In order to ensure that the weights in the network that are important to the old tasks do not change much when the training of the newly added training set is completed, the algorithm adds an L2 regularization term and combines the Fisher information matrix to correct the old tasks when training the newly added data set. Important network weights are constrained.
Fisher信息是一种衡量信息量的指标。如果我们要对一个随机变量x的分布进行建模,用于建模的参数是θ,那么Fisher信息是去度量x携带的有关θ的信息量。也就是在分布上,如果某一个θ点附近的函数表现得非常陡峭,这表示x的Fisher信息量较高,可以采取少量的点就能做出很好的估计;如果在某一个θ点附近的函数表现得平缓,则表示x的Fisher信息量较低,需要采样很多的点才能做出比较好的估计。这说明可以从随机变量的方差入手,对于一个真实参数θ,s函数的Fisher信息就能定义为s函数的方差:Fisher information is a measure of the amount of information. If we want to model the distribution of a random variable x, and the parameter used for modeling is θ, then Fisher information is to measure the amount of information about θ carried by x. That is, in terms of distribution, if the function near a certain θ point is very steep, it means that the Fisher information of x is high, and a small number of points can be used to make a good estimate; if it is near a certain θ point The function of is smooth, which means that the Fisher information of x is low, and a lot of points need to be sampled to make a better estimate. This shows that we can start with the variance of the random variable. For a real parameter θ, the Fisher information of the s-function can be defined as the variance of the s-function:
考虑到上述定义,接下来,训练旧的训练集时,针对每一个网络权重,计算其梯度的平方来获得每一个参数的Fisher信息矩阵项:Considering the above definitions, next, when training the old training set, for each network weight, calculate the square of its gradient to obtain the Fisher information matrix item for each parameter:
具体来说,可以向模型逐个喂入样本,并计算损失函数,使用神经网络框架自动计算梯度。对于每个参数,累加所有的梯度,最后除以样本数量,即可得到对应参数的Fisher信息矩阵项。Specifically, the model can be fed samples one by one, and the loss function is calculated, and the gradient is automatically calculated using the neural network framework. For each parameter, all the gradients are accumulated, and finally divided by the number of samples to obtain the Fisher information matrix item of the corresponding parameter.
用L2正则化对网络权重进行约束:Constrain the network weights with L2 regularization:
使用EWC算法的模型包含当前参数、旧任务参数和fisher信息矩阵。The model using the EWC algorithm contains current parameters, old task parameters and fisher information matrix.
当训练一个新的任务时,输入数据到模型,产生成本预测值,模型会再原有的损失函数上,加入一个新的Loss项,该损失项会根据模型的Fisher信息矩阵和旧任务参数去计算。假设旧任务为A,在使用EWC训练新任务B时的损失函数的公式如下:When training a new task, input data to the model to generate a cost prediction value, the model will add a new Loss item to the original loss function, and the loss item will be based on the Fisher information matrix of the model and the old task parameters. calculate. Assuming that the old task is A, the formula of the loss function when using EWC to train the new task B is as follows:
其中,LB(θ)表示未添加EWC算法前,直接训练旧的训练集的损失函数,即树型准循环神经网络部分中提及的损失函数;Fi表示Fisher信息矩阵;θi表示训练新的数据集B时的网络权重;表示训练完旧的训练集A得到的最优权重;λ为L2的正则化系数。Among them, L B (θ) represents the loss function of directly training the old training set before adding the EWC algorithm, that is, the loss function mentioned in the part of the tree quasi-circular neural network; F i represents the Fisher information matrix; θ i represents the training The network weight of the new data set B; Indicates the optimal weight obtained after training the old training set A; λ is the regularization coefficient of L2.
当在新数据集上训练完毕后,模型的当前参数已经被更新,此时旧任务参数被更新为当前参数,以便下次训练时使用。同时模型使用当前参数对部分或者全部的数据进行预测,并使用原本的损失函数,计算Loss值,并计算梯度,并将这些计算出来的梯度取代模型的Fisher信息矩阵。After the training on the new data set is completed, the current parameters of the model have been updated. At this time, the old task parameters are updated to the current parameters for use in the next training. At the same time, the model uses the current parameters to predict part or all of the data, and uses the original loss function to calculate the Loss value, calculate the gradient, and replace the Fisher information matrix of the model with these calculated gradients.
实施例二Embodiment two
本实施例提供了基于树型QRNN的数据库查询优化处理系统;The present embodiment provides a tree-type QRNN-based database query optimization processing system;
基于树型QRNN的数据库查询优化处理系统,包括:Database query optimization processing system based on tree QRNN, including:
获取模块,其被配置为:获取输入的结构化查询语句;An acquisition module configured to: acquire an input structured query statement;
确定模块,其被配置为:确定所述结构化查询语句对应的执行计划,获取执行计划的树状结构,执行计划树状结构中的每一个节点包括表、操作类型和/或谓词条件;将执行计划树状结构中每一个节点对应的信息转化成特征向量;A determining module configured to: determine the execution plan corresponding to the structured query statement, obtain a tree structure of the execution plan, each node in the tree structure of the execution plan includes a table, an operation type and/or a predicate condition; The information corresponding to each node in the execution plan tree structure is converted into a feature vector;
处理模块,其被配置为:构建并训练查询优化器模型;所述查询优化器模型,包括:树型准循环神经网络和多层感知机;树型准循环神经网络是树状结构,树型准循环神经网络的树状结构中的节点与执行计划的树状结构中的节点是一一对应的;树型准循环神经网络的每个节点均采用准循环神经网络来实现,树型准循环神经网络的树状结构的根节点与多层感知机连接;查询优化器模型在训练时,是将历史执行计划作为样本数据、所述历史执行计划对应的实际查询成本作为目标值进行训练的;The processing module is configured to: build and train a query optimizer model; the query optimizer model includes: a tree-type quasi-cyclic neural network and a multi-layer perceptron; the tree-type quasi-cyclic neural network is a tree structure, and the tree type There is a one-to-one correspondence between the nodes in the tree structure of the quasi-cyclic neural network and the nodes in the tree structure of the execution plan; The root node of the tree structure of the neural network is connected to the multi-layer perceptron; when the query optimizer model is trained, the historical execution plan is used as the sample data, and the actual query cost corresponding to the historical execution plan is used as the target value for training;
输出模块,其被配置为:将执行计划树状结构中每一个节点的特征向量输入到查询优化器模型的树型准循环神经网络的对应节点中,得到多层感知机输出的查询成本,基于查询成本确定目标执行计划。The output module is configured to: input the feature vector of each node in the execution plan tree structure to the corresponding node of the tree quasi-cyclic neural network of the query optimizer model, and obtain the query cost output by the multi-layer perceptron, based on The query cost determines the target execution plan.
此处需要说明的是,上述获取模块、确定模块、处理模块和输出模块对应于实施例一中的步骤S101至S104,上述模块与对应的步骤所实现的示例和应用场景相同,但不限于上述实施例一所公开的内容。需要说明的是,上述模块作为系统的一部分可以在诸如一组计算机可执行指令的计算机系统中执行。It should be noted here that the acquisition module, determination module, processing module, and output module above correspond to steps S101 to S104 in
上述实施例中对各个实施例的描述各有侧重,某个实施例中没有详述的部分可以参见其他实施例的相关描述。The description of each embodiment in the foregoing embodiments has its own emphases, and for parts not described in detail in a certain embodiment, reference may be made to relevant descriptions of other embodiments.
所提出的系统,可以通过其他的方式实现。例如以上所描述的系统实施例仅仅是示意性的,例如上述模块的划分,仅仅为一种逻辑功能划分,实际实现时,可以有另外的划分方式,例如多个模块可以结合或者可以集成到另外一个系统,或一些特征可以忽略,或不执行。The proposed system can be implemented in other ways. For example, the above-described system embodiments are only illustrative. For example, the division of the above modules is only a logical function division. In actual implementation, there may be other division methods, for example, multiple modules can be combined or integrated into another A system, or some feature, can be ignored, or not implemented.
实施例三Embodiment Three
本实施例还提供了一种电子设备,包括:一个或多个处理器、一个或多个存储器、以及一个或多个计算机程序;其中,处理器与存储器连接,上述一个或多个计算机程序被存储在存储器中,当电子设备运行时,该处理器执行该存储器存储的一个或多个计算机程序,以使电子设备执行上述实施例一所述的方法。This embodiment also provides an electronic device, including: one or more processors, one or more memories, and one or more computer programs; wherein, the processor is connected to the memory, and the one or more computer programs are programmed Stored in the memory, when the electronic device is running, the processor executes one or more computer programs stored in the memory, so that the electronic device executes the method described in
应理解,本实施例中,处理器可以是中央处理单元CPU,处理器还可以是其他通用处理器、数字信号处理器DSP、专用集成电路ASIC,现成可编程门阵列FPGA或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。It should be understood that in this embodiment, the processor can be a central processing unit CPU, and the processor can also be other general-purpose processors, digital signal processors DSP, application specific integrated circuits ASIC, off-the-shelf programmable gate array FPGA or other programmable logic devices , discrete gate or transistor logic devices, discrete hardware components, etc. A general-purpose processor may be a microprocessor, or the processor may be any conventional processor, or the like.
存储器可以包括只读存储器和随机存取存储器,并向处理器提供指令和数据、存储器的一部分还可以包括非易失性随机存储器。例如,存储器还可以存储设备类型的信息。The memory may include read-only memory and random access memory, and provide instructions and data to the processor, and a part of the memory may also include non-volatile random access memory. For example, the memory may also store device type information.
在实现过程中,上述方法的各步骤可以通过处理器中的硬件的集成逻辑电路或者软件形式的指令完成。In the implementation process, each step of the above method can be completed by an integrated logic circuit of hardware in a processor or an instruction in the form of software.
实施例一中的方法可以直接体现为硬件处理器执行完成,或者用处理器中的硬件及软件模块组合执行完成。软件模块可以位于随机存储器、闪存、只读存储器、可编程只读存储器或者电可擦写可编程存储器、寄存器等本领域成熟的存储介质中。该存储介质位于存储器,处理器读取存储器中的信息,结合其硬件完成上述方法的步骤。为避免重复,这里不再详细描述。The method in
本领域普通技术人员可以意识到,结合本实施例描述的各示例的单元及算法步骤,能够以电子硬件或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本发明的范围。Those skilled in the art can appreciate that the units and algorithm steps of the examples described in this embodiment can be implemented by electronic hardware or a combination of computer software and electronic hardware. Whether these functions are executed by hardware or software depends on the specific application and design constraints of the technical solution. Skilled artisans may use different methods to implement the described functions for each specific application, but such implementation should not be regarded as exceeding the scope of the present invention.
实施例四Embodiment four
本实施例还提供了一种计算机可读存储介质,用于存储计算机指令,所述计算机指令被处理器执行时,完成实施例一所述的方法。This embodiment also provides a computer-readable storage medium for storing computer instructions, and when the computer instructions are executed by a processor, the method described in the first embodiment is completed.
以上所述仅为本发明的优选实施例而已,并不用于限制本发明,对于本领域的技术人员来说,本发明可以有各种更改和变化。凡在本发明的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本发明的保护范围之内。The above descriptions are only preferred embodiments of the present invention, and are not intended to limit the present invention. For those skilled in the art, the present invention may have various modifications and changes. Any modifications, equivalent replacements, improvements, etc. made within the spirit and principles of the present invention shall be included within the protection scope of the present invention.
Claims (10)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310120892.8A CN116150190A (en) | 2023-02-15 | 2023-02-15 | Database query optimization processing method and system based on tree QRNN |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310120892.8A CN116150190A (en) | 2023-02-15 | 2023-02-15 | Database query optimization processing method and system based on tree QRNN |
Publications (1)
Publication Number | Publication Date |
---|---|
CN116150190A true CN116150190A (en) | 2023-05-23 |
Family
ID=86353958
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310120892.8A Pending CN116150190A (en) | 2023-02-15 | 2023-02-15 | Database query optimization processing method and system based on tree QRNN |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116150190A (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117874071A (en) * | 2023-12-28 | 2024-04-12 | 电子科技大学长三角研究院(衢州) | Database query optimizer cost model based on machine learning |
CN118469618A (en) * | 2024-07-11 | 2024-08-09 | 四川大学 | A component cost estimation method based on long short-term memory network |
CN118656395A (en) * | 2024-08-19 | 2024-09-17 | 腾讯科技(深圳)有限公司 | A query processing method, device, equipment and readable storage medium |
CN119415973A (en) * | 2025-01-03 | 2025-02-11 | 腾讯科技(深圳)有限公司 | Data processing methods, devices, products and equipment |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109447152A (en) * | 2018-10-29 | 2019-03-08 | 中国石油大学(华东) | A Fault Prediction Method Based on Monte Carlo Tree Search and Neural Network |
CA2997797A1 (en) * | 2017-09-11 | 2019-03-11 | Tata Consultancy Services Limited | Bilstm-siamese network based classifier for identifying target class of queries and providing responses thereof |
CN109933602A (en) * | 2019-02-28 | 2019-06-25 | 武汉大学 | Method and device for converting natural language and structured query language |
CN113515539A (en) * | 2021-06-02 | 2021-10-19 | 清华大学 | A method of querying data in a database |
CN115033597A (en) * | 2022-06-29 | 2022-09-09 | 上海汇付支付有限公司 | Method and system for deep learning to participate in HTAP database to perform SQL optimization |
-
2023
- 2023-02-15 CN CN202310120892.8A patent/CN116150190A/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CA2997797A1 (en) * | 2017-09-11 | 2019-03-11 | Tata Consultancy Services Limited | Bilstm-siamese network based classifier for identifying target class of queries and providing responses thereof |
CN109447152A (en) * | 2018-10-29 | 2019-03-08 | 中国石油大学(华东) | A Fault Prediction Method Based on Monte Carlo Tree Search and Neural Network |
CN109933602A (en) * | 2019-02-28 | 2019-06-25 | 武汉大学 | Method and device for converting natural language and structured query language |
CN113515539A (en) * | 2021-06-02 | 2021-10-19 | 清华大学 | A method of querying data in a database |
CN115033597A (en) * | 2022-06-29 | 2022-09-09 | 上海汇付支付有限公司 | Method and system for deep learning to participate in HTAP database to perform SQL optimization |
Non-Patent Citations (2)
Title |
---|
SHAYAN HASSANTABAR 等: "CURIOUS:Efficient Neural Architecture Search Based on a Performance Predictor and Evolutionary Search", 《IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUIT AND SYSTEMS》, vol. 41, no. 11, 1 February 2022 (2022-02-01) * |
李苍: "基于朴素贝叶斯和LSTM循环神经网络的SQL注入检测技术研究", 《中国优秀硕士毕业生论文集信息科技辑》, 15 January 2019 (2019-01-15) * |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117874071A (en) * | 2023-12-28 | 2024-04-12 | 电子科技大学长三角研究院(衢州) | Database query optimizer cost model based on machine learning |
CN118469618A (en) * | 2024-07-11 | 2024-08-09 | 四川大学 | A component cost estimation method based on long short-term memory network |
CN118469618B (en) * | 2024-07-11 | 2024-09-24 | 四川大学 | A component cost estimation method based on long short-term memory network |
CN118656395A (en) * | 2024-08-19 | 2024-09-17 | 腾讯科技(深圳)有限公司 | A query processing method, device, equipment and readable storage medium |
CN118656395B (en) * | 2024-08-19 | 2024-11-15 | 腾讯科技(深圳)有限公司 | Query processing method, device, equipment and readable storage medium |
CN119415973A (en) * | 2025-01-03 | 2025-02-11 | 腾讯科技(深圳)有限公司 | Data processing methods, devices, products and equipment |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN116150190A (en) | Database query optimization processing method and system based on tree QRNN | |
Yuan et al. | Automatic view generation with deep learning and reinforcement learning | |
CN111581454B (en) | Parallel query performance prediction system and method based on depth map compression algorithm | |
Marcus et al. | Bao: Learning to steer query optimizers | |
US8874548B2 (en) | Predicting query execution time | |
CN115617830A (en) | Data query optimization processing method and device based on machine learning | |
CN112434024B (en) | Relational database-oriented data dictionary generation method, device, equipment and medium | |
CN111444220A (en) | Cross-platform SQ L query optimization method combining rule driving and data driving | |
CN113111058B (en) | Database processing method and device | |
CN116244333B (en) | Database query performance prediction method and system based on cost factor calibration | |
US20210097044A1 (en) | Systems and methods for designing data structures and synthesizing costs | |
CN116937559A (en) | Power system load forecasting system and method based on recurrent neural network and tensor decomposition | |
CN112883066B (en) | Method for estimating multi-dimensional range query cardinality on database | |
Li et al. | A resource-aware deep cost model for big data query processing | |
CN119719424A (en) | Data lineage analysis method, device, terminal equipment and computer program product | |
CN119886143A (en) | Method and system for aligning time sequence knowledge graph entities of enterprise | |
CN103150626A (en) | Program dependence graph-based BPEL (Business Process Execution Language) process consistency measurement method | |
CN113220820B (en) | Efficient SPARQL query response method, device and equipment based on graph | |
CN119539145A (en) | A method and system for predicting carbon emissions in the power industry based on XGBoost algorithm | |
CN119294579A (en) | A node optimization method and device for a traffic network | |
CN116861373A (en) | Query selectivity estimation method, system, terminal equipment and storage medium | |
CN117113075A (en) | Query optimization field cost estimation method based on graph attention network | |
CN115034514A (en) | Small sample time sequence prediction method and system based on meta-learning | |
Liu et al. | MSP: Learned Query Performance Prediction Using MetaInfo and Structure of Plans | |
Yan et al. | QCFE: an efficient feature engineering for query cost estimation |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |