+

CN115033597B - Method and system for optimizing SQL execution in HTAP database using deep learning - Google Patents

Method and system for optimizing SQL execution in HTAP database using deep learning Download PDF

Info

Publication number
CN115033597B
CN115033597B CN202210752471.2A CN202210752471A CN115033597B CN 115033597 B CN115033597 B CN 115033597B CN 202210752471 A CN202210752471 A CN 202210752471A CN 115033597 B CN115033597 B CN 115033597B
Authority
CN
China
Prior art keywords
execution
sql
execution plan
deep learning
learning model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210752471.2A
Other languages
Chinese (zh)
Other versions
CN115033597A (en
Inventor
周晔
穆海洁
霍通
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Huifu Payment Co ltd
Original Assignee
Shanghai Huifu Payment Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Huifu Payment Co ltd filed Critical Shanghai Huifu Payment Co ltd
Priority to CN202210752471.2A priority Critical patent/CN115033597B/en
Publication of CN115033597A publication Critical patent/CN115033597A/en
Application granted granted Critical
Publication of CN115033597B publication Critical patent/CN115033597B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2453Query optimisation
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/242Query formulation
    • G06F16/2433Query languages
    • G06F16/2445Data retrieval commands; View definitions
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Databases & Information Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本发明公开了深度学习参与HTAP数据库执行SQL优化的方法和系统,避免人工优化搜索引擎,提高查询速度,精简内存的占用,从而提升机器性能。其技术方案为:SQL解析器接收SQL执行语句,检查SQL执行语句是否符合规范;执行计划模块基于经SQL解析器检查后符合规范的SQL语句,通过深度学习模型生成执行计划;引擎执行所产生的执行计划,将每次SQL执行结果和SQL执行过程的详细信息发送给深度学习模型的训练集;深度学习模型基于训练集进行学习和训练,再将模型学习和训练的结果以最佳优化方案的方式反馈给执行计划模块,执行计划模块收到深度学习模型的反馈后,自动优化执行计划并应用到下次的SQL查询。

The present invention discloses a method and system for deep learning to participate in HTAP database SQL optimization, which avoids manual optimization of search engines, improves query speed, and reduces memory usage, thereby improving machine performance. The technical solution is as follows: the SQL parser receives SQL execution statements and checks whether the SQL execution statements meet the specifications; the execution plan module generates an execution plan through a deep learning model based on SQL statements that meet the specifications after being checked by the SQL parser; the engine executes the generated execution plan, and sends the detailed information of each SQL execution result and SQL execution process to the training set of the deep learning model; the deep learning model learns and trains based on the training set, and then feeds back the results of model learning and training to the execution plan module in the form of the best optimization solution. After receiving the feedback from the deep learning model, the execution plan module automatically optimizes the execution plan and applies it to the next SQL query.

Description

Method and system for deep learning to participate in SQL optimization of HTAP database
Technical Field
The invention relates to database technology, in particular to a method and a system for performing SQL optimization by deep learning participation HTAP databases.
Background
Before the surge of the internet appears, the data volume of enterprises is generally not large, especially the core business data, and a single database can be usually stored. That time the store does not require a complex architecture, all online requests (OLTP, online Transactional Processing) and background analysis (OLAP, online Analytical Processing) run on the same database instance.
With the development of the internet, the business data volume of enterprises is continuously increased, and the capacity limitation of a single database restricts the use of the single database in a mass data scene. In practical applications, OLTP and OLAP are therefore technically split into ribes in order to face various demands, and in many enterprise architectures, these two types of task processing are performed by different teams. When the big data of the Internet of things is continuously deeply applied, massive sensor data are required to be updated and queried in real time, and the performance requirement on the database is also higher and higher, at the moment, new problems are caused.
There is usually a delay of several minutes or even hours between OLAP and OLTP systems, so that consistency between OLAP databases and OLTP databases cannot be ensured, and it is difficult to satisfy a service scenario with high requirements on real-time analysis. Furthermore, enterprises need to maintain different databases to support two different types of tasks, which is costly to manage and maintain.
Thus, databases capable of uniformly supporting transaction and workload analysis are a requirement of numerous enterprises. In this context, HTAP (hybrid transaction/analysis process, hybrid Transactional/ANALYTICAL PROCESSING) proposed by Gartner is desirable. Based on the innovative computing storage framework, the HTAP database can support the operation of a business system and an OLAP scene on one piece of data at the same time, and a large amount of data interaction between an online database and an offline database in the traditional framework is avoided. In addition, the HTAP is based on a distributed architecture, supports elastic capacity expansion, can expand throughput or storage as required, and can easily cope with high concurrency and mass data scenes.
Currently, there are few databases for implementing HTAP, mainly TiDB of PingCAP, hybridDB for MySQL of alicloud, baikalDB of hundred degrees, and the like.
In big data scenarios, in addition to HTAP and the need for cloud native technology (ensuring dynamic scalability), search engines will have increasingly high requirements for SQL optimization execution. Not only is speed required, but also machine performance can be guaranteed, such as memory, IO and the like. Especially for the case of OLAP, the data to be pulled is very large, all of the data in TB level, the memory is often not enough,
In summary, in the existing scheme, under the scenes of HTAP and large data volume, memory occupation is too high, a search engine needs to be manually optimized, the time complexity and the space complexity of each query are not different, and automatic optimization cannot be performed.
Disclosure of Invention
The following presents a simplified summary of one or more aspects in order to provide a basic understanding of such aspects. This summary is not an extensive overview of all contemplated aspects, and is intended to neither identify key or critical elements of all aspects nor delineate the scope of any or all aspects. Its sole purpose is to present some concepts of one or more aspects in a simplified form as a prelude to the more detailed description that is presented later.
The invention aims to solve the problems, and provides a method and a system for performing SQL optimization by deep learning participation HTAP database, which utilize a deep learning model to realize an automatic optimization technology for SQL query, avoid manually optimizing a search engine, improve query speed and simplify occupation of memory, thereby improving machine performance.
The technical scheme of the invention is that the invention discloses a method for executing SQL optimization by a deep learning participation HTAP database, which comprises the following steps:
step 1, an SQL parser receives an SQL execution statement and checks whether the SQL execution statement meets the specification;
Step 2, the execution plan module generates an execution plan through a deep learning model based on SQL sentences conforming to the specifications after being checked by the SQL parser;
step 3, the engine executes the execution plan generated in the step 2 and sends the SQL execution result and the detailed information of the SQL execution process to the training set of the deep learning model;
And 4, the deep learning model performs learning and training based on the training set, and then the results of model learning and training are fed back to the execution plan module in the mode of an optimal optimization scheme, and the execution plan module automatically optimizes the execution plan and applies the execution plan to the next SQL query after receiving the feedback of the deep learning model.
According to one embodiment of the method for deep learning to participate in the SQL optimization of the HTAP database, the execution plan generated in step 2 is divided into a row-based plan and a column-based plan, and the engines in step 3 are divided into a row-based execution engine to execute the row-based plan and a vectorized execution engine to execute the column-based plan.
According to an embodiment of the method for performing SQL optimization by the deep learning participation HTAP database, the generating an execution plan based on the deep learning model in the step 2 further comprises:
The execution plan module utilizes an input layer, a hidden layer and an output layer of the cyclic neural network structure, the input layer inputs SQL execution results and detailed information of SQL execution processes, the hidden layer makes cost calculation, and the output layer outputs SQL sentences and an optimal execution plan.
According to one embodiment of the method for performing SQL optimization by the deep learning participation HTAP database, in the cost calculation of the hidden layer, cost estimation is performed on SQL execution results and detailed information of the SQL execution process, loss function parameters are continuously adjusted in the loop iteration process to reduce cost, and when the reduced cost is superior to an expected value of the cost estimation, the optimization result of the loop iteration is constructed into a data structure of a defined execution plan and fed back to an execution plan module.
According to an embodiment of the method for performing SQL optimization by the deep learning participation HTAP database of the present invention, the step 4 further comprises:
the method comprises the steps that interaction agreements are preset between an execution plan module and a deep learning model, and data structures of the execution plan are defined in advance;
The deep learning model learns and trains the training set data sent by the engine in the step 3, quantifies and designs the feedbacks aiming at the feedbacks, and finally feeds back the results of model learning and training to the execution plan module, so that the execution plan module directly modifies the data structure of the corresponding execution plan in the abstract grammar tree in the memory to optimize after receiving the feedback of the deep learning model, and ensures that the next query can be applied to the optimized feedback.
The invention also discloses a system for deep learning to participate in the SQL optimization of the HTAP database, which comprises:
the SQL parser is used for receiving the SQL execution statement and checking whether the SQL execution statement accords with the specification;
the execution plan module is used for generating an execution plan through the deep learning model based on SQL sentences which are checked by the SQL parser and accord with the specifications;
The engine module is used for executing the generated execution plan and sending the SQL execution result and the detailed information of the SQL execution process to the training set of the deep learning model;
And the deep learning model is used for learning and training based on the training set, and the results of model learning and training are fed back to the execution plan module in a mode of an optimal optimization scheme, so that the execution plan module automatically optimizes the execution plan and applies the execution plan to the next SQL query after receiving the feedback of the deep learning model.
According to one embodiment of the system for deep learning participation in SQL optimization of an HTAP database of the present invention, the execution plans generated by the execution plan module are divided into row-based plans and column-based plans, and the engine module is divided into a row-based execution engine to execute the row-based plans and a vectorized execution engine to execute the column-based plans.
According to an embodiment of the system for deep learning participation HTAP database in performing SQL optimization of the present invention, the deep learning based model generation execution plan of the execution plan module further comprises:
The execution plan module utilizes an input layer, a hidden layer and an output layer of the cyclic neural network structure, the input layer inputs SQL execution results and detailed information of SQL execution processes, the hidden layer makes cost calculation, and the output layer outputs SQL sentences and an optimal execution plan.
According to the embodiment of the system for performing SQL optimization by the deep learning participation HTAP database, in the cost calculation of a hidden layer, a cost estimation is carried out on SQL execution results and detailed information of an SQL execution process by an execution plan module, loss function parameters are continuously adjusted in the process of loop iteration to reduce cost, and when the reduced cost is superior to an expected value of the cost estimation, the optimization result of the loop iteration is constructed into a data structure of a defined execution plan and fed back to the execution plan module.
According to an embodiment of the system for deep learning participation HTAP databases to perform SQL optimization of the present invention, the deep learning model is further configured to:
the method comprises the steps that interaction agreements are preset between an execution plan module and a deep learning model, and data structures of the execution plan are defined in advance;
The deep learning model learns and trains training set data sent by the engine module, quantifies and designs the feedbacks aiming at points, and finally feeds back the results of model learning and training to the execution plan module, so that the execution plan module directly modifies the data structure of the corresponding execution plan in the abstract grammar tree in the memory to optimize after receiving the feedback of the deep learning model, and ensures that the next query can be applied to the optimized feedback.
Compared with the prior art, the method has the advantages that in the scheme, for the HTAP database, the execution result and the execution process are sent to the deep learning model for training after SQL is executed each time, and the optimization scheme prediction is carried out, the prediction result of the optimization scheme prediction is fed back to the execution plan module, and the execution plan module automatically optimizes and generates a plan according to the feedback of the prediction result and applies the generated plan to the next query. After the feedback takes effect, the whole SQL execution flow is more efficient and occupies lower memory.
Drawings
The above features and advantages of the present invention will be better understood after reading the detailed description of embodiments of the present disclosure in conjunction with the following drawings. In the drawings, the components are not necessarily to scale and components having similar related features or characteristics may have the same or similar reference numerals.
FIG. 1 illustrates a flow chart of one embodiment of a method of the present invention for deep learning participation in an HTAP database to perform SQL optimization.
FIG. 2 illustrates a schematic diagram of one embodiment of a system of the present invention for deep learning participation in an HTAP database to perform SQL optimization.
Detailed Description
The invention is described in detail below with reference to the drawings and the specific embodiments. It is noted that the aspects described below in connection with the drawings and the specific embodiments are merely exemplary and should not be construed as limiting the scope of the invention in any way.
FIG. 1 illustrates a flow of one embodiment of a method of the present invention for deep learning participation in an HTAP database to perform SQL optimization. Referring to fig. 1, the following is a detailed description of the implementation steps of the method for performing SQL optimization for the deep learning participation HTAP database of the present embodiment.
Step 1 the SQL parser (SQL PARSER) receives the SQL execution statement to check if the SQL execution statement complies with the specification, typically including checking if the SQL format, syntax is correct, etc.
And 2, generating an execution plan by the execution plan module through the deep learning model based on the SQL sentences which are checked by the SQL parser and accord with the specifications.
According to FIG. 1, the execution plan module generates a logical layer plan and generates a physical layer plan based on the logical layer plan. Since the HTAP database is a row-column mixed database, the physical layer planning is divided into a row-based plan and a column-based plan.
The specific process of the execution plan module generating an execution plan based on the deep learning model is as follows.
The execution plan module utilizes a cyclic neural network structure (the cyclic neural network comprises an input layer, a hidden layer and an output layer), the input layer inputs SQL execution results and detailed information of SQL execution processes, the hidden layer makes cost calculation, and the output layer outputs SQL sentences and an optimal execution plan.
In the cost calculation of the hidden layer, cost estimation is carried out aiming at the SQL execution result and the detailed information of the SQL execution process, loss function parameters are continuously adjusted in the process of loop iteration (the iteration process is advanced by preset iteration times) so as to reduce the cost, and when the reduced cost is superior to an expected value of the cost estimation, an optimization result of the loop iteration is constructed into a data structure of a defined execution plan and fed back to an execution plan module.
The cost estimation is divided into 2 stages, one is cost estimation analysis and the other is learning optimization stage.
The cost estimation analysis stage comprises a plurality of execution conditions such as full-table scanning estimation, common index estimation, multi-table connection query and the like, the deep learning model carries out overall estimation on the different execution conditions to obtain an expected value of the cost estimation, and the expected value comprises IO cost (IO times), CPU calculation cost, memory occupation cost and the like.
In the subsequent learning optimization stage, according to training set data of the deep learning model, the weight parameters of the deep learning model are continuously optimized by using a cyclic neural network, expected values of the cost estimation analysis stage are compared, if learning optimization results are better than the expected values of the cost estimation analysis, the learning optimization results are fed back to an execution plan module after a certain learning times, and iterative optimization is continued.
The learning optimization stage and SQL execution are decoupled, the SQL execution result and the detailed information of the SQL execution process are sent to a training set of the deep learning model in an asynchronous mode each time, and the cyclic neural network feeds back to the execution plan module after a certain optimization times.
The execution plan generated is generally divided into the following cases:
1. Useless plan matching-there may sometimes be only one way to execute the query. For example, heap tables can only acquire data through table scanning, and to avoid wasting time optimizing such queries, SQL SERVER maintains a garbage list for selection, and if the optimization stage is to find a plan in the garbage list that matches the query, then similar plans are generated without any optimization.
2. Multistage optimization-for complex queries, the number of alternative processing strategies that need to be analyzed can be large, and evaluating each selection can take a long time. Thus, the optimization stage does not analyze all possible processing strategies, but rather divides them into several configurations, each containing different indexing and connection techniques.
Index variations consider different index characteristics, single column index, compound index, index column order, index density, etc. Similarly, connection variants consider different connection techniques available in the engine, nested loop connections, merged connections, and hash matches.
Learning optimization considers the statistics of columns caused in the WHERE clause to evaluate the validity of the indexing and connection policies, which are counted to evaluate the configuration overhead in multiple optimization stages, including many factors such as CPU, memory usage, and disk I/O required to perform the query. After each optimization phase, the cost of the processing strategy is evaluated, if the cost is sufficiently economical, the further looping through the optimization phase is stopped and the optimization process is exited, otherwise the looping through the optimization phase is continued to determine a cost-effective processing strategy.
Step 3, the engine executes the execution plan generated in step 2, the engine comprises Row Based Execution Engine (a row-based execution engine for executing the row-based plan) and Vectorized Execution Engine (a vectorization execution engine for executing the column-based plan), and the SQL execution result and the detailed information of the SQL execution process are sent to the training set of the deep learning model.
During the execution of the engine, data (which is some intermediate data and data caches generated in SQL execution of the data after SQL execution, such as data page, index page, redox log, etc.) are put into a Row Buffer Pool (Row Buffer Pool) and a column index Buffer Pool (Column Index Buffer Pool) in the storage engine. And after the SQL execution is finished each time, the SQL execution result and the detailed information of the SQL execution process are sent to a training set of the deep learning model.
The execution engine based on the line mainly processes the execution plan related to the transaction, selects operators (such as operators of full table scanning, nested loop connection and the like) of different modules according to the execution plan, and determines that a plurality of tables and APIs (application program interfaces) needing to be called need to be operated in the execution.
The full table scanning in the execution engine based on the rows is to scan a certain table specified by the execution plan, the expression frame generates corresponding instructions and calls corresponding processing APIs until the needed data is found.
The nested loop connection in the row-based execution engine is that the nested loop connects one external dataset to the internal dataset, and the database matches all rows in the internal dataset that match predicate conditions for each row of data of the external dataset. If an index is available to the internal data set or internal table at this time, the database will acquire the data by using it to locate rowid.
The vectorized execution engine mainly processes execution plans related to data analysis, most of requests are mainly related to inquiry and analysis, and data analysis is very convenient because the data storage is multidimensional. The conventional transactional model is a two-dimensional data model that is inconvenient to operate and analyze when multiple sets of forms are required to operate. In addition, a special SIMD instruction (single instruction stream multiple data stream) is generated through the expression frame, and a corresponding column index is called to access the API, so that the overall analysis efficiency is improved.
The full-table scanning of the vectorization execution engine is that, due to the fact that the full-table scanning is performed by the column-type database, certain needed column data can be extracted for analysis, and the data size of the full-table scanning is greatly reduced.
The nested loop connection of the vectorization execution engine is that according to the execution plan, the characteristics of the column database are fully utilized, and for each row of data of the external data set, only needed columns are associated, and corresponding column data are recombined.
In the hash connection of the vectorized execution engine, the hash connection is mainly divided into two phases, namely an establishment phase (build phase) and a probe phase (probe phase). The creation phase is to select a table (typically a smaller one to reduce the time and space for creating a hash table) and apply a hash function to the connection attribute (join attribute) on each tuple to obtain a hash value, thereby creating a hash table. And a detection stage, namely scanning each row of the other table, calculating the hash value of the connection attribute, comparing the hash value with the hash table established in the establishment stage, and if the hash table is in the same socket, connecting the hash table into a new table if a connection predicate (predicate) is satisfied. When the hash table is built under the condition that the memory is large enough, the whole table is in the memory, and the hash table is put into a disk after the connection operation is completed. But this process also brings about many I/O operations.
The aggregate packet of the vectorized execution engine is commonly an aggregate packet (GROUP BY), and the GROUP BY combines all, cube and roolup as a common function of OLAP (online analysis), so that the use is very convenient. The OALP series of functions can realize a plurality of functions in OLAP scenes such as database report forms, statistical analysis, warehouse processing and the like, and the functions can play a larger role by combining window functions.
And 4, the deep learning model performs learning and training based on the training set, and then the results (the optimal optimization scheme) of the model learning and training are fed back to the execution plan module, and the execution plan module automatically optimizes the execution plan and applies the execution plan to the next SQL query after receiving the feedback of the deep learning model.
The interaction agreement is preset between the execution plan module and the deep learning model, and the data structures of the execution plan are defined in advance and comprise abstract syntax trees, access types type, connection matching conditions ref of tables and the like.
The deep learning model learns and trains the training set data sent by the engine in the step 3, quantifies and designs the feedbacks of points (such as the altitude estimation of the abstract syntax tree, the access type and the like), and finally feeds back the results of model learning and training to the execution plan module. After receiving feedback of the deep learning model, the execution plan module directly modifies a data structure of a corresponding execution plan in the abstract syntax tree in the memory to optimize, ensures that the next query can be applied to the optimized feedback, and timely drops the optimized information for analysis by developers.
FIG. 2 illustrates the principles of one embodiment of a system of the present invention for deep learning participation in an HTAP database to perform SQL optimization. Referring to fig. 2, the system of the present embodiment includes an SQL parser, an execution plan module, an engine module, and a deep learning model.
The SQL parser is used for receiving the SQL execution statement, and checking whether the SQL execution statement meets the specification or not, typically includes checking whether the SQL format, grammar are correct or not, etc.
The execution plan module is used for generating an execution plan through the deep learning model based on the SQL sentences which are checked by the SQL parser and accord with the specifications.
The execution plan module generates a logic layer plan and generates a physical layer plan according to the logic layer plan. Since the HTAP database is a row-column mixed database, the physical layer planning is divided into a row-based plan and a column-based plan.
The specific process of the execution plan module generating an execution plan based on the deep learning model is as follows.
The execution plan module utilizes a cyclic neural network structure (the cyclic neural network comprises an input layer, a hidden layer and an output layer), the input layer inputs SQL execution results and detailed information of SQL execution processes, the hidden layer makes cost calculation, and the output layer outputs SQL sentences and an optimal execution plan.
In the cost calculation of the hidden layer, cost estimation is carried out aiming at the SQL execution result and the detailed information of the SQL execution process, loss function parameters are continuously adjusted in the process of loop iteration (the iteration process is advanced by preset iteration times) so as to reduce the cost, and when the reduced cost is superior to an expected value of the cost estimation, an optimization result of the loop iteration is constructed into a data structure of a defined execution plan and fed back to an execution plan module.
The cost estimation is divided into 2 stages, one is cost estimation analysis and the other is learning optimization stage.
The cost estimation analysis stage comprises a plurality of execution conditions such as full-table scanning estimation, common index estimation, multi-table connection query and the like, the deep learning model carries out overall estimation on the different execution conditions to obtain an expected value of the cost estimation, and the expected value comprises IO cost (IO times), CPU calculation cost, memory occupation cost and the like.
In the subsequent learning optimization stage, according to training set data of the deep learning model, the weight parameters of the deep learning model are continuously optimized by using a cyclic neural network, expected values of the cost estimation analysis stage are compared, if learning optimization results are better than the expected values of the cost estimation analysis, the learning optimization results are fed back to an execution plan module after a certain learning times, and iterative optimization is continued.
The learning optimization stage and SQL execution are decoupled, the SQL execution result and the detailed information of the SQL execution process are sent to a training set of the deep learning model in an asynchronous mode each time, and the cyclic neural network feeds back to the execution plan module after a certain optimization times.
The execution plan generated is generally divided into the following cases:
3. Useless plan matching-there may sometimes be only one way to execute the query. For example, heap tables can only acquire data through table scanning, and to avoid wasting time optimizing such queries, SQL SERVER maintains a garbage list for selection, and if the optimization stage is to find a plan in the garbage list that matches the query, then similar plans are generated without any optimization.
4. Multistage optimization-for complex queries, the number of alternative processing strategies that need to be analyzed can be large, and evaluating each selection can take a long time. Thus, the optimization stage does not analyze all possible processing strategies, but rather divides them into several configurations, each containing different indexing and connection techniques.
Index variations consider different index characteristics, single column index, compound index, index column order, index density, etc. Similarly, connection variants consider different connection techniques available in the engine, nested loop connections, merged connections, and hash matches.
Learning optimization considers the statistics of columns caused in the WHERE clause to evaluate the validity of the indexing and connection policies, which are counted to evaluate the configuration overhead in multiple optimization stages, including many factors such as CPU, memory usage, and disk I/O required to perform the query. After each optimization phase, the cost of the processing strategy is evaluated, if the cost is sufficiently economical, the further looping through the optimization phase is stopped and the optimization process is exited, otherwise the looping through the optimization phase is continued to determine a cost-effective processing strategy.
The engine module is used for executing the generated execution plan and sending the SQL execution result and the detailed information of the SQL execution process to the training set of the deep learning model.
The engines include Row Based Execution Engine (row-based execution engine for executing row-based plans) and Vectorized Execution Engine (vectorized execution engine for executing column-based plans) that send each SQL execution result and detailed information of the SQL execution process to the training set of the deep learning model.
During the execution of the engine, data (which is some intermediate data and data caches generated in SQL execution of the data after SQL execution, such as data page, index page, redox log, etc.) are put into a Row Buffer Pool (Row Buffer Pool) and a column index Buffer Pool (Column Index Buffer Pool) in the storage engine. And after the SQL execution is finished each time, the SQL execution result and the detailed information of the SQL execution process are sent to a training set of the deep learning model.
The execution engine based on the line mainly processes the execution plan related to the transaction, selects operators (such as operators of full table scanning, nested loop connection and the like) of different modules according to the execution plan, and determines that a plurality of tables and APIs (application program interfaces) needing to be called need to be operated in the execution.
The full table scanning in the execution engine based on the rows is to scan a certain table specified by the execution plan, the expression frame generates corresponding instructions and calls corresponding processing APIs until the needed data is found.
The nested loop connection in the row-based execution engine is that the nested loop connects one external dataset to the internal dataset, and the database matches all rows in the internal dataset that match predicate conditions for each row of data of the external dataset. If an index is available to the internal data set or internal table at this time, the database will acquire the data by using it to locate rowid.
The vectorized execution engine mainly processes execution plans related to data analysis, most of requests are mainly related to inquiry and analysis, and data analysis is very convenient because the data storage is multidimensional. The conventional transactional model is a two-dimensional data model that is inconvenient to operate and analyze when multiple sets of forms are required to operate. In addition, a special SIMD instruction (single instruction stream multiple data stream) is generated through the expression frame, and a corresponding column index is called to access the API, so that the overall analysis efficiency is improved.
The full-table scanning of the vectorization execution engine is that, due to the fact that the full-table scanning is performed by the column-type database, certain needed column data can be extracted for analysis, and the data size of the full-table scanning is greatly reduced.
The nested loop connection of the vectorization execution engine is that according to the execution plan, the characteristics of the column database are fully utilized, and for each row of data of the external data set, only needed columns are associated, and corresponding column data are recombined.
In the hash connection of the vectorized execution engine, the hash connection is mainly divided into two phases, namely an establishment phase (build phase) and a probe phase (probe phase). The creation phase is to select a table (typically a smaller one to reduce the time and space for creating a hash table) and apply a hash function to the connection attribute (join attribute) on each tuple to obtain a hash value, thereby creating a hash table. And a detection stage, namely scanning each row of the other table, calculating the hash value of the connection attribute, comparing the hash value with the hash table established in the establishment stage, and if the hash table is in the same socket, connecting the hash table into a new table if a connection predicate (predicate) is satisfied. When the hash table is built under the condition that the memory is large enough, the whole table is in the memory, and the hash table is put into a disk after the connection operation is completed. But this process also brings about many I/O operations.
The aggregate packet of the vectorized execution engine is commonly an aggregate packet (GROUP BY), and the GROUP BY combines all, cube and roolup as a common function of OLAP (online analysis), so that the use is very convenient. The OALP series of functions can realize a plurality of functions in OLAP scenes such as database report forms, statistical analysis, warehouse processing and the like, and the functions can play a larger role by combining window functions.
The deep learning model carries out learning and training based on the training set, and then the results of model learning and training are fed back to the execution plan module in the mode of an optimal optimization scheme, so that the execution plan module automatically optimizes the execution plan and applies the execution plan to the next SQL query after receiving the feedback of the deep learning model.
The interaction agreement is preset between the execution plan module and the deep learning model, and the data structures of the execution plan are defined in advance and comprise abstract syntax trees, access types type, connection matching conditions ref of tables and the like.
The deep learning model learns and trains training set data sent by the engine module, quantifies and designs feedbacks to points (such as altitude estimation of abstract syntax tree, access type and the like), and finally feeds back model learning and training results to the execution plan module. After receiving feedback of the deep learning model, the execution plan module directly modifies a data structure of a corresponding execution plan in the abstract syntax tree in the memory to optimize, ensures that the next query can be applied to the optimized feedback, and timely drops the optimized information for analysis by developers.
While, for purposes of simplicity of explanation, the methodologies are shown and described as a series of acts, it is to be understood and appreciated that the methodologies are not limited by the order of acts, as some acts may, in accordance with one or more embodiments, occur in different orders and/or concurrently with other acts from that shown and described herein or not shown and described herein, as would be understood and appreciated by those skilled in the art.
Those of skill would further appreciate that the various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both. To clearly illustrate this interchangeability of hardware and software, various illustrative components, blocks, modules, circuits, and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the overall system. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
The various illustrative logical blocks, modules, and circuits described in connection with the embodiments disclosed herein may be implemented or performed with a general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or any combination thereof designed to perform the functions described herein. A general purpose processor may be a microprocessor, but in the alternative, the processor may be any conventional processor, controller, microcontroller, or state machine. A processor may also be implemented as a combination of computing devices, e.g., a combination of a DSP and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a DSP core, or any other such configuration.
The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art. An exemplary storage medium is coupled to the processor such the processor can read information from, and write information to, the storage medium. In the alternative, the storage medium may be integral to the processor. The processor and the storage medium may reside in an ASIC. The ASIC may reside in a user terminal. In the alternative, the processor and the storage medium may reside as discrete components in a user terminal.
In one or more exemplary embodiments, the functions described may be implemented in hardware, software, firmware, or any combination thereof. If implemented in software as a computer program product, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium. Computer-readable media includes both computer storage media and communication media including any medium that facilitates transfer of a computer program from one place to another. A storage media may be any available media that can be accessed by a computer. By way of example, and not limitation, such computer-readable media can comprise RAM, ROM, EEPROM, CD-ROM or other optical disk storage, magnetic disk storage or other magnetic storage devices, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer. Any connection is properly termed a computer-readable medium. For example, if the software is transmitted from a web site, server, or other remote source using a coaxial cable, fiber optic cable, twisted pair, digital Subscriber Line (DSL), or wireless technologies such as infrared, radio, and microwave, then the coaxial cable, fiber optic cable, twisted pair, DSL, or wireless technologies such as infrared, radio, and microwave are included in the definition of medium. Disk (disk) and disc (disk) as used herein include Compact Disc (CD), laser disc, optical disc, digital Versatile Disc (DVD), floppy disk and blu-ray disc where disks (disk) usually reproduce data magnetically, while discs (disk) reproduce data optically with lasers. Combinations of the above should also be included within the scope of computer-readable media.
The previous description of the disclosure is provided to enable any person skilled in the art to make or use the disclosure. Various modifications to the disclosure will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other variations without departing from the spirit or scope of the disclosure. Thus, the disclosure is not intended to be limited to the examples and designs described herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (4)

1. A method for deep learning to participate in SQL optimization of an HTAP database, the method comprising:
step 1, an SQL parser receives an SQL execution statement and checks whether the SQL execution statement meets the specification;
Step 2, the execution plan module generates an execution plan through a deep learning model based on SQL sentences conforming to the specifications after being checked by the SQL parser;
step 3, the engine executes the execution plan generated in the step 2 and sends the SQL execution result and the detailed information of the SQL execution process to the training set of the deep learning model;
Step 4, the deep learning model carries out learning and training based on a training set, and then the results of model learning and training are fed back to the execution plan module in a mode of an optimal optimization scheme, and after the execution plan module receives the feedback of the deep learning model, the execution plan is automatically optimized and applied to the next SQL query;
wherein, the generating an execution plan based on the deep learning model by the execution plan module in step 2 further comprises:
the execution plan module utilizes an input layer, a hidden layer and an output layer of the cyclic neural network structure, the input layer inputs the SQL execution result and the detailed information of the SQL execution process, the hidden layer makes cost calculation, the output layer outputs an SQL sentence and an optimal execution plan, in the cost calculation of the hidden layer, cost estimation is carried out on the SQL execution result and the detailed information of the SQL execution process, loss function parameters are continuously adjusted in the cyclic iteration process to reduce cost, and when the reduced cost is superior to an expected value of the cost estimation, an optimization result of the cyclic iteration is constructed into a data structure of a defined execution plan and fed back to the execution plan module;
Wherein step 4 further comprises:
the method comprises the steps that interaction agreements are preset between an execution plan module and a deep learning model, and data structures of the execution plan are defined in advance;
The deep learning model learns and trains the training set data sent by the engine in the step 3, quantifies and designs the feedbacks aiming at the feedbacks, and finally feeds back the results of model learning and training to the execution plan module, so that the execution plan module directly modifies the data structure of the corresponding execution plan in the abstract grammar tree in the memory to optimize after receiving the feedback of the deep learning model, and ensures that the next query can be applied to the optimized feedback.
2. The method of claim 1, wherein the execution plan generated in step 2 is divided into a row-based plan and a column-based plan, and the engine in step 3 is divided into a row-based execution engine to execute the row-based plan and a vectorized execution engine to execute the column-based plan.
3. A system for deep learning to participate in an HTAP database to perform SQL optimization, the system comprising:
the SQL parser is used for receiving the SQL execution statement and checking whether the SQL execution statement accords with the specification;
the execution plan module is used for generating an execution plan through the deep learning model based on SQL sentences which are checked by the SQL parser and accord with the specifications;
The engine module is used for executing the generated execution plan and sending the SQL execution result and the detailed information of the SQL execution process to the training set of the deep learning model;
The deep learning model is used for learning and training based on the training set, and the results of model learning and training are fed back to the execution plan module in a mode of an optimal optimization scheme, so that the execution plan module automatically optimizes the execution plan and applies the execution plan to the next SQL query after receiving the feedback of the deep learning model;
wherein generating the execution plan based on the deep learning model of the execution plan module further comprises:
The execution plan module utilizes an input layer, a hidden layer and an output layer of the cyclic neural network structure, the input layer inputs the SQL execution result and the detailed information of the SQL execution process, the hidden layer makes cost calculation, the output layer outputs an SQL sentence and an optimal execution plan, in the cost calculation of the hidden layer, the cost estimation is carried out on the SQL execution result and the detailed information of the SQL execution process of the execution plan module, loss function parameters are continuously adjusted in the cyclic iteration process to reduce cost, and when the reduced cost is superior to an expected value of the cost estimation, the optimized result of the cyclic iteration is constructed into a data structure of a defined execution plan and fed back to the execution plan module;
wherein the deep learning model is further configured to:
the method comprises the steps that interaction agreements are preset between an execution plan module and a deep learning model, and data structures of the execution plan are defined in advance;
The deep learning model learns and trains training set data sent by the engine module, quantifies and designs the feedbacks aiming at points, and finally feeds back the results of model learning and training to the execution plan module, so that the execution plan module directly modifies the data structure of the corresponding execution plan in the abstract grammar tree in the memory to optimize after receiving the feedback of the deep learning model, and ensures that the next query can be applied to the optimized feedback.
4. The system for deep learning participation HTAP database execution SQL optimization of claim 3, wherein the execution plans generated by the execution plan module are divided into a row-based plan and a column-based plan, and the engine module is divided into a row-based execution engine to execute the row-based plan and a vectorized execution engine to execute the column-based plan.
CN202210752471.2A 2022-06-29 2022-06-29 Method and system for optimizing SQL execution in HTAP database using deep learning Active CN115033597B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210752471.2A CN115033597B (en) 2022-06-29 2022-06-29 Method and system for optimizing SQL execution in HTAP database using deep learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210752471.2A CN115033597B (en) 2022-06-29 2022-06-29 Method and system for optimizing SQL execution in HTAP database using deep learning

Publications (2)

Publication Number Publication Date
CN115033597A CN115033597A (en) 2022-09-09
CN115033597B true CN115033597B (en) 2025-02-18

Family

ID=83126429

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210752471.2A Active CN115033597B (en) 2022-06-29 2022-06-29 Method and system for optimizing SQL execution in HTAP database using deep learning

Country Status (1)

Country Link
CN (1) CN115033597B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116150190A (en) * 2023-02-15 2023-05-23 山东大学 Database query optimization processing method and system based on tree QRNN

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN100543746C (en) * 2007-03-16 2009-09-23 华为技术有限公司 A system and method for realizing automatic optimization of database system
US7716213B2 (en) * 2007-04-26 2010-05-11 International Business Machines Corporation Apparatus, system, and method for efficiently supporting generic SQL data manipulation statements
US8688689B2 (en) * 2010-06-30 2014-04-01 Oracle International Corporation Techniques for recommending alternative SQL execution plans
US10216793B2 (en) * 2015-11-03 2019-02-26 Sap Se Optimization of continuous queries in hybrid database and stream processing systems
CN108763573A (en) * 2018-06-06 2018-11-06 众安信息技术服务有限公司 A kind of OLAP engines method for routing and system based on machine learning

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
轩辕:AI 原生数据库系统;李国良等;《软件学报》;20200315;第831-843页 *

Also Published As

Publication number Publication date
CN115033597A (en) 2022-09-09

Similar Documents

Publication Publication Date Title
US8935232B2 (en) Query execution systems and methods
US8332389B2 (en) Join order for a database query
US8719312B2 (en) Input/output efficiency for online analysis processing in a relational database
US8396852B2 (en) Evaluating execution plan changes after a wakeup threshold time
US20170083573A1 (en) Multi-query optimization
CN109885585B (en) Distributed database system and method supporting stored procedures, triggers and views
CN107291770B (en) A method and device for querying massive data in a distributed system
CN104765731A (en) Database query optimization method and equipment
EP4296866B1 (en) Runtime statistics feedback for query plan cost estimation
CN115794874A (en) Method for accelerating GPU operator execution in heterogeneous database system and application
Vakharia et al. Shared foundations: Modernizing meta’s data lakehouse
CN111737290A (en) A query method and device for a distributed database
WO2024239782A1 (en) Query plan construction method and apparatus, electronic device and storage medium
CN115033597B (en) Method and system for optimizing SQL execution in HTAP database using deep learning
CN119537383B (en) Storage method and device based on cold and hot data separation and multi-mode database engine
CN114168620B (en) Execution plan processing method and device
CN108255852B (en) SQL execution method and device
US8832157B1 (en) System, method, and computer-readable medium that facilitates efficient processing of distinct counts on several columns in a parallel processing system
CN111723104A (en) Method, device and system for syntax analysis in a data processing system
CN114328606A (en) Method, device and storage medium for improving SQL execution efficiency
US12265535B1 (en) Dataset summary metadata providing improved query performance
US12265514B1 (en) Efficient sorting techniques facilitating the creation and use of dataset summary metadata
US12210526B1 (en) Relational subtree matching for improved query performance
US11977582B2 (en) Global index with repartitioning operator
CN119669235B (en) A writing table engine integration method, medium and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
点击 这是indexloc提供的php浏览器服务,不要输入任何密码和下载