CN111723112B

CN111723112B - Data task execution method, device, electronic device and storage medium

Info

Publication number: CN111723112B
Application number: CN202010529804.6A
Authority: CN
Inventors: 黄琼峰; 桂祖宏
Original assignee: China Mobile Communications Group Co Ltd; MIGU Culture Technology Co Ltd
Current assignee: China Mobile Communications Group Co Ltd; MIGU Culture Technology Co Ltd
Priority date: 2020-06-11
Filing date: 2020-06-11
Publication date: 2023-07-07
Anticipated expiration: 2040-06-11
Also published as: CN111723112A

Abstract

The embodiment of the invention provides a data task execution method, a device, electronic equipment and a storage medium, wherein the method comprises the following steps: determining the data volume of a target data query task and the resource consumption cost of the target data query task under each preset calculation engine; and determining a target computing engine for executing the target data query task according to the data volume and the resource consumption cost, and executing the target data query task according to the target computing engine. According to the data task execution method, the device, the electronic equipment and the storage medium, the data query task is executed by selecting the matched computing engine from the computing engines through the data volume of the target data query task and the resource consumption cost of the target data query task under the preset computing engines, so that the influence of the unmatched computing engines on the execution of the data query task is avoided, and the execution efficiency and the stability of the data query task are improved.

Description

Data task execution method, device, electronic device and storage medium

技术领域technical field

本发明涉及数据处理技术领域，尤其涉及一种数据任务执行方法、装置、电子设备及存储介质。The present invention relates to the technical field of data processing, in particular to a data task execution method, device, electronic equipment and storage medium.

背景技术Background technique

对大数据的任务处理形式主要分为批任务处理和实时任务处理。批任务处理，由于处理时间问题又被称为Hive离线任务处理。实时任务处理技术则实现了数据处理较快的响应时间，实时任务处理有不同的实现方式，实时任务处理还包括Storm流式计算以及Spart实时交互计算两种。The task processing forms of big data are mainly divided into batch task processing and real-time task processing. Batch task processing is also called Hive offline task processing due to processing time issues. Real-time task processing technology realizes faster response time for data processing. There are different implementation methods for real-time task processing. Real-time task processing also includes Storm streaming computing and Spart real-time interactive computing.

不管是批任务处理还是实时任务处理，都可以称之为计算引擎。不同的计算引擎有不同的应用场景。离线任务处理适合处理数据量特别大且很少更新的数据，但缺点也是很明显的，响应时间慢，交互性差，编程方式单一。虽然实时任务处理有着快速的响应时间，但大多数实时计算引擎由于对内存依赖，往往对硬件的要求比较高，且对数据量的容忍度不够高。Whether it is batch task processing or real-time task processing, it can be called a computing engine. Different computing engines have different application scenarios. Offline task processing is suitable for processing data with a large amount of data and rarely updated data, but the disadvantages are also obvious, such as slow response time, poor interactivity, and single programming method. Although real-time task processing has a fast response time, most real-time computing engines often have relatively high hardware requirements due to their dependence on memory, and their tolerance for data volume is not high enough.

现有的数据平台都支持批任务处理和实时处理混合计算引擎，如在同一个数据平台上可以同时支持Hive、Spark、Storm、Kylin等数据实时和离线计算引擎，每个计算引擎都有特定的应用场景。故数据查询任务在执行之前，都会先手动指定特定的计算引擎，此时任务会固定选用一个计算引擎，无法动态选择数据计算引擎，从而存在因为不适合的计算引擎影响数据查询任务运行，降低数据查询任务的执行效率和稳定性。Existing data platforms all support hybrid computing engines for batch task processing and real-time processing. For example, real-time and offline computing engines such as Hive, Spark, Storm, and Kylin can be supported on the same data platform. Each computing engine has specific Application scenarios. Therefore, before the data query task is executed, a specific computing engine will be manually specified. At this time, a computing engine will be selected for the task, and the data computing engine cannot be dynamically selected. As a result, an unsuitable computing engine may affect the running of the data query task and reduce the data rate. Execution efficiency and stability of query tasks.

发明内容Contents of the invention

针对现有技术存在的问题，本发明实施例提供一种数据任务执行方法、装置、电子设备及存储介质。Aiming at the problems existing in the prior art, embodiments of the present invention provide a data task execution method, device, electronic equipment, and storage medium.

第一方面，本发明实施例提供一种数据任务执行方法，包括：In a first aspect, an embodiment of the present invention provides a data task execution method, including:

确定目标数据查询任务的数据量，以及所述目标数据查询任务在预设的各计算引擎下的资源消耗成本；Determine the data volume of the target data query task, and the resource consumption cost of the target data query task under each preset computing engine;

根据所述数据量和所述资源消耗成本确定执行所述目标数据查询任务的目标计算引擎，并根据所述目标计算引擎执行所述目标数据查询任务。A target computing engine for executing the target data query task is determined according to the data amount and the resource consumption cost, and the target data query task is executed according to the target computing engine.

进一步地，所述目标数据查询任务为对目标查询语句解析获得，且由多个子任务构成，相应地，确定所述目标数据查询任务在预设的各计算引擎下的资源计算成本，包括：Further, the target data query task is obtained by parsing the target query statement, and is composed of multiple subtasks. Correspondingly, determining the resource calculation cost of the target data query task under each preset computing engine includes:

确定所述目标数据查询任务中各子任务在各计算引擎下所需的IO读写数据量、CPU资源和内存资源，以及数据平台当前可用的IO带宽、CPU资源和内存资源；Determine the amount of IO read and write data, CPU resources and memory resources required by each subtask in each computing engine in the target data query task, and the currently available IO bandwidth, CPU resources and memory resources of the data platform;

根据各子任务在各计算引擎下所需的IO读写数据量、CPU资源和内存资源，以及数据平台当前可用的IO带宽、CPU资源和内存资源，确定所述目标数据查询任务在各计算引擎下的资源消耗成本。According to the amount of IO read and write data, CPU resources, and memory resources required by each subtask under each computing engine, as well as the currently available IO bandwidth, CPU resources, and memory resources of the data platform, determine the target data query task in each computing engine The resource consumption cost under .

进一步地，所述根据各子任务在各计算引擎下所需的IO读写数据量、CPU资源和内存资源，以及数据平台当前可用的IO带宽、CPU资源和内存资源，确定所述目标数据查询任务在各计算引擎下的资源消耗成本，包括：Further, the target data query is determined according to the amount of IO read and write data, CPU resources, and memory resources required by each subtask under each computing engine, and the currently available IO bandwidth, CPU resources, and memory resources of the data platform. Resource consumption costs of tasks under each computing engine, including:

确定各子任务在各计算引擎下IO读写数据量与数据平台当前可用的IO带宽之比的总和、CPU资源与数据平台当前可用的CPU资源之比的总和，以及内存资源与数据平台当前可用的内存资源之比的总和；Determine the sum of the ratio of the IO read and write data volume of each subtask under each computing engine to the current available IO bandwidth of the data platform, the sum of the ratio of CPU resources to the currently available CPU resources of the data platform, and the ratio of memory resources to the current available data platform The sum of the ratios of memory resources;

根据各总和确定所述目标数据查询任务在各计算引擎下的资源消耗成本。The resource consumption cost of the target data query task under each computing engine is determined according to each sum.

进一步地，所述根据所述数据量和所述资源计算成本确定执行所述目标数据查询任务的目标计算引擎，包括：Further, the determining the target computing engine for executing the target data query task according to the data amount and the resource computing cost includes:

当确定所述数据量超出第一阈值，确定Hive计算引擎为目标计算引擎；When it is determined that the amount of data exceeds the first threshold, it is determined that the Hive computing engine is the target computing engine;

当确定所述数据量未超出第一阈值，确定各计算引擎下的资源计算成本中最小值对应的计算引擎作为目标计算引擎。When it is determined that the amount of data does not exceed the first threshold, the computing engine corresponding to the minimum resource computing cost of each computing engine is determined as the target computing engine.

进一步地，在根据所述数据量和所述资源消耗成本确定执行所述目标数据查询任务的目标计算引擎之前，还包括：Further, before determining the target computing engine for executing the target data query task according to the data amount and the resource consumption cost, the method further includes:

获取对应于所述目标数据查询任务的执行缓存表；Obtain an execution cache table corresponding to the target data query task;

根据执行缓存表中预存的计算引擎执行所述目标数据查询任务。Execute the target data query task according to the calculation engine prestored in the execution cache table.

进一步地，所述执行缓存表包括计算引擎执行表和数据缓存有效时间信息，相应地，根据执行缓存表中预存的计算引擎作执行所述目标数据查询任务，包括：Further, the execution cache table includes the calculation engine execution table and data cache valid time information, and accordingly, executing the target data query task according to the calculation engine pre-stored in the execution cache table includes:

根据所述数据缓存有效时间信息确定所述目标数据查询任务被缓存后，从所述计算引擎执行表中解析获得预存的计算引擎执行所述目标数据查询任务。After determining that the target data query task is cached according to the effective time information of the data cache, the pre-stored computing engine is parsed from the computing engine execution table to execute the target data query task.

第二方面，本发明实施例提供一种数据任务执行装置，包括：In a second aspect, an embodiment of the present invention provides a data task execution device, including:

确定模块，用于确定目标数据查询任务的数据量，以及所述目标数据查询任务在预设的各计算引擎下的资源消耗成本；A determining module, configured to determine the data volume of the target data query task, and the resource consumption cost of the target data query task under each preset computing engine;

执行模块，用于根据所述数据量和所述资源消耗成本确定执行所述目标数据查询任务的目标计算引擎，并根据所述目标计算引擎执行所述目标数据查询任务。An execution module, configured to determine a target computing engine for executing the target data query task according to the data amount and the resource consumption cost, and execute the target data query task according to the target computing engine.

第三方面，本发明实施例提供一种数据任务执行系统，其特征在于，包括：In a third aspect, an embodiment of the present invention provides a data task execution system, which is characterized in that it includes:

任务解析器，用于确定目标数据查询任务；A task parser for determining a target data query task;

任务编译器，用于确定目标数据查询任务的数据量，以及所述目标数据查询任务在预设的各计算引擎下的资源消耗成本；A task compiler, configured to determine the data volume of the target data query task, and the resource consumption cost of the target data query task under each preset computing engine;

计算引擎选择器，用于根据所述数据量和所述资源消耗成本确定执行所述目标数据查询任务的目标计算引擎，并根据所述目标计算引擎执行所述目标数据查询任务；A computing engine selector, configured to determine a target computing engine for executing the target data query task according to the amount of data and the resource consumption cost, and execute the target data query task according to the target computing engine;

元数据存储库，用于存储执行所述目标数据查询任务所需的数据。The metadata repository is used to store data required for executing the target data query task.

第四方面，本发明实施例提供一种电子设备，包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序，所述处理器执行所述程序时实现如上述数据任务执行方法的步骤。In a fourth aspect, an embodiment of the present invention provides an electronic device, including a memory, a processor, and a computer program stored in the memory and operable on the processor. When the processor executes the program, the above-mentioned data task execution is realized. method steps.

第五方面，本发明实施例提供一种非暂态计算机可读存储介质，其上存储有计算机程序，该计算机程序被处理器执行时实现如上述数据任务执行方法的步骤。In a fifth aspect, an embodiment of the present invention provides a non-transitory computer-readable storage medium, on which a computer program is stored, and when the computer program is executed by a processor, the steps of the above data task execution method are implemented.

本发明实施例提供的数据任务执行方法、装置、电子设备及存储介质，通过目标数据查询任务的数据量和目标数据查询任务在预设的各计算引擎下的资源消耗成本，从各计算引擎中选择匹配的计算引擎来执行数据查询任务，避免不匹配的计算引擎影响数据查询任务的执行，提升数据查询任务的执行效率和稳定性。The data task execution method, device, electronic equipment, and storage medium provided by the embodiments of the present invention, through the data volume of the target data query task and the resource consumption cost of the target data query task under the preset calculation engines, from each calculation engine Select a matching computing engine to execute data query tasks, avoid unmatched computing engines from affecting the execution of data query tasks, and improve the execution efficiency and stability of data query tasks.

附图说明Description of drawings

为了更清楚地说明本发明实施例或现有技术中的技术方案，下面将对实施例或现有技术描述中所需要使用的附图作一简单地介绍，显而易见地，下面描述中的附图是本发明的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他的附图。In order to more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the following will briefly introduce the drawings that need to be used in the description of the embodiments or the prior art. Obviously, the accompanying drawings in the following description These are some embodiments of the present invention. Those skilled in the art can also obtain other drawings based on these drawings without creative work.

图1为本发明数据任务执行方法实施例流程图；FIG. 1 is a flow chart of an embodiment of a data task execution method of the present invention;

图2为本发明数据任务执行方法的完整流程示意图；FIG. 2 is a schematic diagram of a complete flow of the data task execution method of the present invention;

图3为本发明数据任务执行装置实施例结构图；3 is a structural diagram of an embodiment of the data task execution device of the present invention;

图4为本发明数据任务执行系统实施例结构图；4 is a structural diagram of an embodiment of the data task execution system of the present invention;

图5为本发明电子设备实施例结构图。FIG. 5 is a structural diagram of an embodiment of the electronic device of the present invention.

具体实施方式Detailed ways

为使本发明实施例的目的、技术方案和优点更加清楚，下面将结合本发明实施例中的附图，对本发明实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例是本发明一部分实施例，而不是全部的实施例。基于本发明中的实施例，本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例，都属于本发明保护的范围。In order to make the purpose, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below in conjunction with the drawings in the embodiments of the present invention. Obviously, the described embodiments It is a part of embodiments of the present invention, but not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by persons of ordinary skill in the art without creative efforts fall within the protection scope of the present invention.

图1示出了本发明一实施例提供的一种数据任务执行方法，包括：Figure 1 shows a data task execution method provided by an embodiment of the present invention, including:

S11、确定目标数据查询任务的数据量，以及目标数据查询任务在预设的各计算引擎下的资源消耗成本；S11. Determine the data volume of the target data query task, and the resource consumption cost of the target data query task under each preset computing engine;

S12、根据数据量和资源消耗成本确定执行目标数据查询任务的目标计算引擎，并根据目标计算引擎执行目标数据查询任务。S12. Determine a target computing engine for executing the target data query task according to the data volume and resource consumption cost, and execute the target data query task according to the target computing engine.

针对步骤S11和步骤S12，需要说明的是，在本发明实施例中，目标数据查询任务由对系统接口输入的数据查询语句进行解析转换而成。该数据查询语句包括SQL语句、SPARQL语句、DQL语句等，故由这些语句进行解析转换得到的数据查询任务可为SQL任务、SPARQL任务、DQL任务等。With regard to step S11 and step S12, it should be noted that, in the embodiment of the present invention, the target data query task is formed by parsing and converting the data query sentence input through the system interface. The data query statement includes SQL statement, SPARQL statement, DQL statement, etc. Therefore, the data query task obtained by parsing and converting these statements may be SQL task, SPARQL task, DQL task, etc.

下面以SQL任务对本实施例方法进行解释说明，具体如下：The method of this embodiment is explained below with the SQL task, specifically as follows:

对SQL语句进行解析转换成任务树形式，任务树存在多个树节点，每个树节点对应一个子任务。在对SQL语句进行解析过程中，执行包括语法检查在内的一些初级分析工作，在语法分析通过后，确定该目标SQL任务合格。Parse the SQL statement and convert it into a task tree form. There are multiple tree nodes in the task tree, and each tree node corresponds to a subtask. In the process of parsing the SQL statement, some primary analysis work including syntax checking is performed, and after the syntax analysis is passed, it is determined that the target SQL task is qualified.

该SQL任务中各子任务被执行时都需要处理底层数据，对底层数据能够计算得到数据量，故对目标SQL任务进行计算获得对应的数据量。Each subtask in the SQL task needs to process the underlying data when it is executed, and the underlying data can be calculated to obtain the data volume, so the target SQL task is calculated to obtain the corresponding data volume.

在本发明实施例中，SQL任务对应的底层数据可存储在元数据存储库中，在该元数据存储库中还包括SQL任务相关的数据表的名称、表的列和分区以及其属性、表的属性、表的数据所在目录等等。In the embodiment of the present invention, the underlying data corresponding to the SQL task can be stored in the metadata repository, which also includes the name of the data table related to the SQL task, the columns and partitions of the table and its attributes, table attributes, the directory where the data of the table is located, and so on.

在本发明实施例中，计算引擎包括Hive(离线)、Spark(实时)、Storm(流)和Kylin计算等。对于不同的计算引擎来说，目标SQL任务在各计算引擎下的资源消耗成本不同。故对目标SQL任务的资源消耗成本进行计算，得到各计算引擎下的资源消耗成本。In the embodiment of the present invention, the calculation engine includes Hive (offline), Spark (real-time), Storm (flow), Kylin calculation, etc. For different computing engines, the resource consumption cost of the target SQL task under each computing engine is different. Therefore, the resource consumption cost of the target SQL task is calculated to obtain the resource consumption cost of each calculation engine.

接下来，根据上述得到的数据量和各计算引擎下的资源消耗成本，可通过预设的计算引擎选择策略确定执行目标SQL任务的目标计算引擎，并根据目标计算引擎执行目标SQL任务。Next, according to the amount of data obtained above and the resource consumption cost of each computing engine, the target computing engine for executing the target SQL task can be determined through the preset computing engine selection strategy, and the target SQL task can be executed according to the target computing engine.

对此，需要说明的是，该计算引擎选择策略依据数据量和各计算引擎下的资源消耗成本为筛选条件，从各计算引擎中确定一个计算引擎作为执行目标SQL任务的目标计算引擎。In this regard, it should be noted that the calculation engine selection strategy is based on the data volume and the resource consumption cost under each calculation engine as filtering conditions, and one calculation engine is determined from each calculation engine as the target calculation engine for executing the target SQL task.

本发明实施例提供的一种数据任务执行方法，通过目标数据查询任务的数据量和目标数据查询任务在预设的各计算引擎下的资源消耗成本，从各计算引擎中选择匹配的计算引擎来执行数据查询任务，避免不匹配的计算引擎影响数据查询任务的执行，提升数据查询任务的执行效率和稳定性。According to a data task execution method provided by an embodiment of the present invention, a matching computing engine is selected from each computing engine based on the data volume of the target data query task and the resource consumption cost of the target data query task under each preset computing engine. Execute data query tasks, avoid unmatched computing engines from affecting the execution of data query tasks, and improve the execution efficiency and stability of data query tasks.

在上述实施例方法的进一步实施例中，主要是对确定目标SQL任务在预设的各计算引擎下的资源消耗成本的解释说明，具体如下：In a further embodiment of the method in the above embodiment, it is mainly an explanation of the resource consumption cost of determining the target SQL task under each preset computing engine, specifically as follows:

S111、确定所述目标SQL任务中各子任务在各计算引擎下所需的IO读写数据量、CPU资源和内存资源，以及数据平台当前可用的IO带宽、CPU资源和内存资源；S111. Determine the amount of IO read and write data, CPU resources, and memory resources required by each subtask in the target SQL task under each computing engine, as well as the currently available IO bandwidth, CPU resources, and memory resources of the data platform;

S112、根据各子任务在各计算引擎下所需的IO读写数据量、CPU资源和内存资源，以及数据平台当前可用的IO带宽、CPU资源和内存资源，确定目标SQL任务在各计算引擎下的资源消耗成本。S112. According to the amount of IO read and write data, CPU resources and memory resources required by each subtask under each computing engine, and the currently available IO bandwidth, CPU resources and memory resources of the data platform, determine the target SQL task under each computing engine resource consumption cost.

针对步骤S111和步骤S112，需要说明的是，在本发明实施例中，由于上述实施例中提及到目标SQL任务包含多个子任务。对各个子任务进行分析可获取对应的IO读写数据量、CPU资源和内存资源，以及数据平台当前可用的IO带宽、CPU资源和内存资源。在这里，数据平台为执行数据查询任务所需的计算环境。Regarding step S111 and step S112, it should be noted that, in this embodiment of the present invention, since it is mentioned in the above embodiment that the target SQL task includes multiple subtasks. Analysis of each subtask can obtain the corresponding IO read and write data volume, CPU resources and memory resources, as well as the current available IO bandwidth, CPU resources and memory resources of the data platform. Here, the data platform is the computing environment required to perform data query tasks.

确定各子任务在各计算引擎下IO读写数据量与数据平台当前可用的IO带宽之比的总和、CPU资源与数据平台当前可用的CPU资源之比的总和，以及内存资源与数据平台当前可用的内存资源之比的总和，根据各总和确定所述目标数据查询任务在各计算引擎下的资源消耗成本。Determine the sum of the ratio of the IO read and write data volume of each subtask under each computing engine to the current available IO bandwidth of the data platform, the sum of the ratio of CPU resources to the currently available CPU resources of the data platform, and the ratio of memory resources to the current available data platform The sum of the ratios of the memory resources, and determine the resource consumption cost of the target data query task under each computing engine according to each sum.

在上述实施例方法的进一步实施例中，采用资源消耗成本获取公式确定目标SQL任务在各计算引擎下的资源消耗成本。In a further embodiment of the method in the above embodiment, a resource consumption cost acquisition formula is used to determine the resource consumption cost of the target SQL task under each computing engine.

其中，该资源消耗成本获取公式包括：Wherein, the resource consumption cost acquisition formula includes:

C_{compute-engine}为资源消耗成本；C _{compute-engine} is resource consumption cost;

IO_i为目标SQL任务中第i个子任务在计算引擎下所需的IO读写数据量，IO_useable为数据平台当前可用的IO带宽；IO _i is the amount of IO read and write data required by the computing engine for the i-th subtask in the target SQL task, and IO _usable is the currently available IO bandwidth of the data platform;

CPU_i为目标SQL任务中第i个子任务在计算引擎下所需的CPU资源，CPU_useable为数据平台当前可用的CPU资源；CPU _i is the CPU resource required by the computing engine for the i-th subtask in the target SQL task, and CPU _useable is the CPU resource currently available on the data platform;

Mem_i为目标SQL任务中第i个子任务在计算引擎下所需的内存资源，Mem_useable为数据平台当前可用的内存资源。Mem _i is the memory resource required by the computing engine for the i-th subtask in the target SQL task, and Mem _{usable is} the memory resource currently available on the data platform.

需要说明的是，各子任务在不同计算引擎下所需的IO读写数据量、CPU资源和内存资源可能会不同。为此，目标SQL任务在各计算引擎下的资源消耗成本存在不同。It should be noted that the amount of IO read and write data, CPU resources, and memory resources required by each subtask under different computing engines may be different. For this reason, the resource consumption cost of the target SQL task under each computing engine is different.

在上述实施例方法的进一步实施例中，主要是对根据数据量、资源计算成本及预设的计算引擎选择策略确定执行目标SQL任务的目标计算引擎的过程进行解释说明，在这里，由于目标SQL任务的数据量和目标SQL任务在计算引擎下的资源消耗成本均是数值，为此，主要是采用计算引擎选择策略对数值进行判断，从而确定匹配的计算引擎作为目标SQL任务的目标计算引擎。具体如下：In the further embodiment of the method in the above embodiment, the process of determining the target computing engine for executing the target SQL task according to the amount of data, resource computing cost and preset computing engine selection strategy is mainly explained. Here, because the target SQL The data volume of the task and the resource consumption cost of the target SQL task under the computing engine are both numerical values. For this reason, the calculation engine selection strategy is mainly used to judge the value, so as to determine the matching computing engine as the target computing engine of the target SQL task. details as follows:

该计算引擎选择策略包括：The compute engine selection strategy includes:

当确定SQL任务的数据量超出第一阈值，确定Hive计算引擎为目标计算引擎。When it is determined that the data volume of the SQL task exceeds the first threshold, the Hive computing engine is determined as the target computing engine.

当确定SQL任务的数据量未超出第一阈值，确定各计算引擎下的资源计算成本中最小值对应的计算引擎作为目标计算引擎。When it is determined that the data volume of the SQL task does not exceed the first threshold, the computing engine corresponding to the minimum resource computing cost of each computing engine is determined as the target computing engine.

对此，需要说明的是，Hive计算引擎适合应用在大数据量离线计算场景，任务执行时间相对长。Spark、Storm等计算引擎适合应用在实时计算场景，其数据处理时间短，但在计算资源需求大，在计算资源不足的情况下，执行SQL任务会直接失败。In this regard, it should be noted that the Hive computing engine is suitable for offline computing scenarios with a large amount of data, and the task execution time is relatively long. Computing engines such as Spark and Storm are suitable for real-time computing scenarios, and their data processing time is short. However, when the computing resource demand is large and the computing resources are insufficient, the execution of the SQL task will directly fail.

在这里，第一阈值为判断SQL任务是否为大数据量任务的分界值。例如超过10T数据量的任务采用hive计算引擎。Here, the first threshold is a boundary value for judging whether the SQL task is a task with a large amount of data. For example, tasks with a data volume exceeding 10T use the hive computing engine.

在上述实施例方法的进一步实施例中，主要是在根据SQL任务的数据量和SQL任务在各计算引擎下的资源消耗成本确定执行目标SQL任务的目标计算引擎之前，有关执行目标SQL任务的计算引擎的选取过程的解释说明，具体如下：In a further embodiment of the method of the above-mentioned embodiment, before the target computing engine for executing the target SQL task is determined according to the data volume of the SQL task and the resource consumption cost of the SQL task under each computing engine, the calculation related to the execution of the target SQL task The explanation of the engine selection process is as follows:

获取对应于该目标SQL任务的执行缓存表。Get the execution cache table corresponding to the target SQL task.

在上述实施例中提及SQL任务对应的底层数据可存储在元数据存储库中，在该元数据存储库中还包括SQL任务相关的数据表的名称、表的列和分区以及其属性、表的属性、表的数据所在目录等等。另外，在该元数据存储库中还包括SQL任务的执行缓存表，该执行缓存表包括某个SQL任务在缓存有效时间内所采用的计算引擎的信息。In the above embodiment, it is mentioned that the underlying data corresponding to the SQL task can be stored in the metadata repository, which also includes the name of the data table related to the SQL task, the columns and partitions of the table, and its attributes, table attributes, the directory where the data of the table is located, and so on. In addition, the metadata storage repository also includes an execution cache table of the SQL task, and the execution cache table includes information of a computing engine used by a certain SQL task within the effective cache time.

在确定该执行缓存表上的数据信息有效后，根据执行缓存表中预存的计算引擎执行目标SQL任务。After it is determined that the data information on the execution cache table is valid, the target SQL task is executed according to the calculation engine pre-stored in the execution cache table.

在上述实施例方法的进一步实施例中，该执行缓存表包括计算引擎执行表和数据缓存有效时间信息，该计算引擎执行表中包含SQL任务在缓存有效时间内所采用的计算引擎。该数据缓存有效时间信息包含SQL任务的缓存有效时间。In a further embodiment of the method in the above embodiment, the execution cache table includes a calculation engine execution table and data cache validity time information, and the calculation engine execution table includes the calculation engine used by the SQL task within the cache validity time. The data cache valid time information includes the cache valid time of the SQL task.

为此，根据该数据缓存有效时间信息确定该目标SQL任务被缓存后，从计算引擎执行表中解析获得预存的计算引擎执行目标SQL任务。For this reason, after determining that the target SQL task is cached according to the effective time information of the data cache, the pre-stored calculation engine execution target SQL task is obtained by parsing from the calculation engine execution table.

另外，在确定该执行缓存表的数据信息无效后，再去确定目标SQL任务的数据量，以及该目标SQL任务在预设的各计算引擎下的资源消耗成本，并根据目标计算引擎执行目标SQL任务。In addition, after determining that the data information in the execution cache table is invalid, then determine the data volume of the target SQL task and the resource consumption cost of the target SQL task under each preset computing engine, and execute the target SQL according to the target computing engine Task.

需要说明的是，在根据数据量和资源消耗成本确定执行目标数据查询任务的目标计算引擎，并根据目标计算引擎执行目标数据查询任务之后，需将目标计算引擎置于执行缓存表中存储在元数据存储库中。It should be noted that after determining the target computing engine for executing the target data query task according to the data volume and resource consumption cost, and executing the target data query task according to the target computing engine, the target computing engine needs to be stored in the execution cache table in the metadata in the data repository.

在实施例中，由于执行缓存表存储在元数据存储库中，系统在根据所述数据量和资源消耗成本确定执行目标SQL任务的目标计算引擎之前，可直接访问元数据存储库，以便于根据执行缓存表快速选择合适的计算引擎。In the embodiment, since the execution cache table is stored in the metadata repository, the system can directly access the metadata repository before determining the target computing engine for executing the target SQL task according to the data volume and resource consumption cost, so as to facilitate The execution cache table quickly selects the appropriate computing engine.

结合上述各实施例的内容，图2示出了本发明一实施例提供的数据任务执行方法的整体流程示意图，参见图2，具体如下：Combining the contents of the above-mentioned embodiments, Fig. 2 shows a schematic diagram of the overall flow of the data task execution method provided by an embodiment of the present invention, referring to Fig. 2, the details are as follows:

SQL任务在执行缓存表上有缓存，若执行缓存表上缓存有Hive计算引擎，则采用Hive计算引擎执行该SQL任务；若执行缓存表上缓存有Spark计算引擎，则采用Spark计算引擎执行该SQL任务。The SQL task is cached in the execution cache table. If the Hive computing engine is cached in the execution cache table, the SQL task will be executed using the Hive computing engine; if the Spark computing engine is cached in the execution cache table, the Spark computing engine will be used to execute the SQL. Task.

SQL任务在执行缓存表上未缓存，且数据量超出第一阈值，则采用Hive计算引擎执行该SQL任务。If the SQL task is not cached on the execution cache table and the data volume exceeds the first threshold, the SQL task will be executed using the Hive computing engine.

若数据量未超出第一阈值，则计算该SQL任务在Hive和Spark计算引擎下的资源消耗成本。If the amount of data does not exceed the first threshold, calculate the resource consumption cost of the SQL task under the Hive and Spark computing engines.

若Hive计算引擎下的资源消耗成本相对较小，则采用Hive计算引擎执行该SQL任务。If the resource consumption cost under the Hive computing engine is relatively small, the Hive computing engine is used to execute the SQL task.

若Spark计算引擎下的资源消耗成本相对较小，则采用Spark计算引擎执行该SQL任务。If the cost of resource consumption under the Spark computing engine is relatively small, the Spark computing engine is used to execute the SQL task.

图3示出了本发明一实施例提供的一种数据任务执行装置，包括确定模块21和执行模块22，其中：Fig. 3 shows a data task execution device provided by an embodiment of the present invention, including a determination module 21 and an execution module 22, wherein:

确定模块21，用于确定目标数据查询任务的数据量，以及目标数据查询任务在预设的各计算引擎下的资源消耗成本；A determining module 21, configured to determine the data volume of the target data query task, and the resource consumption cost of the target data query task under each preset computing engine;

执行模块22，用于根据数据量和资源消耗成本确定执行目标数据查询任务的目标计算引擎，并根据目标计算引擎执行目标数据查询任务。The execution module 22 is configured to determine a target computing engine for executing the target data query task according to the data volume and resource consumption cost, and execute the target data query task according to the target computing engine.

在上述实施例装置的进一步实施例中，该确定模块在确定目标数据查询任务在预设的各计算引擎下的资源消耗成本的过程中，具体用于：In a further embodiment of the device in the above embodiment, the determination module is specifically used to:

确定目标数据查询任务中各子任务在各计算引擎下所需的IO读写数据量、CPU资源和内存资源，以及数据平台当前可用的IO带宽、CPU资源和内存资源；Determine the amount of IO read and write data, CPU resources, and memory resources required by each subtask in each computing engine in the target data query task, as well as the currently available IO bandwidth, CPU resources, and memory resources of the data platform;

根据各子任务在各计算引擎下所需的IO读写数据量、CPU资源和内存资源，以及数据平台当前可用的IO带宽、CPU资源和内存资源，确定目标数据查询任务在各计算引擎下的资源消耗成本。According to the amount of IO read and write data, CPU resources, and memory resources required by each subtask under each computing engine, as well as the current available IO bandwidth, CPU resources, and memory resources of the data platform, determine the target data query task under each computing engine Resource consumption costs.

在上述实施例装置的进一步实施例中，根据各子任务在各计算引擎下所需的IO读写数据量、CPU资源和内存资源，以及数据平台当前可用的IO带宽、CPU资源和内存资源，确定目标数据查询任务在各计算引擎下的资源消耗成本，包括：In a further embodiment of the device in the above embodiment, according to the amount of IO read and write data, CPU resources and memory resources required by each subtask under each computing engine, and the currently available IO bandwidth, CPU resources and memory resources of the data platform, Determine the resource consumption cost of the target data query task under each computing engine, including:

在上述实施例装置的进一步实施例中，采用资源消耗成本获取公式确定目标数据查询任务在各计算引擎下的资源消耗成本，其中该资源消耗成本获取公式包括：In a further embodiment of the device in the above embodiment, the resource consumption cost acquisition formula is used to determine the resource consumption cost of the target data query task under each computing engine, wherein the resource consumption cost acquisition formula includes:

IO_i为目标数据查询任务中第i个子任务在计算引擎下所需的IO读写数据量，IO_useable为数据平台当前可用的IO带宽；IO _i is the amount of IO read and write data required by the computing engine for the i-th subtask in the target data query task, and IO _usable is the currently available IO bandwidth of the data platform;

CPU_i为目标数据查询任务中第i个子任务在计算引擎下所需的CPU资源，CPU_useable为数据平台当前可用的CPU资源；CPU _i is the CPU resource required by the computing engine for the i-th subtask in the target data query task, and CPU _useable is the CPU resource currently available on the data platform;

Mem_i为目标数据查询任务中第i个子任务在计算引擎下所需的内存资源，Mem_useable为数据平台当前可用的内存资源。Mem _i is the memory resource required by the computing engine for the i-th subtask in the target data query task, and Mem _{usable is} the memory resource currently available on the data platform.

在上述实施例装置的进一步实施例中，该执行模块在根据数据量和资源消耗成本确定执行目标数据查询任务的目标计算引擎的确定过程中，具体用于：In a further embodiment of the device in the above embodiment, the execution module is specifically used to:

当确定数据量超出第一阈值，确定Hive计算引擎为目标计算引擎；When it is determined that the amount of data exceeds the first threshold, it is determined that the Hive computing engine is the target computing engine;

当确定数据量未超出第一阈值，确定各计算引擎下的资源计算成本中最小值对应的计算引擎作为目标计算引擎。When it is determined that the amount of data does not exceed the first threshold, the computing engine corresponding to the minimum resource computing cost of each computing engine is determined as the target computing engine.

在上述实施例装置的进一步实施例中，在根据数据量、所述资源计算成本及预设的计算引擎选择策略确定执行所述目标数据查询任务的目标计算引擎之前，还包括获取模块，用于：In a further embodiment of the device in the above embodiment, before determining the target computing engine for executing the target data query task according to the amount of data, the resource computing cost, and the preset computing engine selection strategy, an acquisition module is also included for :

获取对应于目标数据查询任务的执行缓存表；Obtain an execution cache table corresponding to the target data query task;

根据执行缓存表中预存的计算引擎执行目标数据查询任务。Execute the target data query task according to the pre-stored calculation engine in the execution cache table.

在上述实施例装置的进一步实施例中，该执行缓存表包括计算引擎执行表和数据缓存有效时间信息，相应地，该获取模块在根据执行缓存表中预存的计算引擎执行目标数据查询任务的过程中，具体用于：In a further embodiment of the device in the above embodiment, the execution cache table includes the calculation engine execution table and data cache valid time information, and correspondingly, the acquisition module executes the target data query task according to the calculation engine pre-stored in the execution cache table , specifically for:

根据数据缓存有效时间信息确定目标数据查询任务被缓存后，从计算引擎执行表中解析获得预存的计算引擎执行目标数据查询任务。After determining that the target data query task is cached according to the effective time information of the data cache, the pre-stored calculation engine is obtained by parsing from the calculation engine execution table to execute the target data query task.

由于本发明实施例所述装置与上述实施例所述方法的原理相同，对于更加详细的解释内容在此不再赘述。Since the principle of the apparatus described in the embodiment of the present invention is the same as that of the method described in the foregoing embodiments, more detailed explanations will not be repeated here.

需要说明的是，本发明实施例中可以通过硬件处理器(hardware processor)来实现相关功能模块。It should be noted that, in the embodiment of the present invention, a hardware processor (hardware processor) may be used to implement related functional modules.

本发明实施例提供的数据任务执行装置，通过目标数据查询任务的数据量和目标数据查询任务在预设的各计算引擎下的资源消耗成本，从各计算引擎中选择匹配的计算引擎来执行数据查询任务，避免不匹配的计算引擎影响数据查询任务的执行，提升数据查询任务的执行效率和稳定性。The data task execution device provided by the embodiment of the present invention selects a matching computing engine from each computing engine to execute the data task based on the data volume of the target data query task and the resource consumption cost of the target data query task under the preset computing engines. Query tasks, avoid unmatched computing engines from affecting the execution of data query tasks, and improve the execution efficiency and stability of data query tasks.

如图4示出了本发明一实施例提供的一种数据任务执行系统的结构示意图，参见图4，该系统包括任务解析器31、任务编译器32、计算引擎选择器33和元数据存储库34，其中：Figure 4 shows a schematic structural diagram of a data task execution system provided by an embodiment of the present invention, referring to Figure 4, the system includes a task parser 31, a task compiler 32, a computing engine selector 33 and a metadata repository 34, of which:

任务解析器31，用于确定目标数据查询任务；A task parser 31, configured to determine a target data query task;

任务编译器32，用于确定目标数据查询任务的数据量，以及所述目标数据查询任务在预设的各计算引擎下的资源消耗成本；The task compiler 32 is used to determine the data volume of the target data query task, and the resource consumption cost of the target data query task under each preset computing engine;

计算引擎选择器33，用于根据所述数据量和所述资源消耗成本确定执行所述目标数据查询任务的目标计算引擎，并根据所述目标计算引擎执行所述目标数据查询任务；A computing engine selector 33, configured to determine a target computing engine for executing the target data query task according to the amount of data and the resource consumption cost, and execute the target data query task according to the target computing engine;

元数据存储库34，用于存储执行所述目标数据查询任务所需的数据。The metadata repository 34 is configured to store data required for executing the target data query task.

由于本发明实施例所述系统与上述实施例所述方法的原理相同，对于更加详细的解释内容在此不再赘述。Since the principle of the system described in this embodiment of the present invention is the same as that of the method described in the foregoing embodiments, more detailed explanations will not be repeated here.

需要说明的是，本发明实施例中可以通过硬件处理器(hardware processor)来实现相关功能单元。It should be noted that, in the embodiment of the present invention, relevant functional units may be implemented by a hardware processor (hardware processor).

本发明实施例提供的数据任务执行系统，通过目标数据查询任务的数据量和目标数据查询任务在预设的各计算引擎下的资源消耗成本，从各计算引擎中选择匹配的计算引擎来执行数据查询任务，避免不匹配的计算引擎影响数据查询任务的执行，提升数据查询任务的执行效率和稳定性。The data task execution system provided by the embodiment of the present invention selects a matching computing engine from each computing engine to execute the data according to the data volume of the target data query task and the resource consumption cost of the target data query task under each preset computing engine. Query tasks, avoid unmatched computing engines from affecting the execution of data query tasks, and improve the execution efficiency and stability of data query tasks.

图5示例了一种电子设备的实体结构示意图，如图5所示，该电子设备可以包括：处理器(processor)41、通信接口(Communications Interface)42、存储器(memory)43和通信总线44，其中，处理器41，通信接口42，存储器43通过通信总线44完成相互间的通信。处理器41可以调用存储器43中的逻辑指令，以执行如下方法：确定目标数据查询任务的数据量，以及目标数据查询任务在预设的各计算引擎下的资源消耗成本；根据数据量和资源计算成本确定执行目标数据查询任务的目标计算引擎，并根据目标计算引擎执行目标数据查询任务。FIG. 5 illustrates a schematic diagram of the physical structure of an electronic device. As shown in FIG. 5, the electronic device may include: a processor (processor) 41, a communication interface (Communications Interface) 42, a memory (memory) 43 and a communication bus 44, Wherein, the processor 41 , the communication interface 42 , and the memory 43 communicate with each other through the communication bus 44 . The processor 41 can call the logical instructions in the memory 43 to perform the following method: determine the data volume of the target data query task, and the resource consumption cost of the target data query task under each preset calculation engine; calculate according to the data volume and resource The cost determines the target computing engine for executing the target data query task, and executes the target data query task according to the target computing engine.

此外，上述的存储器43中的逻辑指令可以通过软件功能单元的形式实现并作为独立的产品销售或使用时，可以存储在一个计算机可读取存储介质中。基于这样的理解，本发明的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的部分可以以软件产品的形式体现出来，该计算机软件产品存储在一个存储介质中，包括若干指令用以使得一台计算机设备(可以是个人计算机，服务器，或者网络设备等)执行本发明各个实施例所述方法的全部或部分步骤。而前述的存储介质包括：U盘、移动硬盘、只读存储器(ROM，Read-Only Memory)、随机存取存储器(RAM，Random Access Memory)、磁碟或者光盘等各种可以存储程序代码的介质。In addition, the above-mentioned logic instructions in the memory 43 can be implemented in the form of software function units and can be stored in a computer-readable storage medium when sold or used as an independent product. Based on this understanding, the essence of the technical solution of the present invention or the part that contributes to the prior art or the part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium, including Several instructions are used to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the methods described in various embodiments of the present invention. The aforementioned storage medium includes: U disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), magnetic disk or optical disk and other media that can store program codes. .

本发明实施例还提供一种非暂态计算机可读存储介质，其上存储有计算机程序，该计算机程序被处理器执行时实现以执行上述各实施例提供的方法，例如包括：确定目标数据查询任务的数据量，以及目标数据查询任务在预设的各计算引擎下的资源消耗成本；根据数据量和资源计算成本确定执行目标数据查询任务的目标计算引擎，并根据目标计算引擎执行目标数据查询任务。An embodiment of the present invention also provides a non-transitory computer-readable storage medium, on which a computer program is stored. When the computer program is executed by a processor, the method provided by the above-mentioned embodiments is implemented, for example, including: determining the target data query The data volume of the task, and the resource consumption cost of the target data query task under the preset computing engines; determine the target computing engine for executing the target data query task according to the data volume and resource computing cost, and execute the target data query according to the target computing engine Task.

通过以上的实施方式的描述，本领域的技术人员可以清楚地了解到各实施方式可借助软件加必需的通用硬件平台的方式来实现，当然也可以通过硬件。基于这样的理解，上述技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来，该计算机软件产品可以存储在计算机可读存储介质中，如ROM/RAM、磁碟、光盘等，包括若干指令用以使得一台计算机设备(可以是个人计算机，服务器，或者网络设备等)执行各个实施例或者实施例的某些部分所述的方法。Through the above description of the implementations, those skilled in the art can clearly understand that each implementation can be implemented by means of software plus a necessary general-purpose hardware platform, and of course also by hardware. Based on this understanding, the essence of the above technical solution or the part that contributes to the prior art can be embodied in the form of software products, and the computer software products can be stored in computer-readable storage media, such as ROM/RAM, magnetic discs, optical discs, etc., including several instructions to make a computer device (which may be a personal computer, server, or network device, etc.) execute the methods described in various embodiments or some parts of the embodiments.

最后应说明的是：以上实施例仅用以说明本发明的技术方案，而非对其限制；尽管参照前述实施例对本发明进行了详细的说明，本领域的普通技术人员应当理解：其依然可以对前述各实施例所记载的技术方案进行修改，或者对其中部分技术特征进行等同替换；而这些修改或者替换，并不使相应技术方案的本质脱离本发明各实施例技术方案的精神和范围。Finally, it should be noted that: the above embodiments are only used to illustrate the technical solutions of the present invention, rather than to limit them; although the present invention has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand that: it can still be Modifications are made to the technical solutions described in the foregoing embodiments, or equivalent replacements are made to some of the technical features; and these modifications or replacements do not make the essence of the corresponding technical solutions deviate from the spirit and scope of the technical solutions of the various embodiments of the present invention.

Claims

1. A data task execution method, characterized in that, comprising:

Determine the data volume of the target data query task, and the resource consumption cost of the target data query task under each preset computing engine;

determining a target computing engine for executing the target data query task according to the amount of data and the resource consumption cost, and executing the target data query task according to the target computing engine;

The target data query task is obtained by parsing the target query statement and consists of multiple subtasks. Correspondingly, determine the resource consumption cost of the target SQL task under each preset computing engine, including:

Determine the amount of IO read and write data, CPU resources, and memory resources required by each subtask in each computing engine in the target data query task, as well as the currently available IO bandwidth, CPU resources, and memory resources of the data platform;

According to the amount of IO read and write data, CPU resources, and memory resources required by each subtask under each computing engine, as well as the current available IO bandwidth, CPU resources, and memory resources of the data platform, determine the target data query task under each computing engine resource consumption costs;

Determine the target computing engine to execute the target data query task according to the amount of data and resource computing cost, including:

When it is determined that the amount of data exceeds the first threshold, it is determined that the Hive computing engine is the target computing engine;

When it is determined that the amount of data does not exceed the first threshold, determine the computing engine corresponding to the minimum resource consumption cost under each computing engine as the target computing engine;

The first threshold is a boundary value for judging whether the SQL task is a task with a large amount of data.

2. The data task execution method according to claim 1, characterized in that, according to the required IO read and write data volume, CPU resources and memory resources under each computing engine according to each subtask, and the currently available data platform IO bandwidth, CPU resources and memory resources, determine the resource consumption cost of the target data query task under each computing engine, including:

Determine the sum of the ratio of the IO read and write data volume of each subtask under each computing engine to the current available IO bandwidth of the data platform, the sum of the ratio of CPU resources to the currently available CPU resources of the data platform, and the ratio of memory resources to the current available data platform The sum of the ratios of memory resources;

The resource consumption cost of the target data query task under each computing engine is determined according to each sum.

3. The data task execution method according to claim 1, further comprising:

Obtain an execution cache table corresponding to the target data query task;

Execute the target data query task according to the calculation engine prestored in the execution cache table.

4. The data task execution method according to claim 3, wherein the execution cache table includes a calculation engine execution table and data cache valid time information, and correspondingly, executes the Target data query tasks, including:

After determining that the target data query task is cached according to the data cache valid time information, the pre-stored computing engine is obtained by parsing from the computing engine execution table, and the target data query task is executed according to the pre-stored computing engine.

5. A data task execution device, characterized in that, comprising:

A determining module, configured to determine the data volume of the target data query task, and the resource consumption cost of the target data query task under each preset computing engine;

An execution module, configured to determine a target computing engine for executing the target data query task according to the amount of data and the resource consumption cost, and execute the target data query task according to the target computing engine;

In the process of determining the resource consumption cost of the target data query task under each preset calculation engine, the determining module is specifically used for:

In the process of determining the target computing engine that executes the target data query task according to the data volume and resource consumption cost, the execution module is specifically used for:

When it is determined that the amount of data does not exceed the first threshold, determine the computing engine corresponding to the minimum resource computing cost under each computing engine as the target computing engine;

6. A data task execution system, comprising:

A task parser for determining a target data query task;

A task compiler, configured to determine the data volume of the target data query task, and the resource consumption cost of the target data query task under each preset computing engine;

A computing engine selector, configured to determine a target computing engine for executing the target data query task according to the amount of data and the resource consumption cost, and execute the target data query task according to the target computing engine;

a metadata repository, configured to store data required for executing the target data query task;

In the process of determining the resource consumption cost of the target data query task under each preset computing engine, the task compiler is specifically used for:

In the process of determining the target computing engine that executes the target data query task according to the data volume and resource consumption cost, the computing engine selector is specifically used for:

7. An electronic device comprising a memory, a processor, and a computer program stored on the memory and operable on the processor, wherein the processor implements any of claims 1 to 4 when executing the program. The steps of the data task execution method described in the item.

8. A non-transitory computer-readable storage medium, on which a computer program is stored, characterized in that, when the computer program is executed by a processor, the data task execution method according to any one of claims 1 to 4 is implemented A step of.