CN115115052A

CN115115052A - An Adaptive Distributed Parallel Training Method of Neural Network Based on Genetic Algorithm

Info

Publication number: CN115115052A
Application number: CN202210963656.8A
Authority: CN
Inventors: 倪益杰; 曾艳; 张纪林; 任永坚; 袁俊峰; 韩猛; 徐梦楚
Original assignee: Hangzhou Dianzi University
Current assignee: Hangzhou Dianzi University
Priority date: 2022-08-11
Filing date: 2022-08-11
Publication date: 2022-09-27
Anticipated expiration: 2042-08-11
Also published as: CN115115052B

Abstract

The invention discloses a neural network self-adaptive distributed parallel training method based on a genetic algorithm. The method comprises the steps of firstly converting an operator placement problem in model parallel into an integer linear programming problem, constructing a solution space, and simplifying the scale of the solution space. And secondly, constructing a cost evaluation model, evaluating the quality of a linear programming problem solution, and guiding two groups of population to perform iterative updating between the two populations. And then converting the two sets of placing strategies into operator placing strategies of the data flow graph after the termination condition of the iterative updating is reached, and training the model by the operator placing strategies to select the optimal distributed scheduling strategy. And finally, performing distributed scheduling on the model to be trained by using the obtained distributed scheduling strategy, and inputting the data set into the neural network for distributed training. The invention accelerates the algorithm searching speed on the premise of ensuring the solution space quality, evaluates the execution performance of the operator placing strategy with fine granularity, and effectively avoids falling into the local optimal condition.

Description

An Adaptive Distributed Parallel Training Method of Neural Network Based on Genetic Algorithm

技术领域technical field

本发明属于分布式机器学习领域，主要涉及一种基于遗传算法的神经网络自适应分布式并行训练方法，为图像识别、自然语言处理等人工智能场景下的大规模复杂神经网络提供最优的模型并行训练方法。The invention belongs to the field of distributed machine learning, and mainly relates to a neural network adaptive distributed parallel training method based on genetic algorithm, which provides an optimal model for large-scale complex neural networks in artificial intelligence scenarios such as image recognition and natural language processing. Parallel training method.

技术背景technical background

近年来，随着数据规模的增大，深度学习模型越来越复杂，深度学习模型参数规模由万级增长到亿级、万亿级。深度学习大模型由于具备强大的海量数据挖掘和分析、预训练、快推理等能力，逐渐发展为下一代AI技术。然而，数据集规模和模型参数量的快速增长使得AI模型训练受到了硬件设备的限制，单设备计算能力和存储能力有限，无法应对在大数据集上训练此类大模型。例如，Bert-Large拥有4亿个参数，其占用内存超过32GB；GPT-3拥有1750亿个参数，占用内存350GB以上。1000层残差网络的普通实现便需要48GB内存，即使经过进一步的优化以降低内存成本，DNN模型仍需至少7GB内存。这使得在内存受限的单设备上运行整个模型，并进行数据并行训练变得极为艰难。同时，由于深度学习模型规模增大，所需要的数据和计算量也越来越大，单设备计算能力往往无法满足需求。因此，采用跨设备的分布式并行技术训练深度学习大模型，已成为深度学习在工业界发展和应用的关键需求。In recent years, with the increase of the data scale, the deep learning model has become more and more complex, and the parameter scale of the deep learning model has increased from 10,000 to 100 million and 1 trillion. Deep learning large models have gradually developed into the next-generation AI technology due to their powerful capabilities in massive data mining and analysis, pre-training, and fast reasoning. However, the rapid growth of dataset size and model parameters makes AI model training limited by hardware devices, and the computing power and storage capacity of a single device are limited, which cannot cope with training such large models on large datasets. For example, Bert-Large has 400 million parameters and occupies more than 32GB of memory; GPT-3 has 175 billion parameters and occupies more than 350GB of memory. A common implementation of a 1000-layer residual network requires 48GB of memory, and even after further optimization to reduce memory costs, the DNN model still requires at least 7GB of memory. This makes it extremely difficult to run the entire model on a single device with limited memory and data-parallel training. At the same time, due to the increase in the scale of deep learning models, the required data and computing volume are also increasing, and the computing power of a single device is often unable to meet the needs. Therefore, the use of cross-device distributed parallel technology to train large deep learning models has become a key requirement for the development and application of deep learning in the industry.

当前主流的分布式机器学习系统如TensorFlow、PyTorch和MindSpore，主要使用数据并行(Data Parallel)的方式加快模型训练速度。然而，数据并行通常只能解决海量数据训练的问题，无法解决大模型受到硬件设备资源限制而无法训练的问题。因此，为了满足大规模网络模型的训练需求，必须对神经网络进行切分并跨设备调度，实现模型并行(Model Parallel)。在模型并行策略的制定方面，通常将神经网络计算图横向按层划分、纵向跨层划分或随机划分并调度至不同设备执行，但这些方法严重依赖专家经验，切分方式不合理，集群利用率低而且通常具有较大通信开销。The current mainstream distributed machine learning systems, such as TensorFlow, PyTorch, and MindSpore, mainly use data parallelism to speed up model training. However, data parallelism can usually only solve the problem of massive data training, and cannot solve the problem that large models cannot be trained due to the limitation of hardware resources. Therefore, in order to meet the training requirements of large-scale network models, the neural network must be segmented and scheduled across devices to achieve Model Parallel. In terms of model parallel strategy formulation, the neural network computing graph is usually divided horizontally by layers, vertically across layers or randomly divided and dispatched to different devices for execution. However, these methods rely heavily on expert experience, the segmentation method is unreasonable, and the cluster utilization rate low and usually has a large communication overhead.

基于机器学习的并行策略自动搜索方法是通过抽取神经网络模型和设备拓扑结构特征，利用机器学习建模指导并行策略的搜索，以得到最优的分布式并行策略。Google团队提出了策略自动搜索框架ColorRL和Hierarchical，通过提取神经网络模型和设备特征并利用强化学习指导并行策略的搜索。但上述方法需要频繁采样，使得策略搜索过程代价高昂。Gao等人提出Spotlight，第一次将神经网络算子调度问题建模为马尔可夫决策过程(Markov decision process,MDP)，随后提出的POST则对其进一步改进，将交叉熵引入采样过程，提高搜索效率。MIT的Ravichandra提出了Placeto，引入了图嵌入方法(GraphEmbedding)使得神经网络可以学习模型的结构特征，使其具备策略快速迁移的能力。Hao等人提出AutoSync自适应框架，基于数据并行场景，利用机器学习方法预测数据并行策略执行时间，从而指导同步并行策略的搜索。Ding和Zhou等人设计了基于强化学习的全量搜索方法可以一次性生成整图放置策略，并支持策略在结构相近的模型间快速迁移。Lan等人通过建立自监督的图神经网络捕获神经网络的结构信息，并利用Seq2Seq(Sequence-to-Sequence)网络对深度学习模型进行分割与调度。上述基于机器学习的方法虽然可以搜索到执行性能优异的并行策略，但通常局限于中小型神经网络模型，并且依赖于硬件资源，搜索过程往往需要耗费数十个小时，不适合快速部署或训练模型的场景。The automatic search method of parallel strategy based on machine learning is to extract the characteristics of neural network model and equipment topology, and use machine learning modeling to guide the search of parallel strategy to obtain the optimal distributed parallel strategy. The Google team proposed the automatic policy search framework ColorRL and Hierarchical, by extracting neural network model and device features and using reinforcement learning to guide the search of parallel policies. However, the above methods require frequent sampling, which makes the policy search process expensive. Gao et al. proposed Spotlight, which for the first time modeled the neural network operator scheduling problem as a Markov decision process (MDP), and then proposed POST to further improve it, introducing cross-entropy into the sampling process to improve search efficiency. Ravichandra of MIT proposed Placeto, which introduced a graph embedding method (GraphEmbedding) so that the neural network can learn the structural features of the model and enable it to have the ability to quickly transfer strategies. Hao et al. proposed the AutoSync adaptive framework, which uses machine learning methods to predict the execution time of data-parallel strategies based on data-parallel scenarios, thereby guiding the search for synchronous-parallel strategies. Ding and Zhou et al. designed a full-scale search method based on reinforcement learning, which can generate an entire graph placement strategy at one time, and support the rapid transfer of strategies between models with similar structures. Lan et al. captured the structural information of the neural network by building a self-supervised graph neural network, and used the Seq2Seq (Sequence-to-Sequence) network to segment and schedule the deep learning model. Although the above-mentioned methods based on machine learning can search for parallel strategies with excellent execution performance, they are usually limited to small and medium-sized neural network models and rely on hardware resources. The search process often takes dozens of hours, which is not suitable for rapid deployment or training of models. scene.

基于图算法的策略自动搜索方法是当前另一种主流方法，相比基于基于机器学习的并行策略自动搜索方法，具有更快的搜索速度。Jia等人提出了OptCNN通过将并行策略抽象为张量的切分，利用动态规划思想在整个搜索空间中搜索最优的策略，但其粗粒度的模型划分以及粗糙的代价评估模型使得搜索到的策略提升性能有限。随后，Jia等人在OptCNN的基础上改进并提出了FlexFlow框架，引入了SOAP(Sample，Operator，Attribute，Parameter)的思想建立高维搜索空间，并利用马尔可夫决策过程在搜索空间内快速搜索最优解，但其实现复杂仅适用于少量的神经网络模型，且代价评估未考虑算子并行执行的情况。为了能够更快的搜索策略以及适用更多的模型，Yi等人提出了基于DAG调度的FastT算法,通过设置细粒度的算子优先级和利用关键路径对算子进行放置和调度，但其未考虑设备训练时的动态内存占比，且不适用动态RNN。Beomyeol等人提出了Beachi框架，提出了基于拓扑排序、最早开始时间和最小通信量三种图算法进行策略的搜索，最快可以在数十秒的时间内搜索出模型并行策略，并且支持大部分的模型，但是其依托于特定条件使得在不同的神经网络模型上搜索到的策略效果不一。上述方法虽然搜索速度较快但通常依托于指定的搜索思想，对策略的搜索方向考虑的维度不全面，并且缺乏全面性的代价评估模型，使得方法容易搜索到次优解。The automatic strategy search method based on graph algorithm is another mainstream method at present. Compared with the parallel automatic strategy search method based on machine learning, it has a faster search speed. Jia et al. proposed that OptCNN uses dynamic programming to search for the optimal strategy in the entire search space by abstracting the parallel strategy into tensor segmentation, but its coarse-grained model division and coarse cost evaluation model make the searched The policy improves performance is limited. Subsequently, Jia et al. improved and proposed the FlexFlow framework on the basis of OptCNN, introduced the idea of SOAP (Sample, Operator, Attribute, Parameter) to establish a high-dimensional search space, and used the Markov decision process to quickly search in the search space. The optimal solution, but its complex implementation is only suitable for a small number of neural network models, and the cost evaluation does not consider the parallel execution of operators. In order to search for strategies faster and apply more models, Yi et al. proposed the FastT algorithm based on DAG scheduling, which places and schedules operators by setting fine-grained operator priorities and using critical paths. Consider the dynamic memory ratio during device training, and dynamic RNN is not applicable. Beomyeol et al. proposed the Beachi framework, and proposed three graph algorithms based on topological sorting, earliest start time and minimum traffic to search for strategies. The model parallel strategy can be searched in tens of seconds at the fastest, and supports most of the However, depending on specific conditions, the strategies searched on different neural network models have different effects. Although the search speed of the above method is fast, it usually relies on the specified search idea, and the dimensions considered in the search direction of the strategy are not comprehensive, and there is a lack of a comprehensive cost evaluation model, which makes the method easy to search for sub-optimal solutions.

发明内容SUMMARY OF THE INVENTION

本发明为了解决上述不足，加快AI模型的训练速度，设计并实现一种基于遗传算法的神经网络自适应分布式并行训练方法。In order to solve the above deficiencies and speed up the training speed of the AI model, the present invention designs and implements a neural network adaptive distributed parallel training method based on genetic algorithm.

一种基于遗传算法的神经网络自适应分布式并行训练方法，步骤可概括如下：A neural network adaptive distributed parallel training method based on genetic algorithm, the steps can be summarized as follows:

步骤1:将神经网络模型中算子放置问题转化整数线性规划问题，基于神经网络模型的数据流图和设备拓扑图构建解空间，并通过设置假设约束和算子分组约束条件简化解空间的规模。Step 1: Transform the operator placement problem in the neural network model into an integer linear programming problem, construct a solution space based on the data flow graph and device topology graph of the neural network model, and simplify the scale of the solution space by setting assumption constraints and operator grouping constraints .

1-1，提取神经网络模型结构并组建设备资源组，抽象得到计算图G(O,E)和设备拓扑图D。在计算图G(O,E)中，顶点O表示神经网络模型的算子集合，E表示顶点间的有向边集合，设备拓扑图D是由M个计算设备组成。1-1, extract the neural network model structure and set up a device resource group, and abstractly obtain a computational graph G(O, E) and a device topology graph D. In the computation graph G(O, E), the vertex O represents the operator set of the neural network model, E represents the directed edge set between the vertices, and the device topology graph D is composed of M computing devices.

1-2，给定计算图G(O,E)和设备拓扑图D，在满足设备内存约束的前提下，寻找一组放置策略使得执行代价最小，从而将神经网络模型的算子在设备上的放置问题进一步转化为整数线性规划问题，并基于数据流图和设备拓扑图构建解空间。1-2, given the calculation graph G(O, E) and the device topology graph D, on the premise of satisfying the device memory constraints, find a set of placement strategies to minimize the execution cost, so that the operators of the neural network model are placed on the device. The placement problem is further transformed into an integer linear programming problem, and a solution space is constructed based on the data flow graph and device topology graph.

1-3，通过分析同构集群内的设备特征以及神经网络模型执行特征提出了两点假设：同构设备的计算和通信性能相同；算子的执行结束时间只与存在直接依赖关系的算子有关。1-3, by analyzing the device characteristics in the homogeneous cluster and the execution characteristics of the neural network model, two assumptions are put forward: the computing and communication performance of the homogeneous devices are the same; the execution end time of the operator is only related to the operator that has a direct dependency relationship. related.

通过上述假设将存在数据依赖关系的两个算子的放置情况约束为两种，其中一种是两个算子放在同一个设备上，另外一种是两个算子放置在不同设备上。其中，当两个算子放在同一个设备上时表明该两个算子的通信代价大于计算代价；当两个算子放在不同的设备上时表明该两个算子的计算代价大于通信代价。Based on the above assumptions, the placement of two operators with data dependencies is constrained to two, one of which is that the two operators are placed on the same device, and the other is that the two operators are placed on different devices. Among them, when two operators are placed on the same device, it indicates that the communication cost of the two operators is greater than the computational cost; when the two operators are placed on different devices, it indicates that the computational cost of the two operators is greater than the communication cost cost.

1-4，随后进一步提出了算子分组约束，存在数据依赖关系的两个算子中，若其中一个算子满足出度或入度为1的条件，便将这两个算子约束为同一组，并将组内的算子视为一个整体进行调度，以此简化计算图的结构，从而缩小解空间的规模。利用上述假设约束和算子分组约束简化搜索空间，以此加快算法的搜索速度。1-4, the operator grouping constraints are further proposed. Among the two operators with data dependencies, if one of the operators satisfies the condition that the out-degree or in-degree is 1, the two operators are constrained to be the same. group, and the operators in the group are regarded as a whole for scheduling, which simplifies the structure of the computational graph and reduces the size of the solution space. The above assumption constraints and operator grouping constraints are used to simplify the search space to speed up the search speed of the algorithm.

步骤2:构建计算代价、通信代价和负载代价融合的代价评估模型以综合评估整数线性规划问题解的优劣，即算子放置策略的执行性能，指导双种群遗传算法在两个种群(放置策略)之间进行基因突变及基因互换迭代更新两组放置策略。Step 2: Construct a cost evaluation model integrating computing cost, communication cost and load cost to comprehensively evaluate the pros and cons of the solution of the integer linear programming problem, that is, the execution performance of the operator placement strategy, and guide the dual population genetic algorithm in the two populations (placement strategy). ) between the two groups of placement strategies were iteratively updated by gene mutation and gene exchange.

2-1，构建计算代价、通信代价和负载代价融合的代价评估模型。2-1, build a cost evaluation model that integrates computing cost, communication cost and load cost.

首先，分析影响神经网络模型执行性能的关键因素，采用神经网络模型中算子在设备上的真实执行时间，建立计算代价L_i；采用神经网络模型中算子间跨设备通信的总张量大小，建立通信代价W；采用各个设备内存负载偏移和平均负载的比例之和，建立负载代价f。Firstly, the key factors affecting the execution performance of the neural network model are analyzed, and the real execution time of the operators in the neural network model on the device is used to establish the calculation cost _Li ; the total tensor size of the cross-device communication between operators in the neural network model is used. , establish the communication cost W; use the sum of the ratio of the memory load offset and the average load of each device to establish the load cost f.

其次，通过综合考量计算代价、通信代价、负载代价以及模型实际训练过程的特征，建立计算代价、通信代价和负载代价融合的代价评估模型Secondly, by comprehensively considering the characteristics of the calculation cost, communication cost, load cost and the actual training process of the model, a cost evaluation model integrating the calculation cost, communication cost and load cost is established.

π_θ＝(∑L_i+λ×Ω(W))×(1+f)π _θ =(∑L _i +λ×Ω(W))×(1+f)

其中，π_θ表示放置策略的执行代价；λ表示权重占比，模拟计算和通信并行的情况；Ω为张量大小和通信时间的线性拟合模型。Among them, π _θ represents the execution cost of the placement strategy; λ represents the weight ratio, simulating the parallel situation of computing and communication; Ω is the linear fitting model of tensor size and communication time.

2-2，基于双种群遗传算法迭代更新放置策略。2-2, iteratively update the placement strategy based on the dual population genetic algorithm.

首先，利用放置策略的初始化方法，在解空间内生成两组带有初始算子放置信息的放置策略。放置策略的初始化方法包括随机初始化和指定策略初始化，随机初始化方法是指将计算图G(O,E)中的算子以随机的方式分配到设备拓扑图D上，指定策略初始化方法是指利用现有的并行策略搜索方法搜索到的放置策略作为初始状态，任选一种初始化方法进行放置策略的初始化。First, using the initialization method of placement strategy, two sets of placement strategies with initial operator placement information are generated in the solution space. The initialization methods of the placement strategy include random initialization and specified strategy initialization. The random initialization method refers to assigning the operators in the calculation graph G(O, E) to the device topology graph D in a random manner. The specified strategy initialization method refers to using The placement strategy searched by the existing parallel strategy search method is used as the initial state, and an initialization method is selected to initialize the placement strategy.

其次，基于代价评估模型，指导双种群遗传算法的基因突变及基因互换方法，迭代更新两组放置策略中算子的放置信息，实现快速在解空间内搜索执行代价最小的放置策略。Secondly, based on the cost evaluation model, the gene mutation and gene exchange methods of the dual population genetic algorithm are guided, and the placement information of the operators in the two placement strategies is updated iteratively, so as to quickly search for the placement strategy with the least cost in the solution space.

步骤3:当到达放置策略迭代更新的终止条件后，分别将两组放置策略转化为数据流图的算子放置策略，并在真实环境下根据迭代得到的算子放置策略对神经网络模型进行训练，选取执行时间最短的策略作为神经网络模型的最优分布式调度策略。Step 3: When the termination condition of the iterative update of the placement strategy is reached, the two sets of placement strategies are respectively transformed into the operator placement strategies of the data flow graph, and the neural network model is trained according to the iteratively obtained operator placement strategies in the real environment. , select the strategy with the shortest execution time as the optimal distributed scheduling strategy of the neural network model.

其中，迭代终止的条件包括总迭代次数达到指定阈值和放置策略在一定迭代次数内不再发生变化，当双种群遗传算法的迭代过程满足上述两个条件中的其中一个时即终止迭代。The conditions for the termination of the iteration include that the total number of iterations reaches a specified threshold and the placement strategy does not change within a certain number of iterations. The iteration is terminated when the iterative process of the dual-population genetic algorithm satisfies one of the above two conditions.

步骤4：利用步骤3得到的分布式调度策略，对待训练的神经网络模型进行分布式调度，并将图像识别、自然语言处理等模型所需的数据集输入到的神经网络模型中进行分布式训练。Step 4: Using the distributed scheduling strategy obtained in Step 3, perform distributed scheduling of the neural network model to be trained, and input the data sets required by the image recognition, natural language processing and other models into the neural network model for distributed training .

本发明具有的有益效果是：将神经网络模型中算子放置问题转化为整数线性规划问题，直观的描述待求解的问题，并基于数据流图和设备拓扑图构建该整数线性规划问题的解空间，然后引入假设约束和算子分组约束，实现解空间的简化，在保证解空间质量的前提下加快算法搜索速度；分析影响神经网络模型执行性能的关键因素，构建代价评估模型，以细粒度评估算子放置策略的执行性能；双种群遗传算法的提出，利用的基因突变以及基因互换方法加快遗传算法迭代速度，有效避免陷入局部最优情况，帮助算法向全局最优解靠近。The invention has the beneficial effects of: transforming the operator placement problem in the neural network model into an integer linear programming problem, intuitively describing the problem to be solved, and constructing the solution space of the integer linear programming problem based on the data flow diagram and the equipment topology diagram , and then introduce assumption constraints and operator grouping constraints to simplify the solution space and speed up the algorithm search speed on the premise of ensuring the quality of the solution space; analyze the key factors that affect the performance of the neural network model, build a cost evaluation model, and evaluate with fine-grained The execution performance of the operator placement strategy; the proposal of the dual population genetic algorithm, the use of gene mutation and gene exchange methods to speed up the iteration speed of the genetic algorithm, effectively avoid falling into the local optimal situation, and help the algorithm to approach the global optimal solution.

附图说明Description of drawings

图1是基于遗传算法搜索最优并行策略示意图；Figure 1 is a schematic diagram of searching for an optimal parallel strategy based on a genetic algorithm;

图2是算子对示意图；Fig. 2 is a schematic diagram of an operator pair;

图3是基因突变方法中基因变异示意图；Figure 3 is a schematic diagram of gene mutation in the gene mutation method;

图4是基因突变和基因互换方法中迭代更新示意图；Fig. 4 is a schematic diagram of iterative update in gene mutation and gene exchange method;

图5是基因互换方法中优质算子对交叉互换互换示意图。Figure 5 is a schematic diagram of cross-exchange of high-quality operator pairs in the gene exchange method.

具体实施方式Detailed ways

下面将结合附图和具体实施步骤对本发明做出进一步说明：The present invention will be further described below in conjunction with the accompanying drawings and specific implementation steps:

一种基于遗传算法的神经网络自适应分布式并行训练方法，整体流程图如图1所示，包括以下具体步骤：A neural network adaptive distributed parallel training method based on genetic algorithm, the overall flow chart is shown in Figure 1, including the following specific steps:

步骤1:将神经网络模型中算子放置问题转化整数线性规划问题，基于神经网络模型的数据流图和设备拓扑图构建解空间，并通过设置假设约束、算子分组约束条件简化解空间的规模：Step 1: Convert the operator placement problem in the neural network model into an integer linear programming problem, build a solution space based on the data flow graph and device topology graph of the neural network model, and simplify the scale of the solution space by setting assumption constraints and operator grouping constraints :

1-1，根据神经网络数据流图，定义计算图G(O,E)，其中O表示算子集合，节点o_i∈O表示单个算子(例如矩阵乘、卷积等)，E表示节点之间的有向边集合，指示算子之间的数据依赖关系；根据集群设备拓扑信息，定义设备拓扑图D，设备拓扑图D是由M个计算设备组成，其中节点d∈D表示某计算设备(例如CPU或GPU)。1-1, according to the neural network data flow graph, define the computation graph G(O, E), where O represents the set of operators, node o _i ∈ O represents a single operator (such as matrix multiplication, convolution, etc.), and E represents the node The set of directed edges between them indicates the data dependency between operators; according to the cluster device topology information, the device topology graph D is defined, and the device topology graph D is composed of M computing devices, in which the node d∈D represents a certain computing device. device (e.g. CPU or GPU).

1-2，给定计算图G(O,E)和设备拓扑图D，在满足设备内存约束的前提下，寻找一组放置策略使得执行代价最小，将神经网络模型的算子和设备之间的放置问题进一步转化为整数线性规划问题，并利用基于神经网络模型数据流图抽象得到的计算图G(O,E)和设备拓扑图D构建该整数线性规划问题的解空间，如式(1)所示：1-2, given the calculation graph G(O, E) and the device topology graph D, on the premise of satisfying the device memory constraints, find a set of placement strategies to minimize the execution cost, and connect the operators of the neural network model to the device. The placement problem is further transformed into an integer linear programming problem, and the solution space of the integer linear programming problem is constructed by using the computational graph G(O, E) and the device topology graph D abstracted from the data flow graph of the neural network model, as shown in Eq. (1 ) as shown:

其中，P表示算子的放置策略，F表示计算图F(O,E)，D表示设备拓扑图，π_θ表示策略P在当前计算图G和设备拓扑图D下的代价开销。式中o_i表示算子集合O中的算子，d表示设备拓扑图D中的计算设备，

表示算子o_i的放置策略产生的内存开销，M(d)表示计算设备d的最大内存。Among them, P represents the placement strategy of the operator, F represents the computation graph F(O, E), D represents the device topology map, and π _θ represents the cost of the strategy P under the current computation graph G and device topology map D. where o _i represents the operator in the operator set O, d represents the computing device in the device topology diagram D,

Represents the memory overhead generated by the placement strategy of the operator _oi , and M(d) represents the maximum memory of the computing device d.

1-3，设置假设约束和算子分组约束简化解空间规模。由于同构设备间性能的差异性较小且算子的前继节点直接影响了算子执行的开始时间。所以，针对基于上述特点，提出了以下两点假设：同构设备的计算和通信性能相同；算子的执行结束时间只与存在直接依赖关系的算子有关。并基于上述假设，对解空间的范围进行约束，假设约束条件如式(2)所示:1-3, set assumption constraints and operator grouping constraints to simplify the solution space scale. Because the performance difference between homogeneous devices is small and the operator's predecessor node directly affects the start time of the operator's execution. Therefore, based on the above characteristics, the following two assumptions are put forward: the computing and communication performance of isomorphic devices are the same; the execution end time of operators is only related to operators with direct dependencies. And based on the above assumptions, the scope of the solution space is constrained, and the constraints are assumed as shown in formula (2):

其中，d_i,d_j表示设备拓扑图D中的计算设备，

和

表示同构设备的性能，≡表示恒等于，o_i,o_j表示算子集合O中的算子，p_oi表示算子o的放置策略，Succ(o_i)方法表示获取o_i算子的所有后续节点，σ表示算子运算结束时间，

表示算子o_i的放置不直接影响o_j的执行结束时间。Among them, d _i , d _j represent the computing devices in the device topology diagram D,

and

Represents the performance of the homogeneous device, ≡ represents equality, o _i , o _j represent the operators in the operator set O, p _oi represents the placement strategy of the operator o, and the Succ(o _i ) method represents the acquisition of the o _i operator. For all subsequent nodes, σ represents the end time of the operator operation,

Indicates that the placement of operator o _i does not directly affect the execution end time of o _j .

1-4，针对神经网络中算子与连边数量众多的问题提出了算子分组约束，基于算子融合方法对计算图中算子进行分组，将组内的算子视为一个整体进行调度，以此简化计算图规模，进一步缩小解空间的范围。为了维持计算图的有向无环性的特征并降低检测成环的成本，算子分组约束采取了保守的分组方法，即存在数据依赖关系的两个算子中其中一个满足出度或入度为1，便将这两个算子约束为同一组，并将组内的算子视为一个整体进行调度，以此简化计算图的结构，从而缩小解空间的规模，并可保证该这两个算子间只存在一条数据链路，从而保证该两个算子进行消融不会使得计算图成环。即只当算子o_i和算子o_j满足以下式3条件时，将其约束为同一组：1-4, put forward operator grouping constraints for the large number of operators and connected edges in neural networks. Based on the operator fusion method, the operators in the calculation graph are grouped, and the operators in the group are regarded as a whole for scheduling. , which simplifies the size of the computational graph and further reduces the scope of the solution space. In order to maintain the directed acyclic feature of the computational graph and reduce the cost of detecting loops, the operator grouping constraint adopts a conservative grouping method, that is, one of the two operators with data dependencies satisfies the out-degree or in-degree is 1, the two operators are constrained to the same group, and the operators in the group are regarded as a whole for scheduling, which simplifies the structure of the calculation graph, reduces the scale of the solution space, and ensures that the two operators There is only one data link between the two operators, so as to ensure that the ablation of the two operators will not make the calculation graph loop. That is, only when the operator o _i and the operator o _j satisfy the following formula 3, they are constrained to be the same group:

其中，succ(o_i)表示算子o_i的后继节点集合，outdegree(o_i)表示算子o_i的出度，indegree(o_j)表示算子o_j的入度。Among them, succ(o _i ) represents the successor node set of operator o _i , outdegree(o _i ) represents the out-degree of operator o _i , and indegree(o _j ) represents the in-degree of operator o _j .

步骤2:基于融合计算代价、通信代价和负载代价的代价评估模型评估算子放置策略的执行性能，指导双种群遗传算法在两个种群(放置策略)之间进行基因突变及基因互换迭代更新两组放置策略。Step 2: Evaluate the execution performance of the operator placement strategy based on the cost evaluation model of the fusion calculation cost, communication cost and load cost, and guide the dual population genetic algorithm to iteratively update the gene mutation and gene exchange between the two populations (placement strategies). Two sets of placement strategies.

2-1，构建计算代价、通信代价和负载代价融合的代价评估模型2-1, build a cost evaluation model that integrates computing cost, communication cost and load cost

首先，分析影响神经网络模型执行性能的关键因素，即设备的计算性能、通信性能、内存负载的特征。采用神经网络模型中算子在设备上的真实执行时间，建立计算代价L_i；采用神经网络模型中算子间跨设备通信的总张量大小，建立通信代价W；采用各个设备内存负载偏移平均负载的比例之和，建立负载代价f。，其中计算代价L_i、通信代价W和建立负载代价f定义如下：First, analyze the key factors that affect the performance of neural network model execution, that is, the characteristics of the device's computing performance, communication performance, and memory load. Use the real execution time of the operator on the device in the neural network model to establish the calculation cost Li; use the total tensor size of the cross-device communication between operators in the neural network model to establish the communication cost _W ; use the memory load offset of each device The sum of the proportions of the average load to establish the load cost f. , where the computation cost L _i , the communication cost W and the establishment load cost f are defined as follows:

计算代价，采用算子在设备上真实执行的结束时间减去开始时间。则算子的计算代价如式(4)所示:Calculate the cost by subtracting the start time from the end time of the actual execution of the operator on the device. Then the calculation cost of the operator is shown in formula (4):

其中，

和

分别为算子o_i在设备d上的开始执行时间和结束时间。in,

and

are the start time and end time of operator o _i on device d, respectively.

通信代价，采用神经网络中算子跨GPU通信的总张量大小。则模型的通信代价如式(5)所示:Communication cost, using the total tensor size of the operator in the neural network for cross-GPU communication. Then the communication cost of the model is shown in formula (5):

其中，记S_i,j为算子o_i和算子o_j间传输的张量大小，μ_i,j为算子o_i和算子o_j是否需要跨设备传输张量，当算子o_oi和算子o_j放置在同一个设备上时μ_i,j为0表示无需跨设备传输，否则μ_i,j为1表示需要跨设备传输。where S _i,j is the size of the tensor transmitted between operator o _i and operator o _j , μ _i,j is whether operator o _i and operator o _j need to transmit tensors across devices, when operator o When _oi and operator o _j are placed on the same device, if μ _i,j is 0, it means that there is no need for cross-device transmission; otherwise, if μ _i,j is 1, it means that cross-device transmission is required.

负载代价，采用各个设备内存负载偏移和平均负载的比例之和。则模型的负载均衡代价如式(6)所示:The load cost is the sum of the ratio of the memory load offset and the average load of each device. Then the load balancing cost of the model is shown in formula (6):

其中，A为当前模型各个设备平均需要承担的内存大小，W_d表示在第d个设备上算子需要跨设备通信的张量总大小，Δ_d表示第d个设备上算子的参数大小。Among them, A is the average memory size of each device in the current model, W _d represents the total size of tensors that the operator needs to communicate across devices on the d-th device, and Δ _d represents the parameter size of the operator on the d-th device.

然后，根据上述定义的计算代价、通信代价和负载代价融合构建代价评估模型，其具体定义如下：Then, a cost evaluation model is constructed based on the calculation cost, communication cost and load cost defined above, and its specific definition is as follows:

计算代价、通信代价和负载代价融合的代价评估模型，从计算代价、通信代价和负载代价三个维度表征模型并行策略性能，构建综合代价评估模型，以指导算子放置策略的自动搜索和调优，如式(7)所示：A cost evaluation model that integrates computing cost, communication cost and load cost, characterizes the performance of the model parallel strategy from the three dimensions of computing cost, communication cost and load cost, and builds a comprehensive cost evaluation model to guide the automatic search and tuning of operator placement strategies , as shown in formula (7):

π_θ＝∑L_i+λ×Ω(W)×(1+f) (7)π _θ =∑L _i +λ×Ω(W)×(1+f) (7)

其中，π_θ表示放置策略的执行代价；权重占比λ基于经验调参得到；Ω为张量大小和通信时间的线性拟合模型；由于在实际环境中部分计算和通信时间往往是重叠的，所以利用权重占比λ，模拟计算和通信并行的情况。Among them, π _θ represents the execution cost of the placement strategy; the weight ratio λ is obtained based on empirical parameter adjustment; Ω is the linear fitting model of tensor size and communication time; since part of the calculation and communication time often overlap in the actual environment, Therefore, the weight ratio λ is used to simulate the parallel situation of computing and communication.

2-2，基于双种群遗传算法迭代更新放置策略2-2. Iteratively update placement strategy based on dual population genetic algorithm

首先利用随机或指定策略在解空间内生成两组初始解，即为计算图中的算子生成两组算子初始放置策略，本方法将每组算子放置策略视为一个种群；然后利用计算代价、通信代价和负载代价融合的代价评估模型，指导双种群遗传算法的基因突变方法和基因互换方法，迭代更新两组放置策略算子的放置信息，实现快速在解空间内搜索执行代价最小的放置策略。First, use random or specified strategies to generate two sets of initial solutions in the solution space, that is, to generate two sets of initial placement strategies for operators in the calculation graph. In this method, each set of operator placement strategies is regarded as a population; The cost evaluation model integrating cost, communication cost and load cost guides the gene mutation method and gene exchange method of the dual-population genetic algorithm, iteratively updates the placement information of the two sets of placement strategy operators, and realizes the fast search in the solution space with the smallest execution cost placement strategy.

采用随机初始化或指定策略初始化为算子生成初始放置策略。Use random initialization or specified strategy initialization to generate initial placement strategies for operators.

1)随机初始化方法是将计算图中的算子随机放置到设备集群中，以此得到初始的放置策略。其实现方法如算法1所示，其中random_device(D)表示从可用设备中随机选择一个设备d。placement(P,o,d)表示将算子o放置在设备d上，并将该算子的放置信息记录在策略P中。1) The random initialization method is to randomly place the operators in the calculation graph into the device cluster to obtain the initial placement strategy. The implementation method is shown in Algorithm 1, where random_device(D) means to randomly select a device d from the available devices. placement(P,o,d) means to place operator o on device d, and record the placement information of the operator in policy P.

2)基于指定策略的初始化方法利用现有的分布式并行策略搜索方法如Beachi框架中基于拓扑排序放置的m_topo方法、基于最早开始时间放置的m_etf方法、基于最小通信放置的m_sct方法进行初始化。其实现方法如算法2所示，其中，initial_method(G^*,D,T)表示利用指定的分布式并行策略搜索方法T，搜索计算图G^*在设备拓扑图D所描述的设备集群中上的放置策略。G′表示经过现有方法搜索后得到的带有算子放置信息的计算图，get_device(o)表示获取计算图中算子o的目标放置设备。placement(P,o,d)表示将算子o放置在设备d上，并将该算子的放置信息记录在策略P中。2) The initialization method based on the specified strategy uses the existing distributed parallel strategy search methods such as the m_topo method based on topological sorting placement in the Beachi framework, the m_etf method based on the earliest start time placement, and the m_sct method based on minimum communication placement for initialization. The implementation method is shown in Algorithm 2, where initial_method(G ^* ,D,T) indicates that the specified distributed parallel strategy search method T is used to search for the calculation graph G ^* in the device cluster described by the device topology diagram D. placement strategy. G' represents the computation graph with operator placement information obtained after searching by existing methods, and get_device(o) represents the target placement device for obtaining operator o in the computation graph. placement(P,o,d) means to place operator o on device d, and record the placement information of the operator in policy P.

基因突变方法包含三个步骤：算子对的选择、基因变异和迭代更新。The gene mutation method consists of three steps: operator pair selection, gene mutation and iterative update.

基因突变方法依据算子对的整体代价选择待变异的算子并通过改变算子的放置信息迭代更新放置策略从而进行策略的寻优。The gene mutation method selects the operator to be mutated according to the overall cost of the operator pair and iteratively updates the placement strategy by changing the placement information of the operator to optimize the strategy.

1)基于最大代价优先原则，在每组放置策略中选择整体代价最大的算子对，作为基因突变算子。1) Based on the principle of maximum cost priority, the operator pair with the largest overall cost is selected in each group of placement strategies as the gene mutation operator.

基因突变方法将计算图中存在数据依赖关系的两个算子称为一个算子对，如图2所示。利用基因上算子的计算代价和通信代价得到算子对的整体代价，并根据这个整体代价选择需要变异的算子对，算子对的整体代价计算方法如式(8)所示：The gene mutation method refers to two operators with data dependencies in the calculation graph as an operator pair, as shown in Figure 2. The overall cost of the operator pair is obtained by using the calculation cost and communication cost of the operator on the gene, and the operator pair that needs to be mutated is selected according to the overall cost. The calculation method of the overall cost of the operator pair is shown in formula (8):

其中，

表示算子对R_ij被选择过的次数，decay表示动态衰减率，L_i和L_j分别表示算子o_i和算子o_j的计算代价，S_i,j表示算子o_i和算子o_j之间的传输张量大小，Ω表示张量大小和通信时间的线性拟合模型，

表示算子对R_ij的整体代价。in,

Represents the number of times the operator pair R _ij has been selected, decay represents the dynamic decay rate, Li and L _j represent the computational cost of the operator o _i and the operator o _j respectively, S _i,j _represent the operator o _i and the operator o is the size of the transmission tensor between _j , Ω represents the linear fitting model of the size of the tensor and the communication time,

Represents the overall cost of the operator to R _ij .

由于在算法执行过程中算子对R可以被反复选中，为了增加整体代价小的算子对被选中的概率，利用动态衰减率decay来对被选中的算子对的整体代价进行动态衰减，算子对选择的具体实现如算法3所示：Since the operator pair R can be repeatedly selected during the execution of the algorithm, in order to increase the probability that the operator pair with the small overall cost is selected, the dynamic decay rate decay is used to dynamically attenuate the overall cost of the selected operator pair, and calculate The specific implementation of sub-pair selection is shown in Algorithm 3:

其中get_all_op_pair(G,P)表示依据计算图G和放置策略P获取所有算子对，caculate_cost(R)表示计算算子对R的整体代价；sort_by_cost(op_pair_list)表示按照算子对的整体代价对所有算子对进行降序排序；p为选中阈值。where get_all_op_pair(G,P) means to obtain all operator pairs according to the calculation graph G and placement strategy P, caculate_cost(R) means the overall cost of calculating operator pair R; sort_by_cost(op_pair_list) means to calculate all operator pairs according to the overall cost of the operator pair The operator pairs are sorted in descending order; p is the selection threshold.

2)利用基因变异方法通过改变基因突变算子的放置信息，生成新的放置策略。2) Using the gene mutation method to generate a new placement strategy by changing the placement information of the gene mutation operator.

本方法采用以下规则对选中算子的进行基因变异：This method uses the following rules to mutate the selected operator:

(1)若当前选中的算子对中两个算子放置在不同设备上，则随机选择一个可用设备将两个算子放置在该设备上。(1) If the two operators in the currently selected operator pair are placed on different devices, randomly select an available device and place the two operators on the device.

(2)若当前选中的算子对中算子放置在同一个设备上，则随机选择两个可用设备将两个算子放置这两个不同的设备上。(2) If the currently selected operator is placed on the same device, then randomly select two available devices and place the two operators on these two different devices.

利用上述两个规则可以在原有的放置策略的基础上生成新的放置策略。规则(1)如图3所示，若算子对选择方法选择了待变异算子对R₂₄，则首先随机选择一个可用的新设备0，然后将R₂₄＝{0,1}变异为R₂₄＝{2,2}，代表o₂算子和o₄算子从原先分别在设备0和设备1上，变异为将算子o₂和算子o₄同时放在设备2上，以此得到新的放置策略。同理，规则(2)如图3所示，若算子对选择方法在选择了算子对R₃₅，则首先随机选择两个可用的新设备设备2和设备1，然后将R₃₅＝{1,1}变异为R₃₅＝{2,0}，代表o₃算子和o₅算子从原先同时放置在设备1上，变异为将算子o₃和算子o₅分别放在设备2和设备0上，以此得到新的放置策略。Using the above two rules, a new placement strategy can be generated on the basis of the original placement strategy. Rule (1) As shown in Figure 3, if the operator pair selection method selects the operator pair to be mutated R ₂₄ , first randomly select an available new device 0, and then mutate R ₂₄ ={0,1} to R ₂₄ = {2, 2}, which means that the o ₂ operator and the o ₄ operator are changed from the original device 0 and device 1 respectively to the operator o ₂ and the operator o ₄ are placed on the device 2 at the same time, so that Get new placement strategies. Similarly, the rule (2) is shown in Fig. 3. If the operator pair selection method selects the operator pair R ₃₅ , it first randomly selects two available new devices, Device 2 and Device 1, and then sets R ₃₅ ={ 1,1} is mutated to R ₃₅ ={2,0}, which means that the o ₃ operator and the o ₅ operator are placed on the device 1 at the same time, and the operator o ₃ and the operator o ₅ are placed on the device respectively. 2 and device 0 to get a new placement strategy.

3)以计算代价、通信代价、负载代价融合的代价评估模型评估为指导，迭代更新放置策略3) Guided by the cost evaluation model evaluation of the fusion of computing cost, communication cost, and load cost, iteratively update the placement strategy

如图4所示，首先利用计算代价、通信代价和负载代价融合的代价评估模型分别计算原放置策略和新放置策略的代价；然后比较两组放置策略的代价大小，并选择代价较小的放置策略作为下一轮迭代的起始放置策略；最后将本轮基因突变中选中的算子对R作为当前放置策略的一个优质算子对，表示该算子对上算子的当前放置信息可以使得策略代价的变小。As shown in Figure 4, the cost of the original placement strategy and the new placement strategy are calculated separately by using the cost evaluation model of the fusion of computing cost, communication cost and load cost; then the cost of the two groups of placement strategies is compared, and the placement with the lower cost is selected. The strategy is used as the initial placement strategy of the next round of iteration; finally, the operator pair R selected in the current round of gene mutation is used as a high-quality operator pair of the current placement strategy, indicating that the current placement information of the operator on the operator pair can make The cost of the strategy becomes smaller.

基因突变方法的整体流程如算法4所示：The overall flow of the gene mutation method is shown in Algorithm 4:

其中choose_gene表示从种群中选择一个基因；genes_info表示指定基因上的算子放置信息；add_good_gene表示将基因作为优质基因；target_device表示选择目标放置设备；update_population表示更新指定种群的基因信息；cost_model为计算、通信、负载融合的代价评估模型计算得到指定种群的舒适度π_θ。Among them, choose_gene means to select a gene from the population; genes_info means the operator placement information on the specified gene; add_good_gene means to use the gene as a high-quality gene; target_device means to select the target placement device; update_population means to update the gene information of the specified population; cost_model means calculation and communication , the cost evaluation model of load fusion calculates the comfort degree π _θ of the specified population.

基因互换方法包含两个步骤：优质算子对交叉互换方法和迭代更新。The gene swap method consists of two steps: high-quality operator pair cross swap method and iterative update.

1)基于优质算子对交叉互换方法，交换两组放置策略的基因信息。1) Based on the high-quality operator pair cross-exchange method, the genetic information of the two groups of placement strategies is exchanged.

优质算子对交叉互换方法，将两组放置策略中各自的优质算子对相互赋予对方，即将各自的优质算子放置信息更新到对方的相同算子上，以此生成两组带有对方算子放置信息的新放置策略。The high-quality operator pair cross-exchange method is to assign the respective high-quality operator pairs in the two sets of placement strategies to each other, that is, update the respective high-quality operator placement information to the same operator of the other party, so as to generate two sets with each other. New placement strategy for operator placement information.

如图5所示，第一组放置策略的优质算子对组中保留着在基因突变过程中得到的优质算子对R₁₂＝{1,0}和R₃₅＝{2,0}。第二组放置策略的优质算子对组中保留着优质算子对R₂₄＝{2,1}和R₁₃＝{1,0}。优质算子对交叉互换方法将第一组放置策略的优质算子对R₁₂和R₃₅更新到第二组放置策略中，表示将第二组放置策略中算子o₁和算子o₂分别放置在设备1和设备0上，算子o₃和算子o₅分别放置在设备2和设备0上。同理，第二组放置策略也会将自己的优质算子对R₂₄和R₁₃更新到第一组放置策略中，表示将第一组放置策略的算子o₂和算子o₄分别放置在设备2和设备1上，算子o₁和算子o₃分别放置在设备1和设备0上。通过上述优质算子对交叉互换方法，以生成两个带有对方算子放置信息的新放置策略。As shown in FIG. 5 , the high-quality operator pairs of the first placement strategy retain the high-quality operator pairs R ₁₂ ={1,0} and R ₃₅ ={2,0} obtained during the gene mutation process. The high-quality operator pairs R ₂₄ ={2,1} and R ₁₃ ={1,0} remain in the high-quality operator pair group of the second set of placement strategies. The high-quality operator pair cross-exchange method updates the high-quality operator pairs R ₁₂ and R ₃₅ of the first set of placement strategies to the second set of placement strategies, which means that operator o ₁ and operator o ₂ in the second set of placement strategies are updated They are placed on device 1 and device 0, respectively, and operator o ₃ and operator o ₅ are placed on device 2 and device 0, respectively. Similarly, the second set of placement strategies will also update its own high-quality operator pairs R ₂₄ and R ₁₃ to the first set of placement strategies, which means that operators o ₂ and o ₄ of the first set of placement strategies are placed respectively. On device 2 and device 1, operator o ₁ and operator o ₃ are placed on device 1 and device 0, respectively. Through the above-mentioned high-quality operator pair cross-exchange method, two new placement strategies with placement information of each other's operators are generated.

2)以计算代价、通信代价、负载代价融合的代价评估模型评估为指导，迭代更新放置策略。2) Iteratively update the placement strategy under the guidance of the cost evaluation model evaluation of the fusion of computing cost, communication cost, and load cost.

与基因突变方法中的迭代更新一致，基因互换方法中的优胜劣汰，同样利用计算代价、通信代价和负载代价融合的代价评估模型，评判由基因组交叉互换产生的新旧种群的优劣并保留相对优质种群。此外为了保持兄弟种群中基因的多样性，防止在迭代的过程中兄弟种群的基因信息趋于一致，在每次发生基因互换后将清空双方种群的优质基因组。Consistent with the iterative update in the gene mutation method, the survival of the fittest in the gene swap method also uses a cost evaluation model that integrates computing cost, communication cost and load cost to judge the pros and cons of the old and new populations generated by genome crossover and retain the relative quality species. In addition, in order to maintain the diversity of genes in the sibling populations and prevent the genetic information of the sibling populations from converging during the iteration process, the high-quality genomes of both populations will be cleared after each gene exchange occurs.

基因互换方法的整体流程如算法5所示：The overall flow of the gene swap method is shown in Algorithm 5:

步骤3:当到达迭代终止条件后，分别将两组放置策略转化为数据流图的算子放置策略，并在真实环境下根据迭代得到的算子放置策略对模型进行训练，以选取执行时间最短的策略作为模型的最优分布式调度策略。Step 3: When the iteration termination condition is reached, the two sets of placement strategies are respectively transformed into the operator placement strategies of the data flow graph, and the model is trained according to the operator placement strategies obtained by iteration in the real environment, so as to select the shortest execution time. The strategy is used as the optimal distributed scheduling strategy of the model.

步骤4：利用得到的分布式调度策略，对待训练的AI模型进行分布式调度，并将图像识别、自然语言处理等模型所需的数据集输入到的神经网络中进行分布式训练。Step 4: Using the obtained distributed scheduling strategy, perform distributed scheduling of the AI model to be trained, and input the data sets required by the image recognition, natural language processing and other models into the neural network for distributed training.

根据上述步骤构建解空间以及基于双种群遗传算法求解的方法，本文提出的基于遗传算法的深度学习自适应训练方法具体流程如下：According to the above steps to construct the solution space and the method based on the dual population genetic algorithm, the specific process of the deep learning adaptive training method based on the genetic algorithm proposed in this paper is as follows:

(1)首先，基于计算图和设备拓扑图构建整数线性规划问题的解空间，并利用假设约束和算子分组约束简化解空间的规模。(1) First, the solution space of integer linear programming problem is constructed based on the computation graph and device topology graph, and the scale of the solution space is simplified by using assumption constraints and operator grouping constraints.

(2)其次，基于随机和指定策略的方法初始化双种群遗传算法所需的两组初始放置策略。并基于计算代价、通信代价和负载代价融合的代价评估模型指导基因突变方法和基因交换方法迭代更新两组放置策略。(2) Secondly, the method based on random and specified strategy initializes two sets of initial placement strategies required by the dual population genetic algorithm. The two groups of placement strategies are iteratively updated based on the cost evaluation model of the fusion of computational cost, communication cost and load cost to guide the gene mutation method and gene exchange method.

(3)然后，当达到迭代终止条件时，分别将两组迭代得到的放置策略转化为数据流图的算子放置策略，并在真实环境对模型进行训练，并选择两组放置策略中执行时间较短的策略作为模型的最优分布式调度策略。(3) Then, when the iteration termination condition is reached, the placement strategies obtained by the two sets of iterations are converted into the operator placement strategies of the data flow graph, and the model is trained in the real environment, and the execution time of the two sets of placement strategies is selected. The shorter policy is used as the optimal distributed scheduling policy for the model.

(4)最后，利用得到的分布式调度策略，对待训练的AI模型进行分布式调度，并将图像识别、自然语言处理等模型所需的数据集输入到的神经网络中进行分布式训练。(4) Finally, use the obtained distributed scheduling strategy to perform distributed scheduling of the AI model to be trained, and input the data sets required by the image recognition, natural language processing and other models into the neural network for distributed training.

本方法具体描述如算法6所示：The specific description of this method is shown in Algorithm 6:

其中，第1-2行为提取模型的计算图，并基于算子分组约束的消融方法对计算图进行简化。Among them, rows 1-2 extract the computational graph of the model, and simplify the computational graph based on the ablation method constrained by operator grouping.

第3行为利用算法1和算法2所述的放置策略初始化方法在解空间内生成两组初始放置策略P1，P2。The third behavior uses the placement strategy initialization method described in Algorithm 1 and Algorithm 2 to generate two sets of initial placement strategies P1 and P2 in the solution space.

第4-11行利用双种群遗传算法对两组放置策略进行迭代更新，其中op_pair_swap表示算法5描述的基于基因交换的迭代方法，op_pair_variation表示算法4描述的基于基因突变的迭代方法，finish_condition表示迭代结束的条件，如当两组初始放置策略在多轮迭代中不再发生变化。因此，当到达指定迭代次数或者满足迭代结束条件时，则终止双种群遗传算法的搜索过程。Lines 4-11 use the dual population genetic algorithm to iteratively update the two groups of placement strategies, where op_pair_swap represents the iterative method based on gene swap described in Algorithm 5, op_pair_variation represents the iterative method based on gene mutation described in Algorithm 4, and finish_condition represents the end of the iteration conditions, such as when the two sets of initial placement strategies no longer change over multiple iterations. Therefore, when the specified number of iterations is reached or the iteration end condition is met, the search process of the dual-population genetic algorithm is terminated.

第12行对迭代后的两组放置策略进行转换，将其转化为数据流图的算子放置策略。Line 12 transforms the two sets of placement strategies after iteration into the operator placement strategies of the data flow graph.

第13-20行将两组算子放置策略在真实的执行环境下执行，并将执行时间短的那组放置策略作为模型并行算子放置的最优策略。Lines 13-20 execute the two sets of operator placement strategies in the real execution environment, and use the set of placement strategies with short execution time as the optimal strategy for model-parallel operator placement.

第21-22行利用得到的分布式调度策略对模型进行分布式调度，并将数据集输入模型进行分布式训练，并返回训练好的模型。Lines 21-22 use the obtained distributed scheduling strategy to perform distributed scheduling of the model, input the dataset into the model for distributed training, and return the trained model.

基于本文方法可以加速图像识别、自然语言处理等人工智能领域下的神经网络模型的训练速度，其和单GPU、Beachi框架的对比实验结果表1所示：Based on the method in this paper, the training speed of neural network models in artificial intelligence fields such as image recognition and natural language processing can be accelerated.

表1 单GPU、Beachi和双种群遗传算法策略执行性能对比Table 1 Performance comparison of single-GPU, Beachi and dual-population genetic algorithm strategies

本发明和单GPU及Beachi框架的m_topo、m_etf、m_sct方法的实验结果如表1所示，表中GA(random/m_topo/m_etf/m_sct)的random表示使用随机初始化方法初始化两组放置策略以及m_topo、m_etf、m_sct表示使用指定策略初始化，既本发明所需的两组初始放置策略中的一组选择m_topo或m_etf或m_sct方法进行初始化，另外一组实验随机初始化方法进行初始化。GA是本发明的实验结果，由表1可以看出，本发明搜索到的策略普遍优于Baechi的三种策略搜索算法，与m_topo算法相比，最高可提升8.9％；与m_etf相比，最高可提升42.2％；与m_sct相比，最高可提升40.4％。遗传算法采用指定策略初始化方法进行放置策略的初始化，并在其基础上迭代更新以此得到的策略对比原策略的单步执行时间最高可提升9％。此外，Baechi在批大小较大的情况下会遇到OOM(内存溢出)使得策略无法执行，而双种群遗传算法可以控制集群中设备的负载均衡防止出现OOM的情况。The experimental results of the present invention and the m_topo, m_etf, m_sct methods of the single GPU and Beachi framework are shown in Table 1. The random of GA (random/m_topo/m_etf/m_sct) in the table indicates that the random initialization method is used to initialize two groups of placement strategies and m_topo , m_etf and m_sct represent initialization using a specified strategy, that is, one of the two groups of initial placement strategies required by the present invention selects m_topo or m_etf or m_sct for initialization, and another group of experimental random initialization methods are initialized. GA is the experimental result of the present invention. It can be seen from Table 1 that the strategy searched by the present invention is generally better than the three strategy search algorithms of Baechi. Compared with the m_topo algorithm, it can be improved by up to 8.9%; It can be improved by 42.2%; compared with m_sct, it can be improved by up to 40.4%. The genetic algorithm uses the specified strategy initialization method to initialize the placement strategy, and iteratively updates the strategy based on it. Compared with the original strategy, the single-step execution time of the obtained strategy can be improved by up to 9%. In addition, Baechi will encounter OOM (out-of-memory) when the batch size is large, which makes the strategy impossible to execute, while the dual-population genetic algorithm can control the load balancing of devices in the cluster to prevent OOM.

Claims

1. a neural network adaptive distributed parallel training method based on genetic algorithm, is characterized in that comprising the following steps:

Step 1: Transform the operator placement problem in the neural network model into an integer linear programming problem; build a solution space based on the data flow graph and device topology graph of the neural network model, and simplify the scale of the solution space by setting assumption constraints and operator grouping constraints :

1-1. Extract the structure of the neural network model and set up a device resource group, and abstractly obtain the computation graph G(O, E) and the device topology graph D. In the computation graph G(O, E), the vertex O represents the computation of the neural network model. Subset, E represents the set of directed edges between vertices, and the device topology graph D is composed of M computing devices;

1-2, given the calculation graph G(O, E) and the device topology graph D, on the premise of satisfying the device memory constraints, find a set of placement strategies to minimize the execution cost, so that the operators of the neural network model are placed on the device. The placement problem is transformed into an integer linear programming problem, and the solution space is constructed based on the data flow graph and device topology graph;

1-3, by analyzing the device characteristics in the homogeneous cluster and the execution characteristics of the neural network model, two assumptions are made: the computing and communication performance of homogeneous devices are the same; the execution end time of operators is only related to operators with direct dependencies. ;

Through the above assumptions, the placement of two operators with data dependencies is constrained to two: one is that the two operators are placed on the same device, and the other is that the two operators are placed on different devices;

1-4, the operator grouping constraint is then proposed. If there are two operators with data dependencies, if one of the operators satisfies the condition that the out-degree or in-degree is 1, the two operators are constrained to the same group, and Scheduling the operators in the group as a whole;

Step 2: Build a cost evaluation model that integrates computing cost, communication cost and load cost, comprehensively evaluates the pros and cons of the solution of the integer linear programming problem, that is, the execution performance of the operator placement strategy, and guides the dual population genetic algorithm in the two placement strategies. During the period, gene mutation and gene exchange are performed iteratively to update two sets of placement strategies:

2-1, build a cost evaluation model that integrates computing cost, communication cost and load cost

Firstly, the key factors affecting the execution performance of the neural network model are analyzed, and the real execution time of the operators in the neural network model on the device is used to establish the calculation cost _Li ; the total tensor size of the cross-device communication between operators in the neural network model is used. , establish the communication cost W; use the sum of the ratio of the memory load offset and the average load of each device to establish the load cost f;

Secondly, by comprehensively considering the characteristics of the calculation cost, communication cost, load cost and the actual training process of the model, a cost evaluation model integrating the calculation cost, communication cost and load cost is established:

π _θ =(∑L _i +λ×Ω(W))×(1+f)

Among them, π _θ represents the execution cost of the placement strategy, λ represents the weight ratio, and Ω represents the linear fitting model of tensor size and communication time;

2-2. Iteratively update placement strategy based on dual population genetic algorithm

First, using the initialization method of the placement strategy, two sets of placement strategies with initial operator placement information are generated in the solution space;

Secondly, based on the cost evaluation model, the gene mutation and gene exchange methods of the dual population genetic algorithm are guided, and the placement information of the operators in the two placement strategies is updated iteratively;

Step 3: When the termination condition of the iterative update of the placement strategy is reached, the two sets of placement strategies are respectively transformed into the operator placement strategies of the data flow graph, and the neural network model is trained according to the iteratively obtained operator placement strategies in the real environment. , select the strategy with the shortest execution time as the optimal distributed scheduling strategy of the neural network model;

Step 4: Use the optimal distributed scheduling strategy obtained in Step 3 to perform distributed scheduling of the neural network model to be trained, and input the data set required by the neural network model into the neural network model for distributed training.

2. a kind of neural network adaptive distributed parallel training method based on genetic algorithm according to claim 1, is characterized in that: the computation graph G (O, E) described in step 1-1 is by the neural network model Data flow graph abstraction is obtained.

3. a kind of neural network adaptive distributed parallel training method based on genetic algorithm according to claim 1, is characterized in that: the key factor described in step 2-1 is computing performance, communication performance and memory load of equipment Characteristics.

4. a kind of neural network adaptive distributed parallel training method based on genetic algorithm according to claim 1, is characterized in that: initialization method described in step 2-2 comprises random initialization method and specified strategy initialization method;

The random initialization method refers to assigning the operators in the calculation graph G(O, E) to the device topology graph D in a random manner;

The specified strategy initialization method refers to the placement strategy searched by the existing parallel strategy search method as the initial state;

Choose an initialization method to initialize the placement strategy.

5. a kind of neural network self-adaptive distributed parallel training method based on genetic algorithm according to claim 1, is characterized in that: the termination condition of iterative update described in step 3 comprises that total number of iterations reaches specified threshold and placement strategy No change will occur within a certain number of iterations;

When the iterative process of the dual population genetic algorithm satisfies one of the above two conditions, the iteration is terminated.

6. a kind of neural network adaptive distributed parallel training method based on genetic algorithm according to claim 1 is characterized in that: the gene mutation method described in step 2-2 comprises three steps:

(1) Based on the principle of maximum cost priority, select the operator pair with the largest overall cost in each group of placement strategies as the gene mutation operator;

(2) Using the gene mutation method to generate a new placement strategy by changing the placement information of the gene mutation operator;

(3) Guided by the evaluation of the cost evaluation model, iteratively update the placement strategy;

The gene swap method consists of two steps:

(1) Based on the cross-exchange method of high-quality operator pairs, the genetic information of the two groups of placement strategies is exchanged;

(2) Guided by the evaluation of the cost evaluation model, the placement strategy is updated iteratively.

7. a kind of neural network adaptive distributed parallel training method based on genetic algorithm according to claim 6, is characterized in that: described overall cost is calculated as follows:

in,

Represents the number of times the operator pair R _ij has been selected, decay represents the dynamic decay rate, Li and L _j represent the computational cost of the operator o _i and the operator o _j respectively, S _i,j _represent the operator o _i and the operator o transfer tensor size between _j ,

Represents the overall cost of the operator to R _ij .

8. a kind of neural network adaptive distributed parallel training method based on genetic algorithm according to claim 6, is characterized in that: the rule of described gene variation is as follows:

(1) If the two operators in the currently selected operator pair are placed on different devices, then randomly select an available device and place the two operators on the device;

(2) If the currently selected operator is placed on the same device, then randomly select two available devices and place the two operators on these two different devices.