CN119728454B

CN119728454B - Network flow simulation method, device, equipment and medium for large model training

Info

Publication number: CN119728454B
Application number: CN202510223771.5A
Authority: CN
Inventors: 辛奇; 王晓湘; 刘贺
Original assignee: Beijing University of Posts and Telecommunications
Current assignee: Beijing University of Posts and Telecommunications
Priority date: 2025-02-27
Filing date: 2025-02-27
Publication date: 2025-07-04
Anticipated expiration: 2045-02-27
Also published as: CN119728454A

Abstract

The invention relates to the technical field of computers, and provides a network flow simulation method, device, equipment and medium for large model training, wherein the method comprises the steps of obtaining configuration information of a user; the method comprises the steps of defining a network topological structure and training parameters of a large model to be trained according to configuration information, generating a communication load matrix based on the training parameters and the network topological structure, wherein the communication load matrix is used for representing calculation time and data transmission requirements of each network node in the network topological structure, and executing flow simulation according to the communication load matrix to simulate network flow in a model training process of the large model to be trained. The network topology structure and the training parameters of the model are defined through the configuration information of the user, so that the user can flexibly adjust the network structure to adapt to the simulation of the network flow of training clusters with different scales and structures, and the flexibility and the applicability of the simulation of the network flow of model training are improved.

Description

Network traffic simulation method, device, equipment and medium for large model training

技术领域Technical Field

本发明涉及计算机技术领域，尤其涉及一种大模型训练的网络流量模拟方法、装置、设备及介质。The present invention relates to the field of computer technology, and in particular to a network traffic simulation method, device, equipment and medium for large model training.

背景技术Background Art

随着深度学习和人工智能技术的迅猛发展，训练大规模机器学习模型已成为推动各行业技术进步的核心需求。这类模型通常包含数亿乃至数千亿的参数，训练过程依赖于分布式计算集群。在分布式训练过程中，网络通信扮演着关键角色，负责节点间的数据同步和模型参数的传递。高效的网络通信要求网络拓扑结构具备高带宽、低延迟和良好的负载均衡能力，以确保训练过程中的数据传输能够及时且稳定地进行。With the rapid development of deep learning and artificial intelligence technologies, training large-scale machine learning models has become a core requirement for promoting technological progress in various industries. Such models usually contain hundreds of millions or even hundreds of billions of parameters, and the training process relies on distributed computing clusters. In the distributed training process, network communication plays a key role, responsible for data synchronization between nodes and the transmission of model parameters. Efficient network communication requires the network topology to have high bandwidth, low latency, and good load balancing capabilities to ensure that data transmission during training can be carried out in a timely and stable manner.

现有技术中，采用通用网络仿真工具模拟分布式训练，在分布式训练环境中的网络性能方面取得了一定成效，但针对大规模模型训练的特定需求仍存在显著的不足。主要体现在对于网络拓扑结构的定义配置，用户难以灵活调整，限制了其在不同规模训练集群中的适用性。In the existing technology, general network simulation tools are used to simulate distributed training, which has achieved certain results in terms of network performance in distributed training environments, but there are still significant deficiencies in the specific needs of large-scale model training. This is mainly reflected in the definition and configuration of the network topology structure, which is difficult for users to flexibly adjust, limiting its applicability in training clusters of different sizes.

发明内容Summary of the invention

本发明提供一种大模型训练的网络流量模拟方法、装置、设备及介质，用以解决现有技术中采用通用网络仿真工具模拟分布式训练，针对大规模模型训练的特定需求，用户对网络拓扑结构的定义配置难以灵活调整，限制了网络仿真模拟在不同规模训练集群中的适用性的缺陷。The present invention provides a network traffic simulation method, device, equipment and medium for large-scale model training, which are used to solve the defect that in the prior art, a general network simulation tool is used to simulate distributed training. According to the specific needs of large-scale model training, it is difficult for users to flexibly adjust the definition and configuration of the network topology structure, which limits the applicability of network simulation in training clusters of different scales.

本发明提供一种大模型训练的网络流量模拟方法，包括如下步骤：The present invention provides a network traffic simulation method for large model training, comprising the following steps:

获取用户的配置信息；Get the user's configuration information;

根据所述配置信息定义网络拓扑结构和待训练大模型的训练参数；Define the network topology and the training parameters of the large model to be trained according to the configuration information;

基于所述训练参数和所述网络拓扑结构生成通信负载矩阵；所述通信负载矩阵用于表征所述网络拓扑结构中各网络节点的计算时间和数据传输需求；Generate a communication load matrix based on the training parameters and the network topology; the communication load matrix is used to characterize the computing time and data transmission requirements of each network node in the network topology;

根据所述通信负载矩阵执行流量仿真，模拟所述待训练大模型的模型训练过程中的网络流量。Traffic simulation is performed according to the communication load matrix to simulate the network traffic during the model training process of the large model to be trained.

根据本发明提供的大模型训练的网络流量模拟方法，所述根据所述配置信息定义网络拓扑结构和待训练大模型的训练参数，包括：According to the network traffic simulation method for large model training provided by the present invention, the network topology structure and the training parameters of the large model to be trained are defined according to the configuration information, including:

根据所述配置信息中的网络配置信息确定节点数量，并根据所述节点数量设置网络节点；所述网络配置信息包括节点配置信息；Determine the number of nodes according to the network configuration information in the configuration information, and set the network nodes according to the number of nodes; the network configuration information includes node configuration information;

基于所述节点配置信息配置所述网络节点的关键参数，以定义网络拓扑结构；所述关键参数包括所述网络节点之间的连接方式、链路带宽和链路延迟；Configuring key parameters of the network nodes based on the node configuration information to define a network topology; the key parameters include a connection mode, link bandwidth, and link delay between the network nodes;

根据所述配置信息中的模型配置信息定义待训练大模型的训练参数；所述训练参数包括模型参数规模、训练并行策略和训练批量大小。The training parameters of the large model to be trained are defined according to the model configuration information in the configuration information; the training parameters include model parameter scale, training parallel strategy and training batch size.

根据本发明提供的大模型训练的网络流量模拟方法，所述训练参数还包括隐藏层大小、模型层数、注意力机制的注意力头数量、前向反馈网络隐藏层大小、词汇表大小和序列长度；According to the network traffic simulation method for large model training provided by the present invention, the training parameters also include hidden layer size, number of model layers, number of attention heads of the attention mechanism, hidden layer size of the forward feedback network, vocabulary size and sequence length;

所述训练并行策略包括数据并行、张量并行和流水线并行；The training parallel strategies include data parallelism, tensor parallelism and pipeline parallelism;

所述训练批量大小包括全局批量大小和微批量大小。The training batch size includes a global batch size and a micro batch size.

根据本发明提供的大模型训练的网络流量模拟方法，所述基于所述训练参数和所述网络拓扑结构生成通信负载矩阵，包括：According to the network traffic simulation method for large model training provided by the present invention, the communication load matrix is generated based on the training parameters and the network topology structure, including:

基于所述训练参数和所述网络拓扑结构，确定待训练大模型的训练过程中模型参数同步和梯度传输的流量模式；Based on the training parameters and the network topology, determine the traffic pattern of model parameter synchronization and gradient transmission during the training process of the large model to be trained;

根据所述流量模式生成通信负载矩阵；所述通信负载矩阵包括前向计算时长、前向通信时长和后向计算时长。A communication load matrix is generated according to the traffic pattern; the communication load matrix includes forward calculation duration, forward communication duration and backward calculation duration.

根据本发明提供的大模型训练的网络流量模拟方法，所述根据所述通信负载矩阵执行流量仿真，模拟所述待训练大模型的模型训练过程中的网络流量，包括：According to the network traffic simulation method for large model training provided by the present invention, the traffic simulation is performed according to the communication load matrix to simulate the network traffic in the model training process of the large model to be trained, including:

根据所述通信负载矩阵确定计算过程集合和通信过程集合；所述计算过程集合包括前向计算过程集合和后向计算过程集合；Determine a computing process set and a communication process set according to the communication load matrix; the computing process set includes a forward computing process set and a backward computing process set;

将所述计算过程集合和所述通信过程集合输入至全局流水线调度器进行分配；所述全局流水线调度器根据网络节点的前向执行次数和后向执行次数确定流水线阶段；Inputting the computing process set and the communication process set into a global pipeline scheduler for allocation; the global pipeline scheduler determines the pipeline stage according to the forward execution times and the backward execution times of the network nodes;

执行所述流水线阶段对应的目标通信过程，以模拟所述待训练大模型的模型训练过程中的网络流量。Execute the target communication process corresponding to the pipeline stage to simulate the network traffic during the model training process of the large model to be trained.

根据本发明提供的大模型训练的网络流量模拟方法，所述执行所述流水线阶段对应的目标通信过程，包括：According to the network traffic simulation method for large model training provided by the present invention, the target communication process corresponding to the pipeline stage is executed, including:

获取预设的算法集合，并从所述算法集合中选取目标算法组合；所述算法集合中包括多个算法组合，各所述算法组合由拥塞控制算法和负载均衡算法组合得到；所述目标算法组合是所述多个算法组合中的任意一个；Obtain a preset algorithm set, and select a target algorithm combination from the algorithm set; the algorithm set includes multiple algorithm combinations, each of which is obtained by combining a congestion control algorithm and a load balancing algorithm; the target algorithm combination is any one of the multiple algorithm combinations;

基于所述目标算法组合，执行所述流水线阶段对应的目标通信过程；Based on the target algorithm combination, executing the target communication process corresponding to the pipeline stage;

返回并执行所述从所述算法集合中选取目标算法组合的步骤，直到所述目标算法组合是所述算法集合中的最后一个为止。Return and execute the step of selecting a target algorithm combination from the algorithm set until the target algorithm combination is the last one in the algorithm set.

根据本发明提供的大模型训练的网络流量模拟方法，所述执行所述流水线阶段对应的目标通信过程之后，还包括：According to the network traffic simulation method for large model training provided by the present invention, after executing the target communication process corresponding to the pipeline stage, it also includes:

采集所述目标通信过程中的性能指标；所述性能指标包括网络性能指标和训练性能指标；所述网络性能指标包括链路利用率、端到端延迟和网络吞吐量，所述训练性能指标包括单次训练迭代时间、训练效率和通信开销比例；Collecting performance indicators during the target communication process; the performance indicators include network performance indicators and training performance indicators; the network performance indicators include link utilization, end-to-end delay and network throughput, and the training performance indicators include single training iteration time, training efficiency and communication overhead ratio;

根据所述性能指标对所述网络拓扑结构和所述待训练大模型的训练过程进行评价，得到性能评价信息。The network topology structure and the training process of the large model to be trained are evaluated according to the performance indicators to obtain performance evaluation information.

本发明还提供一种大模型训练的网络流量模拟装置，包括如下模块：The present invention also provides a network traffic simulation device for large model training, comprising the following modules:

信息获取模块，用于获取用户的配置信息；Information acquisition module, used to obtain user configuration information;

模拟配置模块，用于根据所述配置信息定义网络拓扑结构和待训练大模型的训练参数；A simulation configuration module, used to define the network topology and the training parameters of the large model to be trained according to the configuration information;

负载生成模块，用于基于所述训练参数和所述网络拓扑结构生成通信负载矩阵；所述通信负载矩阵用于表征所述网络拓扑结构中各网络节点的计算时间和数据传输需求；A load generation module, used to generate a communication load matrix based on the training parameters and the network topology; the communication load matrix is used to characterize the computing time and data transmission requirements of each network node in the network topology;

模拟仿真模块，用于根据所述通信负载矩阵执行流量仿真，模拟所述待训练大模型的模型训练过程中的网络流量。The simulation module is used to perform traffic simulation according to the communication load matrix to simulate the network traffic during the model training process of the large model to be trained.

本发明还提供一种电子设备，包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序，所述处理器执行所述计算机程序时实现如上述任一种所述的大模型训练的网络流量模拟方法。The present invention also provides an electronic device, comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein when the processor executes the computer program, the network traffic simulation method for large model training as described in any one of the above is implemented.

本发明还提供一种非暂态计算机可读存储介质，其上存储有计算机程序，该计算机程序被处理器执行时实现如上述任一种所述的大模型训练的网络流量模拟方法。The present invention also provides a non-transitory computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, implements a network traffic simulation method for large model training as described in any one of the above.

本发明还提供一种计算机程序产品，包括计算机程序，所述计算机程序被处理器执行时实现如上述任一种所述的大模型训练的网络流量模拟方法。The present invention also provides a computer program product, including a computer program, which, when executed by a processor, implements the network traffic simulation method for large model training as described in any one of the above.

本发明提供的大模型训练的网络流量模拟方法、装置、设备及介质，通过用户的配置信息定义网络拓扑结构和待训练大模型的训练参数，基于网络拓扑结构和待训练大模型的训练参数生成通信负载矩阵，根据该通信负载矩阵执行流量仿真，模拟待训练大模型的模型训练过程中的网络流量，实现对模型训练过程的网络流量的模拟仿真，以便根据模拟结果，对大模型训练过程的网络配置进行优化调整。通过用户的配置信息定义网络拓扑结构和模型的训练参数，可供用户灵活调整网络结构，以适应不同规模和结构的训练集群的网络流量的模拟仿真，提高了对模型训练的网络流量的模拟仿真的灵活性和适用性。The network traffic simulation method, device, equipment and medium for large model training provided by the present invention define the network topology structure and the training parameters of the large model to be trained through the user's configuration information, generate a communication load matrix based on the network topology structure and the training parameters of the large model to be trained, perform traffic simulation according to the communication load matrix, simulate the network traffic in the model training process of the large model to be trained, and realize the simulation of the network traffic in the model training process, so as to optimize and adjust the network configuration of the large model training process according to the simulation results. By defining the network topology structure and the training parameters of the model through the user's configuration information, the user can flexibly adjust the network structure to adapt to the simulation of the network traffic of training clusters of different scales and structures, thereby improving the flexibility and applicability of the simulation of the network traffic for model training.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

为了更清楚地说明本发明或现有技术中的技术方案，下面将对实施例或现有技术描述中所需要使用的附图作一简单地介绍，显而易见地，下面描述中的附图是本发明的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他的附图。In order to more clearly illustrate the technical solutions in the present invention or the prior art, the following briefly introduces the drawings required for use in the embodiments or the description of the prior art. Obviously, the drawings described below are some embodiments of the present invention. For ordinary technicians in this field, other drawings can be obtained based on these drawings without paying any creative work.

图1是本发明提供的大模型训练的网络流量模拟方法的流程示意图。FIG1 is a flow chart of a network traffic simulation method for large model training provided by the present invention.

图2是本发明提供的网络流量的仿真模拟流程示意图。FIG. 2 is a schematic diagram of a simulation flow of network traffic provided by the present invention.

图3是本发明提供的大模型训练的网络流量模拟装置的结构示意图。FIG3 is a schematic diagram of the structure of a network traffic simulation device for large model training provided by the present invention.

图4是本发明提供的电子设备的结构示意图。FIG. 4 is a schematic diagram of the structure of an electronic device provided by the present invention.

具体实施方式DETAILED DESCRIPTION

为使本发明的目的、技术方案和优点更加清楚，下面将结合本发明中的附图，对本发明中的技术方案进行清楚、完整地描述，显然，所描述的实施例是本发明一部分实施例，而不是全部的实施例。基于本发明中的实施例，本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例，都属于本发明保护的范围。In order to make the purpose, technical solution and advantages of the present invention clearer, the technical solution of the present invention will be clearly and completely described below in conjunction with the drawings of the present invention. Obviously, the described embodiments are part of the embodiments of the present invention, not all of the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by ordinary technicians in this field without creative work are within the scope of protection of the present invention.

本发明实施例提供一种大模型训练的网络流量模拟方法、装置、设备及介质，可以解决现有的通用网络仿真工具在大规模模型训练的网络流量模拟中存在的不足，解决网络拓扑结构配置不灵活、特定流量模式模拟不精确、性能指标分析维度不足以及算法测试支持有限等问题。The embodiments of the present invention provide a network traffic simulation method, device, equipment and medium for large-scale model training, which can solve the deficiencies of existing general network simulation tools in network traffic simulation for large-scale model training, and solve the problems of inflexible network topology configuration, inaccurate simulation of specific traffic patterns, insufficient performance indicator analysis dimensions and limited algorithm testing support.

具体地，本发明实施例提供一种大模型训练的网络流量模拟方法，图1是本发明提供的大模型训练的网络流量模拟方法的流程示意图，如图1所示，该方法包括如下步骤：Specifically, an embodiment of the present invention provides a network traffic simulation method for large model training. FIG1 is a flow chart of the network traffic simulation method for large model training provided by the present invention. As shown in FIG1 , the method includes the following steps:

步骤100，获取用户的配置信息；Step 100, obtaining user configuration information;

步骤200，根据所述配置信息定义网络拓扑结构和待训练大模型的训练参数；Step 200, defining the network topology and the training parameters of the large model to be trained according to the configuration information;

步骤300，基于所述训练参数和所述网络拓扑结构生成通信负载矩阵；所述通信负载矩阵用于表征所述网络拓扑结构中各网络节点的计算时间和数据传输需求；Step 300, generating a communication load matrix based on the training parameters and the network topology; the communication load matrix is used to characterize the computing time and data transmission requirements of each network node in the network topology;

步骤400，根据所述通信负载矩阵执行流量仿真，模拟所述待训练大模型的模型训练过程中的网络流量。Step 400, performing traffic simulation according to the communication load matrix to simulate the network traffic during the model training process of the large model to be trained.

获取用户的配置信息，根据用户的配置信息定义网络拓扑结构和待训练大模型的训练参数，其中，用户的配置信息包括网络结构的网络配置信息，以及待训练大模型的训练配置信息。The user's configuration information is obtained, and the network topology structure and the training parameters of the large model to be trained are defined according to the user's configuration information, wherein the user's configuration information includes the network configuration information of the network structure and the training configuration information of the large model to be trained.

进一步地，在定义网络拓扑结构和待训练大模型的训练参数时，具体是根据网络配置信息定义网络拓扑结构，根据训练配置信息定义待训练大模型的训练参数。Furthermore, when defining the network topology structure and the training parameters of the large model to be trained, the network topology structure is defined according to the network configuration information, and the training parameters of the large model to be trained are defined according to the training configuration information.

网络拓扑结构是在待训练大模型的分布式训练过程中，用于支撑网络通信功能的网络结构，负责计算节点之间的数据同步和模型参数的传递。网络拓扑结构中包括多个计算节点，该多个计算节点也被称为网络节点或通信节点。各网络计算的数据传输需求，是基于待训练大模型的训练过程确定的，而待训练大模型的训练过程是根据训练参数确定的。The network topology is a network structure used to support network communication functions during the distributed training process of the large model to be trained. It is responsible for data synchronization between computing nodes and the transmission of model parameters. The network topology includes multiple computing nodes, which are also called network nodes or communication nodes. The data transmission requirements of each network calculation are determined based on the training process of the large model to be trained, and the training process of the large model to be trained is determined based on the training parameters.

基于训练参数和网络拓扑结构生成通信负载矩阵，该通信负载矩阵用于表征网络拓扑结构中各网络节点的计算时间和数据传输需求。其中，网络节点的计算时间包括前向计算时间、通信时长和后向计算时间，前向计算时间是模型训练过程中的前向计算过程对应的计算时长，后向计算时间是模型训练过程中的后向计算过程对应的计算时长。可选的，生成的通信负载矩阵包括网络拓扑结构中各网络节点间的通信负载矩阵。A communication load matrix is generated based on the training parameters and the network topology structure, and the communication load matrix is used to characterize the computing time and data transmission requirements of each network node in the network topology structure. Among them, the computing time of the network node includes forward computing time, communication time and backward computing time. The forward computing time is the computing time corresponding to the forward computing process in the model training process, and the backward computing time is the computing time corresponding to the backward computing process in the model training process. Optionally, the generated communication load matrix includes the communication load matrix between each network node in the network topology structure.

前向计算时间和后向计算时间是通过对模型计算量进行抽象建模得到的计算时长，通信时长是通过对网络链路的带宽时延进行建模得到的通信时长。The forward calculation time and the backward calculation time are the calculation times obtained by abstractly modeling the model calculation amount, and the communication time is the communication time obtained by modeling the bandwidth delay of the network link.

进一步地，前向计算过程主要包括数据准备、模型应用、逐层计算和结果输出，后向计算过程主要包括损失计算、梯度计算、参数更新和迭代循环。Furthermore, the forward calculation process mainly includes data preparation, model application, layer-by-layer calculation and result output, and the backward calculation process mainly includes loss calculation, gradient calculation, parameter update and iterative loop.

其中，数据准备是对模型训练所需的样本数据进行预处理，如归一化和格式转换等；模型应用是数据准备阶段获取的训练样本数据传递给模型；逐层计算是指模型中的每一层都对输入数据进行处理，生成中间输出；结果输出是指模型执行分类或回归等任务时，获得预测值的过程，模型的输出结果通常是一个或多个向量，该向量表示模型的预测结果。Among them, data preparation is the preprocessing of sample data required for model training, such as normalization and format conversion; model application is the transfer of training sample data obtained in the data preparation stage to the model; layer-by-layer calculation means that each layer in the model processes the input data to generate intermediate outputs; result output refers to the process of obtaining predicted values when the model performs tasks such as classification or regression. The output result of the model is usually one or more vectors, which represent the prediction results of the model.

进一步地，损失计算是在模型的前向传播之后，模型网络会根据模型输出和真实标签计算损失值，即误差，该损失值是用于衡量模型预测值与真实值之间差距的依据；梯度计算是使用前向计算得到的损失，计算所有模型参数的梯度，即损失值对模型参数的导数，该过程是反向传播；参数更新是在计算出每个模型参数的梯度之后，通过梯度下降算法来更新模型权重和偏置，梯度下降的基本原则是：若损失值的梯度（即误差）指向某个方向，则沿着相反的方向调整模型参数，以减小模型的损失值；迭代循环是在模型训练过程中，模型会进行多次前向传播、计算损失和反向传播误差并更新模型权重和偏置，一次完整的前向传播和反向传播的过程称为一次迭代，即一个“epoch”，模型训练过程通常要经历多个epoch，直到模型收敛到较低的损失值，或者达到预设的最大迭代次数为止。Furthermore, loss calculation is after the forward propagation of the model. The model network will calculate the loss value, that is, the error, based on the model output and the true label. The loss value is used to measure the gap between the model prediction value and the true value; gradient calculation is to use the loss obtained by forward calculation to calculate the gradient of all model parameters, that is, the derivative of the loss value with respect to the model parameters. This process is back propagation; parameter update is to update the model weights and biases through the gradient descent algorithm after calculating the gradient of each model parameter. The basic principle of gradient descent is: if the gradient of the loss value (that is, the error) points in a certain direction, then adjust the model parameters in the opposite direction to reduce the loss value of the model; the iterative cycle is during the model training process. The model will perform multiple forward propagations, calculate losses and back propagate errors, and update the model weights and biases. A complete forward propagation and back propagation process is called an iteration, that is, an "epoch". The model training process usually goes through multiple epochs until the model converges to a lower loss value or reaches the preset maximum number of iterations.

根据生成的通信负载矩阵执行流量仿真，模拟待训练大模型的模型训练过程中的网络流量，执行流量仿真的过程包括流量生成和注入、数据采集和分析以及结果验证和优化阶段。Traffic simulation is performed according to the generated communication load matrix to simulate the network traffic in the model training process of the large model to be trained. The process of performing traffic simulation includes traffic generation and injection, data collection and analysis, and result verification and optimization stages.

在流量生成和注入阶段，主要是根据待训练大模型的训练中的数据同步和参数传递过程，生成相应的数据包和发送间隔，并按照该发送间隔对数据包进行发送，从而模拟待训练大模型训练过程中的数据同步和参数传递的网络流量。In the traffic generation and injection stage, the corresponding data packets and sending intervals are mainly generated according to the data synchronization and parameter transfer process in the training of the large model to be trained, and the data packets are sent according to the sending interval, thereby simulating the network traffic of data synchronization and parameter transfer in the training process of the large model to be trained.

在数据采集和分析阶段，采集待训练大模型训练过程中的性能指标，从而确定网络性能和模型性能是否满足需求，以便在大模型的实际训练过程中，根据模拟结果调整网络配置。During the data collection and analysis phase, performance indicators of the large model to be trained are collected to determine whether the network performance and model performance meet the requirements, so that the network configuration can be adjusted according to the simulation results during the actual training of the large model.

在本实施例中，通过用户的配置信息定义网络拓扑结构和待训练大模型的训练参数，基于网络拓扑结构和待训练大模型的训练参数生成通信负载矩阵，根据该通信负载矩阵执行流量仿真，模拟待训练大模型的模型训练过程中的网络流量，实现对模型训练过程的网络流量的模拟仿真，以便根据模拟结果，对大模型训练过程的网络配置进行优化调整。通过用户的配置信息定义网络拓扑结构和模型的训练参数，可供用户灵活调整网络结构，以适应不同规模和结构的训练集群的网络流量的模拟仿真，提高了对模型训练的网络流量的模拟仿真的灵活性和适用性。In this embodiment, the network topology and the training parameters of the large model to be trained are defined through the user's configuration information, a communication load matrix is generated based on the network topology and the training parameters of the large model to be trained, and traffic simulation is performed according to the communication load matrix to simulate the network traffic in the model training process of the large model to be trained, so as to achieve the simulation of the network traffic in the model training process, so as to optimize and adjust the network configuration of the large model training process according to the simulation results. By defining the network topology and the training parameters of the model through the user's configuration information, the user can flexibly adjust the network structure to adapt to the simulation of the network traffic of training clusters of different sizes and structures, thereby improving the flexibility and applicability of the simulation of the network traffic for model training.

可选的，用户的配置信息是根据用户对大模型训练的需求确定的，其中包括网络配置信息和模型配置信息，网络配置信息用于配置网络拓扑结构，模型配置信息用于配置待训练大模型的训练参数。基于此，步骤200还可以包括：Optionally, the user's configuration information is determined based on the user's demand for large model training, including network configuration information and model configuration information, the network configuration information is used to configure the network topology, and the model configuration information is used to configure the training parameters of the large model to be trained. Based on this, step 200 may also include:

步骤201，根据所述配置信息中的网络配置信息确定节点数量，并根据所述节点数量设置网络节点；所述网络配置信息包括节点配置信息；Step 201, determining the number of nodes according to the network configuration information in the configuration information, and setting the network nodes according to the number of nodes; the network configuration information includes node configuration information;

步骤202，基于所述节点配置信息配置所述网络节点的关键参数，以定义网络拓扑结构；所述关键参数包括所述网络节点之间的连接方式、链路带宽和链路延迟；Step 202, configuring key parameters of the network nodes based on the node configuration information to define a network topology; the key parameters include a connection mode, link bandwidth, and link delay between the network nodes;

步骤203，根据所述配置信息中的模型配置信息定义待训练大模型的训练参数；所述训练参数包括模型参数规模、训练并行策略和训练批量大小。Step 203, defining the training parameters of the large model to be trained according to the model configuration information in the configuration information; the training parameters include model parameter scale, training parallel strategy and training batch size.

网络配置信息中包含有节点数量和各节点的节点配置信息，在定义网络拓扑结构时，根据配置信息中的网络配置信息确定节点数量，然后根据该节点数量设置网络节点。其中，网络配置信息中还包括设置的各节点的节点配置信息，基于该节点配置信息配置各网络节点的关键参数，从而定义网络拓扑结构。The network configuration information includes the number of nodes and the node configuration information of each node. When defining the network topology, the number of nodes is determined according to the network configuration information in the configuration information, and then the network nodes are set according to the number of nodes. The network configuration information also includes the node configuration information of each node to be set, and the key parameters of each network node are configured based on the node configuration information, thereby defining the network topology.

可选的，网络节点的关键参数至少包括网络节点之间的连接方式、链路带宽和链路延迟。网络拓扑结构包括多个网络节点，不同的网络节点的关键参数可以相同，也可以不同。Optionally, the key parameters of the network nodes include at least the connection mode, link bandwidth and link delay between the network nodes. The network topology structure includes multiple network nodes, and the key parameters of different network nodes may be the same or different.

进一步地，根据配置信息中的模型配置信息定义待训练大模型的训练参数，该训练参数包括模型参数规模、训练并行策略和训练批量大小。Furthermore, the training parameters of the large model to be trained are defined according to the model configuration information in the configuration information, and the training parameters include model parameter scale, training parallel strategy and training batch size.

可选的，训练参数还包括隐藏层大小、模型层数、注意力机制的注意力头数量、前向反馈网络隐藏层大小、词汇表大小和序列长度。训练并行策略包括数据并行、张量并行和流水线并行，训练批量大小包括全局批量大小和微批量大小。Optionally, the training parameters also include hidden layer size, number of model layers, number of attention heads of the attention mechanism, hidden layer size of the feedforward network, vocabulary size, and sequence length. Training parallel strategies include data parallelism, tensor parallelism, and pipeline parallelism, and training batch size includes global batch size and micro batch size.

通用网络仿真工具在模拟大模型训练过程中复杂且特定的通信模式，如参数同步和梯度传输时，缺乏足够的精确性和灵活性，导致仿真结果难以全面反映实际训练中的网络通信需求。在本实施例中，根据模型训练过程中的参数同步和梯度传输的流量模式，生成通信负载矩阵，从而使得对网络流量的仿真模拟，可以反映实际的模型训练过程中的网络通信需求。General network simulation tools lack sufficient accuracy and flexibility when simulating complex and specific communication modes in the process of large model training, such as parameter synchronization and gradient transmission, which makes it difficult for simulation results to fully reflect the network communication requirements in actual training. In this embodiment, a communication load matrix is generated according to the traffic patterns of parameter synchronization and gradient transmission in the process of model training, so that the simulation of network traffic can reflect the network communication requirements in the actual model training process.

基于此，步骤300还可以包括：Based on this, step 300 may also include:

步骤301，基于所述训练参数和所述网络拓扑结构，确定待训练大模型的训练过程中模型参数同步和梯度传输的流量模式；Step 301, based on the training parameters and the network topology, determining the traffic mode of model parameter synchronization and gradient transmission during the training process of the large model to be trained;

步骤302，根据所述流量模式生成通信负载矩阵；所述通信负载矩阵包括前向计算时长、前向通信时长和后向计算时长。Step 302, generating a communication load matrix according to the traffic pattern; the communication load matrix includes forward calculation duration, forward communication duration and backward calculation duration.

基于待训练大模型的训练参数，以及定义的网络拓扑结构，确定待训练大模型的训练过程中模型参数同步和梯度传输的流量模式，然后根据该流量模式生成通信负载矩阵。Based on the training parameters of the large model to be trained and the defined network topology, the traffic pattern of model parameter synchronization and gradient transmission during the training process of the large model to be trained is determined, and then a communication load matrix is generated according to the traffic pattern.

生成的通信负载矩阵包括前向计算时长、前向通信时长和后向计算时长，其中，通信负载矩阵包括每一条流量的前向计算时长，每一条流量的前向通信时长，以及每一条流量的后向计算时长。根据网络拓扑结构和待训练大模型的训练参数生成通信负载矩阵，通信负载矩阵表示各网络节点的计算时长以及网络节点之间的数据传输需求，反映了模型训练过程中模型参数同步和梯度传输的特定流量模式。The generated communication load matrix includes forward calculation time, forward communication time and backward calculation time, where the communication load matrix includes the forward calculation time of each flow, the forward communication time of each flow, and the backward calculation time of each flow. The communication load matrix is generated according to the network topology and the training parameters of the large model to be trained. The communication load matrix represents the calculation time of each network node and the data transmission requirements between network nodes, reflecting the specific traffic pattern of model parameter synchronization and gradient transmission during model training.

可选的，在模拟待训练大模型的模型训练过程中的网络流量时，是基于全局流水线调度器对模型训练过程中的前向计算过程和通信过程进行分配实现的。步骤400还可以包括：Optionally, when simulating the network traffic in the model training process of the large model to be trained, the forward computing process and the communication process in the model training process are allocated based on the global pipeline scheduler. Step 400 may also include:

步骤401，根据所述通信负载矩阵确定计算过程集合和通信过程集合；所述计算过程集合包括前向计算过程集合和后向计算过程集合；Step 401, determining a computing process set and a communication process set according to the communication load matrix; the computing process set includes a forward computing process set and a backward computing process set;

步骤402，将所述计算过程集合和所述通信过程集合输入至全局流水线调度器进行分配；所述全局流水线调度器根据网络节点的前向执行次数和后向执行次数确定流水线阶段；Step 402, inputting the computing process set and the communication process set into a global pipeline scheduler for allocation; the global pipeline scheduler determines the pipeline stage according to the forward execution times and the backward execution times of the network nodes;

步骤403，执行所述流水线阶段对应的目标通信过程，以模拟所述待训练大模型的模型训练过程中的网络流量。Step 403, executing the target communication process corresponding to the pipeline stage to simulate the network traffic during the model training process of the large model to be trained.

在模拟网络流量时，根据生成的通信负载矩阵，确定计算过程集合和通信过程集合，将计算过程集合和通信过程集合输入到全局流水线调度器进行分配，该全局流水线调度器根据网络节点的前向执行次数和后向执行次数，确定流水线阶段。When simulating network traffic, the computing process set and the communication process set are determined according to the generated communication load matrix, and the computing process set and the communication process set are input into the global pipeline scheduler for allocation. The global pipeline scheduler determines the pipeline stage according to the forward execution times and backward execution times of the network nodes.

其中，计算过程集合包括前向计算过程集合和后向计算过程集合，计算过程集合还包括各网络节点的执行状态，该执行状态用于表征网络节点的下一步是进行前向通信还是后向通信。全局流水线调度器根据计算过程集合确定网络节点的前向执行次数和后向执行次数，从而确定流水线阶段。网络节点的前向执行次数，可以根据网络节点执行的计算过程集合中的前向计算过程的数量确定，网络节点的后向执行次数，可以根据网络节点执行的计算过程集合中的后向计算过程的数量确定。The computing process set includes a forward computing process set and a backward computing process set. The computing process set also includes the execution status of each network node, which is used to characterize whether the next step of the network node is to perform forward communication or backward communication. The global pipeline scheduler determines the forward execution times and backward execution times of the network node according to the computing process set, thereby determining the pipeline stage. The forward execution times of the network node can be determined according to the number of forward computing processes in the computing process set executed by the network node, and the backward execution times of the network node can be determined according to the number of backward computing processes in the computing process set executed by the network node.

进一步地，执行流水线阶段对应的目标通信过程，从而模拟待训练大模型的模型训练过程中的网络流量。目标通信过程是流水线阶段在通信过程集合中对应的通信过程。Furthermore, the target communication process corresponding to the pipeline stage is executed to simulate the network traffic in the model training process of the large model to be trained. The target communication process is the communication process corresponding to the pipeline stage in the communication process set.

可选的，前向计算集合中包含有每一条流量的前向计算过程，通信过程集合中包含有每一条流量的通信过程，全局流水线调度器按照待训练大模型的模型训练过程，对前向计算过程和通信过程按照顺序进行分配。流水线阶段包括数据加载、预热、单前向单后向、冷却和梯度同步等。Optionally, the forward calculation set contains the forward calculation process of each flow, and the communication process set contains the communication process of each flow. The global pipeline scheduler allocates the forward calculation process and the communication process in order according to the model training process of the large model to be trained. The pipeline stages include data loading, preheating, single forward and single backward, cooling, and gradient synchronization.

通用网络仿真工具主要基于通用通信协议，不支持大模型的训练参数自定义和批次自定义，难以捕捉大模型训练中高频词、小数据量的通信特性，且缺乏对训练相关性能指标的内置支持，为了模拟大规模训练中的特定通信需求，通常需要进行二次开发和定制，增加了网络流量仿真模拟的难度和开发成本。大模型网络流量受训练并行策略影响，存在一些特殊的集合通信模式和流量模式，通用网络仿真工具无法复现，并且，流量模拟后无法测试一些拥塞控制算法或负载均衡算法等网络调优算法，在性能分析维度上，主要集中于网络层面，缺少对训练过程中的综合性能评估，对不同调优算法的测试与评估也有限，限制了流量仿真模拟在网络传输层的优化。General network simulation tools are mainly based on general communication protocols, and do not support training parameter customization and batch customization for large models. It is difficult to capture the communication characteristics of high-frequency words and small data volumes in large model training, and lack built-in support for training-related performance indicators. In order to simulate specific communication requirements in large-scale training, secondary development and customization are usually required, which increases the difficulty and development cost of network traffic simulation. Large model network traffic is affected by the training parallel strategy, and there are some special collective communication modes and traffic modes that general network simulation tools cannot reproduce. In addition, after traffic simulation, some network tuning algorithms such as congestion control algorithms or load balancing algorithms cannot be tested. In terms of performance analysis, it mainly focuses on the network level, lacks comprehensive performance evaluation during training, and has limited testing and evaluation of different tuning algorithms, which limits the optimization of traffic simulation at the network transmission layer.

基于此，本发明实施例提供的网络流量模拟方法，在网络流量模拟过程中，可以实现对不同算法的测试评估，在执行通信过程时，采用了拥塞控制算法和负载均衡算法，具体地，步骤403中，执行流水线阶段对应的目标通信过程，还可以包括：Based on this, the network traffic simulation method provided by the embodiment of the present invention can realize the test evaluation of different algorithms during the network traffic simulation process. When executing the communication process, a congestion control algorithm and a load balancing algorithm are adopted. Specifically, in step 403, the target communication process corresponding to the pipeline stage is executed, and it can also include:

步骤413，获取预设的算法集合，并从所述算法集合中选取目标算法组合；所述算法集合中包括多个算法组合，各所述算法组合由拥塞控制算法和负载均衡算法组合得到；所述目标算法组合是所述多个算法组合中的任意一个；Step 413, obtaining a preset algorithm set, and selecting a target algorithm combination from the algorithm set; the algorithm set includes multiple algorithm combinations, each of which is obtained by combining a congestion control algorithm and a load balancing algorithm; the target algorithm combination is any one of the multiple algorithm combinations;

步骤423，基于所述目标算法组合，执行所述流水线阶段对应的目标通信过程；Step 423, based on the target algorithm combination, executing the target communication process corresponding to the pipeline stage;

步骤433，返回并执行所述从所述算法集合中选取目标算法组合的步骤；直到所述目标算法组合是所述算法集合中的最后一个为止。Step 433, return and execute the step of selecting a target algorithm combination from the algorithm set; until the target algorithm combination is the last one in the algorithm set.

在执行通信过程，以模拟模型训练过程中的网络流量时，首先获取预设的算法集合，该算法集合中包括多个算法组合，任意一个算法组合由至少一个拥塞控制算法和一个负载均衡算法组合得到。从算法集合中选取一个目标算法组合，该目标算法组合是算法集合中的多个算法组合中的任意一个。When executing the communication process to simulate the network traffic in the model training process, first obtain a preset algorithm set, which includes multiple algorithm combinations, and any algorithm combination is obtained by combining at least one congestion control algorithm and one load balancing algorithm. Select a target algorithm combination from the algorithm set, and the target algorithm combination is any one of the multiple algorithm combinations in the algorithm set.

在一个实施例中，基于多个不同的拥塞控制算法，以及多个不同的负载均衡算法，将拥塞控制算法与负载均衡算法进行两两组合，得到多个算法组合，形成算法集合。In one embodiment, based on a plurality of different congestion control algorithms and a plurality of different load balancing algorithms, the congestion control algorithms and the load balancing algorithms are combined in pairs to obtain a plurality of algorithm combinations to form an algorithm set.

进一步地，基于选取的目标算法组合，执行流水线阶段对应的目标通信过程，返回并执行从算法集合中选取目标算法组合的步骤，直到选取的目标算法组合是算法集合中的最后一个为止。Further, based on the selected target algorithm combination, the target communication process corresponding to the pipeline stage is executed, and the step of selecting the target algorithm combination from the algorithm set is returned and executed until the selected target algorithm combination is the last one in the algorithm set.

可选的，基于同一个通信过程，基于不同的算法组合，分别执行同一个通信过程，按照此方式，对于每一个通信过程，分别基于不同的算法组合执行各通信过程；或者，基于同一个算法组合，执行所有通信过程，然后选取下一算法组合，重复执行所有通信过程，按照此方式，对于每一个算法组合，基于该算法组合分别执行所有通信过程。Optionally, based on the same communication process, the same communication process is executed separately based on different algorithm combinations. In this way, for each communication process, each communication process is executed separately based on different algorithm combinations. Alternatively, based on the same algorithm combination, all communication processes are executed, and then the next algorithm combination is selected to repeat the execution of all communication processes. In this way, for each algorithm combination, all communication processes are executed separately based on the algorithm combination.

在大规模分布式训练过程中，受到内存限制需要引入三维混合并行技术，受到三维混合并行技术的影响，训练迭代的流量模式被固化，分成了数据加载、预热、单前向单反向、冷却和梯度同步等阶段。本实施例在网络流量仿真中实现了随着训练迭代批量数据输入，轮流计算每一批量数据块的流水线操作过程，并在计算过程中实现张量并行通信，执行点对点通信传递给下一计算节点，在这一过程中对每一条流量的端到端网络性能都进行采集和可视化呈现，用以评价网络性能。并且，针对性能指标的分析维度主要集中在网络层面，缺乏对训练过程的综合性能评估的问题，本实施例可以基于网络性能和模型性能，实现对模型训练过程的综合评价。In the large-scale distributed training process, due to memory limitations, it is necessary to introduce three-dimensional hybrid parallel technology. Affected by the three-dimensional hybrid parallel technology, the traffic pattern of the training iteration is solidified and divided into stages such as data loading, preheating, single forward and single reverse, cooling and gradient synchronization. In the network traffic simulation, this embodiment implements the pipeline operation process of calculating each batch of data blocks in turn as the batch data is input during the training iteration, and realizes tensor parallel communication during the calculation process, and executes point-to-point communication to pass to the next computing node. In this process, the end-to-end network performance of each flow is collected and visualized to evaluate the network performance. In addition, the analysis dimension of the performance indicators is mainly concentrated at the network level, lacking the problem of comprehensive performance evaluation of the training process. This embodiment can realize the comprehensive evaluation of the model training process based on network performance and model performance.

基于此，步骤423中，基于目标算法组合，执行流水线阶段对应的目标通信过程之后，还可以包括：Based on this, in step 423, after executing the target communication process corresponding to the pipeline stage based on the target algorithm combination, the following may also be included:

步骤404，采集所述目标通信过程中的性能指标；所述性能指标包括网络性能指标和训练性能指标；所述网络性能指标包括链路利用率、端到端延迟和网络吞吐量，所述训练性能指标包括单次训练迭代时间、训练效率和通信开销比例；Step 404, collecting performance indicators during the target communication process; the performance indicators include network performance indicators and training performance indicators; the network performance indicators include link utilization, end-to-end delay and network throughput, and the training performance indicators include single training iteration time, training efficiency and communication overhead ratio;

步骤405，根据所述性能指标对所述网络拓扑结构和所述待训练大模型的训练过程进行评价，得到性能评价信息。Step 405: Evaluate the network topology and the training process of the large model to be trained according to the performance index to obtain performance evaluation information.

采集目标通信过程中的性能指标，该性能指标包括网络性能指标和训练性能指标，网络性能指标进一步包括链路利用率、端到端延迟和网络吞吐量，训练性能指标包括单次训练迭代时间、训练效率和通信开销比例。The performance indicators of the target communication process are collected, including network performance indicators and training performance indicators. The network performance indicators further include link utilization, end-to-end delay and network throughput. The training performance indicators include single training iteration time, training efficiency and communication overhead ratio.

根据采集的性能指标对网络拓扑结构和待训练大模型的训练过程进行评价，得到对应的性能评价信息。可选的，基于该性能评价信息，可以对用户的配置信息进行优化，从而实现对模型训练的网络拓扑结构的优化。The network topology and the training process of the large model to be trained are evaluated according to the collected performance indicators to obtain corresponding performance evaluation information. Optionally, based on the performance evaluation information, the user's configuration information can be optimized to achieve the optimization of the network topology for model training.

在一个实施例中，参照图2所示的网络流量的仿真模拟流程，首先，获取用户的配置信息，其中包括网络配置信息和训练配置信息，根据网络配置信息定义网络拓扑，包括设置节点数量，并配置节点之间的链路带宽和链路延迟等，根据训练配置信息配置待训练大模型的训练参数，包括设置模型参数规模、训练并行策略和训练批量大小。In one embodiment, referring to the simulation process of network traffic shown in FIG2 , first, the user's configuration information is obtained, including network configuration information and training configuration information, and the network topology is defined according to the network configuration information, including setting the number of nodes, and configuring the link bandwidth and link delay between nodes, etc., and the training parameters of the large model to be trained are configured according to the training configuration information, including setting the model parameter scale, training parallel strategy and training batch size.

基于配置的网络拓扑结构和训练参数，生成通信负载矩阵，然后初始化仿真环境，根据拥塞控制算法和负载均衡算法的算法组合，执行网络流量仿真，模拟待训练大模型的模型训练过程中的网络流量。监控并采集网络流量在模拟仿真过程中的性能指标，该性能指标包括网络性能指标和训练性能指标，网络性能指标用于评价模型训练的网络性能，训练性能指标用于评价模型训练的训练性能。Based on the configured network topology and training parameters, the communication load matrix is generated, and then the simulation environment is initialized. According to the algorithm combination of the congestion control algorithm and the load balancing algorithm, the network traffic simulation is performed to simulate the network traffic during the model training process of the large model to be trained. The performance indicators of the network traffic during the simulation process are monitored and collected. The performance indicators include network performance indicators and training performance indicators. The network performance indicators are used to evaluate the network performance of the model training, and the training performance indicators are used to evaluate the training performance of the model training.

在一个实施例中，网络配置信息包括节点数量，还进一步包括网络拓扑的配置信息，具体包括网络拓扑的单服务器内的GPU数量、服务器数量、网络交换机数量、网络链路数量、服务器类型和交换机编号。一般地，对于网络节点的设置，8个网络节点连接在一个机内交换机上，该8个节点又分别连接到叶层交换机上。链路带宽在一定范围内也是可配置的，以满足从小型集群到超大规模数据中心的网络环境的模拟，链路延迟在一定范围内也是可配置的，例如在0.001ms至10ms之间是可配置的，用于模拟不同长度的链路，进而复现不同地理位置下的集群分布。In one embodiment, the network configuration information includes the number of nodes, and further includes the configuration information of the network topology, specifically including the number of GPUs in a single server of the network topology, the number of servers, the number of network switches, the number of network links, the server type and the switch number. Generally, for the setting of network nodes, 8 network nodes are connected to an in-machine switch, and the 8 nodes are respectively connected to the leaf layer switches. The link bandwidth is also configurable within a certain range to meet the simulation of network environments from small clusters to ultra-large-scale data centers. The link delay is also configurable within a certain range, for example, it is configurable between 0.001ms and 10ms, which is used to simulate links of different lengths, thereby reproducing the cluster distribution in different geographical locations.

在一些实施例中，利用离散事件模拟（DES）方法精确模拟数据包在网络中的传输过程，并记录流量的端到端完成时间、链路占用情况以及带宽利用率等。流水线在执行通信过程时，根据当前网络节点的前向执行次数和后向执行次数，确定当前处于流水线的哪一阶段，然后执行该阶段对应的通信过程。在任一通信过程中，获取当前网络节点的通信模式和下一状态，以及下一网络节点，其中，通信模式包括不通信、单向通信、双向通信和睡眠等待。In some embodiments, the discrete event simulation (DES) method is used to accurately simulate the transmission process of data packets in the network, and the end-to-end completion time, link occupancy, and bandwidth utilization of the traffic are recorded. When the pipeline executes the communication process, it determines which stage of the pipeline it is currently in based on the forward execution times and backward execution times of the current network node, and then executes the communication process corresponding to the stage. In any communication process, the communication mode and next state of the current network node, as well as the next network node, are obtained, where the communication mode includes no communication, one-way communication, two-way communication, and sleep waiting.

进一步地，在执行通信过程时，若当前网络节点的通信模式为不通信，则当前网络节点的下一节点为后向计算；若当前网络节点的通信模式为单向通信，则向下一网络节点发送数据包，当前网络节点的下一状态为后向计算；若当前网络节点的通信模式为双向通信，则当前网络节点和前一网络节点互相发送数据包，当前网络节点的下一状态为前向计算；若当前网络节点的通信模式为睡眠等待，则等待数据包到达，当前网络节点的下一状态为后向计算。在模型训练过程中的任一时刻，网络拓扑结构中的不同节点的通信模式可以相同，也可以不同。Furthermore, when executing the communication process, if the communication mode of the current network node is no communication, the next node of the current network node is backward calculation; if the communication mode of the current network node is one-way communication, a data packet is sent to the next network node, and the next state of the current network node is backward calculation; if the communication mode of the current network node is two-way communication, the current network node and the previous network node send data packets to each other, and the next state of the current network node is forward calculation; if the communication mode of the current network node is sleep waiting, the data packet is waited for to arrive, and the next state of the current network node is backward calculation. At any time during the model training process, the communication modes of different nodes in the network topology can be the same or different.

可选的，在对网络流量的模拟仿真结束之后，还可以生成仿真报告，该仿真报告中包含有采集的性能指标，以及根据该性能指标对用户配置信息的评价信息，该评价信息可用于指导用户对配置信息进行调整。Optionally, after the simulation of network traffic is completed, a simulation report can be generated. The simulation report includes the collected performance indicators and evaluation information of the user configuration information based on the performance indicators. The evaluation information can be used to guide the user to adjust the configuration information.

在本实施例中，通过自定义配置网络拓扑和模型的训练参数，可以支持模型训练过程中对网络流量的多元化模拟仿真。并且，基于训练并行策略可以实现大模型训练的三维混合并行的训练流程，符合大模型训练的流量模式，有利于对大模型训练过程中复杂且特定的通信模式的精确和灵活模拟，确保仿真结果可以全面反映实际模型训练中的网络通信需求。In this embodiment, by customizing the configuration of network topology and model training parameters, it is possible to support diversified simulation of network traffic during model training. Moreover, based on the training parallel strategy, a three-dimensional hybrid parallel training process for large model training can be implemented, which conforms to the traffic pattern of large model training, and is conducive to the accurate and flexible simulation of complex and specific communication modes in the large model training process, ensuring that the simulation results can fully reflect the network communication requirements in the actual model training.

进一步地，在对网络流量的模拟过程中，可以可视化呈现仿真过程中每一条流量的端到端的性能指标，以便在没有真实集群的条件下，获取到模型训练过程中准确的网络性能指标和模型训练性能指标。且支持对网络传输层的负载均衡算法和拥塞控制算法的测试，以及对不同算法的切换测试，从而得到不同算法在大规模分布式训练过程中的性能指标，具有较强的适配性。Furthermore, in the process of simulating network traffic, the end-to-end performance indicators of each traffic in the simulation process can be visualized, so as to obtain accurate network performance indicators and model training performance indicators in the model training process without a real cluster. It also supports the testing of the load balancing algorithm and congestion control algorithm of the network transmission layer, as well as the switching test of different algorithms, so as to obtain the performance indicators of different algorithms in the large-scale distributed training process, with strong adaptability.

下面对本发明提供的大模型训练的网络流量模拟装置进行描述，下文描述的大模型训练的网络流量模拟装置与上文描述的大模型训练的网络流量模拟方法可相互对应参照。The network traffic simulation device for large model training provided by the present invention is described below. The network traffic simulation device for large model training described below and the network traffic simulation method for large model training described above can be referenced to each other.

参照图3，本发明实施例提供的大模型训练的网络流量模拟装置，包括：3, the network traffic simulation device for large model training provided by an embodiment of the present invention includes:

信息获取模块10，用于获取用户的配置信息；The information acquisition module 10 is used to acquire the user's configuration information;

模拟配置模块20，用于根据所述配置信息定义网络拓扑结构和待训练大模型的训练参数；A simulation configuration module 20, used to define the network topology and the training parameters of the large model to be trained according to the configuration information;

负载生成模块30，用于基于所述训练参数和所述网络拓扑结构生成通信负载矩阵；所述通信负载矩阵用于表征所述网络拓扑结构中各网络节点的计算时间和数据传输需求；A load generation module 30, configured to generate a communication load matrix based on the training parameters and the network topology; the communication load matrix is used to characterize the computing time and data transmission requirements of each network node in the network topology;

模拟仿真模块40，用于根据所述通信负载矩阵执行流量仿真，模拟所述待训练大模型的模型训练过程中的网络流量。The simulation module 40 is used to perform traffic simulation according to the communication load matrix to simulate the network traffic during the model training process of the large model to be trained.

在一个实施例中，所述模拟配置模块20，还用于：In one embodiment, the simulation configuration module 20 is further used for:

在一个实施例中，所述训练参数还包括隐藏层大小、模型层数、注意力机制的注意力头数量、前向反馈网络隐藏层大小、词汇表大小和序列长度；In one embodiment, the training parameters further include hidden layer size, number of model layers, number of attention heads of the attention mechanism, hidden layer size of the feedforward network, vocabulary size, and sequence length;

在一个实施例中，所述负载生成模块30，还用于：In one embodiment, the load generation module 30 is further configured to:

在一个实施例中，所述模拟仿真模块40，还用于：In one embodiment, the simulation module 40 is further used for:

图4示例了一种电子设备的实体结构示意图，如图4所示，该电子设备可以包括：处理器（processor）410、通信接口（Communications Interface）420、存储器（memory）430和通信总线440，其中，处理器410，通信接口420，存储器430通过通信总线440完成相互间的通信。处理器410可以调用存储器430中的逻辑指令，以执行大模型训练的网络流量模拟方法的步骤，例如包括：FIG4 illustrates a schematic diagram of the physical structure of an electronic device. As shown in FIG4 , the electronic device may include: a processor 410, a communications interface 420, a memory 430, and a communication bus 440, wherein the processor 410, the communications interface 420, and the memory 430 communicate with each other through the communication bus 440. The processor 410 may call the logic instructions in the memory 430 to execute the steps of the network traffic simulation method for large model training, for example, including:

获取用户的配置信息；Get the user's configuration information;

此外，上述的存储器430中的逻辑指令可以通过软件功能单元的形式实现并作为独立的产品销售或使用时，可以存储在一个计算机可读取存储介质中。基于这样的理解，本发明的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的部分可以以软件产品的形式体现出来，该计算机软件产品存储在一个存储介质中，包括若干指令用以使得一台计算机设备（可以是个人计算机，服务器，或者网络设备等）执行本发明各个实施例所述方法的全部或部分步骤。而前述的存储介质包括：U盘、移动硬盘、只读存储器（ROM，Read-Only Memory）、随机存取存储器（RAM，Random Access Memory）、磁碟或者光盘等各种可以存储程序代码的介质。In addition, the logic instructions in the above-mentioned memory 430 can be implemented in the form of a software functional unit and can be stored in a computer-readable storage medium when it is sold or used as an independent product. Based on this understanding, the technical solution of the present invention, in essence, or the part that contributes to the prior art or the part of the technical solution, can be embodied in the form of a software product, and the computer software product is stored in a storage medium, including a number of instructions for a computer device (which can be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the method described in each embodiment of the present invention. The aforementioned storage medium includes: U disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), disk or optical disk, etc. Various media that can store program codes.

另一方面，本发明还提供一种计算机程序产品，所述计算机程序产品包括计算机程序，计算机程序可存储在非暂态计算机可读存储介质上，所述计算机程序被处理器执行时，计算机能够执行上述各方法所提供的大模型训练的网络流量模拟方法的步骤，例如包括：On the other hand, the present invention further provides a computer program product, the computer program product includes a computer program, the computer program can be stored on a non-transitory computer-readable storage medium, and when the computer program is executed by a processor, the computer can execute the steps of the network traffic simulation method for large model training provided by the above methods, for example, including:

获取用户的配置信息；Get the user's configuration information;

又一方面，本发明还提供一种非暂态计算机可读存储介质，其上存储有计算机程序，该计算机程序被处理器执行时实现以执行上述各方法提供的大模型训练的网络流量模拟方法的步骤，例如包括：In another aspect, the present invention further provides a non-transitory computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, implements the steps of the network traffic simulation method for large model training provided by the above methods, for example, including:

获取用户的配置信息；Get the user's configuration information;

以上所描述的装置实施例仅仅是示意性的，其中所述作为分离部件说明的单元可以是或者也可以不是物理上分开的，作为单元显示的部件可以是或者也可以不是物理单元，即可以位于一个地方，或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。本领域普通技术人员在不付出创造性的劳动的情况下，即可以理解并实施。The device embodiments described above are merely illustrative, wherein the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the scheme of this embodiment. Ordinary technicians in this field can understand and implement it without paying creative labor.

通过以上的实施方式的描述，本领域的技术人员可以清楚地了解到各实施方式可借助软件加必需的通用硬件平台的方式来实现，当然也可以通过硬件。基于这样的理解，上述技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来，该计算机软件产品可以存储在计算机可读存储介质中，如ROM/RAM、磁碟、光盘等，包括若干指令用以使得一台计算机设备（可以是个人计算机，服务器，或者网络设备等）执行各个实施例或者实施例的某些部分所述的方法。Through the description of the above implementation methods, those skilled in the art can clearly understand that each implementation method can be implemented by means of software plus a necessary general hardware platform, and of course, can also be implemented by hardware. Based on this understanding, the above technical solution is essentially or the part that contributes to the prior art can be embodied in the form of a software product, and the computer software product can be stored in a computer-readable storage medium, such as ROM/RAM, a disk, an optical disk, etc., including a number of instructions for a computer device (which can be a personal computer, a server, or a network device, etc.) to execute the methods described in each embodiment or some parts of the embodiments.

最后应说明的是：以上实施例仅用以说明本发明的技术方案，而非对其限制；尽管参照前述实施例对本发明进行了详细的说明，本领域的普通技术人员应当理解：其依然可以对前述各实施例所记载的技术方案进行修改，或者对其中部分技术特征进行等同替换；而这些修改或者替换，并不使相应技术方案的本质脱离本发明各实施例技术方案的精神和范围。Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention, rather than to limit it. Although the present invention has been described in detail with reference to the aforementioned embodiments, those skilled in the art should understand that they can still modify the technical solutions described in the aforementioned embodiments, or make equivalent replacements for some of the technical features therein. However, these modifications or replacements do not deviate the essence of the corresponding technical solutions from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. A network traffic simulation method for large model training, characterized by comprising:

Get the user's configuration information;

Defining the network topology and the training parameters of the large model to be trained according to the configuration information; the training parameters include the model parameter scale, the training parallel strategy and the training batch size;

Generate a communication load matrix based on the training parameters and the network topology; the communication load matrix is used to characterize the computing time and data transmission requirements of each network node in the network topology;

Performing traffic simulation according to the communication load matrix to simulate the network traffic during the model training process of the large model to be trained;

The generating of a communication load matrix based on the training parameters and the network topology structure comprises:

Based on the training parameters and the network topology, determine the traffic pattern of model parameter synchronization and gradient transmission during the training process of the large model to be trained;

A communication load matrix is generated according to the traffic pattern; the communication load matrix includes forward calculation duration, forward communication duration and backward calculation duration.

2. The network traffic simulation method for large model training according to claim 1 is characterized in that the network topology structure and the training parameters of the large model to be trained are defined according to the configuration information, including:

Determine the number of nodes according to the network configuration information in the configuration information, and set the network nodes according to the number of nodes; the network configuration information includes node configuration information;

Configuring key parameters of the network nodes based on the node configuration information to define a network topology; the key parameters include a connection mode, link bandwidth, and link delay between the network nodes;

The training parameters of the large model to be trained are defined according to the model configuration information in the configuration information.

3. The network traffic simulation method for large model training according to claim 1 is characterized in that the training parameters also include hidden layer size, number of model layers, number of attention heads of the attention mechanism, hidden layer size of the forward feedback network, vocabulary size and sequence length;

The training parallel strategies include data parallelism, tensor parallelism and pipeline parallelism;

The training batch size includes a global batch size and a micro batch size.

4. The network traffic simulation method for large model training according to claim 1 is characterized in that the traffic simulation is performed according to the communication load matrix to simulate the network traffic in the model training process of the large model to be trained, including:

Determine a computing process set and a communication process set according to the communication load matrix; the computing process set includes a forward computing process set and a backward computing process set;

Inputting the computing process set and the communication process set into a global pipeline scheduler for allocation; the global pipeline scheduler determines the pipeline stage according to the forward execution times and the backward execution times of the network nodes;

Execute the target communication process corresponding to the pipeline stage to simulate the network traffic during the model training process of the large model to be trained.

5. The network traffic simulation method for large model training according to claim 4, characterized in that the target communication process corresponding to the pipeline stage is executed, comprising:

Obtain a preset algorithm set, and select a target algorithm combination from the algorithm set; the algorithm set includes multiple algorithm combinations, each of which is obtained by combining a congestion control algorithm and a load balancing algorithm; the target algorithm combination is any one of the multiple algorithm combinations;

Based on the target algorithm combination, executing the target communication process corresponding to the pipeline stage;

Return and execute the step of selecting a target algorithm combination from the algorithm set until the target algorithm combination is the last one in the algorithm set.

6. The network traffic simulation method for large model training according to claim 4 is characterized in that after executing the target communication process corresponding to the pipeline stage, it also includes:

Collecting performance indicators during the target communication process; the performance indicators include network performance indicators and training performance indicators; the network performance indicators include link utilization, end-to-end delay and network throughput, and the training performance indicators include single training iteration time, training efficiency and communication overhead ratio;

The network topology structure and the training process of the large model to be trained are evaluated according to the performance indicators to obtain performance evaluation information.

7. A network traffic simulation device for large model training, characterized by comprising:

Information acquisition module, used to obtain user configuration information;

A simulation configuration module, used to define the network topology and the training parameters of the large model to be trained according to the configuration information; the training parameters include the model parameter scale, the training parallel strategy and the training batch size;

A load generation module, used to generate a communication load matrix based on the training parameters and the network topology; the communication load matrix is used to characterize the computing time and data transmission requirements of each network node in the network topology;

A simulation module, used to perform traffic simulation according to the communication load matrix, and simulate the network traffic in the model training process of the large model to be trained;

The load generation module is further used for:

8. An electronic device, comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor implements the network traffic simulation method for large model training as described in any one of claims 1 to 6 when executing the computer program.

9. A non-transitory computer-readable storage medium having a computer program stored thereon, wherein when the computer program is executed by a processor, the network traffic simulation method for large model training as described in any one of claims 1 to 6 is implemented.