CN119151067A

CN119151067A - System performance prediction model training method, system performance prediction method and device

Info

Publication number: CN119151067A
Application number: CN202411312676.4A
Authority: CN
Inventors: 张君友; 聂延凯
Original assignee: Beijing Aientropy Technology Co ltd
Current assignee: Beijing Aientropy Technology Co ltd
Priority date: 2024-09-20
Filing date: 2024-09-20
Publication date: 2024-12-17
Anticipated expiration: 2044-09-20
Also published as: CN119151067B

Abstract

The application discloses a system performance prediction model training method, a system performance prediction method and a system performance prediction device, and relates to the technical field of artificial intelligence and the technical field of computing power cluster system performance prediction, wherein the system performance prediction method comprises the steps of obtaining sample characteristic data for training a system performance prediction model, wherein the sample characteristic data comprises cluster characteristic data of a known computing power cluster, and system performance data obtained by performing benchmark test on the known computing power cluster; inputting the cluster characteristic data into a system performance prediction model to obtain output system performance prediction data, wherein a model structure of the system performance prediction model is provided with a residual block stacking layer, determining whether the model training meets a convergence condition or not based on the system performance prediction data and the corresponding system performance data, and if the model training does not meet the convergence condition, adjusting model parameters of the system performance prediction model and executing the next model training. By adopting the method, the system performance of the computing power cluster is accurately predicted.

Description

System performance prediction model training method, system performance prediction method and device

Technical Field

The application relates to the technical field of artificial intelligence and the technical field of performance prediction of computing power cluster systems, in particular to a system performance prediction model training method, a system performance prediction method and a system performance prediction device.

Background

With the development of artificial intelligence, especially the high-speed development of large model fields, the system has higher and higher requirements on computing power, thousands of processors and accelerator cards are usually needed for building a computing center and various high-speed buses are matched, and the overall performance of the system is affected by various factors such as processors, accelerator cards, communication modes, model algorithms and scheduling algorithms. If the system performance and the facility investment cost cannot be fully estimated in the early stage of project construction, the situations such as calculation power waste and calculation power shortage are easy to be caused, and obviously, the situations are not willing to see.

How to be able to combine the historical test data analysis according to the actual application demands, recommend reasonable software and hardware facility combination, ensure the cost and performance of the final system to be optimal, and become the topics of interest in the industry.

Disclosure of Invention

The embodiment of the application provides a system performance prediction model training method, a system performance prediction method and a system performance prediction device, which are used for solving the problem of how to accurately predict the system performance of a computing power cluster in the prior art.

The embodiment of the application provides a system performance prediction model training method, which comprises the following steps:

acquiring sample feature data for training a system performance prediction model, wherein the sample feature data comprises cluster feature data of a known calculation power cluster and system performance data obtained by performing benchmark test on the known calculation power cluster;

Inputting the cluster characteristic data into the system performance prediction model to obtain output system performance prediction data, wherein a model structure of the system performance prediction model is provided with a residual block stacking layer, and the residual block stacking layer comprises residual blocks;

Determining whether the model training meets a convergence condition or not based on the system performance prediction data and the corresponding system performance data;

And if the convergence condition is met, determining that the training of the system performance prediction model is completed, and if the convergence condition is not met, adjusting model parameters of the system performance prediction model and executing the next model training.

Further, the cluster characteristic data comprises cluster quantitative characteristic data and cluster qualitative characteristic data;

the residual block stacking layer comprises a plurality of residual blocks connected in series;

Each of the residual blocks has two inputs and one output;

the inputs of the plurality of residual blocks each comprise a qualitative feature vector representing the cluster qualitative feature data;

The other input of the first residual block is a quantitative feature vector representing the cluster quantitative feature data, the other input of the residual blocks except the first one is the output of the connected previous residual block, and the output of the last residual block is taken as the output of the residual block stacking layer.

Further, the operations performed in the residual block include the following operations:

Multiplying the qualitative feature vector by an association matrix to obtain an association vector;

adding the association vector to the quantitative feature vector or the output of the previous residual block to obtain a combined feature vector;

multiplying the combined feature vector by a residual matrix to obtain a residual vector;

Adding the residual vector and the quantitative feature vector or the output of the previous residual block to obtain a jump connection feature vector;

normalizing the jump connection feature vector to obtain the output of the residual block;

The correlation matrix and the residual matrix are used as model parameters of the system performance prediction model.

Further, the model structure of the system performance prediction model is provided with an input layer and a characteristic preprocessing layer;

The input layer is used for receiving the cluster quantitative characteristic data and the cluster qualitative characteristic data;

The feature preprocessing layer is used for preprocessing the cluster quantitative feature data through a multi-layer perceptron MLP network to obtain quantitative feature vectors, and generating qualitative feature vectors corresponding to the cluster qualitative feature data by adopting a label searching code mode.

Further, the model structure of the system performance prediction model is provided with a summarizing layer and an output layer;

The summarizing layer is used for processing the output of the residual block stacking layer through an MLP network to obtain the system performance prediction data;

the output layer is used for outputting the system performance prediction data.

The embodiment of the application also provides a method for predicting the performance of the computing power cluster system, which comprises the following steps:

acquiring cluster characteristic data of a computing power cluster to be predicted;

based on the cluster characteristic data, a system performance prediction model obtained by training by adopting any system performance prediction model training method is adopted to predict the system performance of the computing power cluster to be predicted, so as to obtain system performance prediction data.

The embodiment of the application also provides a system performance prediction model training device, which comprises:

the system comprises a sample data acquisition module, a system performance test module and a system performance test module, wherein the sample data acquisition module is used for acquiring sample characteristic data for training a system performance prediction model, and the sample characteristic data comprises cluster characteristic data of a known computing power cluster and system performance data obtained by performing a reference test on the known computing power cluster;

The system performance prediction module is used for inputting the cluster characteristic data into the system performance prediction model to obtain output system performance prediction data, wherein a model structure of the system performance prediction model is provided with a residual block stacking layer, and the residual block stacking layer comprises a residual block;

the convergence judging module is used for determining whether the model training meets a convergence condition or not based on the system performance prediction data and the corresponding system performance data;

And the model training module is used for determining that the training of the system performance prediction model is completed if the convergence condition is met, adjusting the model parameters of the system performance prediction model if the convergence condition is not met, and executing the next model training.

Each of the residual blocks has two inputs and one output;

the output layer is used for outputting the system performance prediction data.

The embodiment of the application also provides a device for predicting the performance of the computing power cluster system, which comprises the following steps:

the cluster data acquisition module is used for acquiring cluster characteristic data of the computing power cluster to be predicted;

and the system performance prediction module is used for predicting the system performance of the computing power cluster to be predicted by adopting the system performance prediction model obtained by training the system performance prediction model training device based on the cluster characteristic data to obtain system performance prediction data.

The embodiment of the application also provides electronic equipment, which comprises a processor and a machine-readable storage medium, wherein the machine-readable storage medium stores machine-executable instructions capable of being executed by the processor, and the processor is caused by the machine-executable instructions to realize any one of the system performance prediction model training methods or the computing power cluster system performance prediction method.

The embodiment of the application also provides a computer readable storage medium, which is characterized in that a computer program is stored in the computer readable storage medium, and the computer program realizes any one of the system performance prediction model training methods or realizes the computing power cluster system performance prediction method when being executed by a processor.

The embodiment of the application also provides a computer program product containing instructions, which when run on a computer, cause the computer to execute any one of the system performance prediction model training methods or the computing power cluster system performance prediction method.

The beneficial effects of the application include:

In the method provided by the embodiment of the application, cluster characteristic data of a known computing power cluster is used as input of a system performance prediction model, the system performance prediction data output by the system performance prediction model is compared with the system performance data obtained by carrying out reference test on the known computing power cluster, whether the model training meets the convergence condition is determined, the trained system performance prediction model is obtained by repeated iterative training until the convergence condition is met, and the model structure of the system performance prediction model is provided with a residual block stacking layer which comprises a residual block, so that the system performance of the computing power cluster to be predicted can be accurately predicted by using the system performance prediction model based on the cluster characteristic data of the computing power cluster to be predicted.

Additional features and advantages of the application will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the application. The objectives and other advantages of the application will be realized and attained by the structure particularly pointed out in the written description and claims thereof as well as the appended drawings.

Drawings

The accompanying drawings are included to provide a further understanding of the application and are incorporated in and constitute a part of this specification, illustrate the application and together with the embodiments of the application, serve to explain the application. In the drawings:

FIG. 1 is a flowchart of a system performance prediction model training method provided by an embodiment of the present application;

FIG. 2 is a flowchart of a method for predicting performance of a computing power cluster system according to an embodiment of the present application;

FIG. 3 is a schematic diagram of a system performance prediction model according to an embodiment of the present application;

FIG. 4 is a schematic diagram of a residual block stack layer of a system performance prediction model according to an embodiment of the present application;

Fig. 5 is a schematic structural diagram of a residual block stacking layer according to an embodiment of the present application;

FIG. 6 is a schematic structural diagram of a training device for a system performance prediction model according to an embodiment of the present application;

FIG. 7 is a schematic structural diagram of a performance prediction apparatus for a computing power cluster system according to an embodiment of the present application;

fig. 8 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

In order to provide an implementation scheme for accurately predicting the system performance of the computing power cluster, the embodiment of the application provides a system performance prediction model training method, a system performance prediction method and a system performance prediction device, and the following description is made with reference to the accompanying drawings of the specification, which should be understood that the preferred embodiments described herein are only for illustrating and explaining the application, and are not limiting to the application. And embodiments of the application and features of the embodiments may be combined with each other without conflict.

The embodiment of the application provides a system performance prediction model training method, as shown in fig. 1, comprising the following steps:

step 11, acquiring sample feature data for training a system performance prediction model, wherein the sample feature data comprises cluster feature data of a known calculation power cluster and system performance data obtained by performing benchmark test on the known calculation power cluster;

Step 12, inputting the cluster characteristic data into a system performance prediction model to obtain output system performance prediction data, wherein a model structure of the system performance prediction model is provided with a residual block stacking layer, and the residual block stacking layer comprises a residual block;

step 13, determining whether the model training meets a convergence condition or not based on the system performance prediction data and the corresponding system performance data;

and 14, if the convergence condition is met, determining that the training of the system performance prediction model is completed, and if the convergence condition is not met, adjusting model parameters of the system performance prediction model and executing the next model training.

Correspondingly, the embodiment of the application also provides a method for predicting the performance of the computing power cluster system, which is shown in fig. 2 and comprises the following steps:

21. acquiring cluster characteristic data of a computing power cluster to be predicted;

22. based on the cluster characteristic data, a system performance prediction model obtained by training the system performance prediction model training method is adopted to predict the system performance of the computing power cluster to be predicted, and system performance prediction data is obtained.

By adopting the method provided by the embodiment of the application, the cluster characteristic data of the known calculation force cluster is used as the input of the system performance prediction model, the system performance prediction data output by the system performance prediction model is compared with the system performance data obtained by carrying out the reference test on the known calculation force cluster, whether the model training meets the convergence condition is determined, the trained system performance prediction model is obtained by repeated iterative training until the convergence condition is met, and the model structure of the system performance prediction model is provided with a residual block stacking layer which comprises a residual block, so that the system performance of the calculation force cluster to be predicted can be accurately predicted by using the system performance prediction model based on the cluster characteristic data of the calculation force cluster to be predicted.

The method and apparatus provided by the present application will now be described in detail with particular embodiments thereof, with reference to the accompanying drawings.

In the embodiment of the application, the model structure of the system performance prediction model is provided with a residual block stacking layer, as shown in fig. 3, and also can be provided with an input layer, a characteristic preprocessing layer, a summarizing layer and an output layer, wherein the connection relation among the layers is shown in fig. 3, and the input layer, the characteristic preprocessing layer, the residual block stacking layer, the summarizing layer and the output layer are sequentially connected.

In the embodiment of the application, the cluster characteristic data of the computing power cluster input as the model can comprise cluster quantitative characteristic data and cluster qualitative characteristic data;

the cluster quantitative characteristic data specifically comprises the number of processors, the number of accelerator cards, the size of a memory, a main frequency and the like, and is generally data of a numerical value type;

the cluster qualitative feature data is feature data which cannot be quantified through numerical values, and specifically can comprise a processor model, an accelerator card model, a chip architecture, a manufacturing process, an instruction set type and the like, and is generally an enumeration type.

In one embodiment of the application, the input layer is configured to receive cluster quantitative feature data and cluster qualitative feature data;

the feature preprocessing layer is used for preprocessing the cluster quantitative feature data output by the input layer through the MLP network to obtain quantitative feature vectors, specifically, all the cluster quantitative feature data can be combined into an initial numerical vector, the vector length of the numerical vector is determined by the quantity of quantitative indexes actually selected, and then the numerical vector is preprocessed through the MLP network to obtain a 1*n quantitative feature vector;

The feature preprocessing layer is further used for generating qualitative feature vectors corresponding to the qualitative feature data of the clusters by adopting a tag searching code mode, specifically, the tag searching code mode is adopted to directly allocate corresponding numerical values to the types represented by each kind of the qualitative feature data of the clusters to obtain a 1*n tag vector, and random allocation can be started, so long as the fact that the follow-up allocated numerical values of the same type are consistent is ensured, the tag vector is the qualitative feature vector.

And obtaining the input of the residual block stacking layer, namely the cluster quantitative feature vector and the cluster qualitative feature vector, through the input layer and the feature preprocessing layer.

In one embodiment of the present application, the residual block stacking layer may include one residual block, as shown in fig. 4, and the residual block stacking layer may also include a plurality of serially connected residual blocks, and each residual block has two inputs and one output, the inputs of the plurality of residual blocks each include a qualitative feature vector representing qualitative feature data of the cluster, the other input of the first residual block is a quantitative feature vector representing quantitative feature data of the cluster, the other input of the other residual blocks except the first is an output of the connected previous residual block, and the output of the last residual block is an output of the residual block stacking layer.

Further, as shown in fig. 5, the operations performed in the residual block include the following operations:

Multiplying the qualitative feature vector by an incidence matrix to obtain an incidence vector, wherein the incidence matrix is an n-n matrix, and the obtained incidence vector is a 1*n vector;

Adding the correlation vector to the quantitative feature vector or the output of the previous residual block to obtain a combined feature vector, wherein in the operation, if the residual block is the first residual block, the correlation vector is added to the quantitative feature vector, and if the residual block is other residual blocks except the first residual block, the correlation vector is added to the output of the previous residual block, and the obtained combined feature vector is a vector of 1*n, so that the combination of the quantitative feature and the qualitative feature is realized;

Multiplying the combined feature vector by a residual matrix to obtain a residual vector, wherein the residual matrix is a matrix of n, and the obtained residual vector is a vector 1*n;

Adding the residual vector to the output of the quantitative feature vector or the previous residual block to obtain a jump connection feature vector, wherein in the operation, if the residual block is the first residual block, the jump connection feature vector is added to the quantitative feature vector, and if the residual block is other residual blocks except the first residual block, the jump connection feature vector is 1*n, so that the jump connection of the vector residual vector and the output of the quantitative feature vector or the previous residual block is realized;

The correlation matrix and the residual matrix are used as model parameters of a system performance prediction model, and are adjusted through continuous iteration in the training process of the model.

In one embodiment of the present application, as shown in FIG. 3, the model structure of the system performance prediction model has a summary layer and an output layer;

the summarizing layer is used for processing the output of the residual block stacking layer through the MLP network to obtain system performance prediction data, specifically, n-dimensional feature vectors can be processed into m-dimensional performance index vectors, m is the number of representing system performance indexes, and the performance indexes can be temperature, power consumption, throughput, reasoning time delay, reasoning accuracy and the like;

the output layer is used for outputting the system performance prediction data, and the output layer can be one or more neurons and corresponds to each system performance index to be output.

In the embodiment of the application, in the model training process, after the output system performance prediction data is obtained, whether the model training meets the convergence condition or not is determined based on the system performance prediction data and the corresponding system performance data obtained by the reference test;

In one embodiment of the application, the model may be trained using Mean Square Error (MSE) as a loss function, i.e., determining whether the model training converges, as follows:

l=1/N Σ (y_pred-y_true)/(2), where y_pred is system performance prediction data output by the model, y_true is actual system performance data obtained by reference test, and N is the number of samples;

Correspondingly, if the convergence condition is met, determining that the training of the system performance prediction model is completed, and if the convergence condition is not met, adjusting model parameters of the system performance prediction model and executing the next model training;

In particular, gradient descent or variants thereof (e.g., adam optimizers) may be used to minimize the loss function and update the model parameters, mainly including parameters of the MLP network of the feature pre-processing layer, residual block stack layer residual and correlation matrices, and parameters of the MLP network of the summary layer.

Based on the same inventive concept, according to the system performance prediction model training method provided by the above embodiment of the present application, correspondingly, another embodiment of the present application further provides a system performance prediction model training device, a structural schematic diagram of which is shown in fig. 6, which specifically includes:

A sample data obtaining module 61, configured to obtain sample feature data for training a system performance prediction model, where the sample feature data includes cluster feature data of a known computing power cluster, and system performance data obtained by performing a benchmark test on the known computing power cluster;

a system performance prediction module 62, configured to input the cluster feature data into the system performance prediction model to obtain output system performance prediction data, where a model structure of the system performance prediction model has a residual block stack layer, and the residual block stack layer includes a residual block;

a convergence judging module 63, configured to determine whether the current model training meets a convergence condition based on the system performance prediction data and the corresponding system performance data;

the model training module 64 is configured to determine that the training of the system performance prediction model is completed if the convergence condition is satisfied, adjust model parameters of the system performance prediction model if the convergence condition is not satisfied, and perform the next model training.

Each of the residual blocks has two inputs and one output;

the output layer is used for outputting the system performance prediction data.

Based on the same inventive concept, according to the method for predicting performance of a computing power cluster system provided by the above embodiment of the present application, correspondingly, another embodiment of the present application further provides a device for predicting performance of a computing power cluster system, where a schematic structural diagram of the device is shown in fig. 7, and the device specifically includes:

A cluster data acquisition module 71, configured to acquire cluster feature data of a computing power cluster to be predicted;

The system performance prediction module 72 is configured to predict the system performance of the to-be-predicted computing power cluster by using the system performance prediction model obtained by training the system performance prediction model training device based on the cluster feature data, so as to obtain system performance prediction data.

The functions of the above modules may correspond to corresponding processing steps in the flow shown in fig. 1 and 2, and are not described herein.

The system performance prediction model training device and the computing power cluster system performance prediction device provided by the embodiment of the application can be realized through a computer program. It should be understood by those skilled in the art that the above-mentioned module division manner is only one of many module division manners, and if the module division manner is divided into other modules or not, it is within the scope of the present application as long as the system performance prediction model training device and the computing power cluster system performance prediction device have the above-mentioned functions.

An embodiment of the present application further provides an electronic device, as shown in fig. 8, including a processor 81 and a machine-readable storage medium 82, where the machine-readable storage medium 82 stores machine-executable instructions capable of being executed by the processor 81, and the processor 81 is caused by the machine-executable instructions to implement any one of the above-mentioned system performance prediction model training methods, or implement an computing power cluster system performance prediction method.

The machine-readable storage medium in the electronic device may include random access Memory (Random Access Memory, RAM) or may include Non-Volatile Memory (NVM), such as at least one disk Memory. Optionally, the memory may also be at least one memory device located remotely from the aforementioned processor.

The processor may be a general-purpose processor including a central Processing unit (Central Processing Unit, CPU), a network processor (Network Processor, NP), etc., or may be a digital signal processor (DIGITAL SIGNAL Processing, DSP), application Specific Integrated Circuit (ASIC), field-Programmable gate array (Field-Programmable GATE ARRAY, FPGA) or other Programmable logic device, discrete gate or transistor logic device, discrete hardware components.

In this specification, each embodiment is described in a related manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for an apparatus, an electronic device, a computer readable storage medium, a computer program product embodiment, since it is substantially similar to a method embodiment, the description is relatively simple, and reference is made to the section description of a method embodiment for relevant points.

It is noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises an element.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

It will be apparent to those skilled in the art that various modifications and variations can be made to the present application without departing from the spirit or scope of the application. Thus, it is intended that the present application also include such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof.

Claims

1. A system performance prediction model training method, comprising:

2. The method of claim 1, wherein the cluster characterization data comprises cluster quantitative characterization data and cluster qualitative characterization data;

Each of the residual blocks has two inputs and one output;

3. The method of claim 2, wherein the operations performed in the residual block comprise the operations of:

4. The method of claim 2, wherein the model structure of the system performance prediction model has an input layer and a feature pre-processing layer;

5. The method of claim 2, wherein the model structure of the system performance prediction model has a summary layer and an output layer;

the output layer is used for outputting the system performance prediction data.

6. A method for predicting performance of a computing power cluster system, comprising:

Based on the cluster characteristic data, the system performance of the computing power cluster to be predicted is predicted by adopting the system performance prediction model obtained by training the method according to any one of claims 1-5, so as to obtain system performance prediction data.

7. A system performance prediction model training apparatus, comprising:

8. A computing power cluster system performance prediction apparatus, comprising:

And the system performance prediction module is used for predicting the system performance of the computing power cluster to be predicted by adopting the system performance prediction model obtained by training the device of claim 7 based on the cluster characteristic data to obtain system performance prediction data.

9. An electronic device comprising a processor and a machine-readable storage medium storing machine-executable instructions executable by the processor to cause the processor to perform the method of any one of claims 1-5 or to perform the method of claim 6.

10. A computer readable storage medium, characterized in that the computer readable storage medium has stored therein a computer program which, when executed by a processor, implements the method of any of claims 1-5 or implements the method of claim 6.