CN114861907A

CN114861907A - Data computing method, device, storage medium and device

Info

Publication number: CN114861907A
Application number: CN202210424489.XA
Authority: CN
Inventors: 李恭政
Original assignee: Netease Hangzhou Network Co Ltd
Current assignee: Netease Hangzhou Network Co Ltd
Priority date: 2022-04-22
Filing date: 2022-04-22
Publication date: 2022-08-05
Anticipated expiration: 2042-04-22
Also published as: CN114861907B

Abstract

The application relates to the field of data calculation, and provides a data calculation method, a data calculation device, a storage medium and data calculation equipment. The method comprises the following steps: acquiring a target activation value and a target weight of a floating point type to be involved in a target matrix multiplication operation; respectively carrying out quantization processing on the target activation value and the target weight to obtain a quantization activation value of a fixed point type corresponding to the target activation value and a quantization weight of a fixed point type corresponding to the target weight; performing the target matrix multiplication operation by using the quantization activation value and the quantization weight; and performing inverse quantization processing on the result of the target matrix multiplication operation according to the target activation value or the quantization mode of the target weight to obtain target output data of a floating point type. By quantizing the limited number of target activation values and target weights meeting the conditions, the method and the device reduce the resources and time consumed during calculation, save the video memory, and do not reduce the calculation precision.

Description

Data computing method, device, storage medium and device

技术领域technical field

本申请涉及数据计算领域，更具体地涉及一种数据计算方法、装置、存储介质和设备。The present application relates to the field of data computing, and more particularly to a data computing method, apparatus, storage medium and device.

背景技术Background technique

在一些数据计算场景中，通常将数据输入至场景对应的模型中，通过模型中各层的计算，得到该数据对应的结果。其中，在模型中各层参与计算的激活值通常为浮点类型数据。In some data calculation scenarios, the data is usually input into the model corresponding to the scenario, and the result corresponding to the data is obtained through the calculation of each layer in the model. Among them, the activation values involved in the calculation of each layer in the model are usually floating-point data.

例如，在通过生成模型进行图像或文本生成时，将生成目标和隐变量输入至该模型后，得到目标文本。在该模型中的各层参与计算的激活值通常为对应的32位浮点类型数据，通过该模型进行运算后，得到所述目标文本。其中，激活值可以是指模型中各层的输入数据或输出数据。For example, when generating images or texts through a generative model, the target text is obtained by inputting the generative target and latent variables into the model. The activation value involved in the calculation of each layer in the model is usually the corresponding 32-bit floating-point type data, and the target text is obtained after the operation is performed through the model. Among them, the activation value can refer to the input data or output data of each layer in the model.

现有数据计算方式往往将所有参与计算的浮点类型数据量化处理为定点类型数据，以加速计算，提高效率。然而，将浮点类型数据量化为定点类型数据后进行计算，牺牲了计算精度，如果将所有浮点类型数据都量化处理为定点类型数据后进行计算，得到的最终结果的精度将具有较大偏差。Existing data calculation methods often quantify all floating-point type data involved in the calculation into fixed-point type data to speed up calculation and improve efficiency. However, calculating after quantizing floating-point type data into fixed-point type data sacrifices calculation accuracy. If all floating-point type data are quantized into fixed-point type data for calculation, the precision of the final result will have a large deviation .

发明内容SUMMARY OF THE INVENTION

在本上下文中，本申请的实施例期望提供一种数据计算方法、装置、存储介质和设备，将待参与矩阵乘法运算的有限数量的目标激活值和目标权重进行量化处理，即将有限数量的浮点类型数据转换为定点类型数据，而不是将所有浮点类型数据转换为定点类型数据，提高计算效率的同时保证计算精度。In this context, the embodiments of the present application are expected to provide a data calculation method, device, storage medium and device, which quantify a limited number of target activation values and target weights to be involved in a matrix multiplication operation, that is, a limited number of floating Instead of converting all floating-point type data to fixed-point type data, point type data is converted to fixed-point type data, which improves calculation efficiency and ensures calculation accuracy.

在本申请的第一方面中，提供了一种数据计算方法，包括：In a first aspect of the present application, a data calculation method is provided, comprising:

获取目标激活值以及目标权重，其中，所述目标激活值和所述目标权重待进行目标矩阵乘法运算且均为第一预设阈值范围内的浮点类型数据；Obtaining a target activation value and a target weight, wherein the target activation value and the target weight are to be subjected to a target matrix multiplication operation and both are floating-point data within a first preset threshold range;

对所述目标激活值和所述目标权重分别进行量化处理，得到与所述目标激活值对应的量化激活值以及与所述目标权重对应的量化权重，其中，所述量化激活值和所述量化权重均为第二预设阈值范围内的定点类型数据；Perform quantization processing on the target activation value and the target weight, respectively, to obtain a quantized activation value corresponding to the target activation value and a quantized weight corresponding to the target weight, wherein the quantized activation value and the quantized activation value are obtained. The weights are all fixed-point type data within the second preset threshold range;

采用所述量化激活值与所述量化权重进行所述目标矩阵乘法运算；Using the quantized activation value and the quantized weight to perform the target matrix multiplication operation;

根据所述目标激活值或所述目标权重的量化方式将所述目标矩阵乘法运算的结果进行反量化处理，得到目标输出数据，所述目标输出数据为第一预设阈值范围内的浮点类型数据。Perform inverse quantization processing on the result of the multiplication operation of the target matrix according to the quantization method of the target activation value or the target weight to obtain target output data, where the target output data is a floating point type within a first preset threshold range data.

在本申请的一个实施例中，应用于神经网络模型中；In an embodiment of the present application, it is applied to a neural network model;

其中，所述神经网络模型包括至少一个所述目标矩阵乘法运算；Wherein, the neural network model includes at least one multiplication operation of the target matrix;

所述目标矩阵乘法运算为以下矩阵乘法中的一项：The target matrix multiplication operation is one of the following matrix multiplications:

注意力计算层的QKV矩阵乘法；QKV matrix multiplication of attention computation layer;

映射层的映射矩阵乘法；Mapping matrix multiplication of the mapping layer;

前馈神经网络层的第一全连接矩阵乘法；The first fully connected matrix multiplication of the feedforward neural network layer;

所述前馈神经网络层的第二全连接矩阵乘法。A second fully connected matrix multiplication of the feedforward neural network layer.

在本申请的一个实施例中，在所述注意力计算层中，所述目标激活值为所述注意力计算层的输入数据，所述目标权重包括所述注意力计算层的查询权重、关键字权重和值权重；In an embodiment of the present application, in the attention calculation layer, the target activation value is the input data of the attention calculation layer, and the target weight includes the query weight of the attention calculation layer, the key word weight and value weight;

对所述目标权重进行量化处理，包括：Quantizing the target weight, including:

根据所述查询权重、所述关键字权重和所述值权重各自的通道维度分别进行量化处理，得到与所述查询权重对应的量化查询权重、与所述关键字权重对应的量化关键字权重以及与所述值权重对应的量化值权重；Perform quantization processing according to the respective channel dimensions of the query weight, the keyword weight, and the value weight, to obtain a quantized query weight corresponding to the query weight, a quantized keyword weight corresponding to the keyword weight, and a quantized value weight corresponding to the value weight;

所述采用所述量化激活值与所述量化权重进行所述目标矩阵乘法运算，包括：The performing the target matrix multiplication operation using the quantized activation value and the quantized weight includes:

采用所述量化激活值与所述量化查询权重、所述量化关键字权重和所述量化值权重分别进行矩阵乘法运算；Use the quantized activation value and the quantized query weight, the quantized keyword weight, and the quantized value weight to perform matrix multiplication operations respectively;

根据所述目标激活值或所述目标权重的量化方式将所述目标矩阵乘法运算的结果进行反量化处理，得到目标输出数据，包括：Perform inverse quantization processing on the result of the multiplication operation of the target matrix according to the quantization method of the target activation value or the target weight to obtain target output data, including:

将三个矩阵乘法运算的结果分别按照所述目标激活值或所述目标权重的量化方式进行反量化处理；The results of the three matrix multiplication operations are respectively subjected to inverse quantization processing according to the quantization method of the target activation value or the target weight;

采用三个反量化后的矩阵乘法运算结果按照预设规则计算注意力，作为所述注意力计算层的输出数据。The attention is calculated according to a preset rule by using three inverse-quantized matrix multiplication results as output data of the attention calculation layer.

在本申请的一个实施例中，在所述注意力计算层为掩码多头注意力计算层时，权重的通道维度为所述权重的关注头维度。In an embodiment of the present application, when the attention calculation layer is a masked multi-head attention calculation layer, the channel dimension of the weight is the attention head dimension of the weight.

在本申请的一个实施例中，通过预先设置的融合算子进行量化处理或反量化处理；In an embodiment of the present application, quantization processing or inverse quantization processing is performed through a preset fusion operator;

所述融合算子包括量化融合算子和反量化融合算子；The fusion operator includes a quantization fusion operator and an inverse quantization fusion operator;

所述量化融合算子用于将目标激活值量化处理前的数据计算以及所述目标激活值的量化处理进行融合计算；The quantization fusion operator is used to perform fusion calculation on the data calculation before the quantization processing of the target activation value and the quantization processing of the target activation value;

所述反量化融合算子用于将反量化处理后的数据计算以及所述反量化处理进行融合计算。The inverse quantization fusion operator is used to perform fusion calculation on the data after inverse quantization processing and the inverse quantization processing.

在本申请的一个实施例中，每一个目标矩阵乘法对应一个量化融合算子和/或一个反量化融合算子；In an embodiment of the present application, each target matrix multiplication corresponds to a quantization fusion operator and/or an inverse quantization fusion operator;

其中，对应所述QKV矩阵乘法的量化融合算子用于将归一化处理以及所述归一化处理后的激活值的量化处理进行融合计算；对应所述QKV矩阵乘法的反量化融合算子用于将所述QKV矩阵乘法的运算结果的反量化处理以及偏置项相加进行融合计算；Wherein, the quantization fusion operator corresponding to the QKV matrix multiplication is used to perform fusion calculation on the normalization processing and the quantization processing of the activation value after the normalization processing; the inverse quantization fusion operator corresponding to the QKV matrix multiplication For the inverse quantization process of the operation result of the QKV matrix multiplication and the addition of the offset term for fusion calculation;

对应所述映射矩阵乘法的量化融合算子用于将排列转换处理以及所述排列转换处理后的激活值的量化处理进行融合计算；对应所述映射矩阵乘法的反量化融合算子用于将所述映射矩阵乘法的运算结果的反量化处理以及偏置项相加、残差相加进行融合计算；The quantization fusion operator corresponding to the mapping matrix multiplication is used to perform fusion calculation on the permutation conversion processing and the quantization processing of the activation value after the permutation conversion processing; the inverse quantization fusion operator corresponding to the mapping matrix multiplication is used to combine the The inverse quantization processing of the operation result of the mapping matrix multiplication, the addition of the offset terms, and the addition of the residuals are used for fusion calculation;

对应所述第一全连接矩阵乘法的量化融合算子用于将归一化处理以及所述归一化处理后的激活值的量化处理进行融合计算；对应所述第一全连接矩阵乘法的反量化融合算子用于将所述第一全连接矩阵乘法的运算结果的反量化处理以及偏置项相加、激活运算进行融合计算；The quantization fusion operator corresponding to the first fully connected matrix multiplication is used to perform fusion calculation on the normalization processing and the quantization processing of the activation value after the normalization processing; the inverse corresponding to the first fully connected matrix multiplication The quantization fusion operator is used to perform fusion calculation by inverse quantization processing of the operation result of the first fully connected matrix multiplication, addition of offset terms, and activation operation;

对应所述第二全连接矩阵乘法的反量化融合算子用于将所述第二全连接矩阵乘法的运算结果的反量化处理以及偏置项相加、归一化处理进行融合计算。The inverse quantization fusion operator corresponding to the second fully connected matrix multiplication is used for inverse quantization processing of the operation result of the second fully connected matrix multiplication, addition of offset items, and normalization processing to perform fusion calculation.

在本申请的一个实施例中，若所述神经网络模型处于并行训练状态，则根据所述神经网络模型的并行训练方式对所述目标激活值和所述目标权重分别进行量化处理；In an embodiment of the present application, if the neural network model is in a parallel training state, the target activation value and the target weight are respectively quantized according to the parallel training mode of the neural network model;

在所述神经网络模型的并行训练方式为数据并行时，对符合条件的所述目标权重进行通道维度的量化处理，对所述目标激活值以及不符合条件的所述目标权重进行张量维度的量化处理；When the parallel training mode of the neural network model is data parallelism, the quantization of the channel dimension is performed on the target weights that meet the conditions, and the tensor dimension is performed on the target activation values and the target weights that do not meet the conditions. quantification processing;

在所述神经网络模型的并行训练方式为模型并行时，对所述目标激活值和所述目标权重分别进行数据块维度的量化处理，其中，所述目标激活值和所述目标权重的数据块维度的量化方式不同。When the parallel training mode of the neural network model is model parallelism, the target activation value and the target weight are respectively subjected to quantization processing of the data block dimension, wherein the target activation value and the data block of the target weight are Dimensions are quantified differently.

在本申请的一个实施例中，所述对符合条件的所述目标权重进行通道维度的量化处理，包括：In an embodiment of the present application, performing the quantization processing of the channel dimension on the target weights that meet the conditions includes:

获取所述目标权重的关注头维度；Obtain the attention head dimension of the target weight;

将所述目标权重根据所述关注头维度进行数据分块，得到各个目标权重子块；The target weight is divided into data blocks according to the attention head dimension to obtain each target weight sub-block;

对各个目标权重子块分别进行量化处理。Each target weight sub-block is quantized separately.

在本申请的一个实施例中，所述对所述目标激活值和所述目标权重分别进行数据块维度的量化处理，包括：In an embodiment of the present application, the quantization processing of the data block dimension is performed on the target activation value and the target weight respectively, including:

获取所述模型并行的尺度以及所述目标权重的关注头维度；obtaining the parallel scale of the model and the attention head dimension of the target weight;

根据所述模型并行的尺度对所述目标激活值进行数据分块，得到各个目标激活值子块；以及根据所述模型并行的尺度、所述关注头维度对所述目标权重根据进行数据分块，得到各个目标权重子块；The target activation value is divided into data blocks according to the scale of the model parallelism to obtain each target activation value sub-block; and the target weight is divided according to the scale of the model parallelism and the attention head dimension according to the data block. , get each target weight sub-block;

对所述各个目标激活值子块和所述各个目标权重子块分别进行量化处理。Perform quantization processing on each of the target activation value sub-blocks and each of the target weight sub-blocks respectively.

在本申请的一个实施例中，所述根据所述模型并行的尺度、所述关注头维度对所述目标权重根据进行数据分块，包括：In an embodiment of the present application, performing data partitioning on the target weight according to the parallel scale of the model and the attention head dimension includes:

将所述目标权重的关注头维度与所述模型并行的尺度之积，作为进行数据分块的除数。The product of the attention head dimension of the target weight and the parallel scale of the model is used as a divisor for data partitioning.

在本申请的第二方面中，提供了一种数据计算装置，包括：In a second aspect of the present application, a data computing device is provided, comprising:

获取模块，被配置为获取目标激活值以及目标权重，其中，所述目标激活值和所述目标权重待进行目标矩阵乘法运算且均为第一预设阈值范围内的浮点类型数据；an acquisition module, configured to acquire a target activation value and a target weight, wherein the target activation value and the target weight are to be subjected to a target matrix multiplication operation and both are floating-point data within a first preset threshold range;

量化模块，被配置为对所述目标激活值和所述目标权重分别进行量化处理，得到与所述目标激活值对应的量化激活值以及与所述目标权重对应的量化权重，其中，所述量化激活值和所述量化权重均为第二预设阈值范围内的定点类型数据；a quantization module, configured to perform quantization processing on the target activation value and the target weight, respectively, to obtain a quantized activation value corresponding to the target activation value and a quantized weight corresponding to the target weight, wherein the quantization The activation value and the quantization weight are both fixed-point type data within the second preset threshold range;

计算模块，被配置为采用所述量化激活值与所述量化权重进行所述目标矩阵乘法运算；a calculation module configured to perform the target matrix multiplication operation using the quantized activation value and the quantized weight;

反量化模块，被配置为根据所述目标激活值或所述目标权重的量化方式将所述目标矩阵乘法运算的结果进行反量化处理，得到目标输出数据，所述目标输出数据为第一预设阈值范围内的浮点类型数据。an inverse quantization module, configured to perform inverse quantization processing on the result of the multiplication operation of the target matrix according to the quantization method of the target activation value or the target weight to obtain target output data, and the target output data is a first preset Floating point data within the threshold range.

在本申请的第三方面中，提供了计算机可读存储介质，其包括指令，当其在计算机上运行时，使得计算机执行如第一方面所述的方法。In a third aspect of the present application there is provided a computer readable storage medium comprising instructions which, when run on a computer, cause the computer to perform the method as described in the first aspect.

在本申请的第四方面中，提供了一种计算设备，包括存储器，处理器及存储在存储器上并可在处理器上运行的计算机程序，其中，所述处理器执行所述计算机程序时实现第一方面所述的方法。In a fourth aspect of the present application, there is provided a computing device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the computer program when executed The method described in the first aspect.

相比于现有技术，根据本申请实施例的数据计算方法、装置、存储介质和设备，将待参与矩阵乘法运算的有限数量的目标激活值和目标权重进行量化处理，即将有限数量的浮点类型数据转换为定点类型数据，而不是将所有浮点类型数据转换为定点类型数据，使得计算时消耗的资源和时间减少，节约显存，且不降低计算精度，为用户带来了更好的体验。Compared with the prior art, according to the data calculation method, device, storage medium and device of the embodiments of the present application, a limited number of target activation values and target weights to be involved in the matrix multiplication operation are quantized, that is, a limited number of floating-point Converting type data to fixed-point type data, instead of converting all floating-point type data to fixed-point type data, reduces the resources and time consumed during calculation, saves video memory, and does not reduce calculation accuracy, bringing a better experience to users .

附图说明Description of drawings

通过参考附图阅读下文的详细描述，本申请示例性实施例的上述以及其他目的、特征和优点将变得易于理解。在附图中，以示例性而非限制性的方式示出了本申请的若干实施例，其中：The above and other objects, features and advantages of exemplary embodiments of the present application will become readily understood by reading the following detailed description with reference to the accompanying drawings. In the accompanying drawings, several embodiments of the present application are shown by way of example and not limitation, wherein:

图1为本申请一些实施例的数据计算方法的应用场景示意图；FIG. 1 is a schematic diagram of an application scenario of a data calculation method according to some embodiments of the present application;

图2为本申请一个实施例的数据计算方法的流程示意图；2 is a schematic flowchart of a data calculation method according to an embodiment of the application;

图3为本申请一个实施例的GPT模型量化计算图；Fig. 3 is a GPT model quantization calculation diagram of an embodiment of the application;

图4为本申请一个实施例的GPT模型反量化计算图；Fig. 4 is a GPT model inverse quantization calculation diagram of an embodiment of the application;

图5为本申请又一个实施例的模型并行训练时量化切割示意图；5 is a schematic diagram of quantization cutting during parallel training of models according to another embodiment of the present application;

图6为本申请一个实施例的数据计算装置的结构示意图；6 is a schematic structural diagram of a data computing device according to an embodiment of the present application;

图7为本申请一个实施例的一种计算机可读存储介质的结构示意图；7 is a schematic structural diagram of a computer-readable storage medium according to an embodiment of the present application;

图8为本申请一个实施例的一种计算设备的结构示意图。FIG. 8 is a schematic structural diagram of a computing device according to an embodiment of the present application.

在附图中，相同或对应的标号表示相同或对应的部分。In the drawings, the same or corresponding reference numerals denote the same or corresponding parts.

具体实施方式Detailed ways

下面将参考若干示例性实施例来描述本申请的原理和精神。应当理解，给出这些实施例仅仅是为了使本领域技术人员能够更好地理解进而实现本申请，而并非以任何方式限制本申请的范围。相反，提供这些实施例是为了使本公开更加透彻和完整，并且能够将本公开的范围完整地传达给本领域的技术人员。The principles and spirit of the present application will be described below with reference to several exemplary embodiments. It should be understood that these embodiments are given only to enable those skilled in the art to better understand and implement the present application, but do not limit the scope of the present application in any way. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

本领域技术人员知道，本申请的实施例可以实现为一种系统、装置、设备、方法或计算机程序产品。因此，本公开可以具体实现为以下形式，即：完全的硬件、完全的软件(包括固件、驻留软件、微代码等)，或者硬件和软件结合的形式。Those skilled in the art know that the embodiments of the present application can be implemented as a system, apparatus, device, method or computer program product. Accordingly, the present disclosure may be embodied in entirely hardware, entirely software (including firmware, resident software, microcode, etc.), or a combination of hardware and software.

需要说明的是，本申请各个实施例或附图涉及的术语：It should be noted that terms involved in various embodiments or drawings of the present application:

float:浮点类型数据；float: float type data;

FP16/FP32：16位/32位浮点类型数据；FP16/FP32: 16-bit/32-bit floating point data;

Per-tensor:每一个张量；Per-tensor: each tensor;

Per-channel:每一个通道；Per-channel: each channel;

Per-block：每一个数据块。Per-block: Every data block.

目前，对基于神经网络构建的模型进行数据计算的方式通常需要针对32位的数据进行计算，对数据计算的带宽压力较大，降低了计算性能。At present, the data calculation method for a model constructed based on a neural network usually needs to perform calculation on 32-bit data, which puts a large pressure on the bandwidth of the data calculation and reduces the calculation performance.

为此，本申请实施例提供了一种数据计算方法，该方法可以在对数据计算结果的影响较小的前提下，通过将待参与矩阵乘法运算的有限数量的目标激活值和目标权重进行量化处理，即将有限数量的浮点类型数据转换为定点类型数据，而不是将所有浮点类型数据转换为定点类型数据，减小了数据计算的带宽压力，提升了数据计算设备的计算能力，且保证数据计算结果的精度。To this end, the embodiment of the present application provides a data calculation method, which can quantify a limited number of target activation values and target weights to be involved in the matrix multiplication operation under the premise that the impact on the data calculation result is small. Processing, that is, converting a limited amount of floating-point type data to fixed-point type data, instead of converting all floating-point type data to fixed-point type data, reduces the bandwidth pressure of data computing, improves the computing power of data computing devices, and guarantees The precision of the data calculation results.

本申请实施例所提供的数据计算方法可以应用于基于人工智能实现的神经网络模型中，人工智能(Artificial Intelligence,AI)是利用数字计算机或者数字计算机控制的机器模拟、延伸和扩展人的智能，感知环境、获取知识并使用知识获得最佳结果的理论、方法、技术及应用系统。换句话说，人工智能是计算机科学的一个综合技术，它企图了解智能的实质，并生产出一种新的能以人类智能相似的方式做出反应的智能机器。人工智能也就是研究各种智能机器的设计原理与实现方法，使机器具有感知、推理与决策的功能。The data calculation method provided in the embodiments of the present application can be applied to a neural network model based on artificial intelligence. Artificial intelligence (AI) is to use a digital computer or a machine controlled by a digital computer to simulate, extend and expand human intelligence, Theories, methods, techniques, and applied systems for perceiving the environment, acquiring knowledge, and using that knowledge to achieve optimal results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new kind of intelligent machine that can respond in a similar way to human intelligence. Artificial intelligence is to study the design principles and implementation methods of various intelligent machines, so that the machines have the functions of perception, reasoning and decision-making.

人工智能技术是一门综合学科，涉及领域广泛，既有硬件层面的技术也有软件层面的技术。人工智能基础技术一般包括如传感器、专用人工智能芯片、云计算、分布式存储、大数据处理技术、操作/交互系统、机电一体化等技术。人工智能软件技术主要包括计算机视觉技术、语音处理技术、自然语言处理技术以及机器学习/深度学习等几大方向。Artificial intelligence technology is a comprehensive discipline, involving a wide range of fields, including both hardware-level technology and software-level technology. The basic technologies of artificial intelligence generally include technologies such as sensors, special artificial intelligence chips, cloud computing, distributed storage, big data processing technology, operation/interaction systems, and mechatronics. Artificial intelligence software technology mainly includes computer vision technology, speech processing technology, natural language processing technology, and machine learning/deep learning.

在本申请实施例中，主要涉及的人工智能软件技术包括上述自然语言处理技术和深度学习等方向。In the embodiments of the present application, the artificial intelligence software technologies mainly involved include the above-mentioned natural language processing technologies and deep learning and other directions.

例如可以涉及机器学习(Machine learning，ML)中的深度学习(Deep Learning)，包括各类人工神经网络(artificial neural network)。For example, it may involve deep learning (Deep Learning) in machine learning (Machine learning, ML), including various artificial neural networks (artificial neural networks).

首先，对本申请实施例的执行主体进行介绍。本申请提供的数据计算方法可以通过数据计算设备执行。该数据计算设备中可以部署有该数据计算方法适用的神经网络模型，该数据计算设备可以是服务器，其中，该服务器可以是独立的物理服务器，也可以是多个物理服务器构成的服务器集群或者分布式系统，还可以是提供云服务、云数据库、云计算、云函数、云存储、网络服务、云通信、中间件服务、域名服务、安全服务、CDN、以及大数据和人工智能平台等基础云计算服务的云服务器。该数据计算设备可以是服务器，终端可以是智能手机、平板电脑、笔记本电脑、台式计算机、智能音箱、智能手表等，但并不局限于此。终端以及服务器可以通过有线或无线通信方式进行直接或间接地连接，本申请在此不做限制。First, the executive body of the embodiment of the present application is introduced. The data computing method provided in this application can be executed by a data computing device. A neural network model applicable to the data computing method may be deployed in the data computing device, and the data computing device may be a server, where the server may be an independent physical server, or a server cluster or distributed server composed of multiple physical servers It can also provide basic cloud services such as cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communications, middleware services, domain name services, security services, CDN, and big data and artificial intelligence platforms. Cloud servers for computing services. The data computing device may be a server, and the terminal may be a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart speaker, a smart watch, etc., but is not limited thereto. The terminal and the server may be directly or indirectly connected through wired or wireless communication, which is not limited in this application.

该数据计算设备可以具有实施自然语言处理技术中自动语句生成和翻译技术等的能力。The data computing device may have the capability of implementing automatic sentence generation and translation techniques in natural language processing techniques, and the like.

该数据计算设备可以具备机器学习(Machine Learning,ML)能力。ML是一门多领域交叉学科，涉及概率论、统计学、逼近论、凸分析、算法复杂度理论等多门学科。专门研究计算机怎样模拟或实现人类的学习行为，以获取新的知识或技能，重新组织已有的知识结构使之不断改善自身的性能。机器学习是人工智能的核心，是使计算机具有智能的根本途径，其应用遍及人工智能的各个领域。机器学习和深度学习通常包括人工神经网络等技术。The data computing device may have machine learning (Machine Learning, ML) capabilities. ML is a multi-field interdisciplinary subject, involving probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and other disciplines. It specializes in how computers simulate or realize human learning behaviors to acquire new knowledge or skills, and to reorganize existing knowledge structures to continuously improve their performance. Machine learning is the core of artificial intelligence and the fundamental way to make computers intelligent, and its applications are in all fields of artificial intelligence. Machine learning and deep learning often include techniques such as artificial neural networks.

在本申请实施例中，应用上述数据计算方法的神经网络模型主要涉及对各类人工神经网络的应用，例如通过神经网络模型进行序列生成等。In the embodiments of the present application, the neural network model applying the above data calculation method mainly involves the application of various artificial neural networks, such as sequence generation through the neural network model.

需要说明，本申请实施例不限定通过该方法进行数据计算的模型类型，该模型可以是任意类型的模型，在一种可能的实现方式中，该模型可以是循环神经网络(RecurrentNeural Network，RNN)模型。It should be noted that this embodiment of the present application does not limit the type of model for data calculation by this method, and the model may be any type of model. In a possible implementation manner, the model may be a Recurrent Neural Network (RNN) Model.

接下来以服务器作为执行主体，并结合实际应用场景对本申请实施例提供的数据计算方法进行介绍。Next, the server is used as the execution body, and the data calculation method provided by the embodiment of the present application is introduced in combination with the actual application scenario.

参见图1，该图示出了本申请实施例提供的一种数据计算方法的应用场景示意图。如图1所示，该应用场景中包括服务器101，由服务器101执行本申请实施例提供的数据计算方法。Referring to FIG. 1 , the figure shows a schematic diagram of an application scenario of a data calculation method provided by an embodiment of the present application. As shown in FIG. 1 , the application scenario includes a server 101, and the server 101 executes the data calculation method provided by the embodiment of the present application.

在本申请实施例中，当需要针对输入数据进行数据计算时，服务器101可以将输入数据输入至用于进行数据计算的模型中，以通过该模型确定该输入数据对应的输出数据。In this embodiment of the present application, when data calculation is required for input data, the server 101 may input the input data into a model for data calculation, so as to determine output data corresponding to the input data through the model.

本申请的应用场景包括语音分析、语音降噪、语音翻译、文字翻译、文字识别、序列等场景。The application scenarios of this application include speech analysis, speech noise reduction, speech translation, text translation, text recognition, sequences and other scenarios.

当为语音分析场景时，模型可以是语音分析模型，输入数据可以是待进行语音分析的语音数据，所确定的该输入数据对应的输出数据可以是完成语音分析的数据。当为语音降噪场景时，模型可以是语音降噪模型，输入数据可以是待降噪的语音数据，所确定的输入数据对应的输出数据可以是对待降噪的语音数据进行降噪后的语音数据。In the case of a speech analysis scenario, the model may be a speech analysis model, the input data may be speech data to be subjected to speech analysis, and the determined output data corresponding to the input data may be data for which speech analysis has been completed. When it is a speech noise reduction scene, the model may be a speech noise reduction model, the input data may be the speech data to be denoised, and the output data corresponding to the determined input data may be the speech after noise reduction of the speech data to be denoised data.

当为语句翻译场景时，该模型可以是语句翻译模型，则该输入数据可以对应为待翻译的第一语言的语句数据，所确定的输入数据对应的输出数据可以是对翻译得到的第二语言的语句数据。当应用于序列生成场景时，模型可以是序列生成模型，输入数据可以是包括待生成序列的数据，所确定的输入数据对应的输出数据为根据待生成序列进行数据计算后得到的序列，等等，不再赘述。When it is a sentence translation scenario, the model may be a sentence translation model, then the input data may correspond to the sentence data of the first language to be translated, and the output data corresponding to the determined input data may be the translation of the second language. statement data. When applied to a sequence generation scenario, the model can be a sequence generation model, the input data can be data including the sequence to be generated, the output data corresponding to the determined input data is a sequence obtained by performing data calculation according to the sequence to be generated, etc. ,No longer.

在将输入数据输入至模型后，模型对输入数据进行处理，得到在各层流动的激活值，激活值可以是输入某一神经网络层的激活值或某一神经网络层输出的激活值，权重值可以是某一神经网络层固有的权重值。在激活值流动过程中，可以获取目标激活值和目标权重，对其进行量化处理，然后使量化处理后的目标激活值和目标权重继续进行待参与的矩阵乘法运算。After inputting the input data to the model, the model processes the input data to obtain the activation value flowing in each layer. The activation value can be the activation value input to a neural network layer or the activation value output from a neural network layer, and the weight The value can be a weight value inherent to a neural network layer. In the process of activation value flow, the target activation value and target weight can be obtained, quantized, and then the quantized target activation value and target weight can continue to perform the matrix multiplication operation to be involved.

下面参考本申请的若干代表性实施例，详细阐释本申请的原理和精神。The principles and spirit of the present application are explained in detail below with reference to several representative embodiments of the present application.

例如，参见图1，服务器101可以将模型中的线性叠加层的输入激活值作为目标激活值，获取该目标激活值，为一维序列[-0.127，0.126，0.08，-0.07]。For example, referring to FIG. 1 , the server 101 may take the input activation value of the linear overlay in the model as the target activation value, and obtain the target activation value, which is a one-dimensional sequence [-0.127, 0.126, 0.08, -0.07].

在获取到目标激活值后，服务器101可以对目标激活值进行量化，确定每个目标激活值对应的量化激活值，记为量化激活值。其中，进行量化后的量化激活值为定点类型数据，具体是说，本示例中将目标激活值量化为整型值。After acquiring the target activation value, the server 101 may quantify the target activation value, and determine the quantized activation value corresponding to each target activation value, which is recorded as the quantized activation value. The quantized activation value after quantization is fixed-point type data. Specifically, in this example, the target activation value is quantized as an integer value.

在该实施例中，可以将目标激活值通过扩大1000倍，以量化至数据范围[-127,127]内。由此，所确定的量化激活值为：“-127、126、8、-7”，得到一维序列[-127，126，8，-7]。In this example, the target activation value can be quantified into the data range [-127, 127] by scaling up by a factor of 1000. Thus, the determined quantization activation value is: "-127, 126, 8, -7", and a one-dimensional sequence [-127, 126, 8, -7] is obtained.

由此，对量化激活值和量化权重进行前向计算。From this, the quantized activation values and the quantized weights are calculated forward.

在完成计算后，可以根据对目标激活值或目标权重的量化方式，对经过前向计算的结果进行反量化，得到对应的输出结果，以实现减小对激活值和权重进行量化后对前向计算所得的输出结果的影响。也就是说，针对激活值与目标权重进行计算得到的结果，其与量化激活值与量化权重进行计算得到的结果的差别在允许的范围内。After the calculation is completed, the result of the forward calculation can be inversely quantized according to the quantization method of the target activation value or the target weight, and the corresponding output result can be obtained, so as to reduce the activation value and the weight after the quantization of the forward calculation. The effect of the calculated output. That is to say, the difference between the result obtained by calculating the activation value and the target weight and the result obtained by calculating the quantized activation value and the quantized weight is within an allowable range.

在该实施例中，可以根据量化方式对经过前向计算的结果缩放1000倍，实现反量化。如图1所示，在根据量化激活值和量化权重进行前向计算后，得到一维序列为[102，103，-76，8]，可以将序列中的数据反量化，缩放1000倍，得到序列[0.102，0.103，-0.76，0.08]，该序列记为对应的输出结果。In this embodiment, the result of forward calculation can be scaled by 1000 times according to the quantization method to realize inverse quantization. As shown in Figure 1, after the forward calculation is performed according to the quantization activation value and the quantization weight, the one-dimensional sequence is obtained as [102, 103, -76, 8]. The data in the sequence can be inversely quantized and scaled by 1000 times to obtain The sequence [0.102, 0.103, -0.76, 0.08], which is recorded as the corresponding output result.

在通过模型完成针对输入数据的计算后，确定输入数据对应的输出数据。After the calculation of the input data is completed through the model, the output data corresponding to the input data is determined.

该示例中，在尽量不影响数据计算结果的前提下，通过将目标激活值和目标权重量化为位数更少的整型值，减小了数据计算的带宽压力，提升了数据计算设备的计算能力。In this example, under the premise of not affecting the data calculation results as much as possible, by quantizing the target activation value and target weight into integer values with fewer digits, the bandwidth pressure of data calculation is reduced and the calculation of data calculation equipment is improved. ability.

需要说明的是，虽然以上示例中仅仅具体说明了如何将目标激活值进行量化以及将输出数据进行反量化，但是在本申请的一些实施例中还将目标权重进行量化，并将量化权重和量化激活值按照原有的激活值与权重的计算方式进行处理；例如输入一个目标激活值将会与所述目标权重进行矩阵乘法计算，那么采用本申请的数据计算方法对数据进行计算时，则会将所述目标激活值以及所述目标权重分别进行量化，然后将量化激活值和量化权重进行矩阵乘法计算，并将计算结果作为对应的输出数据，或者将所述计算结果进行反量化后，作为对应的输出数据。It should be noted that, although the above examples only specifically describe how to quantize the target activation value and inversely quantify the output data, in some embodiments of the present application, the target weight is also quantized, and the quantization weight and quantization are also quantized. The activation value is processed according to the original calculation method of activation value and weight; for example, inputting a target activation value will perform matrix multiplication calculation with the target weight, then when the data calculation method of this application is used to calculate the data, it will be The target activation value and the target weight are respectively quantized, then the quantized activation value and the quantized weight are subjected to matrix multiplication calculation, and the calculation result is used as the corresponding output data, or after the calculation result is inversely quantized, as corresponding output data.

可以理解的是，所述对应权重的量化方式与所述对应激活值的量化方式类似，即采用一个量化系数对所述对应权重进行处理，得到量化对应权重。It can be understood that the quantization method of the corresponding weight is similar to the quantization method of the corresponding activation value, that is, a quantization coefficient is used to process the corresponding weight to obtain the quantized corresponding weight.

下面结合几个具体实施例对本申请的各个技术方案进行说明。Various technical solutions of the present application will be described below with reference to several specific embodiments.

下面结合图1的应用场景，参考图2来描述根据本申请示例性实施例的用于数据计算的方法。需要注意的是，上述应用场景仅是为了便于理解本申请的精神和原理而示出，本申请的实施例在此方面不受任何限制。相反，本申请的实施例可以应用于适用的任何场景。The following describes a method for data calculation according to an exemplary embodiment of the present application with reference to FIG. 2 in conjunction with the application scenario of FIG. 1 . It should be noted that the above application scenarios are only shown for the convenience of understanding the spirit and principles of the present application, and the embodiments of the present application are not limited in this respect. On the contrary, the embodiments of the present application can be applied to any applicable scenarios.

接下来，将以服务器为上述数据计算设备，以自然语言生成的场景为例，对本申请实施例提供的数据计算方法进行介绍。其中，该服务器中部署了上述的模型，该模型可以是完成训练得到的模型，该数据计算方法即为针对该模型的推断过程。Next, the data computing method provided by the embodiment of the present application will be introduced by taking the server as the above-mentioned data computing device and taking a scenario generated by natural language as an example. Wherein, the above-mentioned model is deployed in the server, and the model may be a model obtained by completing training, and the data calculation method is an inference process for the model.

参照图2，所述数据计算方法包括：Referring to Figure 2, the data calculation method includes:

步骤S110，获取目标激活值以及目标权重，其中，所述目标激活值和所述目标权重待进行目标矩阵乘法运算且均为第一预设阈值范围内的浮点类型数据；Step S110, obtaining a target activation value and a target weight, wherein the target activation value and the target weight are to be multiplied by a target matrix and both are floating-point data within a first preset threshold range;

本申请的实施例提供的数据计算方法对神经网络模型中的各个目标激活值和目标权重进行量化，所述各个目标激活值和目标权重进行量化可以划分为多组，一组目标激活值和目标权重可以包括多个目标激活值和目标权重。The data calculation method provided by the embodiments of the present application quantifies each target activation value and target weight in the neural network model, and the quantization of each target activation value and target weight can be divided into multiple groups, a group of target activation values and target weights Weights can include multiple target activations and target weights.

需要说明的是，一组目标激活值和目标权重应该待参与同一个矩阵乘法运算。也即，本实施例根据激活值是否将要与权重进行矩阵乘法运算确定所述激活值是否为目标激活值，类似地，根据权重是否将要与激活值进行矩阵乘法运算确定所述权重是否为目标权重；另外，若激活值与权重将要进行矩阵乘法运算，则将所述激活值与所述权重确定为一组目标激活值与目标权重。例如若激活值1将与权重1进行矩阵乘法运算1，那么激活值1和权重1为一组目标激活值和目标权重。It should be noted that a set of target activation values and target weights should participate in the same matrix multiplication operation. That is, this embodiment determines whether the activation value is the target activation value according to whether the activation value is to be subjected to matrix multiplication operation with the weight. Similarly, whether the weight is to be subjected to matrix multiplication operation with the activation value determines whether the weight is the target weight. ; In addition, if the activation value and the weight are to be subjected to a matrix multiplication operation, the activation value and the weight are determined as a set of target activation values and target weights. For example, if activation value 1 is to perform matrix multiplication operation 1 with weight 1, then activation value 1 and weight 1 are a set of target activation values and target weights.

可以理解的是，由于神经网络模型中天然的激活值和权重本身即为浮点类型数据，由此，本实施例对神经网络模型中的激活值和权重的数据类型是否为浮点类型数据不做判断。即自动认为神经网络模型中的激活值和权重为浮点类型数据。本领域的技术人员可以根据实际应用场景，对待参与矩阵乘法运算的激活值和权重是否为浮点类型数据进行进一步判断，本实施例对此不做限定。It can be understood that, since the natural activation values and weights in the neural network model are floating-point data, the present embodiment does not have any effect on whether the data types of the activation values and weights in the neural network model are floating-point data. make judgments. That is, the activation values and weights in the neural network model are automatically considered to be floating-point data. Those skilled in the art may further determine whether the activation value and weight to be involved in the matrix multiplication operation are floating-point data according to the actual application scenario, which is not limited in this embodiment.

所述浮点类型数据是用来标示小数的实数，所述第一预设阈值范围可以为现有技术认可的浮点类型数据的阈值范围，例如可以是1.8E-308～1.8E+308。The floating-point data is a real number used to indicate a decimal, and the first preset threshold range may be a threshold range of floating-point data recognized in the prior art, for example, may be 1.8E-308˜1.8E+308.

需要说明的是，浮点类型的数据可以表达更多的信息，也即根据浮点类型的数据进行计算的结果将更加精确，然而由于浮点类型的数据在进行计算时将花费更多的时间和计算资源，由此需要对浮点类型的数据进行量化。考虑到如果将神经网络模型中所有的浮点类型数据均进行量化处理，会极大地损耗模型最终输出结果的精度，由此，在实施例中，获取符合一定条件的目标激活值以及目标权重进行量化处理。It should be noted that data of floating-point type can express more information, that is, the result of calculation based on data of floating-point type will be more accurate, however, because data of floating-point type will take more time to calculate and computing resources, thus requiring quantization of floating-point data. Considering that if all floating-point data in the neural network model are quantized, the accuracy of the final output result of the model will be greatly lost. Therefore, in the embodiment, the target activation value and target weight that meet certain conditions are obtained. Quantization processing.

原因在于，浮点类型的数据进行矩阵乘法计算将花费极其高的计算资源和时间，即具备非常高的计算复杂度，而相比于矩阵乘法计算，其他乘法、加法计算花费的计算资源较少。由此，只将需要进行矩阵乘法计算的权重和激活值进行量化就可以取得较好的资源节约效果，即量化之后的模型计算速度和花费资源大大降低，且损失的精度也较小。The reason is that the matrix multiplication calculation of floating-point data will take extremely high computing resources and time, that is, it has a very high computational complexity, and compared with matrix multiplication calculations, other multiplication and addition calculations cost less computational resources . Therefore, only quantizing the weights and activation values that need to be calculated by matrix multiplication can achieve better resource saving effect, that is, the model calculation speed and cost resources after quantization are greatly reduced, and the loss of accuracy is also small.

根据以上原则，发明人发现神经网络模型中包括至少一个需要权重和激活值参与的矩阵乘法运算。由此，在一个实施例中，可以以矩阵乘法运算为依据，确定目标权重和目标激活值，即认定激活值和权重参与的矩阵乘法运算为目标矩阵乘法运算；具体来说所述目标矩阵乘法运算为以下矩阵乘法中的一项：Based on the above principles, the inventor found that the neural network model includes at least one matrix multiplication operation that requires the participation of weights and activation values. Thus, in one embodiment, the target weight and the target activation value can be determined based on the matrix multiplication operation, that is, the matrix multiplication operation involving the activation value and the weight is determined as the target matrix multiplication operation; specifically, the target matrix multiplication operation The operation is one of the following matrix multiplications:

在一个实施例中，在生成式无监督预训练模型Gererate Pre-Training Model(GPT)中，则包括上述实施例列举的全部目标矩阵乘法运算。In one embodiment, the generative unsupervised pre-training model (GPT) includes all the target matrix multiplication operations listed in the foregoing embodiment.

需要说明的是，GPT模型包括多个堆叠的解码器-编码器(transformer-decoder)，每个transformer-decoder均具备相同的结构，例如包括掩码多头注意力层、映射层、前馈神经网络层、残差连接层和归一化层，由此，每个transformer-decoder均包括上述实施例列举的各个目标矩阵乘法运算。It should be noted that the GPT model includes multiple stacked decoder-encoders (transformer-decoder), each transformer-decoder has the same structure, such as including mask multi-head attention layer, mapping layer, feedforward neural network layer, residual connection layer and normalization layer, thus, each transformer-decoder includes each target matrix multiplication operation listed in the above embodiment.

步骤S120，对所述目标激活值和所述目标权重分别进行量化处理，得到与所述目标激活值对应的量化激活值以及与所述目标权重对应的量化权重，其中，所述量化激活值和所述量化权重均为第二预设阈值范围内的定点类型数据；Step S120, quantizing the target activation value and the target weight, respectively, to obtain a quantization activation value corresponding to the target activation value and a quantization weight corresponding to the target weight, wherein the quantization activation value and The quantization weights are all fixed-point type data within the second preset threshold range;

如应用场景部分的内容，将浮点类型的数据量化处理，即为将浮点类型的数据转换为定点类型的数据，浮点类型的数据与定点类型的数据的转换公式如下：As described in the application scenario section, quantizing floating-point data means converting floating-point data to fixed-point data. The conversion formula between floating-point data and fixed-point data is as follows:

x_out＝(clamp(round(x/scale+zero_point),quant_min,quant_max)-zero_point)*scale，其中x_out表示量化处理后得到的定点类型的数据，x是量化处理前的浮点类型的数据，quant_min,quant_max分别表示特定位宽下的定点类型的数据所能表示的最大、最小值，scale为缩放因子(例如应用场景部分举例的扩大1000倍)。x_out=(clamp(round(x/scale+zero_point),quant_min,quant_max)-zero_point)*scale, where x_out represents the fixed-point type data obtained after quantization processing, x is the floating-point type data before quantization processing, quant_min , quant_max respectively represent the maximum and minimum values that can be represented by fixed-point type data under a specific bit width, and scale is a scaling factor (for example, an example of an application scenario that expands by 1000 times).

在量化处理中，最重要的是计算scale系数。Scale系数计算方法为张量tensor的统计范围与定点数范围的比值，因此，如何统计tensor的范围就成了量化精度的关键，既要从范围上照顾到尽可能多的数，又要从精度上尽可能的区分互相接近的数。常用的计算scale的方法包括max、百分比等方法。In the quantization process, the most important thing is to calculate the scale coefficient. The calculation method of the Scale coefficient is the ratio of the statistical range of the tensor to the fixed-point range. Therefore, how to count the range of the tensor becomes the key to the quantification accuracy. It is necessary to take care of as many numbers as possible from the range, and from the precision Distinguish numbers that are as close to each other as possible. Commonly used methods for calculating scale include methods such as max and percentage.

另外，量化方法按照zero_point是否为0可以分为对称量化和非对称量化。Zero_point指的是浮点数的0映射到定点数的数值，如果浮点数的0映射到定点数的0则为对称量化，否则为非对称量化。在深度学习中，由于激活值、权重基本上符合均值为0的正态分布，由此，本申请的实施例采用对称量化来达到节省计算的目的。In addition, the quantization method can be divided into symmetric quantization and asymmetric quantization according to whether zero_point is 0 or not. Zero_point refers to the value that the 0 of the floating-point number is mapped to the fixed-point number. If the 0 of the floating-point number is mapped to the 0 of the fixed-point number, it is symmetric quantization, otherwise it is asymmetric quantization. In deep learning, since activation values and weights basically conform to a normal distribution with a mean value of 0, the embodiment of the present application adopts symmetric quantization to achieve the purpose of saving computation.

所述定点类型数据为约定小数点隐含在某一个固定的位置的数据，所述第二预设阈值范围可以为现有技术认可的定点类型数据的阈值范围，例如也可以是1.8E-308～1.8E+308。The fixed-point type data is data in which the decimal point is implied in a certain fixed position, and the second preset threshold range may be the threshold range of the fixed-point type data recognized by the prior art, for example, it may also be 1.8E-308～ 1.8E+308.

为了更加详细的描述在神经网络模型中需要对哪些激活值和权重进行量化，下面参照图3，以在GPT模型中如何进行各个目标激活值和目标权重的量化处理为例进行介绍，其中，float input为浮点类型数据的输入：即嵌入(embedding)层的输出激活值，由于嵌入层并不涉及权重和激活值的矩阵乘法运算，即使进行量化处理也不能带来性能提升，且显存占比较少，本实施例并没有对嵌入层的激活值和权重进行量化处理，以便保证模型进行数据计算的精度。In order to describe in more detail which activation values and weights need to be quantified in the neural network model, the following will refer to Figure 3 to introduce how to quantify each target activation value and target weight in the GPT model as an example. Among them, float input is the input of floating-point data: that is, the output activation value of the embedding layer. Since the embedding layer does not involve the matrix multiplication operation of the weight and the activation value, even if the quantization process is performed, the performance cannot be improved, and the memory usage is relatively high. At least, this embodiment does not perform quantization processing on the activation value and weight of the embedding layer, so as to ensure the accuracy of data calculation performed by the model.

在所述注意力计算层中，所述目标激活值为所述注意力计算层的输入数据，所述目标权重包括所述注意力计算层的查询权重、关键字权重和值权重；In the attention calculation layer, the target activation value is input data of the attention calculation layer, and the target weight includes query weight, keyword weight and value weight of the attention calculation layer;

参照图3，query-scale对应transformer-decoder结构中的query量化系数。输入注意力计算层的激活值query会与注意力计算层的查询权重Q、关键字权重K、值权重V三个权重分别进行矩阵乘法运算(即注意力计算层的QKV乘法，图3中所示的Gemm表示矩阵乘法)。根据图3可知，上述3个矩阵乘法各自的输入实质上为同一个激活值，由此，本实施例将上述3个矩阵乘法各自的输入的量化处理操作合并为一个，减少了量化步数，保证量化精度。Referring to FIG. 3 , the query-scale corresponds to the query quantization coefficient in the transformer-decoder structure. The activation value query of the input attention calculation layer will perform matrix multiplication operations with the query weight Q, keyword weight K, and value weight V of the attention calculation layer respectively (ie, the QKV multiplication of the attention calculation layer, shown in Figure 3). The Gemm shown stands for matrix multiplication). It can be seen from FIG. 3 that the inputs of the above-mentioned three matrix multiplications are substantially the same activation value. Therefore, in this embodiment, the quantization processing operations of the respective inputs of the above-mentioned three matrix multiplications are combined into one, thereby reducing the number of quantization steps. Guaranteed quantization accuracy.

另外，在一些实施例中，注意力计算层可能为掩码多头注意力计算层时，即注意力计算层包括多个关注头。由此，为了提升量化的粒度，减少量化误差，在一个实施例中，若目标权重的维度为(head_num,size_per_head),则将关注头维度，即size_per_head视为通道维度，按通道维度对目标权重分别进行量化处理，即将目标权重先按照关注头维度划分为各个目标权重子块之后，再分别进行量化处理。In addition, in some embodiments, the attention computation layer may be a masked multi-head attention computation layer, that is, the attention computation layer includes multiple attention heads. Therefore, in order to improve the granularity of quantization and reduce the quantization error, in one embodiment, if the dimension of the target weight is (head_num, size_per_head), the head dimension, that is, size_per_head, is regarded as the channel dimension, and the target weight is determined according to the channel dimension. The quantization processing is performed separately, that is, the target weight is first divided into each target weight sub-block according to the attention head dimension, and then the quantization processing is performed separately.

由于Q*K和注意力attn*v这两个矩阵乘法没有权重，都是激活值相乘，不能降低显存占用，因此，本实施例不对参与这两个矩阵乘法的数据做量化处理。Since the two matrix multiplications of Q*K and attention attn*v have no weight, they are all multiplication of activation values, which cannot reduce the video memory occupation. Therefore, this embodiment does not perform quantization processing on the data participating in the two matrix multiplications.

继续参照图3，考虑到Out_scale/fc1_scale和fc2_scale分别对应deocder结构的project矩阵乘法(即所述映射层的映射矩阵乘法)的缩放因子、fc1矩阵乘法(即所述前馈神经网络层的第一全连接矩阵乘法)的缩放因子和fc2矩阵乘法(即所述前馈神经网络层的第二全连接矩阵乘法)的缩放因子，本实施例都进行正常的量化操作，以便节省计算资源和时间。Continuing to refer to FIG. 3, consider that Out_scale/fc1_scale and fc2_scale correspond to the scaling factor of the project matrix multiplication of the deocder structure (that is, the mapping matrix multiplication of the mapping layer), the fc1 matrix multiplication (that is, the first of the feedforward neural network layer). The scaling factor of the fully connected matrix multiplication) and the scaling factor of the fc2 matrix multiplication (that is, the second fully connected matrix multiplication of the feedforward neural network layer), this embodiment performs normal quantization operations to save computing resources and time.

在介绍了如何确定目标激活值和目标权重，以及如何对各个目标激活值以及目标权重进行量化处理之后，接下来执行步骤S130，采用所述量化激活值与所述量化权重进行所述目标矩阵乘法运算。After introducing how to determine the target activation value and target weight, and how to perform quantization processing on each target activation value and target weight, step S130 is executed next, and the target matrix multiplication is performed by using the quantized activation value and the quantized weight. operation.

为了进一步提升计算效率，在本申请的一个实施例中，通过预先设置的融合算子进行量化处理；所述融合算子包括量化融合算子；In order to further improve the calculation efficiency, in an embodiment of the present application, quantization processing is performed by a preset fusion operator; the fusion operator includes a quantization fusion operator;

所述量化融合算子用于将目标激活值量化处理前的数据计算以及所述目标激活值的量化处理进行融合计算。The quantization fusion operator is used to perform fusion calculation on the data calculation before the quantization processing of the target activation value and the quantization processing of the target activation value.

具体来说，每一个目标矩阵乘法可以对应一个量化融合算子；Specifically, each target matrix multiplication can correspond to a quantization fusion operator;

其中，对应所述QKV矩阵乘法的量化融合算子用于将归一化处理以及所述归一化处理后的激活值的量化处理进行融合计算；如图3中虚线框中的层归一化layernorm以及输入队列缩放因子query-scale(相当于图4中的Quant)。Wherein, the quantization fusion operator corresponding to the QKV matrix multiplication is used to perform fusion calculation on the normalization process and the quantization process of the activation value after the normalization process; layernorm and the input queue scaling factor query-scale (equivalent to Quant in Figure 4).

对应所述映射矩阵乘法的量化融合算子用于将排列转换处理以及所述排列转换处理后的激活值的量化处理进行融合计算；如图4中虚线框中的Transpose排列转换处理以及激活值Quant的量化处理。The quantization fusion operator corresponding to the described mapping matrix multiplication is used for the quantization processing of the activation value after the permutation conversion processing and the permutation conversion processing to carry out fusion calculation; Transpose permutation conversion processing and activation value Quant in the dashed-line frame in Figure 4 quantization processing.

对应所述第一全连接矩阵乘法的量化融合算子用于将归一化处理以及所述归一化处理后的激活值的量化处理进行融合计算；如图4中第一全连接矩阵乘法FC1 Gemm之前的虚线框中的层归一化LayerNorm处理以及激活值Quant的量化处理。The quantization fusion operator corresponding to the first fully connected matrix multiplication is used for normalization processing and the quantization processing of the activation value after the normalization processing to carry out fusion calculation; the first fully connected matrix multiplication FC1 in Figure 4 The layer normalization LayerNorm processing in the dashed box before Gemm and the quantization processing of the activation value Quant.

对应所述第二全连接矩阵乘法的反量化融合算子用于将激活处理以及所述激活处理后的激活值的量化处理进行融合计算；如图4中的虚线框中的第二全连接矩阵乘法fc2_scale之前的激活GELU处理以及激活值根据fc2_scale的量化处理。The inverse quantization fusion operator corresponding to the second fully connected matrix multiplication is used to perform fusion calculation with the activation processing and the quantization processing of the activation value after the activation processing; the second fully connected matrix in the dotted line frame in Figure 4 Activation GELU processing before multiplying fc2_scale and quantization processing of activation values according to fc2_scale.

在采用所述量化激活值与所述量化权重进行所述目标矩阵乘法运算之后，为了确保最终计算结果的精度，还需要执行步骤S140，根据所述目标激活值或所述目标权重的量化方式将所述目标矩阵乘法运算的结果进行反量化处理，得到目标输出数据，所述目标输出数据为第一预设阈值范围内的浮点类型数据。After the target matrix multiplication operation is performed using the quantization activation value and the quantization weight, in order to ensure the accuracy of the final calculation result, step S140 also needs to be executed, according to the quantization method of the target activation value or the target weight. The result of the target matrix multiplication operation is subjected to inverse quantization processing to obtain target output data, where the target output data is floating-point data within a first preset threshold range.

其中，所述反量化处理即为量化处理的逆运算，例如量化处理将数据放大1000倍，反量化处理即为将数据缩小1000倍。The inverse quantization process is the inverse operation of the quantization process. For example, the quantization process enlarges the data by 1000 times, and the inverse quantization process reduces the data by 1000 times.

另外，由于一些目标权重在通道维度进行量化处理，由此，在一些实施例中，目标权重的量化方式与目标激活值并不相同，此时，可以根据实际需要采用根据所述目标激活值或所述目标权重的量化方式将所述目标矩阵乘法运算的结果进行反量化处理。In addition, since some target weights are quantized in the channel dimension, in some embodiments, the quantization method of the target weight is not the same as the target activation value. The quantization method of the target weight performs inverse quantization processing on the result of the multiplication operation of the target matrix.

与量化实施例类似，本申请的一个实施例中，还通过预先设置的融合算子进行反量化处理；Similar to the quantization embodiment, in an embodiment of the present application, inverse quantization processing is also performed through a preset fusion operator;

所述融合算子为反量化融合算子；The fusion operator is an inverse quantization fusion operator;

在本申请的一个实施例中，每一个目标矩阵乘法还对应一个反量化融合算子；In an embodiment of the present application, each target matrix multiplication also corresponds to an inverse quantization fusion operator;

其中，对应所述QKV矩阵乘法的反量化融合算子用于将所述QKV矩阵乘法的运算结果的反量化处理以及偏置项相加进行融合计算；具体来说，在QKV乘法这一步，QKV矩阵乘运算结束后，本实施例将反量化过程与偏置项bias相加融合到一起，例如图4中的反量化deQuant和查询权重Q偏置项相加Qbias。Wherein, the inverse quantization fusion operator corresponding to the QKV matrix multiplication is used for the inverse quantization processing of the operation result of the QKV matrix multiplication and the addition of the offset term for fusion calculation; After the matrix multiplication operation is completed, this embodiment fuses the inverse quantization process with the addition of the bias term bias, for example, the inverse quantization deQuant in FIG. 4 and the query weight Q bias term addition Qbias.

对应所述映射矩阵乘法的反量化融合算子用于将所述映射矩阵乘法的运算结果的反量化处理以及偏置项相加、残差相加进行融合计算；具体来说，例如图4中的反量化deQuant、映射偏置项相加Proj Bias和Add input残差相加。The inverse quantization fusion operator corresponding to the mapping matrix multiplication is used for the inverse quantization processing of the operation result of the mapping matrix multiplication and the addition of offset terms and the addition of residuals for fusion calculation; specifically, for example, in FIG. 4 The inverse quantization deQuant, the mapping bias term addition Proj Bias and the Add input residual addition.

对应所述第一全连接矩阵乘法的反量化融合算子用于将所述第一全连接矩阵乘法的运算结果的反量化处理以及偏置项相加、激活运算进行融合计算；具体来说，例如图4中的反量化deQuant、偏置项相加FC1 Bias&act和GELU激活运算(图3所示)。The inverse quantization fusion operator corresponding to the first fully connected matrix multiplication is used to perform fusion calculation on the inverse quantization processing of the operation result of the first fully connected matrix multiplication, the addition of offset terms, and the activation operation; specifically, For example, the inverse quantization deQuant in Figure 4, the addition of the bias term FC1 Bias&act, and the GELU activation operation (shown in Figure 3).

对应所述第二全连接矩阵乘法的反量化融合算子用于将所述第二全连接矩阵乘法的运算结果的反量化处理以及偏置项相加、归一化处理进行融合计算；具体来说，例如图4中的反量化deQuant、偏置项相加FC2Bias&act和归一化处理Add&Norm。The inverse quantization fusion operator corresponding to the second fully connected matrix multiplication is used for the inverse quantization processing of the operation result of the second fully connected matrix multiplication and the addition and normalization of the offset items to perform fusion calculation; Say, for example, inverse quantization deQuant, bias term addition FC2Bias&act and normalization Add&Norm in Figure 4.

另外，所述量化融合算子可以是所述量化处理前置计算原始的核函数融合所述量化处理后得到的；所述反量化融合算子可以是所述反量化处理后置计算原始的核函数融合所述反量化处理后得到的。In addition, the quantization fusion operator may be obtained by fusing the quantization processing with the original kernel function calculated before the quantization processing; the inverse quantization fusion operator may be the original kernel calculated after the inverse quantization processing. The function is obtained after the inverse quantization process is fused.

本申请的实施例在进行量化推理的时候，将量化与反量化的过程融入到前后算子中，从而减少显存的访问，带来性能提升。由于权重和其量化系数是已知的，因此，本实施例可以对权重做提前量化。In the embodiments of the present application, when performing quantization inference, the processes of quantization and inverse quantization are integrated into the front and rear operators, thereby reducing access to video memory and improving performance. Since the weight and its quantization coefficient are known, this embodiment can perform quantization on the weight in advance.

一些神经网络模型如GPT大模型由于参数量巨大，存在单个显卡的内存不足，无法装下所有模型的问题。为了解决显存不足的问题，可以将权重进行张量tensor切割，并将切割后的tensor存放到不同的显卡上，从而达到装载更大模型的目的。Some neural network models, such as the GPT large model, have the problem that the memory of a single graphics card is insufficient due to the huge amount of parameters, and all models cannot be loaded. In order to solve the problem of insufficient video memory, the weight can be cut by tensor tensor, and the cut tensor can be stored on different graphics cards, so as to achieve the purpose of loading a larger model.

具体来说，在一个实施例中，若所述神经网络模型处于并行训练状态，则根据所述神经网络模型的并行训练方式对所述目标激活值和所述目标权重分别进行量化处理；Specifically, in one embodiment, if the neural network model is in a parallel training state, the target activation value and the target weight are respectively quantized according to the parallel training mode of the neural network model;

在所述神经网络模型的并行训练方式为数据并行时，对符合条件的所述目标权重进行通道维度的量化处理，对所述目标激活值以及不符合条件的所述目标权重进行张量维度的量化处理。其中，所述符合条件可以是目标权重为多个关注头输出的权重。When the parallel training mode of the neural network model is data parallelism, the quantization of the channel dimension is performed on the target weights that meet the conditions, and the tensor dimension is performed on the target activation values and the target weights that do not meet the conditions. Quantization processing. Wherein, the matching condition may be that the target weight is the weight output by multiple attention heads.

在一个实施例中，所述对符合条件的所述目标权重进行通道维度的量化处理，包括：In one embodiment, the quantization processing of the channel dimension is performed on the target weights that meet the conditions, including:

在模型并行训练时，对于权重量化，通常是权重通道内部的数据被分配到不同的显卡上，导致无法正常获得通道内数据的统计范围，进而无法算出权重的scale系数；对于激活值而言，由于不同机器会进行激活值不同部分的计算，也无法统计出激活值的整体的范围，进而无法算出激活值scale系数。When the model is trained in parallel, for weight quantization, the data inside the weight channel is usually allocated to different graphics cards, which makes it impossible to obtain the statistical range of the data in the channel normally, and thus cannot calculate the scale coefficient of the weight; for the activation value, Since different machines will calculate different parts of the activation value, the overall range of the activation value cannot be counted, and the activation value scale coefficient cannot be calculated.

为了解决模型并行训练带来的激活值和权重的scale跨机间无法统计的问题，本申请的实施例提出了per-block数据块粒度的量化方法。In order to solve the problem that the activation value and the scale of the weight cannot be counted across machines due to the parallel training of the model, the embodiment of the present application proposes a method for quantizing the granularity of per-block data blocks.

在一个实施例中，所述对所述目标激活值和所述目标权重分别进行数据块维度的量化处理，包括：In one embodiment, the quantization processing of the data block dimension is respectively performed on the target activation value and the target weight, including:

在一个实施例中，所述根据所述模型并行的尺度、所述关注头维度对所述目标权重根据进行数据分块，包括：In one embodiment, performing data partitioning on the target weight according to the parallel scale of the model and the attention head dimension includes:

Per-block比per-channel的量化粒度更细，对于激活值，按照模型并行的规模进行分块处理，假设模型并行的size为K，原有激活值的数据个数为n,那么per-tensor的量化粒度为n，per-block的量化粒度为n/K；对于权重，同样按照模型并行的规模进行分块处理，假设模型并行的规模size为K，原有权重的数据个数为n,权重的关注头head数为n，那么per-channel的量化粒度为n/h，per-block的量化粒度为n/(h*K)。Per-block has a finer quantization granularity than per-channel. For the activation value, the block processing is performed according to the scale of model parallelism. Assuming that the parallel size of the model is K and the number of data of the original activation value is n, then per-tensor The quantization granularity of per-block is n, and the quantization granularity of per-block is n/K; for the weight, it is also divided into blocks according to the scale of model parallelism. The number of heads of attention of the weight is n, then the quantization granularity of per-channel is n/h, and the quantization granularity of per-block is n/(h*K).

参照图5，在此图中，一个长方形代表的是per-tensor的量化过程，(一个长方形即代表一个tensor)，横向一分为二代表的是per-channel的量化过程，横向+纵向一分为四代表的是per-block的量化过程。Referring to Figure 5, in this figure, a rectangle represents the quantization process of per-tensor, (a rectangle represents a tensor), and the horizontal division into two represents the quantization process of per-channel, horizontal + vertical one point Four represents the per-block quantization process.

在图5中，quant开头的参数表示激活值的量化，weight结尾的参数代表的是权重的量化。In Figure 5, the parameters at the beginning of quant represent the quantization of activation values, and the parameters at the end of weight represent the quantization of weights.

三种量化过程的选择取决于训练方法。在数据并行情况下，本实施例采用的是对激活值per-tensor、权重per-channel的量化方法，因此，数据并行下，quant开头的均是整个长方形，weight结尾的均是一分为二的长方形。The choice of the three quantization processes depends on the training method. In the case of data parallelism, this embodiment adopts the quantization method of activation value per-tensor and weight per-channel. Therefore, under data parallelism, the beginning of quant is the whole rectangle, and the end of weight is divided into two parts. 's rectangle.

在模型并行下，首先是注意力attention的模型并行过程，QKV的权重由于模型并行，做了张量拆分tensor split的操作，且该拆分split操作不在通道channel维度上，因此表述成了垂直的两条分割线，变成了一分为四的长方形。Quant_out是QKV三个矩阵乘的输出的计算attention的结果，在tensor-split的情况下，也被纵向切割，同理，out_weight也需要纵向切割，直到attention的计算完成(即quant_fc1所代表的节点)。Under model parallelism, the first is the model parallelization process of attention. Due to the parallelism of the model, the weight of QKV performs a tensor split operation, and the split operation is not in the channel dimension, so it is expressed as vertical The two dividing lines of , become a rectangle divided into four. Quant_out is the result of calculating the attention of the output of the three-matrix multiplication of QKV. In the case of tensor-split, it is also cut vertically. Similarly, out_weight also needs to be cut vertically until the calculation of attention is completed (that is, the node represented by quant_fc1) .

类似于attention的模型并行过程，前馈神经网络层ffn的计算过程也是如此。FC1_weight和FC2_weight在模型并行和per_channel量化的双重作用下，变成了一分为四per-block的量化过程，fc1的矩阵乘结果quant_fc2由原来的per-tensor的量化变成了纵向一分为二的per-block量化，直到ffn的计算过程结束，即fc2矩阵乘法的输出结果，又恢复成了per-tensor的量化过程。Similar to the model parallel process of attention, the calculation process of the feedforward neural network layer ffn is also the same. FC1_weight and FC2_weight become a quantization process that is divided into four per-blocks under the dual effects of model parallelism and per_channel quantization. The matrix multiplication result of fc1, quant_fc2, has changed from the original per-tensor quantization to a vertical split into two. The per-block quantization of , until the end of the ffn calculation process, that is, the output result of the fc2 matrix multiplication, is restored to the per-tensor quantization process.

例如一个矩阵[[1,2,3,4],[5,6,7,8]]，per-tensor的量化方式将矩阵中的8个数值一起做量化；per-channel的量化方式则是将[1,2,3,4]做量化，将[5,6,7,8]做量化；per-block的量化方式，先将[1,2,3,4]切分成两个数据块block:[1,2]和[3,4],再对[1,2]做量化以及对[3,4]做量化，之后对[5,6,7,8]重复执行上述操作。For example, a matrix [[1,2,3,4],[5,6,7,8]], the quantization method of per-tensor quantizes the 8 values in the matrix together; the quantization method of per-channel is Quantize [1,2,3,4] and quantize [5,6,7,8]; for per-block quantization, first divide [1,2,3,4] into two data blocks block: [1,2] and [3,4], then quantize [1,2] and quantize [3,4], and then repeat the above operations for [5,6,7,8].

根据本申请实施例的数据计算方法，将待参与矩阵乘法运算的有限数量的目标激活值和目标权重进行量化处理，即将有限数量的浮点类型数据转换为定点类型数据，而不是将所有浮点类型数据转换为定点类型数据，使得计算时消耗的资源和时间减少，节约显存，且不降低计算精度，为用户带来了更好的体验。另外，一些实施例中，还将量化、反量化与前后算子融合，进一步提高计算效率。另外，一些实施例中，对并行训练状态的模型进行数据块粒度的量化，由于在更细粒度进行数据量化，有效降低了量化误差，且避免了跨机间的通信开销，提升训练效率。According to the data calculation method of the embodiment of the present application, a limited number of target activation values and target weights to be involved in the matrix multiplication operation are quantized, that is, a limited number of floating-point type data are converted into fixed-point type data, instead of all floating-point type data. Converting type data to fixed-point type data reduces the resources and time consumed in computing, saves video memory, does not reduce computing accuracy, and brings a better experience to users. In addition, in some embodiments, quantization, inverse quantization, and front and rear operators are also combined to further improve computational efficiency. In addition, in some embodiments, the quantization of the data block granularity is performed on the model in the parallel training state, since the data quantization is performed at a finer granularity, the quantization error is effectively reduced, and the communication overhead between the machines is avoided, and the training efficiency is improved.

在介绍了本申请示例性实施例的方法之后，接下来，参考图6对本申请示例性实施例的用于数据计算的装置，该装置60包括：After the method of the exemplary embodiment of the present application is introduced, next, referring to FIG. 6 , the apparatus for data calculation according to the exemplary embodiment of the present application is described. The apparatus 60 includes:

获取模块610，被配置为获取目标激活值以及目标权重，其中，所述目标激活值和所述目标权重待进行目标矩阵乘法运算且均为第一预设阈值范围内的浮点类型数据；The acquisition module 610 is configured to acquire a target activation value and a target weight, wherein the target activation value and the target weight are to be subjected to a target matrix multiplication operation and are both floating-point data within a first preset threshold range;

量化模块620，被配置为对所述目标激活值和所述目标权重分别进行量化处理，得到与所述目标激活值对应的量化激活值以及与所述目标权重对应的量化权重，其中，所述量化激活值和所述量化权重均为第二预设阈值范围内的定点类型数据；The quantization module 620 is configured to perform quantization processing on the target activation value and the target weight, respectively, to obtain a quantization activation value corresponding to the target activation value and a quantization weight corresponding to the target weight, wherein the Both the quantization activation value and the quantization weight are fixed-point type data within the second preset threshold range;

计算模块630，被配置为采用所述量化激活值与所述量化权重进行所述目标矩阵乘法运算；a calculation module 630, configured to perform the target matrix multiplication operation by using the quantized activation value and the quantized weight;

反量化模块640，被配置为根据所述目标激活值或所述目标权重的量化方式将所述目标矩阵乘法运算的结果进行反量化处理，得到目标输出数据，所述目标输出数据为第一预设阈值范围内的浮点类型数据。The inverse quantization module 640 is configured to perform inverse quantization processing on the result of the multiplication operation of the target matrix according to the quantization method of the target activation value or the target weight to obtain target output data, and the target output data is the first pre- Set the floating-point data within the threshold range.

在本申请的一个实施例中，该装置60应用于神经网络模型中；In an embodiment of the present application, the device 60 is applied in a neural network model;

所述量化模块620，还被配置为根据所述查询权重、所述关键字权重和所述值权重各自的通道维度分别进行量化处理，得到与所述查询权重对应的量化查询权重、与所述关键字权重对应的量化关键字权重以及与所述值权重对应的量化值权重；The quantization module 620 is further configured to perform quantization processing according to the respective channel dimensions of the query weight, the keyword weight and the value weight, to obtain a quantified query weight corresponding to the query weight, The quantized keyword weight corresponding to the keyword weight and the quantized value weight corresponding to the value weight;

所述计算模块630，还被配置为采用所述量化激活值与所述量化查询权重、所述量化关键字权重和所述量化值权重分别进行矩阵乘法运算；The computing module 630 is further configured to perform matrix multiplication operations with the quantized activation value and the quantized query weight, the quantized keyword weight, and the quantized value weight, respectively;

所述反量化模块640，还被配置为将三个矩阵乘法运算的结果分别按照所述目标激活值或所述目标权重的量化方式进行反量化处理；以及The inverse quantization module 640 is further configured to perform inverse quantization processing on the results of the three matrix multiplication operations respectively according to the quantization method of the target activation value or the target weight; and

在本申请的一个实施例中，所述量化模块620，还被配置为获取所述目标权重的关注头维度；将所述目标权重根据所述关注头维度进行数据分块，得到各个目标权重子块；对各个目标权重子块分别进行量化处理。In an embodiment of the present application, the quantization module 620 is further configured to obtain the attention head dimension of the target weight; the target weight is divided into data blocks according to the attention head dimension to obtain each target weight sub-dimension. block; quantize each target weight sub-block separately.

在本申请的一个实施例中，所述量化模块620，还被配置为获取所述模型并行的尺度以及所述目标权重的关注头维度；根据所述模型并行的尺度对所述目标激活值进行数据分块，得到各个目标激活值子块；以及根据所述模型并行的尺度、所述关注头维度对所述目标权重根据进行数据分块，得到各个目标权重子块；对所述各个目标激活值子块和所述各个目标权重子块分别进行量化处理。In an embodiment of the present application, the quantization module 620 is further configured to obtain the scale of the model parallelism and the attention head dimension of the target weight; and perform the target activation value according to the scale of the model parallelism. Data is divided into blocks to obtain each target activation value sub-block; and according to the scale of the model parallel and the attention head dimension, the data is divided into blocks according to the target weight to obtain each target weight sub-block; each target is activated The value sub-block and each of the target weight sub-blocks are separately quantized.

根据本申请实施例的数据计算装置，将待参与矩阵乘法运算的有限数量的目标激活值和目标权重进行量化处理，即将有限数量的浮点类型数据转换为定点类型数据，而不是将所有浮点类型数据转换为定点类型数据，使得计算时消耗的资源和时间减少，节约显存，且不降低计算精度，为用户带来了更好的体验。另外，一些实施例中，还将量化、反量化与前后算子融合，进一步提高计算效率。另外，一些实施例中，对并行训练状态的模型进行数据块粒度的量化，由于在更细粒度进行数据量化，有效降低了量化误差，且避免了跨机间的通信开销，提升训练效率。According to the data computing device of the embodiment of the present application, a limited number of target activation values and target weights to be involved in the matrix multiplication operation are quantized, that is, a limited number of floating-point type data are converted into fixed-point type data, instead of all floating-point type data. Converting type data to fixed-point type data reduces the resources and time consumed in computing, saves video memory, does not reduce computing accuracy, and brings a better experience to users. In addition, in some embodiments, quantization, inverse quantization, and front and rear operators are also combined to further improve computational efficiency. In addition, in some embodiments, the quantization of the data block granularity is performed on the model in the parallel training state, since the data quantization is performed at a finer granularity, the quantization error is effectively reduced, and the communication overhead between the machines is avoided, and the training efficiency is improved.

在介绍了本申请示例性实施方式的方法和装置之后，接下来，参考图7对本申请示例性实施方式的计算机可读存储介质进行说明，其示出的计算机可读存储介质为光盘70，其上存储有计算机程序(即程序产品)，所述计算机程序在被处理器运行时，会实现上述方法实施方式中所记载的各步骤，例如，获取目标激活值以及目标权重，其中，所述目标激活值和所述目标权重待进行目标矩阵乘法运算且均为第一预设阈值范围内的浮点类型数据；对所述目标激活值和所述目标权重分别进行量化处理，得到与所述目标激活值对应的量化激活值以及与所述目标权重对应的量化权重，其中，所述量化激活值和所述量化权重均为第二预设阈值范围内的定点类型数据；采用所述量化激活值与所述量化权重进行所述目标矩阵乘法运算；根据所述目标激活值或所述目标权重的量化方式将所述目标矩阵乘法运算的结果进行反量化处理，得到目标输出数据，所述目标输出数据为第一预设阈值范围内的浮点类型数据；各步骤的具体实现方式在此不再重复说明。After introducing the method and apparatus of the exemplary embodiment of the present application, next, the computer-readable storage medium of the exemplary embodiment of the present application will be described with reference to FIG. A computer program (ie a program product) is stored on the computer, and when the computer program is run by the processor, it will implement the steps described in the above method embodiments, for example, obtain the target activation value and the target weight, wherein the target The activation value and the target weight are to be multiplied by the target matrix, and both are floating-point data within the range of the first preset threshold; quantization processing is performed on the target activation value and the target weight, respectively, to obtain the target activation value and the target weight. The quantization activation value corresponding to the activation value and the quantization weight corresponding to the target weight, wherein the quantization activation value and the quantization weight are both fixed-point type data within a second preset threshold range; the quantization activation value is used Carry out the target matrix multiplication operation with the quantization weight; perform inverse quantization processing on the result of the target matrix multiplication operation according to the quantization method of the target activation value or the target weight to obtain target output data, and the target output The data is floating-point data within the range of the first preset threshold; the specific implementation of each step will not be repeated here.

需要说明的是，所述计算机可读存储介质的例子还可以包括，但不限于相变内存(PRAM)、静态随机存取存储器(SRAM)、动态随机存取存储器(DRAM)、其他类型的随机存取存储器(RAM)、只读存储器(ROM)、电可擦除可编程只读存储器(EEPROM)、快闪记忆体或其他光学、磁性存储介质，在此不再一一赘述。It should be noted that examples of the computer-readable storage medium may also include, but are not limited to, phase-change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random Access memory (RAM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), flash memory or other optical and magnetic storage media will not be repeated here.

在介绍了本申请示例性实施例的方法、装置和存储介质之后，接下来，参参考图8对本申请示例性实施方式的用于数据计算的设备。After introducing the method, apparatus and storage medium of the exemplary embodiment of the present application, next, referring to FIG. 8 , the apparatus for data computing of the exemplary embodiment of the present application will be described.

图8示出了适于用来实现本申请实施方式的示例性计算设备80的框图，该计算设备80可以是计算机系统或服务器。图8显示的计算设备80仅仅是一个示例，不应对本申请实施例的功能和使用范围带来任何限制。8 illustrates a block diagram of an exemplary computing device 80, which may be a computer system or server, suitable for use in implementing embodiments of the present application. The computing device 80 shown in FIG. 8 is only an example, and should not impose any limitations on the functions and scope of use of the embodiments of the present application.

如图8所示，计算设备80的组件可以包括但不限于：一个或者多个处理器或者处理单元801，系统存储器802，连接不同系统组件(包括系统存储器802和处理单元801)的总线803。As shown in FIG. 8, components of computing device 80 may include, but are not limited to, one or more processors or processing units 801, system memory 802, and a bus 803 connecting different system components (including system memory 802 and processing unit 801).

计算设备80典型地包括多种计算机系统可读介质。这些介质可以是任何能够被计算设备80访问的可用介质，包括易失性和非易失性介质，可移动的和不可移动的介质。Computing device 80 typically includes a variety of computer system readable media. These media can be any available media that can be accessed by computing device 80, including both volatile and nonvolatile media, removable and non-removable media.

系统存储器802可以包括易失性存储器形式的计算机系统可读介质，例如随机存取存储器(RAM)8021和/或高速缓存存储器8022。计算设备80可以进一步包括其它可移动/不可移动的、易失性/非易失性计算机系统存储介质。仅作为举例，ROM8023可以用于读写不可移动的、非易失性磁介质(图8中未显示，通常称为“硬盘驱动器”)。尽管未在图8中示出，可以提供用于对可移动非易失性磁盘(例如“软盘”)读写的磁盘驱动器，以及对可移动非易失性光盘(例如CD-ROM，DVD-ROM或者其它光介质)读写的光盘驱动器。在这些情况下，每个驱动器可以通过一个或者多个数据介质接口与总线803相连。系统存储器802中可以包括至少一个程序产品，该程序产品具有一组(例如至少一个)程序模块，这些程序模块被配置以执行本申请各实施例的功能。System memory 802 may include computer system readable media in the form of volatile memory, such as random access memory (RAM) 8021 and/or cache memory 8022 . Computing device 80 may further include other removable/non-removable, volatile/non-volatile computer system storage media. For example only, the ROM 8023 may be used to read and write to a non-removable, non-volatile magnetic medium (not shown in Figure 8, commonly referred to as a "hard disk drive"). Although not shown in Figure 8, a disk drive may be provided for reading and writing to removable non-volatile magnetic disks (eg "floppy disks"), as well as removable non-volatile optical disks (eg CD-ROM, DVD- ROM or other optical media) to read and write optical drives. In these cases, each drive may be connected to bus 803 through one or more data media interfaces. System memory 802 may include at least one program product having a set (eg, at least one) of program modules configured to perform the functions of various embodiments of the present application.

具有一组(至少一个)程序模块8024的程序/实用工具8025，可以存储在例如系统存储器802中，且这样的程序模块8024包括但不限于：操作系统、一个或者多个应用程序、其它程序模块以及程序数据，这些示例中的每一个或某种组合中可能包括网络环境的实现。程序模块8024通常执行本申请所描述的实施例中的功能和/或方法。A program/utility 8025 having a set (at least one) of program modules 8024, which may be stored, for example, in system memory 802, and such program modules 8024 include, but are not limited to: an operating system, one or more application programs, other program modules As well as program data, each or some combination of these examples may include an implementation of a network environment. Program modules 8024 generally perform the functions and/or methods of the embodiments described herein.

计算设备80也可以与一个或多个外部设备804(如键盘、指向设备、显示器等)通信。这种通信可以通过输入/输出(I/O)接口805进行。并且，计算设备80还可以通过网络适配器806与一个或者多个网络(例如局域网(LAN)，广域网(WAN)和/或公共网络，例如因特网)通信。如图8所示，网络适配器806通过总线803与计算设备80的其它模块(如处理单元801等)通信。应当明白，尽管图8中未示出，可以结合计算设备80使用其它硬件和/或软件模块。Computing device 80 may also communicate with one or more external devices 804 (eg, keyboards, pointing devices, displays, etc.). Such communication may take place through input/output (I/O) interface 805 . Also, the computing device 80 may communicate with one or more networks (eg, a local area network (LAN), a wide area network (WAN), and/or a public network such as the Internet) through a network adapter 806 . As shown in FIG. 8 , network adapter 806 communicates with other modules of computing device 80 (eg, processing unit 801 , etc.) through bus 803 . It should be appreciated that, although not shown in FIG. 8 , other hardware and/or software modules may be used in conjunction with computing device 80 .

处理单元801通过运行存储在系统存储器802中的程序，从而执行各种功能应用以及数据计算，例如，获取目标激活值以及目标权重，其中，所述目标激活值和所述目标权重待进行目标矩阵乘法运算且均为第一预设阈值范围内的浮点类型数据；对所述目标激活值和所述目标权重分别进行量化处理，得到与所述目标激活值对应的量化激活值以及与所述目标权重对应的量化权重，其中，所述量化激活值和所述量化权重均为第二预设阈值范围内的定点类型数据；采用所述量化激活值与所述量化权重进行所述目标矩阵乘法运算；根据所述目标激活值或所述目标权重的量化方式将所述目标矩阵乘法运算的结果进行反量化处理，得到目标输出数据，所述目标输出数据为第一预设阈值范围内的浮点类型数据。各步骤的具体实现方式在此不再重复说明。The processing unit 801 executes various functional applications and data calculations by running the programs stored in the system memory 802, for example, obtains target activation values and target weights, wherein the target activation values and the target weights are to be performed on the target matrix. Multiplication operation and both are floating-point type data within the first preset threshold range; respectively perform quantization processing on the target activation value and the target weight to obtain a quantized activation value corresponding to the target activation value and a quantization activation value corresponding to the target activation value. The quantization weight corresponding to the target weight, wherein the quantization activation value and the quantization weight are both fixed-point type data within a second preset threshold range; the quantization activation value and the quantization weight are used to perform the target matrix multiplication operation; perform inverse quantization processing on the result of the multiplication operation of the target matrix according to the quantization method of the target activation value or the target weight to obtain target output data, and the target output data is a floating value within the first preset threshold range. Point type data. The specific implementation manner of each step is not repeated here.

应当注意，尽管在上文详细描述中提及了数据计算装置的若干单元/模块或子单元/模块，但是这种划分仅仅是示例性的并非强制性的。实际上，根据本申请的实施方式，上文描述的两个或更多单元/模块的特征和功能可以在一个单元/模块中具体化。反之，上文描述的一个单元/模块的特征和功能可以进一步划分为由多个单元/模块来具体化。It should be noted that although several units/modules or sub-units/modules of the data computing device are mentioned in the above detailed description, this division is merely exemplary and not mandatory. Indeed, according to embodiments of the present application, the features and functions of two or more units/modules described above may be embodied in one unit/module. Conversely, the features and functions of one unit/module described above may be further subdivided to be embodied by multiple units/modules.

此外，尽管在附图中以特定顺序描述了本申请方法的操作，但是，这并非要求或者暗示必须按照该特定顺序来执行这些操作，或是必须执行全部所示的操作才能实现期望的结果。附加地或备选地，可以省略某些步骤，将多个步骤合并为一个步骤执行，和/或将一个步骤分解为多个步骤执行。Furthermore, although the operations of the methods of the present application are depicted in the figures in a particular order, this does not require or imply that the operations must be performed in the particular order, or that all illustrated operations must be performed to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps may be combined to be performed as one step, and/or one step may be decomposed into multiple steps to be performed.

虽然已经参考若干具体实施方式描述了本申请的精神和原理，但是应该理解，本申请并不限于所申请的具体实施方式，对各方面的划分也不意味着这些方面中的特征不能组合以进行受益，这种划分仅是为了表述的方便。本申请旨在涵盖所附权利要求的精神和范围内所包括的各种修改和等同布置。Although the spirit and principles of the present application have been described with reference to several specific embodiments, it should be understood that the present application is not limited to the specific embodiments of the application, nor does the division of aspects mean that features in these aspects cannot be combined to perform Benefit, this division is only for convenience of presentation. This application is intended to cover various modifications and equivalent arrangements included within the spirit and scope of the appended claims.

Claims

1. A method of data computation, comprising:

acquiring a target activation value and a target weight, wherein the target activation value and the target weight are to be subjected to target matrix multiplication and are floating point type data within a first preset threshold range;

respectively carrying out quantization processing on the target activation value and the target weight to obtain a quantization activation value corresponding to the target activation value and a quantization weight corresponding to the target weight, wherein the quantization activation value and the quantization weight are both fixed point type data within a second preset threshold range;

performing the target matrix multiplication operation by using the quantization activation value and the quantization weight;

and performing inverse quantization processing on the result of the target matrix multiplication operation according to the quantization mode of the target activation value or the target weight to obtain target output data, wherein the target output data is floating point type data within a first preset threshold range.

2. The data computation method of claim 1, applied in a neural network model;

wherein the neural network model comprises at least one of the target matrix multiplication operations;

the target matrix multiplication operation is one of the following matrix multiplications:

QKV matrix multiplication of the attention calculation layer;

mapping matrix multiplication of a mapping layer;

a first fully-connected matrix multiplication of the feedforward neural network layer;

a second fully-connected matrix multiplication of the feedforward neural network layer.

3. The data calculation method according to claim 2, wherein in the attention calculation layer, the target activation value is input data of the attention calculation layer, and the target weight includes a query weight, a keyword weight, and a value weight of the attention calculation layer;

performing quantization processing on the target weight, including:

respectively carrying out quantization processing according to the channel dimensions of the query weight, the keyword weight and the value weight to obtain a quantization query weight corresponding to the query weight, a quantization keyword weight corresponding to the keyword weight and a quantization value weight corresponding to the value weight;

the performing the target matrix multiplication operation by using the quantization activation value and the quantization weight includes:

matrix multiplication is carried out on the quantization activation value, the quantization inquiry weight, the quantization keyword weight and the quantization value weight respectively;

performing inverse quantization processing on the result of the target matrix multiplication according to the quantization mode of the target activation value or the target weight to obtain target output data, including:

performing inverse quantization processing on the results of the three matrix multiplication operations according to the quantization modes of the target activation values or the target weights respectively;

and calculating attention according to a preset rule by adopting three matrix multiplication results subjected to inverse quantization, and taking the attention as output data of the attention calculation layer.

4. The data calculation method of claim 3, wherein when the attention calculation layer is a mask multi-head attention calculation layer, a channel dimension of a weight is a head-of-interest dimension of the weight.

5. The data calculation method according to claim 2, wherein quantization processing or inverse quantization processing is performed by a preset fusion operator;

the fusion operator comprises a quantitative fusion operator and an inverse quantitative fusion operator;

the quantization fusion operator is used for performing fusion calculation on data calculation before quantization processing of a target activation value and quantization processing of the target activation value;

and the inverse quantization fusion operator is used for performing fusion calculation on the data calculation after inverse quantization processing and the inverse quantization processing.

6. The data calculation method of claim 5, wherein each target matrix multiplication corresponds to one quantized fusion operator and/or one inverse quantized fusion operator;

the quantization fusion operator corresponding to the QKV matrix multiplication is used for performing fusion calculation on normalization processing and quantization processing of the activation value after the normalization processing; the inverse quantization fusion operator corresponding to the QKV matrix multiplication is used for performing fusion calculation by adding an offset term and inverse quantization processing of the operation result of the QKV matrix multiplication;

the quantization fusion operator corresponding to the mapping matrix multiplication is used for performing fusion calculation on permutation conversion processing and the quantization processing of the activation value after the permutation conversion processing; the inverse quantization fusion operator corresponding to the mapping matrix multiplication is used for performing fusion calculation on inverse quantization processing, bias term addition and residual addition of the operation result of the mapping matrix multiplication;

the quantization fusion operator corresponding to the first full-connection matrix multiplication is used for performing fusion calculation on normalization processing and the quantization processing of the activation value after the normalization processing; the inverse quantization fusion operator corresponding to the first full-connection matrix multiplication is used for performing fusion calculation on inverse quantization processing, bias term addition and activation operation of the operation result of the first full-connection matrix multiplication;

and the inverse quantization fusion operator corresponding to the second full-connection matrix multiplication is used for performing fusion calculation on the inverse quantization processing, the addition of the bias terms and the normalization processing of the operation result of the second full-connection matrix multiplication.

7. The data calculation method according to any one of claims 2 to 5, wherein if the neural network model is in a parallel training state, the target activation value and the target weight are respectively subjected to quantization processing according to a parallel training mode of the neural network model;

when the parallel training mode of the neural network model is data parallel, carrying out channel dimension quantization processing on the target weight meeting the condition, and carrying out tensor dimension quantization processing on the target activation value and the target weight not meeting the condition;

and when the parallel training mode of the neural network model is model parallel, respectively carrying out data block dimension quantization processing on the target activation value and the target weight, wherein the data block dimensions quantization modes of the target activation value and the target weight are different.

8. The data computing method of claim 7, wherein the performing quantization processing of channel dimensions on the target weights that meet conditions comprises:

obtaining the attention head dimension of the target weight;

partitioning the target weight into blocks according to the attention head dimension to obtain each target weight sub-block;

and respectively carrying out quantization processing on each target weight subblock.

9. The data calculation method of claim 7, wherein the performing quantization processing on the target activation values and the target weights respectively for data block dimensions comprises:

acquiring the parallel scale of the model and the attention head dimension of the target weight;

performing data partitioning on the target activation value according to the parallel scale of the model to obtain each target activation value sub-block; performing data blocking on the target weight according to the parallel scale of the model and the dimension of the attention head to obtain each target weight sub-block;

and respectively carrying out quantization processing on each target activation value sub-block and each target weight sub-block.

10. The data computing method of claim 7, wherein the data partitioning of the target weights according to the parallel scale of the model and the care-head dimension comprises:

and taking the product of the concerned head dimension of the target weight and the parallel scale of the model as a divisor for partitioning data.

11. A data computing apparatus, comprising:

the acquisition module is configured to acquire a target activation value and a target weight, wherein the target activation value and the target weight are to be subjected to target matrix multiplication and are floating point type data within a first preset threshold range;

the quantization module is configured to perform quantization processing on the target activation value and the target weight respectively to obtain a quantization activation value corresponding to the target activation value and a quantization weight corresponding to the target weight, wherein the quantization activation value and the quantization weight are both fixed point type data within a second preset threshold range;

a calculation module configured to perform the target matrix multiplication operation using the quantization activation value and the quantization weight;

and the inverse quantization module is configured to perform inverse quantization processing on the result of the target matrix multiplication according to the quantization mode of the target activation value or the target weight to obtain target output data, wherein the target output data is floating point type data within a first preset threshold range.

12. A computer-readable storage medium comprising instructions that, when executed on a computer, cause the computer to perform the method of any one of claims 1-10.

13. A computing device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the method according to any of claims 1-10 when executing the computer program.