CN112070162A

CN112070162A - Multi-class processing task training sample construction method, device and medium

Info

Publication number: CN112070162A
Application number: CN202010936484.6A
Authority: CN
Inventors: 张超; 吴海山; 殷磊
Original assignee: WeBank Co Ltd
Current assignee: WeBank Co Ltd
Priority date: 2020-09-08
Filing date: 2020-09-08
Publication date: 2020-12-11

Abstract

The invention discloses a multi-class processing task training sample construction method, equipment and medium. The method includes: acquiring the real probability distribution of sample data on multiple preset categories, and determining its predicted probability distribution; according to the real probability distribution and Predict the probability distribution, and determine the loss list of the sample data on multiple preset categories; determine the mask list according to the loss list and the true probability distribution, and determine the forward category loss and multi-level loss of the sample data according to the mask list and the loss list. A negative category loss; then the two determine the positive category and multiple negative categories that the sample data belongs to in multiple preset categories, and according to the positive category and multiple negative categories, respectively construct multiple presets Positive and negative samples of the class. The present invention constructs positive and negative samples on multiple preset categories from the positive category loss and multiple negative category losses of multiple pieces of sample data, which is beneficial to the balance of sample data among each preset category.

Description

Method, equipment and medium for constructing training samples for multi-class processing tasks

技术领域technical field

本发明涉及金融科技(Fintech)技术领域，尤其涉及一种多类别处理任务训练样本构建方法、设备及介质。The present invention relates to the technical field of financial technology (Fintech), and in particular, to a method, device and medium for constructing training samples for multi-category processing tasks.

背景技术Background technique

随着金融科技(Fintech)，尤其是互联网科技金融的不断发展，越来越多的技术(如人工智能、大数据、云存储等)应用在金融领域，但金融领域也对各类技术提出了更高的要求，如要求人工智能中的样本数据更为均衡。With the continuous development of financial technology (Fintech), especially Internet technology finance, more and more technologies (such as artificial intelligence, big data, cloud storage, etc.) are applied in the financial field, but the financial field has also proposed various technologies. Higher requirements, such as requiring more balanced sample data in artificial intelligence.

目前，对于多类别分类任务，直接将收集到的各类别数据，作为多类别分类任务的样本数据，但不同类别所收集的数据数量通常难以均衡，对于涉及隐私、或者冷门的类别所收集的数据数量较少，而对于热门的类别所收集的数量数量则较多。如此一来，在依据不均衡样本执行多类别分类任务时，样本数量少的类别的准确率低。At present, for multi-category classification tasks, the collected data of each category is directly used as sample data for multi-category classification tasks, but the amount of data collected by different categories is usually difficult to balance. A smaller number, and a larger number for popular categories. As a result, when performing multi-class classification tasks based on unbalanced samples, the accuracy of the class with a small number of samples is low.

因此，在多类别分类任务中，如何构建均衡的样本，以确保多类别分类任务的准确性，是当前亟待解决的技术问题。Therefore, in the multi-class classification task, how to construct a balanced sample to ensure the accuracy of the multi-class classification task is a technical problem that needs to be solved urgently.

发明内容SUMMARY OF THE INVENTION

本发明的主要目的在于提供一种多类别处理任务训练样本构建方法、设备及介质，旨在解决现有技术在多类别分类任务中，如何构建均衡的样本的技术问题。The main purpose of the present invention is to provide a multi-class processing task training sample construction method, device and medium, aiming to solve the technical problem of how to construct balanced samples in multi-class classification tasks in the prior art.

为实现上述目的，本发明提供一种多类别处理任务训练样本构建方法，所述多类别处理任务训练样本构建方法包括以下步骤：To achieve the above object, the present invention provides a method for constructing training samples for multi-category processing tasks, and the method for constructing training samples for multi-category processing tasks includes the following steps:

获取样本数据在多个预设类别上的真实概率分布，并确定所述样本数据在多个预设类别上的预测概率分布；Obtain the real probability distribution of the sample data on the multiple preset categories, and determine the predicted probability distribution of the sample data on the multiple preset categories;

根据所述真实概率分布和所述预测概率分布，确定所述样本数据在多个预设类别上的损失列表；According to the real probability distribution and the predicted probability distribution, determine the loss list of the sample data in multiple preset categories;

根据所述损失列表和所述真实概率分布，确定掩码列表，并根据所述掩码列表和所述损失列表，确定所述样本数据的正向类别损失和多个负向类别损失；Determine a mask list according to the loss list and the true probability distribution, and determine a positive category loss and a plurality of negative category losses of the sample data according to the mask list and the loss list;

根据所述正向类别损失和多个所述负向类别损失，确定所述样本数据在多个预设类别中归属的正向类别和多个负向类别，并根据所述正向类别和多个负向类别，分别构建多个预设类别的正负向样本，以基于多个预设类别的正负向样本生成多类别分类模型进行类别分类。According to the positive category loss and a plurality of the negative category losses, determine a positive category and a plurality of negative categories to which the sample data belongs in a plurality of preset categories, and according to the positive category and the multiple negative categories A multi-category classification model is generated based on the positive and negative samples of the multiple preset categories for category classification.

可选地，所述根据所述正向类别和多个负向类别，分别构建多个预设类别的正负向样本的步骤之后，所述方法还包括：Optionally, after the step of respectively constructing positive and negative samples of multiple preset categories according to the positive category and multiple negative categories, the method further includes:

基于多个所述预设类别的正负项样本，对预设多类别模型进行训练，生成多类别分类模型；training a preset multi-category model based on a plurality of positive and negative item samples of the preset categories to generate a multi-category classification model;

当接收到待分类数据时，基于所述多类别分类模型对所述待分类数据进行类别分类，确定所述待分类数据所归属的类别。When the data to be classified is received, the data to be classified is classified based on the multi-class classification model, and the class to which the data to be classified is determined is determined.

可选地，所述根据所述损失列表和所述真实概率分布，确定掩码列表的步骤包括：Optionally, the step of determining the mask list according to the loss list and the true probability distribution includes:

对所述损失列表中的各数值进行排序，获得概率序列，并从所述概率序列中选取出排列在前预设位的目标概率；Sort each value in the loss list to obtain a probability sequence, and select the target probability arranged in the previous preset position from the probability sequence;

确定各所述目标概率在所述损失列表中的排列位置，并根据所述排列位置，对所述损失列表进行更新；determining the arrangement position of each target probability in the loss list, and updating the loss list according to the arrangement position;

将更新后的所述损失列表和所述真实概率分布进行加和运算，生成掩码列表。The updated loss list and the true probability distribution are summed to generate a mask list.

可选地，所述根据所述掩码列表和所述损失列表，确定所述样本数据的正向类别损失和多个负向类别损失的步骤包括：Optionally, the step of determining the positive category loss and multiple negative category losses of the sample data according to the mask list and the loss list includes:

对所述掩码列表和更新前的所述损失列表进行乘积运算，生成乘积结果列表；Perform a product operation on the mask list and the loss list before updating to generate a product result list;

根据所述真实概率分布中真实概率所在的位置，从所述乘积结果列表中确定出所述正向类别损失；According to the position of the true probability in the true probability distribution, the forward category loss is determined from the product result list;

根据所述排列位置，从所述乘积结果中确定出多个所述负向类别损失。According to the arrangement position, a plurality of the negative class losses are determined from the product result.

可选地，所述根据所述真实概率分布和所述预测概率分布，确定所述样本数据在多个预设类别上的损失列表的步骤包括：Optionally, the step of determining the loss list of the sample data on multiple preset categories according to the real probability distribution and the predicted probability distribution includes:

根据所述真实概率分布和所述预测概率分布，分别确定所述样本数据在每一预设类别上的真实概率值和预测概率值；According to the real probability distribution and the predicted probability distribution, respectively determine the real probability value and the predicted probability value of the sample data on each preset category;

基于预设损失公式，对所述样本数据在每一预设类别上的所述真实概率值和所述预测概率值进行计算，得到所述样本数据在每一预设类别上的损失值；Based on a preset loss formula, calculate the real probability value and the predicted probability value of the sample data in each preset category, and obtain the loss value of the sample data in each preset category;

根据每一所述预设类别在多个所述预设类别中的排列顺序，对每一所述损失值排列，获得所述损失列表。The loss list is obtained by arranging each of the loss values according to the arrangement order of each of the preset categories in the plurality of preset categories.

可选地，所述确定所述样本数据在多个预设类别上的预测概率分布的步骤包括：Optionally, the step of determining the predicted probability distribution of the sample data on multiple preset categories includes:

获取所述样本数据与多个预设类别分别对应的输出值，并根据预设函数，对多个所述输出值分别进行映射，获得所述样本数据在多个预设类别上的预测概率分布。Obtain the output values corresponding to the sample data and a plurality of preset categories respectively, and map the output values respectively according to a preset function to obtain the predicted probability distribution of the sample data on the multiple preset categories .

可选地，所述获取样本数据在多个预设类别上的真实概率分布的步骤之前，所述方法还包括：Optionally, before the step of acquiring the real probability distribution of the sample data on a plurality of preset categories, the method further includes:

获取样本数据，以及与所述样本数据对应的样本标签，并基于预设编码方式对所述样本标签进行编码，生成所述样本数据在多个预设类别上的真实概率分布。Obtain sample data and a sample label corresponding to the sample data, and encode the sample label based on a preset encoding method to generate a true probability distribution of the sample data on multiple preset categories.

可选地，所述根据所述正向类别损失和多个所述负向类别损失，确定所述样本数据在多个预设类别中归属的正向类别和多个负向类别的步骤包括：Optionally, the step of determining, according to the positive class loss and a plurality of the negative class losses, to which the sample data belongs to a positive class and a plurality of negative classes in a plurality of preset classes includes:

根据所述正向类别损失和多个所述负向类别损失，对初始模型进行训练，并获取与所述初始模型对应的梯度；train an initial model according to the positive class loss and a plurality of the negative class losses, and obtain gradients corresponding to the initial model;

当所述梯度小于预设阈值时，完成所述初始模型的训练，并根据所述初始模型完成训练时所述样本数据在多个预设类别中的映射属性，确定所述正向类别和多个所述负向类别。When the gradient is less than a preset threshold, the training of the initial model is completed, and the forward category and the multiple preset categories are determined according to the mapping attributes of the sample data in multiple preset categories when the training of the initial model is completed. one of the negative categories.

进一步地，为实现上述目的，本发明还提供一种多类别处理任务训练样本构建装置，所述多类别处理任务训练样本构建装置包括：Further, in order to achieve the above object, the present invention also provides a multi-category processing task training sample construction device, and the multi-category processing task training sample construction device includes:

获取模块，用于获取样本数据在多个预设类别上的真实概率分布，并确定所述样本数据在多个预设类别上的预测概率分布；an acquisition module, configured to acquire the real probability distribution of sample data on multiple preset categories, and determine the predicted probability distribution of the sample data on multiple preset categories;

第一确定模块，用于根据所述真实概率分布和所述预测概率分布，确定所述样本数据在多个预设类别上的损失列表；a first determining module, configured to determine a loss list of the sample data in multiple preset categories according to the real probability distribution and the predicted probability distribution;

第二确定模块，用于根据所述损失列表和所述真实概率分布，确定掩码列表，并根据所述掩码列表和所述损失列表，确定所述样本数据的正向类别损失和多个负向类别损失；The second determination module is configured to determine a mask list according to the loss list and the true probability distribution, and determine the forward category loss and multiple values of the sample data according to the mask list and the loss list negative class loss;

构建模块，用于根据所述正向类别损失和多个所述负向类别损失，确定所述样本数据在多个预设类别中归属的正向类别和多个负向类别，并根据所述正向类别和多个负向类别，分别构建多个预设类别的正负向样本，以基于多个预设类别的正负向样本生成多类别分类模型进行类别分类。a building module for determining, according to the positive category loss and a plurality of the negative category losses, a positive category and a plurality of negative categories to which the sample data belongs in a plurality of preset categories, and according to the A positive category and a plurality of negative categories are respectively constructed to construct positive and negative samples of a plurality of preset categories, so as to generate a multi-category classification model based on the positive and negative samples of the multiple preset categories for category classification.

进一步地，为实现上述目的，本发明还提供一种多类别处理任务训练样本构建设备，所述多类别处理任务训练样本构建设备包括存储器、处理器以及存储在所述存储器上并可在所述处理器上运行的多类别处理任务训练样本构建程序，所述多类别处理任务训练样本构建程序被所述处理器执行时实现如上述所述的多类别处理任务训练样本构建方法的步骤。Further, in order to achieve the above object, the present invention also provides a multi-class processing task training sample construction device, the multi-class processing task training sample construction device includes a memory, a processor, and a multi-class processing task training sample construction device. A multi-class processing task training sample construction program running on the processor, when the multi-class processing task training sample construction program is executed by the processor, implements the steps of the above-mentioned multi-class processing task training sample construction method.

进一步地，为实现上述目的，本发明还提供一种介质，所述介质上存储有多类别处理任务训练样本构建程序，所述多类别处理任务训练样本构建程序被处理器执行时实现如上所述的多类别处理任务训练样本构建方法的步骤。Further, in order to achieve the above object, the present invention also provides a medium on which a multi-category processing task training sample construction program is stored, and the multi-category processing task training sample construction program is executed by the processor. The steps of the training sample construction method for the multi-class processing task.

本发明的多类别处理任务训练样本构建方法、设备及介质，与现有技术在多类别分类任务中，因每个类别的样本数据不均衡，导致执行多类别分类任务不准确相比，本发明先获取样本数据在多个预设类别上的真实概率分布，并确定样本数据在多个预设类别上的预测概率分布；再根据该真实概率分布和预测概率分布，确定样本数据在多个预设类别上的损失列表；进而依据损失列表和真实概率分布，确定掩码列表；并由掩码列表和损失列表，确定样本数据的正向类别损失和多个负向类别损失，由该正向类别损失和多个负向类别损失，确定样本数据在多个预设类别中归属的正向类别和多个负向类别，并根据正向类别和多个负向类别来分别构建多个预设类别的正负向样本，以便于依据多个预设类别的正负向样本生成多类别分类模型进行类别分类。其中，掩码列表体现了样本数据可能属于多个预设类别中某一类别的正向概率，以及可能不属于某些类别的负向概率；由其对损失列表过滤得到正向类别损失和多个负向类别损失，来确定样本数据在多个预设类别中所归属的正向类别和多个负向类别，不同的样本数据归属不同的正向类别和负向类别，使得各个预设类别均对应有各自的正向样本和多个负向样本，构建在多个预设类别上的正负向样本，使得各个预设类别均包含正向样本和多项负向样本，实现预设类别间样本数据的均衡。以此，克服了现有技术多类别分类任务中，各个类别之间样本数据不均衡的缺陷，提升了执行多类别分类任务的准确性。Compared with the multi-class processing task training sample construction method, device and medium of the present invention, in the multi-class classification task in the prior art, the multi-class classification task is inaccurately performed due to the unbalanced sample data of each class. First, obtain the real probability distribution of sample data on multiple preset categories, and determine the predicted probability distribution of sample data on multiple preset categories; then, according to the real probability distribution and predicted probability distribution, determine the sample data in multiple preset categories Set the loss list on the category; then determine the mask list according to the loss list and the true probability distribution; and determine the positive category loss and multiple negative category losses of the sample data from the mask list and the loss list, and the positive category loss is determined by the positive category loss. Category loss and multiple negative category losses, determine the positive category and multiple negative categories that the sample data belongs to in multiple preset categories, and construct multiple presets according to the positive category and multiple negative categories. The positive and negative samples of the category are used to generate a multi-category classification model for category classification based on the positive and negative samples of multiple preset categories. Among them, the mask list reflects the positive probability that the sample data may belong to a certain category of multiple preset categories, and the negative probability that may not belong to some categories; it filters the loss list to obtain the positive category loss and multi-category loss. A negative category loss is used to determine the positive category and multiple negative categories that the sample data belongs to in multiple preset categories. Different sample data belong to different positive categories and negative categories, so that each preset category belongs to Each corresponds to its own positive samples and multiple negative samples, and positive and negative samples are constructed on multiple preset categories, so that each preset category contains positive samples and multiple negative samples to achieve preset categories. Balance between sample data. In this way, the defect of unbalanced sample data among various categories in the prior art multi-category classification task is overcome, and the accuracy of performing the multi-category classification task is improved.

附图说明Description of drawings

图1为本发明多类别处理任务训练样本构建设备实施例方案涉及的设备硬件运行环境的结构示意图；1 is a schematic structural diagram of a device hardware operating environment involved in an embodiment of a multi-category processing task training sample construction device according to the present invention;

图2为本发明多类别处理任务训练样本构建方法第一实施例的流程示意图；2 is a schematic flowchart of a first embodiment of a method for constructing training samples for multi-category processing tasks according to the present invention;

图3为本发明多类别处理任务训练样本构建装置较佳实施例的功能模块示意图。FIG. 3 is a schematic diagram of functional modules of a preferred embodiment of an apparatus for constructing training samples for multi-class processing tasks according to the present invention.

本发明目的的实现、功能特点及优点将结合实施例，参照附图做进一步说明。The realization, functional characteristics and advantages of the present invention will be further described with reference to the accompanying drawings in conjunction with the embodiments.

具体实施方式Detailed ways

应当理解，此处所描述的具体实施例仅用以解释本发明，并不用于限定本发明。It should be understood that the specific embodiments described herein are only used to explain the present invention, but not to limit the present invention.

本发明提供一种多类别处理任务训练样本构建设备，所述多类别处理任务训练样本构建设备包括风控设备以及与所述风控设备通信连接的至少一个消费设备，参照图1，图1为本发明多类别处理任务训练样本构建设备实施例方案涉及的设备硬件运行环境的结构示意图。The present invention provides a multi-category processing task training sample construction device. The multi-category processing task training sample construction device includes a wind control device and at least one consumer device communicatively connected to the wind control device. Referring to FIG. 1, FIG. 1 is a A schematic structural diagram of the device hardware operating environment involved in the multi-category processing task training sample construction device embodiment scheme of the present invention.

如图1所示，该多类别处理任务训练样本构建设备可以包括：处理器1001，例如CPU，通信总线1002、用户接口1003，网络接口1004，存储器1005。其中，通信总线1002用于实现这些组件之间的连接通信。用户接口1003可以包括显示屏(Display)、输入单元比如键盘(Keyboard)，可选用户接口1003还可以包括标准的有线接口、无线接口。网络接口1004可选的可以包括标准的有线接口、无线接口(如WI-FI接口)。存储器1005可以是高速RAM存储器，也可以是稳定的存储器(non-volatile memory)，例如磁盘存储器。存储器1005可选的还可以是独立于前述处理器1001的存储设备。As shown in FIG. 1 , the multi-class processing task training sample construction device may include: a processor 1001 , such as a CPU, a communication bus 1002 , a user interface 1003 , a network interface 1004 , and a memory 1005 . Among them, the communication bus 1002 is used to realize the connection and communication between these components. The user interface 1003 may include a display screen (Display), an input unit such as a keyboard (Keyboard), and the optional user interface 1003 may also include a standard wired interface and a wireless interface. Optionally, the network interface 1004 may include a standard wired interface and a wireless interface (eg, a WI-FI interface). The memory 1005 may be high-speed RAM memory, or may be non-volatile memory, such as disk memory. Optionally, the memory 1005 may also be a storage device independent of the aforementioned processor 1001 .

本领域技术人员可以理解，图1中示出的多类别处理任务训练样本构建设备的硬件结构并不构成对多类别处理任务训练样本构建设备的限定，可以包括比图示更多或更少的部件，或者组合某些部件，或者不同的部件布置。Those skilled in the art can understand that the hardware structure of the multi-class processing task training sample construction device shown in FIG. 1 does not constitute a limitation on the multi-class processing task training sample construction device, and may include more or less than the illustration. components, or a combination of certain components, or a different arrangement of components.

如图1所示，作为一种介质的存储器1005中可以包括操作系统、网络通信模块、用户接口模块以及多类别处理任务训练样本构建程序。其中，操作系统是管理和控制多类别处理任务训练样本构建设备与软件资源的程序，支持网络通信模块、用户接口模块、多类别处理任务训练样本构建程序以及其他程序或软件的运行；网络通信模块用于管理和控制网络接口1004；用户接口模块用于管理和控制用户接口1003。As shown in FIG. 1 , the memory 1005 as a medium may include an operating system, a network communication module, a user interface module, and a multi-class processing task training sample construction program. Among them, the operating system is a program that manages and controls the multi-class processing task training sample construction equipment and software resources, and supports the operation of the network communication module, the user interface module, the multi-class processing task training sample construction program and other programs or software; the network communication module Used to manage and control the network interface 1004; the user interface module is used to manage and control the user interface 1003.

在图1所示的多类别处理任务训练样本构建设备硬件结构中，网络接口1004主要用于连接后台服务器，与后台服务器进行数据通信；用户接口1003主要用于连接客户端(用户端)，与客户端进行数据通信；处理器1001可以调用存储器1005中存储的多类别处理任务训练样本构建程序，并执行以下操作：In the hardware structure of the multi-category processing task training sample construction device shown in FIG. 1, the network interface 1004 is mainly used to connect to the background server and perform data communication with the background server; the user interface 1003 is mainly used to connect the client (client), and The client performs data communication; the processor 1001 can call the multi-class processing task training sample construction program stored in the memory 1005, and perform the following operations:

进一步地，所述根据所述正向类别和多个负向类别，分别构建多个预设类别的正负向样本的步骤之后，处理器1001可以调用存储器1005中存储的多类别处理任务训练样本构建程序，并执行以下操作：Further, after the step of constructing positive and negative samples of a plurality of preset categories respectively according to the positive category and a plurality of negative categories, the processor 1001 may call the multi-category processing task training samples stored in the memory 1005. Build the program, and do the following:

进一步地，所述根据所述损失列表和所述真实概率分布，确定掩码列表的步骤包括：Further, the step of determining the mask list according to the loss list and the true probability distribution includes:

进一步地，所述根据所述掩码列表和所述损失列表，确定所述样本数据的正向类别损失和多个负向类别损失的步骤包括：Further, the step of determining the positive category loss and multiple negative category losses of the sample data according to the mask list and the loss list includes:

进一步地，所述根据所述真实概率分布和所述预测概率分布，确定所述样本数据在多个预设类别上的损失列表的步骤包括：Further, the step of determining the loss list of the sample data on multiple preset categories according to the real probability distribution and the predicted probability distribution includes:

进一步地，所述确定所述样本数据在多个预设类别上的预测概率分布的步骤包括：Further, the step of determining the predicted probability distribution of the sample data on a plurality of preset categories includes:

进一步地，所述获取样本数据在多个预设类别上的真实概率分布的步骤之前，处理器1001可以调用存储器1005中存储的多类别处理任务训练样本构建程序，并执行以下操作：Further, before the step of obtaining the real probability distribution of sample data on multiple preset categories, the processor 1001 may call the multi-category processing task training sample construction program stored in the memory 1005, and perform the following operations:

进一步地，所述根据所述正向类别损失和多个所述负向类别损失，确定所述样本数据在多个预设类别中归属的正向类别和多个负向类别的步骤包括：Further, the step of determining, according to the positive category loss and a plurality of the negative category losses, to which the sample data belongs in the multiple preset categories, includes:

本发明多类别处理任务训练样本构建设备的具体实施方式与下述多类别处理任务训练样本构建方法各实施例基本相同，在此不再赘述。The specific implementation of the multi-category processing task training sample construction device of the present invention is basically the same as the following embodiments of the multi-category processing task training sample construction method, and will not be repeated here.

本发明还提供一种多类别处理任务训练样本构建方法。The present invention also provides a method for constructing training samples for multi-category processing tasks.

参照图2，图2为本发明多类别处理任务训练样本构建方法第一实施例的流程示意图。Referring to FIG. 2 , FIG. 2 is a schematic flowchart of a first embodiment of a method for constructing training samples for multi-category processing tasks according to the present invention.

本发明实施例提供了多类别处理任务训练样本构建方法的实施例，需要说明的是，虽然在流程图中示出了逻辑顺序，但是在某些情况下，可以以不同于此处的顺序执行所示出或描述的步骤。具体地，本实施例中的多类别处理任务训练样本构建方法包括：Embodiments of the present invention provide embodiments of a method for constructing training samples for multi-category processing tasks. It should be noted that although a logical sequence is shown in the flowchart, in some cases, the execution may be performed in a sequence different from that here. steps shown or described. Specifically, the multi-category processing task training sample construction method in this embodiment includes:

步骤S10，获取样本数据在多个预设类别上的真实概率分布，并确定所述样本数据在多个预设类别上的预测概率分布；Step S10, obtaining the real probability distribution of the sample data on multiple preset categories, and determining the predicted probability distribution of the sample data on the multiple preset categories;

本实施例中的多类别处理任务训练样本构建方法，针对多类别分类任务的各个类别构建包含正向样本和负向样本的正负向样本。在多类别分类任务中，每条样本仅归属于多个类别中的其中一类。将多类别分类任务涉及的各个类别预先设置为预设类别，并针对每条样本数据，预先设置表征其所属类别的标签。并且，针对各个预设类别，以真实概率分布表征样本数据的归属类别，真实概率分布包含多个概率值。对于标签表征样本数据所在类别为某一概率值，对于其他类别则为另一概率值，多个概率值整体形成真实概率分布；以此，由各个概率值体现样本数据所在的类别。如对于C个类别[1...i...c]，若样本数据归属于类别i，并且归属的类别以概率值1表示，其他类别以概率值0表示，则形成的真实概率分布为[0...1...0]。The multi-category processing task training sample construction method in this embodiment constructs positive and negative samples including positive samples and negative samples for each category of the multi-category classification task. In multi-class classification tasks, each sample belongs to only one of multiple classes. Each category involved in the multi-category classification task is preset as a preset category, and for each piece of sample data, a label representing the category to which it belongs is preset. Moreover, for each preset category, the attribution category of the sample data is represented by a real probability distribution, and the real probability distribution includes a plurality of probability values. For labels, the category of the sample data is a certain probability value, and for other categories, it is another probability value, and the multiple probability values form a true probability distribution as a whole; in this way, each probability value reflects the category of the sample data. For example, for C categories [1...i...c], if the sample data belongs to category i, and the category to which it belongs is represented by a probability value of 1, and the other categories are represented by a probability value of 0, then the true probability distribution formed is [0...1...0].

可理解地，样本数据以标签的形式表征其所属类别，而真实概率分布以数值的形式存在，故需要将标签转换为数值，以形成真实概率分布。具体地，获取样本数据在多个预设类别上的真实概率分布的步骤之前，还包括：Understandably, the sample data is represented in the form of labels, and the real probability distribution exists in the form of numerical values, so it is necessary to convert the labels into numerical values to form the real probability distribution. Specifically, before the step of acquiring the real probability distribution of the sample data on the multiple preset categories, the method further includes:

步骤a，获取样本数据，以及与所述样本数据对应的样本标签，并基于预设编码方式对所述样本标签进行编码，生成所述样本数据在多个预设类别上的真实概率分布。Step a: Obtain sample data and a sample label corresponding to the sample data, and encode the sample label based on a preset encoding method to generate a true probability distribution of the sample data on multiple preset categories.

进一步地，对样本数据及其携带的样本标签进行获取，并且对于获取的样本数据，其在多个预设类别上真实的样本标签，通过预设编码方式对其进行编码，得到样本数据在多个预设类别上的真实概率分布。其中，预设编码方式优选为one-hot方法；通过one-hot方法得到真实概率分布vector,可以称为one-hot vector。对于上述C个预设类别，该真实概率分布是一个长度为C的vector,只有真实类别i的第i个位置的值为1，其他位置的值为0。Further, the sample data and the sample labels carried by it are obtained, and for the obtained sample data, the real sample labels in a plurality of preset categories are encoded by a preset encoding method, and the sample data is obtained. true probability distribution over a preset class. Among them, the preset encoding method is preferably the one-hot method; the true probability distribution vector is obtained by the one-hot method, which may be called a one-hot vector. For the above C preset categories, the true probability distribution is a vector of length C, only the ith position of the true category i has a value of 1, and the values of other positions are 0.

进一步地，对该生成的样本数据在多个预设类别上的真实概率分布进行获取，并确定样本数据在多个预设类别上的预测概率分布。其中预测概率分布表征对样本数据进行预测，确定样本数据所归属各个预设类别的可能性大小。具体地，确定样本数据在多个预设类别上的预测概率分布的步骤包括：Further, the real probability distribution of the generated sample data on the multiple preset categories is acquired, and the predicted probability distribution of the sample data on the multiple preset categories is determined. The prediction probability distribution represents the prediction of the sample data, and the probability of each preset category to which the sample data belongs is determined. Specifically, the step of determining the predicted probability distribution of the sample data on multiple preset categories includes:

步骤b，获取所述样本数据与多个预设类别分别对应的输出值，并根据预设函数，对多个所述输出值分别进行映射，获得所述样本数据在多个预设类别上的预测概率分布。Step b, obtaining output values corresponding to the sample data and a plurality of preset categories respectively, and mapping a plurality of the output values according to a preset function to obtain the sample data on the multiple preset categories. Predict probability distributions.

更进一步地，预先设置有用于预测的模型，通过该模型对样本数据所归属的类别进行预测，得到与多个预设类别分别对应的输出值。并且，预先设置有对输出值进行数值映射的预设函数。对生成的各个输出值进行获取，并由预设函数对各个输出值分别进行映射，映射到0-1之间，得到各个实数，形成样本数据在多个预设类别上的预测概率分布。其中，预设函数优选为softmax函数，即通过该softmax函数将各个输出值映射到0-1之间的实数，形成为预测概率分布。其中，所映射得到的各个实数归一化之和为1，以确保样本数据在多个类别上的概率之和也为1，表征样本数据归属于多个类别中的其中一类。例如，对于上述C个预设类别，所形成的预测概率分布可用[y1,y2...yi...yc]表示。Furthermore, a model for prediction is preset, and the category to which the sample data belongs is predicted by the model, so as to obtain output values corresponding to a plurality of preset categories respectively. In addition, a preset function for performing numerical mapping on the output value is preset. Each of the generated output values is acquired, and each output value is mapped by a preset function to be between 0 and 1 to obtain each real number to form the predicted probability distribution of the sample data on multiple preset categories. The preset function is preferably a softmax function, that is, each output value is mapped to a real number between 0 and 1 through the softmax function to form a predicted probability distribution. Among them, the normalized sum of each mapped real number is 1 to ensure that the sum of the probabilities of the sample data on multiple categories is also 1, indicating that the sample data belongs to one of the multiple categories. For example, for the above C preset categories, the formed predicted probability distribution can be represented by [y1, y2...yi...yc].

步骤S20，根据所述真实概率分布和所述预测概率分布，确定所述样本数据在多个预设类别上的损失列表；Step S20, according to the real probability distribution and the predicted probability distribution, determine the loss list of the sample data in multiple preset categories;

进一步地，在得到样本数据在各个预设类别上的真实概率分布和预测概率分布之后，则可依据该真实概率分布和预测概率分布，来确定样本数据在多个预设类别上的损失列表。通过损失列表体现对样本数据在多个预设类别上预测的误差。具体地，根据真实概率分布和预测概率分布，确定样本数据在多个预设类别上的损失列表的步骤包括：Further, after obtaining the actual probability distribution and the predicted probability distribution of the sample data on each preset category, the loss list of the sample data on the multiple preset categories can be determined according to the actual probability distribution and the predicted probability distribution. The error of predicting sample data on multiple preset categories is represented by the loss list. Specifically, according to the real probability distribution and the predicted probability distribution, the step of determining the loss list of the sample data on multiple preset categories includes:

步骤S21，根据所述真实概率分布和所述预测概率分布，分别确定所述样本数据在每一预设类别上的真实概率值和预测概率值；Step S21, according to the real probability distribution and the predicted probability distribution, respectively determine the real probability value and the predicted probability value of the sample data on each preset category;

步骤S22，基于预设损失公式，对所述样本数据在每一预设类别上的所述真实概率值和所述预测概率值进行计算，得到所述样本数据在每一预设类别上的损失值；Step S22, based on a preset loss formula, calculate the real probability value and the predicted probability value of the sample data in each preset category, and obtain the loss of the sample data in each preset category value;

步骤S23，根据每一所述预设类别在多个所述预设类别中的排列顺序，对每一所述损失值排列，获得所述损失列表。Step S23 , arranging each of the loss values according to the arrangement order of each of the preset categories in the plurality of preset categories to obtain the loss list.

可理解地，真实概率分布包含样本数据在各个预设类型上的真实概率值，预测概率分布包含样本数据在各个预设类型上的预测概率值。针对每一预设类别，从真实概率分布中确定出样本数据的真实概率值，并从预测概率分布中确定出样本数据的预测概率值。进而通过预先设置的预设损失公式，如交叉熵公式，对样本数据在每一预设类别上的真实概率值和预测概率值进行计算，得到样本数据在每一预设类别上的损失值。如对于C个预设类别中的第i个类别，经确定样本数据在真实概率分布中第i个位置的值为y，在预测概率分布中第i个位置的值为y’，则判定样本数据在第i类别上的真实概率值为y，预测概率值为y’，进而调用预设损失公式对y和y’进行计算，得到样本数据在第i个类别上的损失值。其中，预设预设损失公式包含基础公式以及对基础公式的变形公式，基础公式如下公式(1)所示：Understandably, the real probability distribution includes the real probability values of the sample data on each preset type, and the predicted probability distribution includes the predicted probability values of the sample data on each preset type. For each preset category, the real probability value of the sample data is determined from the real probability distribution, and the predicted probability value of the sample data is determined from the predicted probability distribution. Furthermore, the actual probability value and the predicted probability value of the sample data in each preset category are calculated through a preset preset loss formula, such as the cross-entropy formula, to obtain the loss value of the sample data in each preset category. For example, for the ith category in the C preset categories, it is determined that the value of the ith position of the sample data in the real probability distribution is y, and the value of the ith position in the predicted probability distribution is y', then determine the sample data. The real probability value of the data in the ith category is y, and the predicted probability value is y', and then the preset loss formula is called to calculate y and y', and the loss value of the sample data in the ith category is obtained. Among them, the preset preset loss formula includes a basic formula and a deformation formula to the basic formula, and the basic formula is shown in the following formula (1):

其中，L表示损失值，y表示真实概率值，y'表示预测概率值。Among them, L represents the loss value, y represents the true probability value, and y' represents the predicted probability value.

对上述公式(1)进行变形，得到变形公式如下公式(2)、(3)和(4)所示：The above formula (1) is deformed, and the deformation formulas are obtained as shown in the following formulas (2), (3) and (4):

CE(p_t)＝-log(p_t) (4)；CE(p _t )=-log(p _t ) (4);

其中，CE(p_t)、p_t和CE(p_t)均表示损失值，p表示预测概率值。Among them, CE(p _t ), pt _{, and CE(p t} ₎ all represent the loss value, and p represents the predicted probability value.

通过上述变形公式，得到样本数据预设类别中第i类别的损失值的表达式如下公式(5)所示：Through the above deformation formula, the expression of the loss value of the i-th category in the preset category of the sample data is obtained as shown in the following formula (5):

CE(p_i,t)＝-log(p_i,t) (5)；CE(pi _,t )=-log(pi _,t ) (5);

进一步地，将针对各个预设类别计算所得到的损失值，按照每一预设类别在多个预设类别中的排列顺序进行排列，得到的序列，即为样本数据在多个预设类别上的损失列表。其中，损失列表的表达式如下公式(6)所示：Further, the loss values calculated for each preset category are arranged according to the arrangement order of each preset category in the multiple preset categories, and the obtained sequence is the sample data on the multiple preset categories. list of losses. Among them, the expression of the loss list is shown in the following formula (6):

SoftmaxCrossEntropyLossList＝[-log(p_1,t),…,-log(p_i,t),…,-log(p_c,t)](6)；SoftmaxCrossEntropyLossList=[-log(p _1,t ),...,-log(pi _,t ),...,-log(p _c,t )](6);

步骤S30，根据所述损失列表和所述真实概率分布，确定掩码列表，并根据所述掩码列表和所述损失列表，确定所述样本数据的正向类别损失和多个负向类别损失；Step S30: Determine a mask list according to the loss list and the true probability distribution, and determine the positive category loss and multiple negative category losses of the sample data according to the mask list and the loss list ;

可理解地，因损失列表表征了误差大小，其中的误差越大，样本数据归属于该预设类别的可能性越小。故可通过可能性的大小来确定预设类别的负向样本，将样本数据形成为可能性小的预设类别的负向样本。在生成为损失列表后，通过将损失列表和真实概率分布结合，生成掩码列表。该掩码列表包含来源于损失列表表征负向样本的数值，以及包含来源于真实概率分布表征正向样本的数值。在确定掩码列表后，则通过掩码列表对损失列表进一步处理，生成样本数据的正向类别损失和多个负向类别损失。由正向类别损失表征多类别分类任务中样本数据归属的真实类别，即正向样本；并且由多个负向类别损失表征多类别分类任务中样本数据不可能归属的类别，即负向样本。Understandably, because the loss list represents the size of the error, the larger the error, the less likely the sample data belongs to the preset category. Therefore, the negative samples of the preset category can be determined by the size of the possibility, and the sample data can be formed into negative samples of the preset category with low possibility. After being generated as a loss list, a mask list is generated by combining the loss list with the true probability distribution. The mask list contains values representing negative samples from the loss list and values representing positive samples from the true probability distribution. After the mask list is determined, the loss list is further processed through the mask list to generate the positive category loss and multiple negative category losses of the sample data. The real category to which the sample data belongs in the multi-category classification task is represented by the positive category loss, that is, the positive sample; and the multiple negative category losses are used to represent the category that the sample data cannot belong to in the multi-category classification task, that is, the negative sample.

步骤S40，根据所述正向类别损失和多个所述负向类别损失，确定所述样本数据在多个预设类别中归属的正向类别和多个负向类别，并根据所述正向类别和多个负向类别，分别构建多个预设类别的正负向样本，以基于多个预设类别的正负向样本生成多类别分类模型进行类别分类。Step S40, according to the positive category loss and a plurality of the negative category losses, determine a positive category and a plurality of negative categories to which the sample data belongs in a plurality of preset categories, and according to the positive category category and multiple negative categories, respectively construct positive and negative samples of multiple preset categories, so as to generate a multi-category classification model based on the positive and negative samples of multiple preset categories for category classification.

进一步地，依据正向类别损失表征归属的预设类别，以及多个负向类别表征不可能归属的预设类别，构建多个预设类别的正负向样本。根据正向类别损失，确定样本数据在多个预设类别中归属的类别，该归属的类别即为正向类别，进而将样本数据形成为所归属正向类别的正向样本。同时，根据多个负向类别损失，确定样本数据在多个预设类别中所不可能归属的类别，该不可能归属的类别即为负向类别，进而将样本数据形成为不可能归属的负向类别的负向样本。在对大量的样本数据进行处理后，则可确定大量样本数据所归属的正向类别，形成各个正向类别的正向样本，以及各个不可能归属的负向类别，形成各个负向类别的负向样本，使得各个预设类别均包含正向样本和多项负向样本，实现根据正向类别和多个负向类别，构建多个预设类别各自的正负性样本，确保了各预设类别之间样本的均衡。此后，通过构建得到各预设类别的正负向样本对预设多类别模型进行训练，得到多类别分类模型执行多类别分类任务进行类别分类，因各预设类别之间的样本均衡，而使得分类准确。Further, positive and negative samples of multiple preset categories are constructed according to the preset category to which the loss of the positive category is represented, and the multiple negative categories that represent the preset category that cannot be assigned. According to the forward category loss, the category to which the sample data belongs in the plurality of preset categories is determined, and the category to which the sample data belongs is the forward category, and the sample data is then formed into a forward sample of the assigned forward category. At the same time, according to the losses of multiple negative categories, determine the category that the sample data cannot belong to among the multiple preset categories, and the impossible category is the negative category, and then the sample data is formed into an impossible negative category. Negative samples for the class. After processing a large amount of sample data, the positive category to which the large amount of sample data belongs can be determined to form positive samples of each positive category, and each impossible negative category to form the negative category of each negative category. oriented samples, so that each preset category contains positive samples and multiple negative samples, so that the positive and negative samples of multiple preset categories can be constructed according to the positive category and multiple negative categories, ensuring that each preset Balance of samples between classes. After that, the preset multi-category model is trained by constructing positive and negative samples of each preset category, and the multi-category classification model is obtained to perform the multi-category classification task for category classification. accurate classification.

更进一步地，所述根据所述正向类别损失和多个所述负向类别损失，确定所述样本数据在多个预设类别中归属的正向类别和多个负向类别的步骤包括：Further, the step of determining, according to the positive class loss and a plurality of the negative class losses, to which the sample data belongs to a plurality of preset classes, includes:

步骤c1，根据所述正向类别损失和多个所述负向类别损失，对初始模型进行训练，并获取与所述初始模型对应的梯度；Step c1, according to the positive category loss and a plurality of the negative category losses, train the initial model, and obtain the gradient corresponding to the initial model;

步骤c2，当所述梯度小于预设阈值时，完成所述初始模型的训练，并根据所述初始模型完成训练时所述样本数据在多个预设类别中的映射属性，确定所述正向类别和多个所述负向类别。Step c2, when the gradient is less than a preset threshold, complete the training of the initial model, and determine the forward direction according to the mapping attributes of the sample data in multiple preset categories when the initial model completes the training category and a plurality of said negative categories.

进一步地，本实施例对于多类别处理任务训练样本构建在模型训练的动态过程中实现。根据所确定的正向类别损失和多个负向类别损失，对预先设定用于执行多类别分类任务的初始模型进行训练，并且获取初始模型训练后的梯度。此后，将梯度和预先设置的预设阈值对比，判断梯度是否小于预设阈值，若小于则完成对初始模型的训练。反之若经对比确定梯度不小于预设阈值，则继续训练，直到小于预设阈值才完成初始模型的训练。该经训练完成时初始模型中的训练样本，使得初始模型呈现收敛特性，执行分类别分类任务具有较好的效果；故将各项训练样本依据各自所归属的类别，以及非归属的类别作为映射属性，进而依据映射属性，确定正向类别和多个负向类别；将初始模型训练完成时样本数据所归属的类别作为正向类别，并且将样本数据非归属的类别作为负向类别。进而依据正向类别和多个负向类别划分正向样本和负向样本，构建得到各预设类别的正负向样本，有利于各预设类别之间样本的均衡性。Further, the construction of training samples for multi-category processing tasks in this embodiment is implemented in the dynamic process of model training. According to the determined positive class loss and a plurality of negative class losses, an initial model preset for performing a multi-class classification task is trained, and the gradient of the initial model after training is obtained. Thereafter, the gradient is compared with a preset preset threshold to determine whether the gradient is less than the preset threshold, and if it is less than the initial model training is completed. On the contrary, if it is determined by comparison that the gradient is not less than the preset threshold, the training continues, and the training of the initial model is not completed until it is less than the preset threshold. The training samples in the initial model after the training is completed makes the initial model show convergence characteristics, and the classification task has a good effect; therefore, each training sample is used as a mapping according to the category to which it belongs and the category that does not belong to it. attribute, and then determine the positive category and multiple negative categories according to the mapping attribute; take the category to which the sample data belongs when the initial model training is completed as the positive category, and the category to which the sample data does not belong as the negative category. Further, the positive samples and the negative samples are divided according to the positive category and the plurality of negative categories, and the positive and negative samples of each preset category are constructed, which is beneficial to the balance of the samples among the preset categories.

更进一步地，所述根据所述正向类别和多个所述负向类别，分别构建多个预设类别的正负向样本的步骤之后，所述方法还包括：Further, after the step of respectively constructing positive and negative samples of a plurality of preset categories according to the positive category and a plurality of the negative categories, the method further includes:

步骤d1，基于多个所述预设类别的正负项样本，对预设多类别模型进行训练，生成多类别分类模型；Step d1, based on a plurality of positive and negative item samples of the preset categories, train a preset multi-category model to generate a multi-category classification model;

步骤d1，当接收到待分类数据时，基于所述多类别分类模型对所述待分类数据进行类别分类，确定所述待分类数据所归属的类别。Step d1, when receiving the data to be classified, classify the data to be classified based on the multi-class classification model, and determine the class to which the data to be classified belongs.

更进一步地，预先设置有用于训练的预设多类别模型，以通过对预设多类别模型训练生成多类别分类模型。具体地，将构建的在各个预设类别上的正负向样本，传输到预设多类别模型，对预设多类别模型进行训练。预设多类别模型包含有训练结束条件，当经训练后的预设多类别模型达到该训练结束条件，则结束训练，生成多类别分类模型用于类别分类。Furthermore, a preset multi-category model for training is preset, so as to generate a multi-category classification model by training the preset multi-category model. Specifically, the constructed positive and negative samples on each preset category are transmitted to the preset multi-category model, and the preset multi-category model is trained. The preset multi-category model includes a training end condition. When the trained preset multi-category model reaches the training end condition, the training is ended, and a multi-category classification model is generated for category classification.

进一步地，当接收到待分类数据，表征有执行多类别分类任务的需求时，调用多类别分类模型对待分类数据进行分类处理，确定待分类数据所归属的类别，并将待分类数据划分到该归属的类别中，实现待分类数据的准确分类。Further, when the data to be classified is received, indicating that there is a need to perform a multi-class classification task, the multi-class classification model is called to classify the data to be classified, determine the category to which the data to be classified belongs, and divide the data to be classified into the In the attribution category, the accurate classification of the data to be classified is realized.

本发明的多类别处理任务训练样本构建方法，与现有技术在多类别分类任务中，因每个类别的样本数据不均衡，导致执行多类别分类任务不准确相比，本发明先获取样本数据在多个预设类别上的真实概率分布，并确定样本数据在多个预设类别上的预测概率分布；再根据该真实概率分布和预测概率分布，确定样本数据在多个预设类别上的损失列表；进而依据损失列表和真实概率分布，确定掩码列表；并由掩码列表和损失列表，确定样本数据的正向类别损失和多个负向类别损失，由该正向类别损失和多个负向类别损失，确定样本数据在多个预设类别中归属的正向类别和多个负向类别，并根据正向类别和多个负向类别来分别构建多个预设类别的正负向样本，以便于依据多个预设类别的正负向样本生成多类别分类模型进行类别分类。其中，掩码列表体现了样本数据可能属于多个预设类别中某一类别的正向概率，以及可能不属于某些类别的负向概率；由其对损失列表过滤得到正向类别损失和多个负向类别损失，来确定样本数据在多个预设类别中所归属的正向类别和多个负向类别，不同的样本数据归属不同的正向类别和负向类别，使得各个预设类别均对应有各自的正向样本和多个负向样本，构建在多个预设类别上的正负向样本，使得各个预设类别均包含正向样本和多项负向样本，实现预设类别间样本数据的均衡。以此，克服了现有技术多类别分类任务中，各个类别之间样本数据不均衡的缺陷，提升了执行多类别分类任务的准确性。Compared with the multi-class processing task training sample construction method of the present invention, in the multi-class classification task in the prior art, the sample data of each class is unbalanced, resulting in inaccurate execution of the multi-class classification task. The present invention first obtains the sample data. The real probability distribution on multiple preset categories, and determine the predicted probability distribution of sample data on multiple preset categories; and then according to the real probability distribution and predicted probability distribution, determine the sample data on multiple preset categories. Loss list; then determine the mask list according to the loss list and the true probability distribution; and determine the positive category loss and multiple negative category losses of the sample data from the mask list and the loss list, and the positive category loss and the multiple negative category losses are determined. A negative category loss is used to determine the positive category and multiple negative categories that the sample data belongs to in multiple preset categories, and construct the positive and negative categories of multiple preset categories according to the positive category and multiple negative categories. directional samples, so as to generate a multi-category classification model for category classification based on positive and negative samples of multiple preset categories. Among them, the mask list reflects the positive probability that the sample data may belong to a certain category of multiple preset categories, and the negative probability that may not belong to some categories; it filters the loss list to obtain the positive category loss and multi-category loss. A negative category loss is used to determine the positive category and multiple negative categories that the sample data belongs to in multiple preset categories, and different sample data belong to different positive categories and negative categories, so that each preset category belongs to Each corresponds to its own positive samples and multiple negative samples, and positive and negative samples are constructed on multiple preset categories, so that each preset category contains positive samples and multiple negative samples to achieve preset categories. Balance between sample data. In this way, the defect of unbalanced sample data among various categories in the prior art multi-category classification task is overcome, and the accuracy of performing the multi-category classification task is improved.

进一步地，基于本发明多类别处理任务训练样本构建方法的第一实施例，提出本发明多类别处理任务训练样本构建方法第二实施例。Further, based on the first embodiment of the method for constructing training samples for multi-category processing tasks of the present invention, a second embodiment of the method for constructing training samples for multi-category processing tasks of the present invention is proposed.

所述多类别处理任务训练样本构建方法第二实施例与所述多类别处理任务训练样本构建方法第一实施例的区别在于，所述根据所述损失列表和所述真实概率分布，确定掩码列表的步骤包括：The difference between the second embodiment of the multi-category processing task training sample construction method and the multi-category processing task training sample construction method of the first embodiment is that the mask is determined according to the loss list and the true probability distribution. The steps to list include:

步骤S31，对所述损失列表中的各数值进行排序，获得概率序列，并从所述概率序列中选取出排列在前预设位的目标概率；Step S31, sort each value in the loss list to obtain a probability sequence, and select the target probability arranged in the previous preset position from the probability sequence;

步骤S32，确定各所述目标概率在所述损失列表中的排列位置，并根据所述排列位置，对所述损失列表进行更新；Step S32, determining the arrangement position of each of the target probabilities in the loss list, and updating the loss list according to the arrangement position;

步骤S33，将更新后的所述损失列表和所述真实概率分布进行加和运算，生成掩码列表。Step S33, performing a sum operation on the updated loss list and the true probability distribution to generate a mask list.

本实施例结合损失列表和真实概率分布确定掩码列表。具体地，对损失列表中的各数值进行sorted运算，通过sorted运算，对损失列表中的各数值，按照从大到小的顺序排序，得到概率序列。此后对概率序列进行topn运算，topn运算为查找序列中排列在前n位数据，n为预先依据需求设定的超参数，作为预设位，通常设定为小于或等于5的数值。以此，通过topn运算，从概率序列中选取出排列在前预设位的数值作为目标概率。如n设定为3，则选取排列在前三位的数值作为目标概率，形成子列表：[-log(p_j,t),-log(p_k,t),-log(p_l,t)]，其中j、k、l表征三项数值所来源损失列表的位置。将各项目标概率所来源损失列表的位置确定为排列位置，并根据该排列位置，对损失列表进行更新。该更新的过程为将排列位置的数值设定为某一数值，如数值1；而将排列位置之外其他位置的数值设定为另一数值，如数值0。如对于预设类别C等于10，即10个预设类别，j、k、l的排列位置分别为第3位、第4位和第8位，表征损失列表中该三个位置的损失值最大；则将该三个位置的数值设定为1，而将其他位置的数值设定为0，得到的更新损失列表为[0，0，1，1，0，0，0，1，0，0]。This embodiment determines the mask list in combination with the loss list and the true probability distribution. Specifically, a sorted operation is performed on each value in the loss list, and through the sorted operation, each value in the loss list is sorted in descending order to obtain a probability sequence. After that, the topn operation is performed on the probability sequence. The topn operation is to find the top n data in the sequence, and n is a hyperparameter set in advance according to the requirements. As a preset bit, it is usually set to a value less than or equal to 5. Therefore, through the topn operation, the numerical value arranged in the previous preset position is selected from the probability sequence as the target probability. If n is set to 3, select the first three values as the target probability to form a sublist: [-log(p _j,t ),-log(p _k,t ),-log(p _l,t )], where j, k, l represent the position of the loss list from which the three values come from. The position of the loss list from which each target probability comes from is determined as the arrangement position, and the loss list is updated according to the arrangement position. The updating process is to set the value of the arrangement position to a certain value, such as a value of 1; and set the value of other positions other than the arrangement position to another value, such as a value of 0. For example, for the preset category C equal to 10, that is, 10 preset categories, the arrangement positions of j, k, and l are the 3rd, 4th, and 8th positions, respectively, and the loss value of these three positions in the loss list is the largest. ; then the values of the three positions are set to 1, and the values of other positions are set to 0, and the resulting update loss list is [0, 0, 1, 1, 0, 0, 0, 1, 0, 0].

需要说明地，本实施例对损失列表所进行的排序和选取操作，即sorted运算和topn运算，可通过如下公式(7)表示，公式(7)为：It should be noted that the sorting and selection operations performed on the loss list in this embodiment, that is, the sorted operation and the topn operation, can be represented by the following formula (7), and the formula (7) is:

SoftmaxCrossEntropyLossList＝topn(sorted(SoftmaxCrossEntropyLossList))(7)；SoftmaxCrossEntropyLossList=topn(sorted(SoftmaxCrossEntropyLossList))(7);

进一步地，将更新后的损失列表和真实概率分布进行加和运算，生成掩码列表。掩码列表中除了包含更新后损失列表的各项数值外，还包含真实概率分布中真实概率值。如对于上述更新后各项数值为[0，0，1，1，0，0，0，1，0，0]的损失列表，若真实概率分布的第6位为真实概率，则将真实概率分布[0，0，0，0，0，1，0，0，0，0]和更新后的损失列表[0，0，1，1，0，0，0，1，0，0]进行加和，得到[0，0，1，1，0，1，0，1，0，0]的掩码列表。Further, the updated loss list and the true probability distribution are added together to generate a mask list. In addition to the values of the updated loss list, the mask list also contains the true probability values in the true probability distribution. For example, for the loss list whose values are [0, 0, 1, 1, 0, 0, 0, 1, 0, 0] after the above update, if the sixth digit of the true probability distribution is the true probability, then the true probability The distribution [0, 0, 0, 0, 0, 1, 0, 0, 0, 0] and the updated loss list [0, 0, 1, 1, 0, 0, 0, 1, 0, 0] do Add and get a masked list of [0, 0, 1, 1, 0, 1, 0, 1, 0, 0].

更进一步地，所述根据所述掩码列表和所述损失列表，确定所述样本数据的正向类别损失和多个负向类别损失的步骤包括：Further, according to the mask list and the loss list, the step of determining the positive class loss and multiple negative class losses of the sample data includes:

步骤S34，对所述掩码列表和更新前的所述损失列表进行乘积运算，生成乘积结果列表；Step S34, performing a product operation on the mask list and the loss list before updating to generate a product result list;

步骤S35，根据所述真实概率分布中真实概率所在的位置，从所述乘积结果列表中确定出所述正向类别损失；Step S35, according to the position of the true probability in the true probability distribution, determine the forward category loss from the product result list;

步骤S36，根据所述排列位置，从所述乘积结果中确定出多个所述负向类别损失。Step S36, according to the arrangement position, determine a plurality of the negative class losses from the product result.

进一步地，将掩码列表和更新前的损失列表进行乘积运算，并将两者相乘所得到的结果作为乘积结果列表。其中，乘积运算为将列表中位于同一位置的数值进行相乘，如掩码列表中第一位的数值与更新前损失列表中第一位的数值进行相乘。因掩码列表中各位置的数值是数值1或者数值0，从而在相乘时，通过数值0的相乘来对损失列表中的无效负向样本过滤，并通过数值1的相乘来保存损失列表中的正向样本以及有效样本。并且，该乘积运算可通过如下公式(8)表示，公式(8)为：Further, a product operation is performed on the mask list and the loss list before updating, and the result obtained by multiplying the two is used as the product result list. Among them, the product operation is to multiply the values in the same position in the list, for example, the first value in the mask list is multiplied by the first value in the loss list before the update. Since the value of each position in the mask list is a value of 1 or a value of 0, during multiplication, the invalid negative samples in the loss list are filtered by the multiplication of the value of 0, and the loss is saved by the multiplication of the value of 1. Positive samples in the list as well as valid samples. And, the product operation can be expressed by the following formula (8), and the formula (8) is:

FinalLossList＝MaskedLossList*SoftmaxCrossEntropyLossList (8)；FinalLossList=MaskedLossList*SoftmaxCrossEntropyLossList(8);

其中，FinalLossList表示乘积结果列表，MaskedLossList表示掩码列表。Among them, FinalLossList represents the product result list, and MaskedLossList represents the mask list.

更进一步地，在生成经过滤的乘积结果列表后，依据真实概率分布中真实概率所在的位置，从乘积结果列表中确定出正向类别损失。如对于上述真实概率分布中第6位为真实概率的情形，则获取乘积结果列表中排列在第6位的数值作为正向类别损失。同时，依据排列在前预设位的目标概率所在的排列位置，从乘积结果中确定出多个负向类别损失。如对于上述目标概率所在的排列位置分别为第3位、第4位和第8位的情形，则获取乘积结果列表中分别排列在第3位、第4位和第8位的数值作为负向类别损失。Further, after the filtered product result list is generated, the forward category loss is determined from the product result list according to the position of the true probability in the true probability distribution. For example, in the case where the 6th digit in the above true probability distribution is the true probability, the value arranged in the 6th digit in the product result list is obtained as the forward class loss. At the same time, according to the arrangement positions of the target probabilities arranged in the previous preset positions, a plurality of negative class losses are determined from the product results. For example, for the situation where the above-mentioned target probabilities are arranged at the 3rd, 4th, and 8th positions, the values in the 3rd, 4th, and 8th positions in the product result list are obtained as the negative direction. Category loss.

本实施例通过损失列表的排序和选取，对损失列表进行更新，并结合更新后的损失列表和真实概率，生成掩码列表；进而由掩码列表对更新前的损失列表进行过滤，生成乘积结果列表，用以确定正向类别损失和多个负向类别损失。以此，通过获得的正向类别损失和多个负向类别损失，构建正负向样本；通过多条样本数据的处理和构建，使得各预设类别均包含多项正负向样本，有利于各预设类别样本的均衡性。In this embodiment, the loss list is updated by sorting and selecting the loss list, and a mask list is generated in combination with the updated loss list and the true probability; and the mask list is used to filter the loss list before the update to generate a product result A list to determine the positive class loss and multiple negative class losses. In this way, positive and negative samples are constructed through the obtained positive category loss and multiple negative category losses; through the processing and construction of multiple sample data, each preset category contains multiple positive and negative samples, which is beneficial to The balance of samples of each preset category.

本发明还提供一种多类别处理任务训练样本构建装置。The invention also provides a multi-class processing task training sample construction device.

参照图3，图3为本发明多类别处理任务训练样本构建装置第一实施例的功能模块示意图。所述多类别处理任务训练样本构建装置包括：Referring to FIG. 3 , FIG. 3 is a schematic diagram of functional modules of the first embodiment of the apparatus for constructing training samples for multi-category processing tasks according to the present invention. The multi-category processing task training sample construction device includes:

获取模块10，用于获取样本数据在多个预设类别上的真实概率分布，并确定所述样本数据在多个预设类别上的预测概率分布；an acquisition module 10, configured to acquire the real probability distribution of the sample data on multiple preset categories, and determine the predicted probability distribution of the sample data on multiple preset categories;

第一确定模块20，用于根据所述真实概率分布和所述预测概率分布，确定所述样本数据在多个预设类别上的损失列表；a first determination module 20, configured to determine a loss list of the sample data in multiple preset categories according to the real probability distribution and the predicted probability distribution;

第二确定模块30，用于根据所述损失列表和所述真实概率分布，确定掩码列表，并根据所述掩码列表和所述损失列表，确定所述样本数据的正向类别损失和多个负向类别损失；The second determination module 30 is configured to determine a mask list according to the loss list and the true probability distribution, and determine the forward class loss and multivariate loss of the sample data according to the mask list and the loss list negative class loss;

构建模块40，用于根据所述正向类别损失和多个所述负向类别损失，确定所述样本数据在多个预设类别中归属的正向类别和多个负向类别，并根据所述正向类别和多个负向类别，分别构建多个预设类别的正负向样本，以基于多个预设类别的正负向样本生成多类别分类模型进行类别分类。The building module 40 is configured to determine, according to the positive category loss and a plurality of the negative category losses, a positive category and a plurality of negative categories to which the sample data belongs in a plurality of preset categories, and according to the The positive and negative categories are described above, and positive and negative samples of a plurality of preset categories are respectively constructed, so as to generate a multi-category classification model based on the positive and negative samples of the multiple preset categories for category classification.

进一步地，所述多类别处理任务训练样本构建装置包括：Further, the multi-category processing task training sample construction device includes:

训练模块，用于基于多个所述预设类别的正负项样本，对预设多类别模型进行训练，生成多类别分类模型；a training module for training a preset multi-category model based on a plurality of positive and negative item samples of the preset categories to generate a multi-category classification model;

分类模块，用于当接收到待分类数据时，基于所述多类别分类模型对所述待分类数据进行类别分类，确定所述待分类数据所归属的类别。A classification module, configured to classify the data to be classified based on the multi-class classification model when receiving the data to be classified, and determine the class to which the data to be classified belongs.

进一步地，所述第二确定模块30还包括：Further, the second determining module 30 further includes:

排序单元，用于对所述损失列表中的各数值进行排序，获得概率序列，并从所述概率序列中选取出排列在前预设位的目标概率；a sorting unit, configured to sort each value in the loss list, obtain a probability sequence, and select the target probability arranged in the previous preset position from the probability sequence;

更新单元，用于确定各所述目标概率在所述损失列表中的排列位置，并根据所述排列位置，对所述损失列表进行更新；an update unit, configured to determine the arrangement position of each of the target probabilities in the loss list, and update the loss list according to the arrangement position;

加和运算单元，用于将更新后的所述损失列表和所述真实概率分布进行加和运算，生成掩码列表。A summation operation unit, configured to perform an addition operation on the updated loss list and the true probability distribution to generate a mask list.

乘积运算单元，用于对所述掩码列表和更新前的所述损失列表进行乘积运算，生成乘积结果列表；a product operation unit, configured to perform a product operation on the mask list and the loss list before updating to generate a product result list;

第一确定单元，用于根据所述真实概率分布中真实概率所在的位置，从所述乘积结果列表中确定出所述正向类别损失；a first determining unit, configured to determine the forward category loss from the product result list according to the position of the true probability in the true probability distribution;

所述第一确定单元还用于根据所述排列位置，从所述乘积结果中确定出多个所述负向类别损失。The first determining unit is further configured to determine a plurality of the negative class losses from the product result according to the arrangement position.

进一步地，所述第一确定模块20还包括：Further, the first determining module 20 also includes:

第二确定单元，用于根据所述真实概率分布和所述预测概率分布，分别确定所述样本数据在每一预设类别上的真实概率值和预测概率值；a second determining unit, configured to respectively determine the real probability value and the predicted probability value of the sample data on each preset category according to the real probability distribution and the predicted probability distribution;

计算单元，用于基于预设损失公式，对所述样本数据在每一预设类别上的所述真实概率值和所述预测概率值进行计算，得到所述样本数据在每一预设类别上的损失值；A calculation unit, configured to calculate the true probability value and the predicted probability value of the sample data in each preset category based on a preset loss formula, to obtain the sample data in each preset category loss value;

排列单元，用于根据每一所述预设类别在多个所述预设类别中的排列顺序，对每一所述损失值排列，获得所述损失列表。an arrangement unit, configured to arrange each of the loss values according to the arrangement order of each of the preset categories in the plurality of preset categories to obtain the loss list.

进一步地，所述获取模块10还包括：Further, the acquisition module 10 also includes:

获取单元，用于获取所述样本数据与多个预设类别分别对应的输出值，并根据预设函数，对多个所述输出值分别进行映射，获得所述样本数据在多个预设类别上的预测概率分布。an obtaining unit, configured to obtain output values corresponding to the sample data and a plurality of preset categories respectively, and map a plurality of the output values respectively according to a preset function to obtain the sample data in a plurality of preset categories The predicted probability distribution on .

进一步地，所述多类别处理任务训练样本构建装置还包括：Further, the multi-category processing task training sample construction device further includes:

编码模块，用于获取样本数据，以及与所述样本数据对应的样本标签，并基于预设编码方式对所述样本标签进行编码，生成所述样本数据在多个预设类别上的真实概率分布。an encoding module, configured to obtain sample data and sample labels corresponding to the sample data, and encode the sample labels based on a preset encoding method to generate the true probability distribution of the sample data on multiple preset categories .

进一步地，所述构建模块40还包括：Further, the building module 40 also includes:

训练单元，用于根据所述正向类别损失和多个所述负向类别损失，对初始模型进行训练，并获取与所述初始模型对应的梯度；a training unit, configured to train an initial model according to the positive class loss and a plurality of the negative class losses, and obtain gradients corresponding to the initial model;

第三确定单元，用于当所述梯度小于预设阈值时，完成所述初始模型的训练，并根据所述初始模型完成训练时所述样本数据在多个预设类别中的映射属性，确定所述正向类别和多个所述负向类别。A third determining unit, configured to complete the training of the initial model when the gradient is less than a preset threshold, and determine the mapping attributes of the sample data in multiple preset categories when the initial model completes the training the positive class and a plurality of the negative classes.

本发明多类别处理任务训练样本构建装置具体实施方式与上述多类别处理任务训练样本构建方法各实施例基本相同，在此不再赘述。The specific implementations of the multi-class processing task training sample construction apparatus of the present invention are basically the same as the above-mentioned embodiments of the multi-class processing task training sample construction methods, and will not be repeated here.

此外，本发明实施例还提出一种介质。In addition, an embodiment of the present invention also provides a medium.

介质上存储有多类别处理任务训练样本构建程序，多类别处理任务训练样本构建程序被处理器执行时实现如上所述的多类别处理任务训练样本构建方法的步骤。A multi-class processing task training sample construction program is stored on the medium, and the multi-class processing task training sample construction program is executed by the processor to implement the steps of the multi-class processing task training sample construction method described above.

本发明介质可以是计算机介质，其具体实施方式与上述多类别处理任务训练样本构建方法各实施例基本相同，在此不再赘述。The medium of the present invention may be a computer medium, and its specific implementation is basically the same as the above-mentioned embodiments of the multi-category processing task training sample construction method, and will not be repeated here.

上面结合附图对本发明的实施例进行了描述，但是本发明并不局限于上述的具体实施方式，上述的具体实施方式仅仅是示意性的，而不是限制性的，本领域的普通技术人员在本发明的启示下，在不脱离本发明宗旨和权利要求所保护的范围情况下，还可做出很多形式，凡是利用本发明说明书及附图内容所作的等效结构或等效流程变换，或直接或间接运用在其他相关的技术领域，这些均属于本发明的保护之内。The embodiments of the present invention have been described above in conjunction with the accompanying drawings, but the present invention is not limited to the above-mentioned specific embodiments, which are merely illustrative rather than restrictive. Under the inspiration of the present invention, without departing from the scope of protection of the purpose of the present invention and the claims, many forms can be made. Directly or indirectly applied in other related technical fields, these all belong to the protection of the present invention.

Claims

1. A multi-category processing task training sample construction method, wherein the multi-category processing task training sample construction method comprises the following steps:

Obtain the real probability distribution of the sample data on the multiple preset categories, and determine the predicted probability distribution of the sample data on the multiple preset categories;

According to the real probability distribution and the predicted probability distribution, determine the loss list of the sample data in multiple preset categories;

Determine a mask list according to the loss list and the true probability distribution, and determine a positive category loss and a plurality of negative category losses of the sample data according to the mask list and the loss list;

According to the positive category loss and a plurality of the negative category losses, determine a positive category and a plurality of negative categories to which the sample data belongs in a plurality of preset categories, and according to the positive category and the multiple negative categories A multi-category classification model is generated based on the positive and negative samples of the multiple preset categories for category classification.

2 . The method for constructing training samples for multi-category processing tasks according to claim 1 , wherein, according to the positive category and a plurality of negative categories, respectively construct a plurality of preset categories of positive and negative samples. 3 . After the step, the method further includes:

training a preset multi-category model based on a plurality of positive and negative item samples of the preset categories to generate a multi-category classification model;

When the data to be classified is received, the data to be classified is classified based on the multi-class classification model, and the class to which the data to be classified is determined is determined.

3. The multi-class processing task training sample construction method according to claim 1, wherein the step of determining the mask list according to the loss list and the true probability distribution comprises:

Sort each value in the loss list to obtain a probability sequence, and select the target probability arranged in the previous preset position from the probability sequence;

determining the arrangement position of each target probability in the loss list, and updating the loss list according to the arrangement position;

The updated loss list and the true probability distribution are summed to generate a mask list.

4. The method for constructing training samples for multi-class processing tasks according to claim 3, wherein the positive class loss and multiple negative class losses of the sample data are determined according to the mask list and the loss list. The steps towards class loss include:

Perform a product operation on the mask list and the loss list before updating to generate a product result list;

According to the position of the true probability in the true probability distribution, the forward category loss is determined from the product result list;

According to the arrangement position, a plurality of the negative class losses are determined from the product result.

5 . The method for constructing training samples for multi-class processing tasks according to claim 1 , wherein, according to the real probability distribution and the predicted probability distribution, the determination of the sample data on multiple preset classes is performed. 6 . The steps for the loss list include:

According to the real probability distribution and the predicted probability distribution, respectively determine the real probability value and the predicted probability value of the sample data on each preset category;

Based on a preset loss formula, calculate the real probability value and the predicted probability value of the sample data in each preset category, and obtain the loss value of the sample data in each preset category;

The loss list is obtained by arranging each of the loss values according to the arrangement order of each of the preset categories in the plurality of preset categories.

6. The multi-class processing task training sample construction method according to any one of claims 1-5, wherein the step of determining the prediction probability distribution of the sample data on a plurality of preset classes comprises:

Obtain the output values corresponding to the sample data and a plurality of preset categories respectively, and map the output values respectively according to a preset function to obtain the predicted probability distribution of the sample data on the multiple preset categories .

7. The method for constructing training samples for multi-category processing tasks according to any one of claims 1 to 5, wherein before the step of acquiring the true probability distribution of sample data on multiple preset categories, the method Also includes:

Obtain sample data and a sample label corresponding to the sample data, and encode the sample label based on a preset encoding method to generate a true probability distribution of the sample data on multiple preset categories.

8. The multi-class processing task training sample construction method according to any one of claims 1-5, wherein the sample is determined according to the positive class loss and a plurality of the negative class losses The steps of a positive category and a plurality of negative categories that the data belongs to in a plurality of preset categories include:

train an initial model according to the positive class loss and a plurality of the negative class losses, and obtain gradients corresponding to the initial model;

When the gradient is less than a preset threshold, the training of the initial model is completed, and the forward category and the multiple preset categories are determined according to the mapping attributes of the sample data in multiple preset categories when the training of the initial model is completed. one of the negative categories.

9. A multi-class processing task training sample construction device, characterized in that, the multi-class processing task training sample construction device comprises a memory, a processor, and a plurality of A class processing task training sample construction program, when the multi-class processing task training sample construction program is executed by the processor, implements the steps of the multi-class processing task training sample construction method according to any one of claims 1-8.

10. A medium, characterized in that a multi-category processing task training sample construction program is stored on the medium, and the multi-category processing task training sample construction program is executed by a processor to implement any one of claims 1-8. The steps of the multi-category processing task training sample construction method described in item.