CN115761851A

CN115761851A - Optimization Method of Cosine Optimal Loss Function Based on Global Information

Info

Publication number: CN115761851A
Application number: CN202211442334.5A
Authority: CN
Inventors: 魏欣; 毛日强; 张远来; 万欢; 晏斐; 徐健锋
Original assignee: Nanchang University
Current assignee: Nanchang University
Priority date: 2022-11-16
Filing date: 2022-11-16
Publication date: 2023-03-07
Anticipated expiration: 2042-11-16
Also published as: CN115761851B

Abstract

本发明提出一种基于全局信息的余弦最优损失函数的优化方法，包括：S1.将现有损失函数的优点与一些重要的新属性相结合，应用L2权重归一化；S2.明确遵循最小化类内变化和最大化类间变化两个目标，依靠一种新的算法来学习类中心和类边缘的余弦相似度，并分别提出两个轻量化版本的余弦最优损失函数；S3.整合上述两个轻量化版本来创建余弦最优损失函数的标准版本。本发明主要针对现有损失函数没有应用权重和特征归一化或未明确遵循最小化类内变化和最大化类间变化的问题，使用全局信息作为人脸识别的反馈，提出了一种基于全局信息的余弦最优损失函数，相比于现有的损失函数，该损失函数更加有效并实现了更先进的性能。The present invention proposes a cosine optimal loss function optimization method based on global information, including: S1. Combining the advantages of the existing loss function with some important new attributes, applying L2 weight normalization; S2. Explicitly following the minimum The two goals of minimizing intra-class change and maximizing inter-class change rely on a new algorithm to learn the cosine similarity of the class center and class edge, and propose two lightweight versions of the cosine optimal loss function; S3. Integration The above two lightweight versions are used to create a standard version of the cosine-optimal loss function. The present invention mainly aims at the problem that the existing loss function does not apply weight and feature normalization or does not clearly follow the problem of minimizing intra-class change and maximizing inter-class change, using global information as the feedback of face recognition, and proposes a global-based A cosine-optimal loss function for information that is more efficient and achieves state-of-the-art performance compared to existing loss functions.

Description

Optimization Method of Cosine Optimal Loss Function Based on Global Information

技术领域technical field

本发明涉及人工智能、机器学习与人脸识别技术领域，具体涉及可应用于人脸识别且基于全局信息的余弦最优损失函数的优化方法。The invention relates to the technical fields of artificial intelligence, machine learning and face recognition, in particular to an optimization method of a cosine optimal loss function that can be applied to face recognition and is based on global information.

背景技术Background technique

卷积神经网络(CNN)在人脸识别方面表现出令人印象深刻的性能，其中损失函数在此过程中起着重要作用。为了学习到具有高度判别能力的特征，近年来提出了许多不同的损失函数。目前，人脸识别中表现最好的损失函数可以分为两类——基于欧几里德距离的损失函数和基于余弦相似度的损失函数。Convolutional Neural Networks (CNNs) have shown impressive performance in face recognition, where loss functions play an important role in the process. To learn highly discriminative features, many different loss functions have been proposed in recent years. At present, the best-performing loss functions in face recognition can be divided into two categories - loss functions based on Euclidean distance and loss functions based on cosine similarity.

Softmax损失可以表述为：

其中N表示批量大小，P表示整个训练集中的类别数，f_i∈R^d是属于第y_i个类的第i个样本的特征向量，W_i∈R^d是权重矩阵W的第j列最后的全连接层，b_j是第j个类的偏置项。典型的基于欧几里德距离的损失包括中心损失、间隔损失和范围损失。它们都添加了额外的惩罚来实现与softmax损失的联合监督，并且基于以下两个目标进行设计：最小化类内变化和最大化类间变化。这两个目标都对性能提升有贡献。基于余弦相似度的损失函数包括L-Softmax损失、A-Softmax损失和AM-Softmax损失。它们是通过添加额外的间隔约束从softmax损失衍生出来的。L2权重归一化提高了性能，尽管改进非常有限。特征归一化带来的优势包括更好的性能和更好的几何解释。Softmax loss can be expressed as:

where N represents the batch size, P represents the number of categories in the entire training set, f _i ∈ R ^d is the feature vector of the i-th sample belonging to the y _i- th class, W _i ∈ R ^d is the j-th column of the weight matrix W and finally The fully connected layer of , b _j is the bias term of the jth class. Typical losses based on Euclidean distance include center loss, margin loss and range loss. They both add additional penalties to achieve joint supervision with softmax loss, and are designed based on the following two objectives: minimizing intra-class variation and maximizing inter-class variation. Both of these goals contribute to performance gains. Loss functions based on cosine similarity include L-Softmax loss, A-Softmax loss and AM-Softmax loss. They are derived from the softmax loss by adding an additional margin constraint. L2 weight normalization improves performance, although the improvement is very limited. Advantages brought about by feature normalization include better performance and better geometric interpretation.

目前已经提出的这些损失函数要么没应用权重和特征归一化，如对比损失、三重损失、中心损失、范围损失和间隔损失；要么未明确遵循提高判别能力的两个目标，如L-Softmax loss、ASoftmax loss、AM-Softmax loss和ArcFace。These loss functions that have been proposed so far either do not apply weight and feature normalization, such as contrastive loss, triple loss, center loss, range loss and interval loss; or do not explicitly follow the two goals of improving discriminative ability, such as L-Softmax loss , ASoftmax loss, AM-Softmax loss and ArcFace.

目前，深度神经网络是通过基于每个小批量的反馈信息迭代更新网络参数来训练的。这是一个可行的解决方案，因为存在两个限制：GPU，TPU或其他类似处理单元的计算能力和内存大小。在没有计算能力限制的情况下，深度神经网络可以以整个训练集作为反馈信息的来源进行训练，直接优化整个训练集的样本分布。在没有内存大小限制的情况下，深度神经网络会将整个训练集输入到内存中，而不是逐个小批量处理数据。也许正是因为以上两个约束，没有任何一个损失使用整个数据集作为反馈信息的来源来优化人脸识别中的CNNs。Currently, deep neural networks are trained by iteratively updating network parameters based on feedback information from each mini-batch. This is a viable solution because there are two constraints: the computing power and memory size of the GPU, TPU or other similar processing unit. Without the limitation of computing power, the deep neural network can use the entire training set as the source of feedback information for training, and directly optimize the sample distribution of the entire training set. Without memory size constraints, deep neural networks feed the entire training set into memory instead of processing data individually in mini-batches. Perhaps precisely because of the above two constraints, none of the losses use the entire dataset as a source of feedback information to optimize CNNs in face recognition.

我们提出了一种新的损失函数，即基于全局信息的余弦最优损失函数。余弦最优损失函数具有优化类内和类间变化以及权重和特征归一化的所有四个属性。并且，余弦最优损失函数由整个训练集的分布信息引导。相比于之前提出的损失函数，余弦最优损失函数更加有效，并表现出了更加先进的性能。We propose a new loss function, the cosine optimal loss function based on global information. The cosine-optimal loss function has all four properties of optimizing intra- and inter-class variation as well as weight and feature normalization. And, the cosine optimal loss function is guided by the distribution information of the whole training set. Compared to previously proposed loss functions, the cosine-optimal loss function is more effective and shows more advanced performance.

发明内容Contents of the invention

(1)要解决的技术问题(1) Technical problems to be solved

损失函数在CNN(卷积神经网络)中起着重要作用。然而，现有的损失函数要么没有应用权重和特征归一化，要么没有明确遵循提高辨别能力的两个目标：最小化类内变化和最大化类间变化。而且，所有这些函数只考虑小批量的反馈信息，而没有考虑整个训练集的分布信息。Loss function plays an important role in CNN (Convolutional Neural Network). However, existing loss functions either do not apply weight and feature normalization, or do not explicitly follow the two goals of improving discriminative ability: minimizing intra-class variation and maximizing inter-class variation. Moreover, all these functions only consider the feedback information of the mini-batch, but not the distribution information of the whole training set.

(2)技术方案(2) Technical solution

可应用于人脸识别且基于全局信息的余弦最优损失函数，包括如下步骤：A cosine optimal loss function that can be applied to face recognition and based on global information, including the following steps:

a)Softmax损失是深度学习中最常用的损失函数，可以表述为：a) Softmax loss is the most commonly used loss function in deep learning, which can be expressed as:

其中N表示批量大小，P表示整个训练集中的类别数，f_i∈R^d是属于第y_i个类的第i个样本的特征向量，W_j∈R^d是权重矩阵W的第j列最后的全连接层，b_j是第j个类的偏置项；where N represents the batch size, P represents the number of categories in the entire training set, f _i ∈ R ^d is the feature vector of the i-th sample belonging to the y _i- th class, W _j ∈ R ^d is the j-th column of the weight matrix W and finally The fully connected layer of , b _j is the bias item of the jth class;

固定Softmax损失中的b_j＝0和||W_j||＝1来应用L2权重归一化。同时对特征向量f_i应用L2归一化并将||f_i||重新缩放到S，再与AM-Softmax损失结合。得到的总损失为L＝L_AM+λL_G。其中S是一个指定的常数，L_G是所提出的余弦最优损失函数，λ是用于调整这两种损失影响力的超参数，L_AM为AM-Softmax损失的函数表述。Fix b _j = 0 and ||W _j || = 1 in Softmax loss to apply L2 weight normalization. Simultaneously apply L2 normalization to feature vector f _i and rescale ||f _i || to S, combined with AM-Softmax loss. The resulting total loss is L = L _AM + λL _G . where S is a specified constant, L _G is the proposed cosine optimal loss function, λ is a hyperparameter used to adjust the influence of these two losses, and L _AM is the functional expression of AM-Softmax loss.

b)为了最小化类内变化，首先提出一个轻量化版本的余弦最优损失函数

其公式如下：

R(j)＝cos(c_j，e_j)，其中P是整个训练集中的类别数，c_j是类j的中心，e_j表示类j的边缘(即类j的最远样本)。R(j)表示j类的余弦范围，即类中心与j类边缘的余弦相似度。我们使用W_j作为c_j的近似替代，并且提出一种算法来递归更新每个类的范围。b) In order to minimize intra-class variation, a lightweight version of the cosine optimal loss function is first proposed

Its formula is as follows:

R(j)=cos(c _j , e _j ), where P is the number of categories in the entire training set, c _j is the center of class j, and e _j represents the edge of class j (ie, the farthest sample of class j). R(j) represents the cosine range of class j, that is, the cosine similarity between the class center and the edge of class j. We use _Wj as an approximate surrogate for _cj , and propose an algorithm to recursively update the bounds of each class.

c)根据步骤b)所提到的算法，一开始，R(j)被初始化为1。然后我们使用以下迭代方式来更新R(j)：c) According to the algorithm mentioned in step b), at the beginning, R(j) is initialized to 1. Then we update R(j) iteratively as follows:

其中j＝1，2，...，P，where j=1,2,...,P,

其中，当y_i＝j时φ(y_i，j)＝1，否则φ(y_i，j)＝0。β是收缩率，用于调整学习类别范围的收缩速度。Wherein, φ(y _i , j)=1 when y _i =j, otherwise φ(y _i , j)=0. β is the shrinkage rate, which is used to adjust the shrinkage speed of the learned category range.

根据步骤b)中所提出的学习算法，其基本思想涉及两种情形：①如果输入样本与其对应的类中心的余弦相似度小于记录的类范围，则直接用它们的余弦相似度替换类范围；②相反，如果输入样本与其对应的类中心的余弦相似度不小于记录的类范围，则通过用β缩放它们的余弦相似度来收缩类范围。情形①使学习的类范围保持最新。随着训练的进行，真实的类范围会越来越小。情形②用于帮助学习的类范围缩小到真实值。According to the learning algorithm proposed in step b), its basic idea involves two situations: ① If the cosine similarity between the input sample and its corresponding class center is smaller than the recorded class range, directly replace the class range with their cosine similarity; ② Conversely, if the cosine similarity of an input sample and its corresponding class center is not smaller than the recorded class range, the class range is shrunk by scaling their cosine similarity with β. Scenario ① keeps the learned class range up-to-date. As training progresses, the true class range becomes smaller and smaller. Case ② is used to help the learning class narrow down to the real value.

d)为了最大化类间变化，提出另一个轻量化版本的余弦最优损失函数

d) In order to maximize the inter-class variation, another lightweight version of the cosine optimal loss function is proposed

其中∑_Top(A，k)表示集合A中K个最大元素的总和，W_a和W_b是任意两个不同类的类中心的近似替代值。余弦最优损失函数

的目的是在整个训练集中找到K对最近的类中心，并计算它们的距离总和。与不相邻的类中心相比，相邻中心的对应类很可能有较小的间隔或有重叠。如果所有相邻类都有适当的间隔，则非相邻类将具有更大的间隔。因此，没有必要考虑所有中心对。最有效的方法是优化所有相邻中心的距离。这里将K的值设置为P，其中P是类的数量。因为当所有类中心在超球面上排成一圈时，相邻中心对的最小数量是P。where _∑Top (A,k) denotes the sum of the K largest elements in the set A, and W _a and W _b are approximate surrogates for the class centers of any two different classes. Cosine optimal loss function

The purpose of is to find the K pairs of nearest class centers in the whole training set and calculate the sum of their distances. Corresponding classes of adjacent centers are likely to have smaller separations or overlap than non-adjacent class centers. If all adjacent classes have proper spacing, non-adjacent classes will have larger spacing. Therefore, it is not necessary to consider all center pairs. The most efficient way is to optimize the distance of all adjacent centers. Here the value of K is set to P, where P is the number of classes. Because when all the class centers are arranged in a circle on the hypersphere, the minimum number of pairs of adjacent centers is P.

e)整合步骤b)和步骤d)提出的两个轻量化版本创建出余弦最优损失函数的标准版

e) Integrate the two lightweight versions proposed in step b) and step d) to create a standard version of the cosine optimal loss function

(3)有益效果(3) Beneficial effect

本发明的余弦最优损失函数综合了近年来在人脸识别中提出的最优损失函数的优点。并首次尝试使用全局信息作为人脸识别的反馈。余弦最优损失函数运用了一种新的算法来学习类中心和类边缘之间的余弦相似度。本专利所提出的余弦最优损失函数在LFW、SLLFW和YTF数据集上进行了大量的实验。结果证明了其有效性并表明余弦最优损失函数实现了最先进的性能。The cosine optimal loss function of the present invention combines the advantages of the optimal loss function proposed in face recognition in recent years. And it is the first attempt to use global information as feedback for face recognition. The cosine optimal loss function employs a novel algorithm to learn the cosine similarity between class centers and class edges. The cosine optimal loss function proposed in this patent has been extensively tested on LFW, SLLFW and YTF data sets. The results demonstrate its effectiveness and show that the cosine-optimal loss function achieves state-of-the-art performance.

具体实施方式Detailed ways

下面对本发明做进一步说明。The present invention will be further described below.

可应用于人脸识别且基于全局信息的余弦最优损失函数的设计方法，包括如下步骤：A method for designing a cosine optimal loss function that can be applied to face recognition and based on global information includes the following steps:

a)固定Softmax损失中的b_j＝0和||W_j||＝1来应用L2权重归一化。同时对特征向量f_i应用L2归一化并将||f_i||重新缩放到S，再与AM-Softmax损失结合。得到的总损失为L＝L_AM+λL_G。其中S是一个指定的常数，L_G是所提出的余弦最优损失函数，λ是用于调整这两种损失影响力的超参数，L_AM为AM-Softmax损失的函数表述。a) Fix b _j =0 and ||W _j ||=1 in Softmax loss to apply L2 weight normalization. Simultaneously apply L2 normalization to feature vector f _i and rescale ||f _i || to S, combined with AM-Softmax loss. The resulting total loss is L = L _AM + λL _G . where S is a specified constant, L _G is the proposed cosine optimal loss function, λ is a hyperparameter used to adjust the influence of these two losses, and L _AM is the functional expression of AM-Softmax loss.

Softmax损失是深度学习中最常用的损失函数，可以表述为：Softmax loss is the most commonly used loss function in deep learning, which can be expressed as:

其公式如下：b) In order to minimize intra-class variation, a lightweight version of the cosine optimal loss function is first proposed

Its formula is as follows:

R(j)＝cos(c_j，e_j)

R(j)=cos(c _j , e _j )

其中P是整个训练集中的类别数，c_j是类j的中心，e_j表示类j的边缘(即类j的最远样本)。R(j)表示j类的余弦范围，即类中心与j类边缘的余弦相似度。我们使用W_j作为c_j的近似替代，并且提出一种算法来递归更新每个类的范围。where P is the number of categories in the entire training set, c _j is the center of class j, and e _j represents the edge of class j (i.e. the farthest sample of class j). R(j) represents the cosine range of class j, that is, the cosine similarity between the class center and the edge of class j. We use _Wj as an approximate surrogate for _cj , and propose an algorithm to recursively update the bounds of each class.

其中j＝1，2，...，P，where j=1,2,...,P,

余弦最优损失函数综合了近年来在人脸识别中提出的最优损失函数的优点。并首次尝试使用全局信息作为人脸识别的反馈。余弦最优损失函数运用了一种新的算法来学习类中心和类边缘之间的余弦相似度。本专利所提出的余弦最优损失函数在LFW、SLLFW和YTF数据集上进行了大量的实验。结果证明了其有效性并表明余弦最优损失函数实现了最先进的性能。The cosine optimal loss function combines the advantages of the optimal loss functions proposed in face recognition in recent years. And it is the first attempt to use global information as feedback for face recognition. The cosine optimal loss function employs a novel algorithm to learn the cosine similarity between class centers and class edges. The cosine optimal loss function proposed in this patent has been extensively tested on LFW, SLLFW and YTF data sets. The results demonstrate its effectiveness and show that the cosine-optimal loss function achieves state-of-the-art performance.

Claims

1. The optimization method of the cosine optimal loss function based on the global information is characterized by comprising the following steps of:

s1, combining the advantages of the existing loss function with a plurality of new attributes, and applying L2 weight normalization to obtain a total loss function;

s2, definitely following two targets of minimizing intra-class variation and maximizing inter-class variation, learning cosine similarity of class centers and class edges by means of a new algorithm, and respectively providing cosine optimal loss functions of two lightweight versions;

and S3, integrating the two lightweight versions to create a standard version of the cosine optimal loss function.

2. The method for optimizing a cosine optimal loss function based on global information according to claim 1, wherein the specific step S1 comprises:

the Softmax loss function is expressed as:

where N denotes the batch size, P denotes the number of classes in the entire training set, f _i ∈R ^d Is of the y _i Feature vector of ith sample of an individual class, W _j ∈R ^d Is the last fully-connected layer of the jth column of the weight matrix W, b _j Is the bias term of the jth class;

fixing b in Softmax loss _j =0 and | | W _j L2 weight normalization is applied, | = 1; for feature vector f simultaneously _i Apply L2 normalization and convert | | | f _i Rescaling | to S, S being a specified constant, and combining with AM-Softmax loss to obtain a total loss of L = L _AM +λL _G ；

In the formula, L _G Is the proposed cosine optimum loss function, λ is the hyper-parameter for adjusting these two loss contributions, L _AM Expressed as a function of the AM-Softmax loss.

3. The method for optimizing a cosine optimal loss function based on global information according to claim 1, wherein the step S2 comprises:

s21, in order to minimize the intra-class variation, a first lightweight version of the cosine optimal loss function is provided

The formula is as follows:

R(j)＝cos(c _j ，e _j )

where P is the number of classes in the entire training set, c _j Is the center of class j, e _j Representing the edge of the class j, and R (j) represents the cosine range of the class j, namely the cosine similarity between the class center and the edge of the class j; using W _j As c is _j And a learning algorithm is employed to recursively update the range of each class;

s22, in order to maximize the inter-class variation, a second lightweight version of the cosine optimal loss function is provided

In the formula (E) _Top (A, K) represents the sum of the K largest elements in set A, W _a And W _b Is an approximate replacement value for the class centers of any two different classes;

cosine optimum loss function

The aim of the method is to find the nearest class center of the K pairs in the whole training set and calculate the sum of the distances of the K pairs; optimizing the distance of all neighboring centers, setting the value of K to P, where P is the number of classes, since the minimum number of pairs of neighboring centers is P when all class centers are aligned in a circle on the hypersphere.

4. Optimization method of cosine optimal loss function based on global information according to claim 3The method is characterized in that the S3 comprises the following specific steps: and integrating the two lightweight versions provided by the step S2 to create a standard version of the cosine optimal loss function

5. The method for optimizing a cosine optimal loss function based on global information according to claim 3, wherein the learning algorithm of the step S21 is:

r (j) is initialized to 1; r (j) is then updated using the following iterative approach:

wherein j =1,2,. -, P;

when y is _i Phi (y) when = j _i J) =1, otherwise φ (y) _i J) =0; β is a contraction rate for adjusting the contraction speed of the learning class range.

6. The method of claim 3, wherein the global information based optimization method for the cosine optimal loss function comprises: the learning algorithm of step S21 involves two cases:

(1) if the cosine similarity between the input sample and the corresponding class center is smaller than the recorded class range, directly replacing the class range by the cosine similarity;

(2) conversely, if the cosine similarity of the input sample and its corresponding class center is not less than the recorded class range, the class range is narrowed by scaling their cosine similarity by β;

the situation (1) keeps the learned class range up to date, and the real class range becomes smaller and smaller as the training progresses; case (2) is to help the class range of learning to narrow down to the true value.