CN111126278B

CN111126278B - A Method of Optimizing and Accelerating Object Detection Models for Few-Category Scenes

Info

Publication number: CN111126278B
Application number: CN201911350732.2A
Authority: CN
Inventors: 王洪波; 陈岩
Original assignee: Beijing University of Posts and Telecommunications
Current assignee: Beijing University of Posts and Telecommunications
Priority date: 2019-12-24
Filing date: 2019-12-24
Publication date: 2023-06-20
Anticipated expiration: 2039-12-24
Also published as: CN111126278A

Abstract

The invention relates to the technical field of computer vision, in particular to a method for optimizing and accelerating a target detection model for few types of scenes, which comprises the following steps: acquiring and labeling a picture to be detected; adding the marked pictures into a feature extraction device to perform feature extraction to obtain three groups of feature images with fixed sizes; sending the three groups of feature images with fixed sizes to a predicting device with a FocalLoss loss function for result prediction; filtering the prediction result by using a detection frame, so that only one detection frame outputs for the same object to be detected, wherein the characteristic extraction process comprises the following steps: compressing network DarkNet, alexNet, resNet, VGG, googLeNet, SENet, denseNet, adjusting the size of the picture to N x N, wherein N is a multiple of 32 between 320 and 1280, and extracting the characteristics of the picture by adopting the compressed characteristic extraction network. The invention can simultaneously have higher detection speed and higher accuracy.

Description

A Method of Optimizing and Accelerating Object Detection Models for Few-Category Scenes

技术领域technical field

本发明涉及计算机视觉技术领域，尤其涉及一种少类别场景的目标检测模型优化与加速的方法。The invention relates to the technical field of computer vision, in particular to a method for optimizing and accelerating an object detection model in a scene with few categories.

背景技术Background technique

目标检测任务作为计算机视觉领域中的一项重要应用，其主要目标是从图像中找到所有的目标物体，并且给出其分类信息与位置信息。The target detection task is an important application in the field of computer vision. Its main goal is to find all target objects from the image and give their classification information and location information.

基于传统的模式识别方法通常需要数十秒的时间来处理一张图片，而基于卷积神经网络的目标检测算法已经可以在几百甚至是几十毫秒完成对一张高分辨率图片的目标检测工作，而且在精度方面也有了很大的提升。Based on traditional pattern recognition methods, it usually takes tens of seconds to process a picture, while the target detection algorithm based on convolutional neural network can already complete the target detection of a high-resolution picture in hundreds or even tens of milliseconds. Work, but also has a great improvement in accuracy.

神经网络的迅速发展，使得人们脱离了手工设计特征的方法，转向了通过使用大量的数据集训练深度卷积神经网络的方法，让网络自适应学习特征。The rapid development of neural networks has made people move away from the method of manually designing features, and turned to the method of training deep convolutional neural networks by using a large number of data sets, so that the network can learn features adaptively.

现有基于卷积神经网络的目标检测技术方案主要分为以下三类：Existing target detection technology solutions based on convolutional neural networks are mainly divided into the following three categories:

第一类：双阶段检测器，其主要特点是分两个步骤来确定物体的位置信息和分类信息。首先挑选出感兴趣区域，在早期的RCNN和Fast-RCNN网络中，通过选择性搜索来确定感兴趣区域，而发展到Faster-RCNN的时候，发明了区域生成网络以确定感兴趣区域，使得整个目标检测网络成为了端到端的结构。在第二个阶段，卷积神经网络对第一阶段提取出的感兴趣区域的坐标位置进行进一步的纠正，将所有预测结果通过非极大值抑制算法后，输出最终结果，得到图片中物体的精确位置和坐标，可以看出此类方法的特点是结果更加准确，但是其相应的时间开销也比较大。The first type: two-stage detector, its main feature is to determine the position information and classification information of the object in two steps. First select the region of interest. In the early RCNN and Fast-RCNN networks, the region of interest was determined by selective search. When developing to Faster-RCNN, the region generation network was invented to determine the region of interest, so that the entire The object detection network becomes an end-to-end structure. In the second stage, the convolutional neural network further corrects the coordinate position of the region of interest extracted in the first stage, and after passing all the prediction results through the non-maximum value suppression algorithm, the final result is output, and the object in the picture is obtained. Precise position and coordinates, it can be seen that this type of method is characterized by more accurate results, but its corresponding time overhead is also relatively large.

第二类：单阶段检测器，如YOLO，SSD，其主要特点只需要一个步骤就可以直接得到物体的位置信息和分类信息，区别于双阶段检测器，网络直接在最后一层的特征图中输出最终的预测结果，对于最后一层特征图的每个点，输出相对于预先设置好的锚点的位置偏移，直接得出每个物体的位置信息，同样的，最后的结果也需要通过非极大值抑制算法，由于此类方法只有一个阶段，所以速度要比双阶段检测器快很多，但是准确率相对较低。The second category: single-stage detectors, such as YOLO, SSD, its main feature is that it only needs one step to directly obtain the location information and classification information of the object, different from the two-stage detector, the network is directly in the feature map of the last layer Output the final prediction result. For each point of the last layer feature map, output the position offset relative to the preset anchor point, and directly obtain the position information of each object. Similarly, the final result also needs to pass The non-maximum suppression algorithm, since this type of method has only one stage, is much faster than the two-stage detector, but the accuracy is relatively low.

第三类：无锚点的方法，此类方法的特点是不需要预先设置锚点，直接对每个物体的四个边界进行回归，目前在精度上可以达到和双阶段检测器相当的水平，但是速度普遍较慢，实用性较差。The third category: methods without anchor points. The characteristic of this type of method is that it does not need to set anchor points in advance, and directly regresses the four boundaries of each object. At present, the accuracy can reach a level equivalent to that of a two-stage detector. However, the speed is generally slow and the practicability is poor.

发明内容Contents of the invention

本发明就是针对现有技术要么检测速度快但准确率低、要么准确率高但检测速度慢的缺陷，提供一种针对少类别场景的目标检测模型优化与加速的方法，同时具有较高的检测速度和较高的准确率。The present invention aims at the defects of the prior art that the detection speed is fast but the accuracy rate is low, or the accuracy rate is high but the detection speed is slow, and provides a method for optimizing and accelerating the target detection model for a few-category scene, and has a high detection rate. speed and high accuracy.

本发明是采用以下技术方案实现的：The present invention is realized by adopting the following technical solutions:

一种针对少类别场景的目标检测模型优化与加速的方法，包括以下步骤：A method for optimizing and accelerating a target detection model for a scene with few categories, comprising the following steps:

获取并标注待检测图片；Obtain and label the image to be detected;

将标注好的图片，加入到特征提取装置进行特征提取，得到的三组固定大小的特征图；The marked pictures are added to the feature extraction device for feature extraction, and three sets of fixed-size feature maps are obtained;

将三组固定大小的特征图送入到带有FocalLoss损失函数的预测装置进行结果预测；Send three sets of fixed-size feature maps to the prediction device with FocalLoss loss function for result prediction;

对预测结果进行检测框的过滤，使得对于同一待检测物体，只有一个检测框输出；Filter the detection frame of the prediction result so that only one detection frame is output for the same object to be detected;

所述的进行特征提取过程包括：对特征提取网络进行压缩，将图片大小调整至分辨率为N×N，N的可选取值为320-1280之间的32的倍数，采用压缩后的特征提取网络进行图片的特征提取。The feature extraction process includes: compressing the feature extraction network, adjusting the size of the picture to a resolution of N×N, the selectable value of N is a multiple of 32 between 320-1280, and using the compressed feature The extraction network performs feature extraction of pictures.

优选的，所述的对特征提取网络进行压缩是针对所有的卷积层，将其卷积核的数量缩减为原来的X倍，X的可选取值为X＝{2,4,6,8}。Preferably, the compression of the feature extraction network is to reduce the number of convolution kernels to the original X times for all convolution layers, and the optional value of X is X={2,4,6, 8}.

优选的，所述的特征提取网络是DarkNet网络、AlexNet网络、ResNet网络、VGG网络、GoogLeNet网络、SENet网络或DenseNet网络中的一种。Preferably, the feature extraction network is one of DarkNet network, AlexNet network, ResNet network, VGG network, GoogLeNet network, SENet network or DenseNet network.

优选的，在所述的FocalLoss损失函数中增加一个自适应模块，具体为：Preferably, an adaptive module is added in the FocalLoss loss function, specifically:

对于每张图片的每一个预测结果，计算交叉熵损失，累加所有真实标注为前景的损失，记为M，累加所有真实标注为背景的损失，记为N；For each prediction result of each picture, calculate the cross entropy loss, accumulate all the losses that are actually marked as the foreground, denoted as M, and accumulate all the losses that are truly marked as the background, denoted as N;

计算log(N/M)，记为S，如果S小于零，则S置为零，如果是第一次迭代，令S₀的值为S；Calculate log(N/M), record it as S, if S is less than zero, then S is set to zero, if it is the first iteration, let the value of S ₀ be S;

根据S的值，计算α的值，α＝1/(2^S+0.2)；According to the value of S, calculate the value of α, α=1/(2 ^S +0.2);

根据S和S₀的值计算γ的值，计算公式为γ＝min(S₀-S,2)。The value of γ is calculated according to the values of S and S ₀ , and the calculation formula is γ=min(S ₀ −S,2).

优选的，所述的检测框的过滤包括以下步骤：Preferably, the filtering of the detection frame comprises the following steps:

步骤1：首先得到网络的输出检测框坐标信息集合B＝{b₁，b₂...b_n}，置信度集合为S＝[s₁，s₂...s_n}，其中b_t，s_t对应同一检测框坐标信息和置信度；Step 1: First obtain the network output detection frame coordinate information set B={b ₁ , b ₂ ...b _n }, and the confidence set is S=[s ₁ , s ₂ ...s _n }, where b _t , s _t corresponds to the same detection frame coordinate information and confidence;

步骤2：从S中取出置信度最大的元素，其下标记为i；Step 2: Take out the element with the highest confidence from S, and mark it as i;

步骤3：根据下标i得到s_i和b_i，并且将s_i和b_i分别从集合S和B中删除；Step 3: Get s _i and b _i according to subscript i, and delete s _i and b _i from sets S and B respectively;

步骤4：遍历集合B中剩余检测框，将所有与b_i的交并比大于0.5的检测框的下标加入到集合T中，如果T不为空集，则将b_i和s_i加入到结果集D和F中，同时从S和B集合中删除下标在集合T中的元素，如果T为空集，则直接从S和B集合中删除下标在集合T中的元素；Step 4: Traverse the remaining detection frames in the set B, and add the subscripts of all detection frames with an intersection ratio greater than 0.5 with _bi to the set T. If T is not an empty set, add bi _and s _i to In the result sets D and F, delete the elements whose subscripts are in the set T from the S and B sets at the same time, if T is an empty set, directly delete the elements whose subscripts are in the set T from the S and B sets;

步骤5：循环步骤2-4，直到集合B为空集；Step 5: Repeat steps 2-4 until set B is an empty set;

步骤6：返回最终结果集合D和F。Step 6: Return the final result sets D and F.

步骤1：首先得到网络的输出检测框坐标信息集合B＝{b₁，b₂...b_n}，置信度集合为S＝{s₁，s₂...s_n}，其中b_t，s_t对应同一检测框坐标信息和置信度；Step 1: First obtain the network output detection frame coordinate information set B={b ₁ , b ₂ ...b _n }, and the confidence set is S={s ₁ , s ₂ ...s _n }, where b _t , s _t corresponds to the same detection frame coordinate information and confidence;

步骤4：遍历集合B中剩余检测框，将所有与检测框b_i交并比大于0.5的检测框所对应的坐标信息和置信度分别加入到集合

和/>

中，对于集合/>

和集合/>

中的每个元素，以检测框的置信度为权重，使用检测框的坐标信息对检测框b_i进行加权，得到/>

然后从集合S中删除集合/>

中元素，从集合B中删除集合/>

中元素；Step 4: traverse the remaining detection frames in the set B, and add the coordinate information and confidence corresponding to all detection frames with an intersection ratio greater than 0.5 with the detection frame b _i to the set respectively

and />

in, for the set />

and collection />

For each element in , the confidence of the detection frame is used as the weight, and the coordinate information of the detection frame is used to weight the detection frame _bi to obtain />

Then delete the set /> from the set S

middle element, remove the set from set B />

middle element;

步骤5：将

和s_i加入到结果集D和F中；Step 5: Put

and s _i are added to the result sets D and F;

步骤6：循环步骤2-5，直到集合B为空集；Step 6: Repeat steps 2-5 until set B is an empty set;

步骤7：返回最终结果集合D和F。Step 7: Return the final result sets D and F.

由于本发明采用了压缩的特征提取网络、在FocalLoss损失函数中增加一个自适应模块，并对检测框的过滤进行了优化，因此可以同时具有较高的检测速度和较高的准确率。Since the present invention uses a compressed feature extraction network, adds an adaptive module to the FocalLoss loss function, and optimizes the filtering of the detection frame, it can simultaneously have higher detection speed and higher accuracy.

附图说明Description of drawings

图1为本发明面针对少类别场景的目标检测模型优化与加速的方法的流程示意图。FIG. 1 is a schematic flowchart of a method for optimizing and accelerating a target detection model for a scene with few categories according to the present invention.

具体实施方式Detailed ways

下面将结合本发明实施例中的附图，对本发明实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例仅仅是本发明一部分实施例，而不是全部的实施例。基于本发明中的实施例，本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例，都属于本发明保护的范围。The following will clearly and completely describe the technical solutions in the embodiments of the present invention with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are only some, not all, embodiments of the present invention. Based on the embodiments of the present invention, all other embodiments obtained by persons of ordinary skill in the art without making creative efforts belong to the protection scope of the present invention.

参见图1表示了本发明实施例提供的一种针对少类别场景的目标检测模型优化与加速的方法和装置的流程示意图，该方法包括以下步骤：Referring to FIG. 1, it shows a schematic flowchart of a method and device for optimizing and accelerating a target detection model for a scene with few categories provided by an embodiment of the present invention. The method includes the following steps:

S100，获取并标注待检测图片。S100. Acquire and label the picture to be detected.

S110，将标注好的待检测图片加入到精简的特征提取装置进行特征提取，最终得到三组特定大小的特征图，作为预测装置的输入。S110, adding the marked picture to be detected to the streamlined feature extraction device for feature extraction, finally obtaining three sets of feature maps of a specific size as input to the prediction device.

S120，图片经过特征提取装置提取好特征之后，将得到的三组固定大小的特征图后，送入到具有自适应FocalLoss损失函数的预测装置进行结果预测。S120, after the features of the picture are extracted by the feature extraction device, the obtained three sets of fixed-size feature maps are sent to the prediction device with an adaptive FocalLoss loss function for result prediction.

S130，在得到网络的预测结果之后，不能直接将结果作为输出，因为结果中对于同一个待检测目标，往往会输出多个检测框，因此后处理装置主要作用是检测框的过滤，使得对于同一待检测物体，只有一个检测框输出。S130, after obtaining the prediction result of the network, the result cannot be directly used as an output, because in the result, for the same object to be detected, multiple detection frames are often output, so the main function of the post-processing device is to filter the detection frame, so that for the same object to be detected For the object to be detected, only one detection frame is output.

本发明实施例的装置设有三个组成部件：精简的特征提取装置，具有自适应FocalLoss损失函数的预测装置和后处理装置。The device of the embodiment of the present invention has three components: a simplified feature extraction device, a prediction device with an adaptive FocalLoss loss function, and a post-processing device.

下面针对上述每个步骤和装置分别进行详细描述：Each of the above-mentioned steps and devices is described in detail below:

待识别图片即包含少类别待检测目标的目标场景图片，首先确定好需要检测的类别，如行人，车辆，动物，桌子，凳子等。The picture to be recognized is the picture of the target scene containing a few categories of targets to be detected. First, determine the categories to be detected, such as pedestrians, vehicles, animals, tables, stools, etc.

在确定好类别之后，根据此来对图片进行标注，标注规则为：使用边平行于图片的最小的矩形框，将物体全部包含在内，一般来讲，每一个矩形框都可以用四个自由度的向量表示，一种表现形式为(x1，x2，y1，y2)其中x1，y1代表矩形标注框的左上角点坐标，x2，y2代表矩形标注框的右下角点坐标；另一种表现形式为(x，y，w，h)，其中x，y代表矩形标注框的中心点坐标，w和h分别代表矩形标注框的宽和高。显然这两种坐标表示形式可以很容易的进行转换。After the category is determined, the picture is marked according to this. The marking rule is: use the smallest rectangular frame whose sides are parallel to the picture to include all the objects. Generally speaking, each rectangular frame can be used with four free Degree vector representation, one form of expression is (x1, x2, y1, y2), where x1, y1 represent the coordinates of the upper left corner of the rectangular label box, x2, y2 represent the coordinates of the lower right corner of the rectangular label box; another expression The form is (x, y, w, h), where x, y represent the coordinates of the center point of the rectangular label box, and w and h represent the width and height of the rectangular label box, respectively. Obviously these two coordinate representations can be easily converted.

S110，将标注好的待检测图片，加入到精简的特征提取装置进行特征提取，最终得到三组特定大小的特征图，作为预测装置的输入。S110, adding the marked picture to be detected to the streamlined feature extraction device for feature extraction, and finally obtaining three groups of feature maps of specific sizes as the input of the prediction device.

精简的特征提取装置负责提供一个精简的图片特征提取网络，同时具有较少的模型参数，以及较快的运行速度。The streamlined feature extraction device is responsible for providing a streamlined image feature extraction network with fewer model parameters and faster operation speed.

特征提取网络主要由是卷积层和池化层组成，比较常见的特征提取网络如：AlexNet、VGG、GoogLeNet、ResNet、SENet、DenseNet、DarkNet等，对于精度要求较高的场景，多采用ResNet作为特征提取网络，而对于速度要求较高的网络，则可以采用DarkNet进行图片的特征提取，如YOLO单阶段检测器。本装置应用为现实场景，因此采用DarkNet53的网络结构作为本装置特征提取网络的基础结构。The feature extraction network is mainly composed of a convolutional layer and a pooling layer. Common feature extraction networks such as: AlexNet, VGG, GoogLeNet, ResNet, SENet, DenseNet, DarkNet, etc. For scenes with high precision requirements, ResNet is often used as the Feature extraction network, and for networks with high speed requirements, DarkNet can be used for feature extraction of pictures, such as YOLO single-stage detector. The application of this device is a real scene, so the network structure of DarkNet53 is used as the basic structure of the feature extraction network of this device.

DaekNet53具有53个卷积层，并且这些卷积层全部由1×1,3×3的卷积核构成，每一个卷积模块的后面都会跟接上一个Batch Normalization层，负责在训练中加快网络收敛速度；和一个LeakyReLU层作为激活函数，增加网络的非线性性，特征提取网络一共有5次下采样过程，每次下采样之后，都由1-8个残差模块组成，其中，残差模块是由一个卷积核大小为3×3大小的卷积模块和一个卷积核大小为1×1大小的卷积模块共同组成，并且加入跳跃连接后共同组成残差模块。使用残差模块的目的是即使网络深度加深，也可以很容易的收敛。DaekNet53 has 53 convolutional layers, and these convolutional layers are all composed of 1×1, 3×3 convolution kernels. Each convolution module will be followed by a Batch Normalization layer, which is responsible for speeding up the network during training. Convergence speed; and a LeakyReLU layer as an activation function to increase the nonlinearity of the network. The feature extraction network has a total of 5 downsampling processes. After each downsampling, it consists of 1-8 residual modules. Among them, the residual The module is composed of a convolution module with a convolution kernel size of 3×3 and a convolution module with a convolution kernel size of 1×1, and a residual module is formed after adding skip connections. The purpose of using the residual module is that even if the depth of the network is deepened, it can be easily converged.

由于普通的DarkNet网络结构在最后一个残差模块会输出通道数为1024的特征，这对于只需要预测较少类别的目标检测场景是存在冗余的，因此我们做出如下改动，在精度不变或略有损失的前提下，减少网络的计算量，提高特征提取装置的运行速度。Since the ordinary DarkNet network structure will output features with a channel number of 1024 in the last residual module, this is redundant for target detection scenarios that only need to predict fewer categories, so we make the following changes without changing the accuracy On the premise of a slight loss or a slight loss, the calculation amount of the network is reduced, and the operating speed of the feature extraction device is improved.

针对所有的卷积层，将其卷积核的数量缩减为原来的X倍，X的可选取值为X＝{1/2,1/4,1/6,1/8}，当X＝1/2时，缩减后的特征提取子网络参数量下降为原始的1/2，计算量下降为原始的1/4倍，可以看出，随着X取值的不断减小，特征提取子网络的计算量和参数量不断下降，网络的运行速度不断加快，根据装置对于实时性的不同要求，可以按需选择不同的X参数。For all convolution layers, the number of convolution kernels is reduced to X times the original, and the optional value of X is X={1/2,1/4,1/6,1/8}, when X = 1/2, the parameter quantity of the reduced feature extraction sub-network is reduced to 1/2 of the original, and the calculation amount is reduced to 1/4 of the original. It can be seen that as the value of X continues to decrease, the feature extraction The amount of calculations and parameters of the sub-network is continuously decreasing, and the operation speed of the network is continuously accelerating. According to the different requirements of the device for real-time performance, different X parameters can be selected as needed.

很明显可以看出，除了DarkNet，其他比较常见的特征提取网络如：AlexNet、VGG、GoogLeNet、ResNet、SENet、DenseNet等，也可以上述手段进行压缩，以加快网络运行速度。It can be clearly seen that in addition to DarkNet, other common feature extraction networks such as: AlexNet, VGG, GoogLeNet, ResNet, SENet, DenseNet, etc., can also be compressed by the above means to speed up network operation.

将图片大小调整至分辨率为N×N，N的可选取值为320-1280之间的32的倍数，如果输入图片的尺寸是416×416像素，在图片经过特征提取装置之后，会经过5次2倍的下采样，特征图的尺寸分别为：208×208、104×104、52×52、26×26、13×13，在后面接上特征金字塔结构进行多个尺度目标的预测，我们使用后三层的特征图进行多尺度预测，即在特征提取网络特征图大小为52×52、26×26、13×13三处后面分别连接小型的特征转换网络。除特征图大小的区别之外，该三处特征图后面所连接的三个分支结构类似。Adjust the size of the picture to a resolution of N×N, and the selectable value of N is a multiple of 32 between 320-1280. If the size of the input picture is 416×416 pixels, after the picture passes through the feature extraction device, it will pass through 5 downsampling by 2 times, the size of the feature map is: 208×208, 104×104, 52×52, 26×26, 13×13, followed by a feature pyramid structure to predict multiple scale targets, We use the feature maps of the last three layers for multi-scale prediction, that is, connect a small feature conversion network after the feature map sizes of 52×52, 26×26, and 13×13 in the feature extraction network. Except for the difference in the size of the feature maps, the three branch structures connected behind the three feature maps are similar.

S120，图片经过特征提取装置提取好特征之后，将得到的三组固定大小的特征图，送入到具有自适应FocalLoss损失函数的预测装置进行结果预测。S120, after the features of the picture are extracted by the feature extraction device, the obtained three sets of fixed-size feature maps are sent to the prediction device with an adaptive FocalLoss loss function for result prediction.

具有自适应FocalLoss损失函数的预测装置：让网络可以动态的，自动的进行超参数的调整，省略训练网络的超参数调整过程，使得装置使用者无需掌握深度学习的知识就可以训练出质量很高的模型。Prediction device with adaptive FocalLoss loss function: Allows the network to dynamically and automatically adjust hyperparameters, omitting the hyperparameter adjustment process of training the network, so that device users can train high-quality models without mastering deep learning knowledge. model.

本预测装置由三部分组成，分别为：小型的特征转换网络、预测检测目标分类及坐标的分支以及损失函数层。The prediction device consists of three parts, namely: a small feature conversion network, a branch for predicting and detecting object classification and coordinates, and a loss function layer.

小型的特征转换网络由五个卷积层组成，卷积核的大小分别为1×1、3×3、1×1、3×3、1×1，这样设计的好处是：瓶颈结构具有较小的计算量，并且可以充分的学习到特征空间的变换。The small feature conversion network consists of five convolutional layers, and the sizes of the convolution kernels are 1×1, 3×3, 1×1, 3×3, and 1×1. The advantage of this design is that the bottleneck structure has a relatively The amount of calculation is small, and the transformation of the feature space can be fully learned.

预测检测目标分类及坐标的分支，连接在小型的特征转换网络之后。可选的，可以在这两部分结构之间增加一个由3×3、1×1的卷积核组成的瓶颈结构，进一步加深预测装置的深度。此分支的特征图大小和经过特征提取网络输出的尺寸相同，即特征提取网络输出特征图尺寸为13×13时，接在其后面的预测检测目标分类及坐标的分支特征图尺寸也为13×13，The branch of predicting and detecting object classification and coordinates is connected after a small feature transformation network. Optionally, a bottleneck structure composed of 3×3 and 1×1 convolution kernels can be added between these two structures to further deepen the depth of the prediction device. The size of the feature map of this branch is the same as the size of the output feature extraction network, that is, when the output feature map size of the feature extraction network is 13×13, the branch feature map size of the prediction detection target classification and coordinates connected behind it is also 13× 13,

具体的通道数计算方法如下：The specific channel number calculation method is as follows:

确定待识别目标类别总数N，一般N小于等于5，根据公式C＝(N+5)×3计算出输出层特征图的通道数C。其中5的含义是：一个输出代表前景背景预测，另外四个输出代表坐标预测，3代表每个输出层，负责预测三种尺度的目标。Determine the total number N of target categories to be identified, generally N is less than or equal to 5, and calculate the channel number C of the output layer feature map according to the formula C=(N+5)×3. The meaning of 5 is: one output represents foreground background prediction, the other four outputs represent coordinate prediction, and 3 represents each output layer, which is responsible for predicting targets of three scales.

对于预测检测目标分类及坐标的分支特征图中的每一个点，包含三个尺度的预设锚点，在尺寸为13×13的特征图中，预设锚点尺寸分别为116×90、156×198、373×326，预设锚点数量为13×13×3，主要负责大尺寸目标的预测；在尺寸为26×26的特征图中，预设锚点尺寸分别为30×61、62×45、59×119，预设锚点数量为26×26×3，主要负责中等尺寸目标的预测；在尺寸为52×52的特征图中，预设锚点尺寸分别为10×13、16×30、33×23，预设锚点数量为52×52×3，主要负责小尺寸目标的预测。For each point in the branch feature map that predicts and detects the target classification and coordinates, it contains preset anchor points of three scales. In the feature map with a size of 13×13, the preset anchor point sizes are 116×90 and 156 respectively. ×198, 373×326, the number of preset anchor points is 13×13×3, which is mainly responsible for the prediction of large-scale targets; in the feature map with a size of 26×26, the preset anchor point sizes are 30×61, 62 ×45, 59×119, the number of preset anchor points is 26×26×3, which is mainly responsible for the prediction of medium-sized targets; in the feature map with a size of 52×52, the preset anchor point sizes are 10×13 and 16 respectively ×30, 33×23, the number of preset anchor points is 52×52×3, which is mainly responsible for the prediction of small-sized targets.

预测结果由三部分组成，根据预测检测目标分类及坐标的分支通道数的计算公式C＝(N+5)×3，对于每一个预测框，由N+5维向量构成，首先其中1维确定目标是否属于前景；如果是，说明该预测框包含待检测目标，后面的预测结果才有意义，后面的4维向量代表预测检测框的坐标，其后的N维向量中的最大值代表该目标的预测分类结果；如果否，说明该预测框为图片的背景区域。The prediction result is composed of three parts. According to the calculation formula C=(N+5)×3 of the number of branch channels for predicting the detection target classification and coordinates, for each prediction frame, it is composed of N+5 dimensional vectors. First, 1 dimension is determined Whether the target belongs to the foreground; if it is, it means that the prediction frame contains the target to be detected, and the subsequent prediction results are meaningful. The following 4-dimensional vector represents the coordinates of the predicted detection frame, and the maximum value in the subsequent N-dimensional vector represents the target The predicted classification result of ; if not, it means that the predicted frame is the background area of the picture.

损失函数层的作用是计算网络模型的预测值与真实值之间的距离，并根据反向传播算法对网络模型参数进行优化，因此损失函数层连接在预测检测目标分类及坐标的分支之后。损失函数层的输入由两部分组成：预测结果以及真实标注值。The function of the loss function layer is to calculate the distance between the predicted value of the network model and the real value, and optimize the parameters of the network model according to the back propagation algorithm. Therefore, the loss function layer is connected after the branch of predicting and detecting the target classification and coordinates. The input of the loss function layer consists of two parts: the prediction result and the real label value.

在计算损失函数时，共需要计算三部分的损失值；When calculating the loss function, a total of three parts of the loss value need to be calculated;

第一步根据公式(1)，如果预测目标为前景区域，则计算预测值与真实值坐标的损失，使用的是均方误差损失，在宽w和高h处计算根号的目的是减小宽和高的预测结果对整个损失函数的贡献，突出目标中心点坐标x，y预测正确的重要性。The first step is according to the formula (1), if the predicted target is the foreground area, then calculate the loss of the predicted value and the real value coordinates, using the mean square error loss, the purpose of calculating the root sign at the width w and height h is to reduce The contribution of the width and height prediction results to the entire loss function highlights the importance of the correct prediction of the target center point coordinates x, y.

第二步根据公式(2)、(3)计算是否包含有目标的损失，使用的是自适应FocalLoss损失函数，首先，当该预测点的真实值为前景即y＝1时，p_t＝p。由于图片中背景区域要远大于前景区域，因此前背景区域的比例十分不均衡，所以本装置的损失函数在传统的交叉熵基础上，需要增加可以调整前背景损失权重比例的系数α，α越大，前景区域损失所占权重越大；同时，在网络训练的过程中，简单目标预测的误差值会越来越小，为了使得网络可以学习到困难目标的预测，损失函数需要增加可以增加困难样本损失权重的系数γ，γ取值越大，困难样本损失所占权重越大。这样设计的好处是可以增加网络对于困难样本的学习能力，使得预测结果更多的包含真实值，提高检测结果的召回率。In the second step, according to the formulas (2) and (3), it is calculated whether there is a target loss, using the adaptive FocalLoss loss function. First, when the actual value of the predicted point is the foreground, that is, y=1, p _t =p . Since the background area in the picture is much larger than the foreground area, the proportion of the foreground and background areas is very unbalanced, so the loss function of this device needs to increase the coefficient α that can adjust the weight ratio of the foreground and background loss on the basis of the traditional cross entropy. Larger, the greater the weight of the foreground area loss; at the same time, in the process of network training, the error value of simple target prediction will become smaller and smaller. In order for the network to learn the prediction of difficult targets, the loss function needs to be increased to increase the difficulty The coefficient γ of the sample loss weight, the larger the value of γ, the greater the weight of the difficult sample loss. The advantage of this design is that it can increase the learning ability of the network for difficult samples, make the prediction results contain more real values, and improve the recall rate of the detection results.

由于α和γ的参数设置很依赖于装置使用者对网络的理解，为保证任何人都可以轻松的使用本装置而无需任何先验知识，本装置增加了自适应模块，具体方法如下：Since the parameter settings of α and γ are very dependent on the device user's understanding of the network, in order to ensure that anyone can easily use the device without any prior knowledge, this device adds an adaptive module, the specific method is as follows:

计算log(N/M)，记为S，如果S小于零，则S置为零，如果是第一次迭代，令S₀的值为S。本步骤的目的是计算网络训练的当前阶段，负样本损失与正样本损失的比例情况，一般地，网络训练初期更多的样本会被预测为负样本，S的值会比较大；Calculate log(N/M) and record it as S. If S is less than zero, set S to zero. If it is the first iteration, let the value of S ₀ be S. The purpose of this step is to calculate the ratio of negative sample loss to positive sample loss in the current stage of network training. Generally, more samples in the early stage of network training will be predicted as negative samples, and the value of S will be relatively large;

经过以上步骤，网络可以根据训练进行，自动的调整α和γ两个超参数的值，由于网络在初始训练阶段预测错误的情况比较多，为了使网络能够尽快收敛到一个比较不错的情况，需要首先增加负样本的权重，同时降低对难样本学习的权重，即增加α，减少γ；网路因此会很快达到第一个比较稳定的阶段，此时降低对负样本惩罚的权重，增加正样本的惩罚，可以让网络更好的去学习真正有意义的区域，而且在此阶段S将趋向于稳定，因此α也会稳定在0.5以上，同时也不会超过0.85；为了保证训练后期的稳定性，我们不希望γ具有一个很大的数值，因为γ＝2对于难样本的学习已经足够了。After the above steps, the network can automatically adjust the values of the two hyperparameters α and γ according to the training. Since the network has many prediction errors in the initial training stage, in order to enable the network to converge to a relatively good situation as soon as possible, it is necessary to First, increase the weight of negative samples, and at the same time reduce the weight of learning difficult samples, that is, increase α and reduce γ; therefore, the network will soon reach the first relatively stable stage. At this time, reduce the weight of negative sample punishment and increase positive samples. The penalty of the sample can make the network better learn the really meaningful area, and at this stage S will tend to be stable, so α will also be stable above 0.5, and will not exceed 0.85 at the same time; in order to ensure the stability in the later stage of training , we don't want γ to have a large value, because γ=2 is enough for hard sample learning.

S130，由于本检测装置是基于密集的锚点预测的方法，在得到网络的预测结果之后，不能直接将结果作为输出，因为结果中对于同一个待检测目标，往往会输出多个检测框，因此后处理装置主要作用是检测框的过滤，使得对于同一待检测物体，只有一个检测框输出。S130, since the detection device is based on a dense anchor point prediction method, after obtaining the prediction result of the network, it cannot directly output the result, because in the result, multiple detection frames are often output for the same target to be detected, so The main function of the post-processing device is to filter the detection frame, so that for the same object to be detected, only one detection frame is output.

可选的，可以针对不同的场景对于精确性和召回率的的不同要求，选择下面两种后处理方法，分别为更高精确性的Precise-NMS(非极大值抑制算法)和更高召回率的Better-NMS(非极大值抑制算法)。Optionally, the following two post-processing methods can be selected according to the different requirements of precision and recall in different scenarios, which are higher precision Precise-NMS (non-maximum suppression algorithm) and higher recall Rate of Better-NMS (Non-Maximum Suppression Algorithm).

Precise-NMS的目的是保证更高的精确性，同时允许一些漏检的情况出现，即保证检测出来的结果都尽可能的正确。具体步骤如下：The purpose of Precise-NMS is to ensure higher accuracy, while allowing some missed detections, that is, to ensure that the detected results are as correct as possible. Specific steps are as follows:

本策略的目的是如果有多个锚点同时偏向一个物体进行学习，同时在检测阶段都能给出这个物体一个比较高的分数，则可以认为该区域存在物体的可能性更大。The purpose of this strategy is that if there are multiple anchor points biased towards an object for learning at the same time, and at the same time they can give the object a relatively high score in the detection stage, it can be considered that there is a greater possibility of an object in this area.

上述所提到的交并比的含义是：两检测框交集的面积和并集的面积的比值。The meaning of the intersection ratio mentioned above is: the ratio of the area of the intersection of the two detection frames and the area of the union.

Better-NMS可以得到更加高质量的检测框，使得检测结果具有更高的召回率。具体步骤如下：Better-NMS can get a higher quality detection frame, making the detection result have a higher recall rate. Specific steps are as follows:

和/>

中，对于集合/>

和集合/>

然后从集合S中删除集合/>

中元素，从集合B中删除集合/>

and />

in, for the set />

and collection />

Then delete the set /> from the set S

middle element, remove the set from set B />

middle element;

步骤5：将

和s_i加入到结果集D和F中；Step 5: Put

and s _i are added to the result sets D and F;

上述所提到的加权按照如下公式进行计算：The weighting mentioned above is calculated according to the following formula:

本策略的目的是重新考虑了那些被过滤掉的高分检测框的作用，因为那些被过滤掉的检测框，同样也是网络通过训练学习之后得到的比较不错的结果，这其中也包含了物体的语义信息，这样加权的到的结果可以得到更加高质量的检测框。The purpose of this strategy is to reconsider the role of those high-scoring detection frames that are filtered out, because those filtered out detection frames are also relatively good results obtained by the network after training and learning, which also includes objects. Semantic information, so that the weighted result can get a higher quality detection frame.

Claims

1. A method for optimizing and accelerating a target detection model for a few-category scene, comprising the following steps:

Obtain and label the image to be detected;

The marked pictures are added to the feature extraction device for feature extraction, and three sets of fixed-size feature maps are obtained;

Send three sets of fixed-size feature maps to the prediction device with FocalLoss loss function for result prediction;

Filter the detection frame of the prediction result so that only one detection frame is output for the same object to be detected;

It is characterized in that an adaptive module is added to the FocalLoss loss function, specifically:

For each prediction result of each picture, calculate the cross entropy loss, accumulate all the losses that are actually marked as the foreground, denoted as M, and accumulate all the losses that are truly marked as the background, denoted as N;

Calculate log(N/M), record it as S, if S is less than zero, then S is set to zero, if it is the first iteration, let the value of S ₀ be S;

According to the value of S, calculate the value of α, α=1/(2 ^S +0.2);

The value of γ is calculated according to the values of S and S ₀ , and the calculation formula is γ=min(S ₀ −S,2).

2. the method for the object detection model optimization and acceleration of few category scenes as claimed in claim 1, it is characterized in that the filtering of described detection frame comprises the following steps:

Step 1: First obtain the network output detection frame coordinate information set B={b ₁ , b ₂ ...b _n }, and the confidence set is S={s ₁ , s ₂ ...s _n }, where b _t , s _t corresponds to the same detection frame coordinate information and confidence;

Step 2: Take out the element with the highest confidence from S, and mark it as i;

Step 3: Get s _i and b _i according to subscript i, and delete s _i and b _i from sets S and B respectively;

Step 4: Traverse the remaining detection frames in the set B, and add the subscripts of all detection frames whose intersection ratio with _bi is greater than 0.5 to the set _{T. If T is not an empty set, add bi and s i} _respectively To the result sets D and F, delete the elements whose subscripts are in the set T from the S and B sets at the same time;

Step 5: Repeat steps 2-4 until set B is an empty set;

Step 6: Return the final result sets D and F.

3. the method for optimizing and accelerating the target detection model for few category scenes as claimed in claim 1, it is characterized in that the filtering of described detection frame comprises the following steps:

Step 4: traverse the remaining detection frames in the set B, and add the coordinate information and confidence corresponding to all detection frames with an intersection ratio greater than 0.5 with the detection frame b _i to the set respectively