CN109784476B

CN109784476B - Method for improving DSOD network

Info

Publication number: CN109784476B
Application number: CN201910029814.0A
Authority: CN
Inventors: 程树英; 吴建耀; 郑茜颖; 林培杰; 陈志聪; 吴丽君
Original assignee: Fuzhou University
Current assignee: Fuzhou University
Priority date: 2019-01-12
Filing date: 2019-01-12
Publication date: 2022-08-16
Anticipated expiration: 2039-01-12
Also published as: CN109784476A

Abstract

The invention relates to a method for improving DSOD network. First, the input image is preprocessed, the preprocessed image is input into the DSOD feature extraction sub-network, and the RFB_a network is added after the second transition layer of the feature extraction sub-network. The module extracts features with different receptive fields through Atrous convolution with different sampling steps in the RFB_a network. After the feature extraction sub-network, an Atrous convolution layer with a sampling step of 6 is added, and the features generated by the Atrous convolution layer are input to In the multi-scale prediction layer, the multi-scale prediction layer is input into the loss function, and the IOG penalty term is added to the loss function to prevent the overlapping of similar prediction boxes when predicting dense targets of the same type. At the same time, a warm-up strategy is used to set the learning rate in the training phase, and by setting an appropriate batch sample size, the hardware equipment requirements for training the network are reduced. Compared with the original DSOD algorithm, the present invention has higher detection accuracy, improves the detection ability of small targets, and reduces the hardware equipment requirements for training the network.

Description

A Method for Improving DSOD Network

技术领域technical field

本发明涉及计算机视觉领域，特别是一种改进DSOD网络的方法。The invention relates to the field of computer vision, in particular to a method for improving DSOD network.

背景技术Background technique

目标检测是计算机视觉中最重要的研究课题之一，主要任务是在给定图像中定位感兴趣的目标，并准确的判断每个目标的具体位置。基于卷积神经网络的目标检测算法可以分为两种：基于区域提取的目标检测算法和基于回归的目标检测算法。基于区域提取目标检测算法，虽然具有较高的检测精度，但是需要进行提取候选区域，检测速度难以达到实时。基于回归的目标检测算法，如SSD、DSOD,通过去除区域提取的步骤，使得检测达到了实时。但是DSOD算法存在着对小目标检测能力差，以及训练网络时对硬件设备要求高等问题。Object detection is one of the most important research topics in computer vision. The main task is to locate the objects of interest in a given image and accurately determine the specific location of each object. Target detection algorithms based on convolutional neural networks can be divided into two types: target detection algorithms based on region extraction and target detection algorithms based on regression. The target detection algorithm based on region extraction has high detection accuracy, but it needs to extract candidate regions, and the detection speed is difficult to achieve real-time. Regression-based target detection algorithms, such as SSD and DSOD, achieve real-time detection by removing the step of region extraction. However, the DSOD algorithm has the problems of poor detection ability of small targets and high requirements for hardware equipment when training the network.

发明内容SUMMARY OF THE INVENTION

有鉴于此，本发明的目的是提出一种改进DSOD网络的方法，能够提高对小目标的检测能力和提高目标的检测精度。In view of this, the purpose of the present invention is to propose a method for improving the DSOD network, which can improve the detection ability of small targets and the detection accuracy of the targets.

本发明采用以下方案实现：一种改进DSOD网络的方法，包括以下步骤：The present invention adopts the following scheme to realize: a kind of method for improving DSOD network, comprises the following steps:

步骤S1：获取数据集中的图像作为输入图像，并将输入图像输入到输入层；对输入图像进行裁剪、镜像和去均值化预处理，得到预处理后的图像，同时利用归一化方法将预处理后的图像中的绝对坐标转换成相对坐标；Step S1: Obtain an image in the dataset as an input image, and input the input image to the input layer; perform preprocessing on the input image by cropping, mirroring and de-averaging to obtain a preprocessed image, and at the same time use the normalization method to convert the preprocessed image. Convert absolute coordinates in the processed image to relative coordinates;

步骤S2：在DSOD网络中的特征提取子网络的第二个转接层后加入RFB_a网络模块；将步骤S1中预处理后的图像输入到DSOD网络中的特征提取子网络进行特征提取；将DSOD网络中的特征提取子网络中的第二个转接层的特征图输入到RFB_a网络模块中，经过RFB_a网络模块中不同采样步长的Atrous膨胀卷积，提取具有不同感受野的特征；所述提取的不同感受野特征输入到3×3卷积层中，形成DSOD网络的第一个尺度预测层；Step S2: adding the RFB_a network module after the second transition layer of the feature extraction sub-network in the DSOD network; inputting the preprocessed image in step S1 into the feature extraction sub-network in the DSOD network for feature extraction; The feature map of the second transition layer in the feature extraction sub-network in the network is input into the RFB_a network module, and features with different receptive fields are extracted through Atrous dilation convolution with different sampling steps in the RFB_a network module; the The extracted features of different receptive fields are input into the 3×3 convolutional layer to form the first scale prediction layer of the DSOD network;

步骤S3：将有预设采样步长的Atrous卷积层加入到所述DSOD网络中的特征提取子网络后，并将步骤S2中所述特征提取子网络的特征图输入到具有预设(采样步长为6)采样步长的Atrous卷积层中，用以增大特征图的感受野；同时，将Atrous卷积层产生的特征输入到DSOD网络中的多尺度预测层中，形成5个尺度预测层；Step S3: after adding the Atrous convolutional layer with preset sampling step size to the feature extraction sub-network in the DSOD network, and input the feature map of the feature extraction sub-network described in step S2 into the feature extraction sub-network with preset (sampling). The step size is 6) In the Atrous convolutional layer with the sampling step size, it is used to increase the receptive field of the feature map; at the same time, the features generated by the Atrous convolutional layer are input into the multi-scale prediction layer in the DSOD network to form 5 scale prediction layer;

步骤S4：将步骤S2中所述的DSOD网络的第一个尺度预测层和步骤S3所述的5个尺度预测层的特征输入到加入IOG惩罚项的多任务损失函数L中；Step S4: Input the features of the first scale prediction layer of the DSOD network described in the step S2 and the 5 scale prediction layers described in the step S3 into the multi-task loss function L adding the IOG penalty term;

步骤S5：通过预热策略设置学习率，利用梯度下降算法优化所述DSOD网络中的所有网络层的权值；设置合适的样本大小(本发明设置为16)，用以降低训练DSOD网络的硬件设备要求。Step S5: set the learning rate through the preheating strategy, and use the gradient descent algorithm to optimize the weights of all network layers in the DSOD network; set an appropriate sample size (the present invention is set to 16), in order to reduce the hardware for training the DSOD network equipment requirements.

进一步地，步骤S1中图像预处理具体为：Further, the image preprocessing in step S1 is specifically:

步骤S11：对输入图像进行裁剪：先随机选择输入图像即裁剪图像的长度和高度，然后在0.1、0.3、0.7、0.9中进行随机选择一个数作为Jaccard系数阈值，经过Jaccard系数计算原图像中所有的真实框与裁剪图像的相似度；Step S11: Crop the input image: first randomly select the input image, that is, the length and height of the cropped image, and then randomly select a number among 0.1, 0.3, 0.7, and 0.9 as the Jaccard coefficient threshold, and calculate all the values in the original image through the Jaccard coefficient. The similarity of the ground-truth box and the cropped image;

步骤S12：判断真实框或裁剪图像的Jaccard系数是否大于步骤S11中随机选取的Jaccard阈值；若至少有一个真实框和裁剪图像的Jaccard系数大于所述选取的Jaccard阈值，而且此真实框的中心坐标落在裁剪图像中，则裁剪图像符合要求，否则返回步骤S11；其中Jaccard系数计算如下：Step S12: Judging whether the Jaccard coefficient of the real frame or the cropped image is greater than the Jaccard threshold randomly selected in step S11; if at least one real frame and the Jaccard coefficient of the cropped image are greater than the selected Jaccard threshold, and the center coordinates of the real frame are If it falls in the cropped image, the cropped image meets the requirements, otherwise it returns to step S11; the Jaccard coefficient is calculated as follows:

其中，N表示图像中真实框的个数，box_i表示第i个真实框的面积，box_cut表示裁剪图像的面积，运算符∩表示计算重叠面积。Among them, N represents the number of real boxes in the image, box _i represents the area of the ith real box, box _cut represents the area of the cropped image, and operator ∩ represents the calculation of the overlap area.

步骤S13：对裁剪后的图像，按照预设的概率T(本发明T＝0.5)进行左右镜像处理，将镜像处理后的图像分辨率调整为300×300，得到镜像处理后的图像。Step S13: Perform left and right mirroring processing on the cropped image according to a preset probability T (T=0.5 in the present invention), and adjust the resolution of the mirrored image to 300×300 to obtain a mirrored image.

步骤S14:采用去均值化方法将镜像处理后的图像进行去均值化，得到去均值化后的图像。Step S14: De-average the mirrored image by using a de-average method to obtain a de-averaged image.

进一步地，所述步骤S2的具体内容为：首先在每一个RFB_a网络分支上使用1×1卷积层，用以来减少特征的通道数；在RFB_a网络第一个分支上采用步长为1且卷积核为3×3的卷积层，得到3×3的感受野特征；第二个分支上采用1×3卷积层和采样步长为3的Atrous卷积层，得到1×7的感受野特征；在第三个分支上采用3×1卷积层和采样步长为3的Atrous卷积层，得到7×1的感受野特征；在第四个分支上采用3×3卷积层和采样步长为5的Atrous卷积层，得到11×11的感受野特征；通过通道拼接和1×1卷积层将各分支提取的特征进行融合；最后将融合的特征和DSOD网络的第二个转接层产生的特征通过残差融合形成最终输出的特征。Further, the specific content of the step S2 is: first, a 1×1 convolution layer is used on each RFB_a network branch to reduce the number of channels of the feature; on the first branch of the RFB_a network, the step size is 1 and The convolution kernel is a 3×3 convolutional layer, and a 3×3 receptive field feature is obtained; the second branch adopts a 1×3 convolutional layer and an Atrous convolutional layer with a sampling step size of 3 to obtain a 1×7 convolutional layer. Receptive field features; use a 3×1 convolutional layer and an Atrous convolutional layer with a sampling stride of 3 on the third branch to obtain a 7×1 receptive field feature; use a 3×3 convolution on the fourth branch layer and the Atrous convolutional layer with sampling step size of 5 to obtain 11×11 receptive field features; the features extracted from each branch are fused through channel splicing and 1×1 convolutional layers; finally, the fused features are combined with the DSOD network The features produced by the second transition layer are fused by residuals to form the final output features.

进一步地，步骤S3中所述加入具有预设采样步长为Atrous卷积层的具体方法为：首先，增加所述DSOD网络中特征提取子网络的输出通道数C，用以提取更加丰富的特征信息，然后加入具有一定采样步长r的Atrous卷积层，Atrous卷积层的输出通道等于原DSOD特征提取子网络输出通道数，使得Atrous卷积嵌入到DSOD的网络中去；紧接着加入1×1卷积层进行特征融合。Further, the specific method of adding the Atrous convolutional layer with a preset sampling step size described in step S3 is: first, increase the number of output channels C of the feature extraction sub-network in the DSOD network, in order to extract more abundant features. information, and then add the Atrous convolution layer with a certain sampling step size r. The output channel of the Atrous convolution layer is equal to the number of output channels of the original DSOD feature extraction sub-network, so that the Atrous convolution is embedded in the DSOD network; then add 1 A ×1 convolutional layer performs feature fusion.

进一步地，步骤S4所述加入IOG惩罚项的多任务损失函数L具体为：Further, the multi-task loss function L added to the IOG penalty item described in step S4 is specifically:

步骤S41：计算DSOD网络输出的预测框p与预设样本所有真实框G面积交并比最大的预测框g_{iou_max}，公式如下：Step S41: Calculate the prediction frame g _{iou_max} with the largest area intersection ratio between the prediction frame p output by the DSOD network and all the real frames G of the preset sample, and the formula is as follows:

其中，g表示真实框，G表示所有真实框的集合，p表示预测框，P表示所有预测框集合，box_g表示真实框的面积，box_p表示预测框的面积；Among them, g represents the real box, G represents the set of all real boxes, p represents the prediction box, P represents the set of all prediction boxes, box _g represents the area of the real box, and box _p represents the area of the prediction box;

步骤S42：将步骤S41中交并比最大面积的真实框去除，再计算预测框与剩余真实框的最大的IOG惩罚项，将最大的IOG惩罚项作为L_iog损失函数，计算公式如下：Step S42: Remove the real frame with the largest area of the intersection and ratio in step S41, then calculate the largest IOG penalty item between the predicted frame and the remaining real frame, and use the largest IOG penalty item as the _Liog loss function. The calculation formula is as follows:

步骤S43：将L_iog损失函数和定位损失函数L_loc以及分类损失函数L_conf函数进行加权融合，形成最终的多任务损失函数L，公式如下：Step S43: Weighted fusion of the _Liog loss function, the localization loss function _Lloc and the classification loss function L _conf function to form the final multi-task loss function L, the formula is as follows:

其中，N表示检测的正样本的数量，α表示定位损失的L_loc的权重；定位损失函数L_loc采用的是smooth_L1损失；分类损失函数L_conf采用信息交叉熵进行计算；定位损失函数L_loc计算如下：Among them, N represents the number of detected positive samples, α represents the weight of L _loc of the localization loss; the localization loss function L _loc adopts the smooth _L1 loss; the classification loss function L _conf adopts the information cross entropy to calculate; the localization loss function L _loc The calculation is as follows:

其中，

指示函数，表示第i个默认框匹配第j个类别为k的真实框，l表示预测框的位置坐标，pos表示正样本的默认框，N表示的正样本的数量；smooth_L1计算如下：in,

Indication function, indicating that the ith default box matches the jth category of the real box, l represents the position coordinates of the predicted box, pos represents the default box of the positive sample, and N represents the number of positive samples; smooth _L1 is calculated as follows:

分类损失L_conf计算如下：The classification loss _Lconf is calculated as follows:

其中，c表示每一个类别的置信度，Neg表示负样本，p表示类别，0表示类别是背景。

表示第i个预测框为类别p的概率，计算如下：Among them, c represents the confidence of each category, Neg represents the negative sample, p represents the category, and 0 represents the category is the background.

Represents the probability that the i-th prediction box is of category p, calculated as follows:

进一步地，步骤S5中所述采用预热(warmup)策略设置学习率具体为：将初始学习率设置为10^-5，在前5个epoch使学习率线性增长到10^-2，在第75个epoch、第125个epoch和第175个epoch将学习率分别除以10，在第200个epoch完成训练；批归一化的权重初始值设置为0.5，偏置的值设置为0；所有的卷积使用xavier的方法进行初始化；通过改进训练策略将训练的批样本大小从128降到了16，用以降低训练网络对硬件设备的要求。Further, using the warmup strategy to set the learning rate in step S5 is specifically: setting the initial learning rate to 10 ^-5 , and linearly increasing the learning rate to 10 -2 in the first 5 epochs, and in the 75th epoch, the learning rate is linearly increased to 10 ^-2 . epoch, 125th epoch and 175th epoch divide the learning rate by 10, and complete the training at the 200th epoch; the initial value of batch normalization weights is set to 0.5, and the value of bias is set to 0; all volumes The product uses the xavier method for initialization; by improving the training strategy, the training batch sample size is reduced from 128 to 16 to reduce the hardware equipment requirements for training the network.

与现有技术相比，本发明有以下有益效果：Compared with the prior art, the present invention has the following beneficial effects:

本发明在低层网络中加入高效的网络结构，提取更加全局的特征信息，提高了对小目标的检测能力；在损失函数中加入惩罚项防止在目标密集时出现同类预测框重叠，而在非最大值抑制后处理时产生漏检，提高了目标的检测精度。此外，通过改进训练策略，降低了训练网络的硬件设备要求。The invention adds an efficient network structure to the low-level network, extracts more global feature information, and improves the detection ability of small targets; a penalty term is added to the loss function to prevent the overlapping of similar prediction frames when the targets are dense, and the non-maximum The value suppresses the missed detection during post-processing, which improves the detection accuracy of the target. In addition, by improving the training strategy, the hardware equipment requirements for training the network are reduced.

附图说明Description of drawings

图1为本发明实施例的结构图。FIG. 1 is a structural diagram of an embodiment of the present invention.

图2为本发明实施例的卷积层和Atrous卷积层一维特征提取图。FIG. 2 is a one-dimensional feature extraction diagram of a convolution layer and an Atrous convolution layer according to an embodiment of the present invention.

图3为本发明实施例的RFB_a网络结构图。FIG. 3 is a network structure diagram of RFB_a according to an embodiment of the present invention.

图4为本发明实施例的密集采样图。FIG. 4 is a dense sampling diagram of an embodiment of the present invention.

图5为本发明实施例的具体实施例1检测结果与原DSOD检测结果的对比图。5 is a comparison diagram of the detection result of the specific embodiment 1 of the embodiment of the present invention and the detection result of the original DSOD.

具体实施方式Detailed ways

下面结合附图及实施例对本发明做进一步说明。The present invention will be further described below with reference to the accompanying drawings and embodiments.

如图1所示，本实施例提供了一种改进DSOD网络的方法，包括以下步骤：As shown in Figure 1, the present embodiment provides a method for improving a DSOD network, including the following steps:

在本实施例中，步骤S1中图像预处理具体为：In this embodiment, the image preprocessing in step S1 is specifically:

在本实施例中，所述步骤S2的具体内容为：首先在每一个RFB_a网络分支上使用1×1卷积层，用以来减少特征的通道数；在RFB_a网络第一个分支上采用步长为1且卷积核为3×3的卷积层，得到3×3的感受野特征；第二个分支上采用1×3卷积层和采样步长为3的Atrous卷积层，得到1×7的感受野特征；在第三个分支上采用3×1卷积层和采样步长为3的Atrous卷积层，得到7×1的感受野特征；在第四个分支上采用3×3卷积层和采样步长为5的Atrous卷积层，得到11×11的感受野特征；通过通道拼接和1×1卷积层将各分支提取的特征进行融合；最后将融合的特征和DSOD网络的第二个转接层产生的特征通过残差融合形成最终输出的特征。In this embodiment, the specific content of step S2 is: first, a 1×1 convolution layer is used on each RFB_a network branch to reduce the number of feature channels; step size is used on the first branch of the RFB_a network is 1 and the convolution kernel is a 3×3 convolutional layer to obtain a 3×3 receptive field feature; the second branch uses a 1×3 convolutional layer and an Atrous convolutional layer with a sampling stride of 3 to obtain 1 ×7 receptive field features; use 3 × 1 convolutional layers and Atrous convolutional layers with sampling stride 3 on the third branch to obtain 7 × 1 receptive field features; use 3 × 1 on the fourth branch 3 convolutional layers and Atrous convolutional layers with a sampling step size of 5 to obtain 11×11 receptive field features; the features extracted from each branch are fused through channel splicing and 1×1 convolutional layers; finally, the fused features and The features generated by the second transition layer of the DSOD network are fused by residuals to form the final output features.

在本实施例中，步骤S3中所述加入具有预设采样步长为Atrous卷积层的具体方法为：首先，增加所述DSOD网络中特征提取子网络的输出通道数C，用以提取更加丰富的特征信息，然后加入具有一定采样步长r的Atrous卷积层，Atrous卷积层的输出通道等于原DSOD特征提取子网络输出通道数，使得Atrous卷积嵌入到DSOD的网络中去；紧接着加入1×1卷积层进行特征融合。In this embodiment, the specific method for adding the Atrous convolutional layer with a preset sampling step size in step S3 is: first, increase the number of output channels C of the feature extraction sub-network in the DSOD network, in order to extract more Rich feature information, and then add the Atrous convolution layer with a certain sampling step size r, the output channel of the Atrous convolution layer is equal to the number of output channels of the original DSOD feature extraction sub-network, so that the Atrous convolution is embedded in the DSOD network; Then a 1×1 convolutional layer is added for feature fusion.

在本实施例中，步骤S4所述加入IOG惩罚项的多任务损失函数L具体为：In this embodiment, the multi-task loss function L added with the IOG penalty item in step S4 is specifically:

其中，

在本实施例中，步骤S5中所述采用预热(warmup)策略设置学习率具体为：将初始学习率设置为10^-5，在前5个epoch使学习率线性增长到10^-2，在第75个epoch、第125个epoch和第175个epoch将学习率分别除以10，在第200个epoch完成训练；批归一化的权重初始值设置为0.5，偏置的值设置为0；所有的卷积使用xavier的方法进行初始化；通过改进训练策略将训练的批样本大小从128降到了16，用以降低训练网络对硬件设备的要求。In this embodiment, the use of the warmup strategy to set the learning rate in step S5 is specifically: setting the initial learning rate to 10 ⁻⁵ , and linearly increasing the learning rate to 10 ⁻² in the first 5 epochs. The 75th epoch, the 125th epoch and the 175th epoch divide the learning rate by 10 respectively, and the training is completed at the 200th epoch; the initial value of the batch normalization weight is set to 0.5, and the value of the bias is set to 0; All convolutions are initialized using the xavier method; the training batch size is reduced from 128 to 16 by improving the training strategy to reduce the hardware requirements for training the network.

较佳的，本实施例将输入的图像进行裁剪、镜像、去均值化预处理；将预处理后的图像输入到DSOD特征提取子网络中，在特征提取子网络的第二个转接层后加入RFB_a网络模块中，经过RFB_a网络中不同采样步长的Atrous卷积提取具有不同感受野的特征，为检测小目标步骤提供更具有全局信息的特征；在特征提取子网络后加入采样步长为6的Atrous卷积层，增大了特征图的感受野，为后续多尺度预测层提供更加丰富的语义信息；将Atrous卷积层产生的特征输入到多尺度预测层中，将多尺度预测层输入到损失函数中，在损失函数中加入IOG惩罚项，防止在预测密集的同类型目标时出现同类预测框重叠，从而避免被非最大值抑制处理后出现漏检；同时，在训练阶段采用预热(warmup)策略设置学习率，通过合适的批样本大小，降低了训练网络的硬件设备要求。实验结果表明，本发明相对于原DSOD算法具有更高的检测精度，提高了对小目标的检测能力，同时降低了训练网络的硬件设备要求。Preferably, in this embodiment, the input image is preprocessed by cropping, mirroring, and de-averaging; the preprocessed image is input into the DSOD feature extraction sub-network, after the second transition layer of the feature extraction sub-network. In the RFB_a network module, the features with different receptive fields are extracted through Atrous convolution with different sampling steps in the RFB_a network, which provides features with more global information for the detection of small targets. After the feature extraction sub-network, the sampling step is added as 6's Atrous convolutional layer increases the receptive field of the feature map and provides richer semantic information for the subsequent multi-scale prediction layer; the features generated by the Atrous convolutional layer are input into the multi-scale prediction layer, and the multi-scale prediction layer Input into the loss function, and add the IOG penalty term to the loss function to prevent overlapping of similar prediction boxes when predicting dense targets of the same type, so as to avoid missed detection after being processed by non-maximum suppression; The warmup strategy sets the learning rate, which reduces the hardware requirements for training the network through an appropriate batch sample size. The experimental results show that compared with the original DSOD algorithm, the present invention has higher detection accuracy, improves the detection ability of small targets, and reduces the hardware equipment requirements for training the network.

图2为标准卷积层和Atrous卷积层一维特征提取图。当采样的步长r为1时，Atrous卷积就是一个标准的卷积。当采样步长r为2，填充系数pading为2时，且在输入信号之间插入r-1个0，经过Atrous卷积运算之后，3个输入信号产生了5个信号激励。从图中可以看出，Atrous卷积层具有增大卷积核的感受野。对于本实施例的使用二维的Atrous卷积也具有核一维Atrous卷积相同的效果。Figure 2 is a one-dimensional feature extraction diagram of the standard convolutional layer and the Atrous convolutional layer. When the sampling stride r is 1, the Atrous convolution is a standard convolution. When the sampling step r is 2, the padding coefficient is 2, and r-1 0s are inserted between the input signals, after the Atrous convolution operation, the 3 input signals generate 5 signal excitations. As can be seen from the figure, the Atrous convolutional layer has the receptive field of increasing the convolution kernel. The use of the two-dimensional Atrous convolution in this embodiment also has the same effect as the one-dimensional Atrous convolution.

图3表示一个标准的RFB_a的网络结构。RFB_a网络模块是多分支卷积网络的结构。在不同分支中，通过不同采样步长的Atrous卷积提取的不同大小感受野特征，经过通道拼接进行特征融合，在原特征图上形成密集采样的效果，如图4所示。本实施例在标准的RFB_a网络结构中每一个分支的最后一个Atrous卷积加入ReLU激活函数，以提取更高层次的特征。同时，为了保证DSOD网络结构的一致性，本实施例将RFB_a网络中批归一化(BatchNormalization，BN)和ReLU激活函数调整到卷积层前。Figure 3 shows the network structure of a standard RFB_a. The RFB_a network module is the structure of a multi-branch convolutional network. In different branches, the receptive field features of different sizes extracted by Atrous convolution with different sampling steps are fused by channel splicing to form the effect of dense sampling on the original feature map, as shown in Figure 4. In this embodiment, a ReLU activation function is added to the last Atrous convolution of each branch in the standard RFB_a network structure to extract higher-level features. Meanwhile, in order to ensure the consistency of the DSOD network structure, this embodiment adjusts the batch normalization (BN) and ReLU activation functions in the RFB_a network before the convolution layer.

实施例1，如图5所示，利用目标检测分析工具分析DSOD和改进方法在最最小(XS)目标上的检测能力。从图5可以看出，除了桌子类别外，改进的DSOD目标检测方法在飞机、自行车、鸟等类别上的检测精度均有不同程度的提高。在桌子类别图像中，放在桌子上的杯子等物品，对桌子造成一定的遮挡，对改进的DSOD检测造成了较大影响，所以精度低于原DSOD算法。总体而言，本实施例改进的方法对小目标具有更好的检测精度。Example 1, as shown in FIG. 5 , uses the target detection analysis tool to analyze the detection capability of the DSOD and the improved method on the smallest (XS) target. It can be seen from Figure 5 that, except for the table category, the detection accuracy of the improved DSOD object detection method in categories such as airplanes, bicycles, and birds has been improved to varying degrees. In the table category image, the cups and other items placed on the table cause a certain occlusion on the table, which has a great impact on the improved DSOD detection, so the accuracy is lower than the original DSOD algorithm. In general, the improved method of this embodiment has better detection accuracy for small targets.

实施例2，在PASCAL V0C2007测试集中，将改进的DSOD和其他一些典型的基于回归的目标检测算法在检测精度和检测速度上进行对比，主要考量的指标是mAP(mean AveragePrecision)和FPS(Frames Per Second)。其中，*表示在本实施例实验环境中测试的结果。从表中的数据可以看出，改进的DSOD模型具有更高的精度，检测精度从77.4％提高到79.0％。和DSSD相比，改进的DSOD在检测精度和检测速度均优于DSSD。由于RFBNet300使用了多个RFB网络块，能提取更全局的特征，在精度上与本实施例的改进方法大致相同。但是RFBNet300的运算复杂度高于本实施例的改进方法，改进的方法具有更好的实时性。Embodiment 2, in the PASCAL VOC2007 test set, the improved DSOD and some other typical regression-based target detection algorithms are compared in terms of detection accuracy and detection speed. The main indicators considered are mAP (mean Average Precision) and FPS (Frames Per Second). Among them, * represents the result of testing in the experimental environment of this embodiment. From the data in the table, it can be seen that the improved DSOD model has higher accuracy, and the detection accuracy is improved from 77.4% to 79.0%. Compared with DSSD, the improved DSOD is superior to DSSD in both detection accuracy and detection speed. Since RFBNet300 uses multiple RFB network blocks, more global features can be extracted, and the accuracy is roughly the same as the improved method in this embodiment. However, the computational complexity of RFBNet300 is higher than that of the improved method in this embodiment, and the improved method has better real-time performance.

以上所述仅为本发明的较佳实施例，凡依本发明申请专利范围所做的均等变化与修饰，皆应属本发明的涵盖范围。The above descriptions are only preferred embodiments of the present invention, and all equivalent changes and modifications made according to the scope of the patent application of the present invention shall fall within the scope of the present invention.

Claims

1. a method for improving DSOD network, is characterized in that: comprise the following steps:

Step S1: Obtain an image in the dataset as an input image, and input the input image to the input layer; perform preprocessing on the input image by cropping, mirroring and de-averaging to obtain a preprocessed image, and at the same time use the normalization method to convert the preprocessed image. Convert absolute coordinates in the processed image to relative coordinates;

Step S2: adding the RFB_a network module after the second transition layer of the feature extraction sub-network in the DSOD network; inputting the preprocessed image in step S1 into the feature extraction sub-network in the DSOD network for feature extraction; The feature map of the second transition layer in the feature extraction sub-network in the network is input into the RFB_a network module, and features with different receptive fields are extracted through Atrous dilation convolution with different sampling steps in the RFB_a network module; the The extracted features of different receptive fields are input into the 3×3 convolutional layer to form the first scale prediction layer of the DSOD network;

Step S3: After adding the Atrous convolutional layer with a preset sampling step size to the feature extraction sub-network in the DSOD network, input the feature map of the feature extraction sub-network described in step S2 into the feature extraction sub-network with a preset sampling step. In the long Atrous convolutional layer, it is used to increase the receptive field of the feature map; at the same time, the features generated by the Atrous convolutional layer are input into the multi-scale prediction layer in the DSOD network to form 5 scale prediction layers;

Step S4: Input the features of the first scale prediction layer of the DSOD network described in the step S2 and the 5 scale prediction layers described in the step S3 into the multi-task loss function L adding the IOG penalty term;

Step S5: set the learning rate through the preheating strategy, and use the gradient descent algorithm to optimize the weights of all network layers in the DSOD network; set the sample size to reduce the hardware equipment requirements for training the DSOD network;

The multi-task loss function L added to the IOG penalty item described in step S4 is specifically:

Step S41: Calculate the prediction frame g _{iou_max} with the largest area intersection ratio between the prediction frame p output by the DSOD network and all the real frames G of the preset sample, and the formula is as follows:

Among them, g represents the real box, G represents the set of all real boxes, p represents the prediction box, P represents the set of all prediction boxes, box _g represents the area of the real box, and box _p represents the area of the prediction box;

Step S42: Remove the real frame with the largest area of the intersection and ratio in step S41, then calculate the largest IOG penalty item between the predicted frame and the remaining real frame, and use the largest IOG penalty item as the _Liog loss function. The calculation formula is as follows:

Step S43: Weighted fusion of the _Liog loss function, the localization loss function _Lloc and the classification loss function L _conf function to form the final multi-task loss function L, the formula is as follows:

Among them, N represents the number of detected positive samples, α represents the weight of L _loc of the localization loss; the localization loss function L _loc adopts the xmooth _L1 loss; the classification loss function L _conf adopts the information cross entropy to calculate; the localization loss function L _loc The calculation is as follows:

in,

The classification loss _Lconf is calculated as follows:

Among them, c represents the confidence of each category, Neg represents the negative sample, p represents the category, 0 represents the category is the background,

2. a kind of method of improving DSOD network according to claim 1, is characterized in that: in step S1, image preprocessing is specifically:

Step S11: Crop the input image: first randomly select the input image, that is, the length and height of the cropped image, and then randomly select a number among 0.1, 0.3, 0.7, and 0.9 as the Jaccard coefficient threshold, and calculate all the values in the original image through the Jaccard coefficient. The similarity of the ground-truth box and the cropped image;

Step S12: Judging whether the Jaccard coefficient of the real frame or the cropped image is greater than the Jaccard threshold randomly selected in step S11; if at least one real frame and the Jaccard coefficient of the cropped image are greater than the selected Jaccard threshold, and the center coordinates of the real frame are If it falls in the cropped image, the cropped image meets the requirements, otherwise it returns to step S11; the Jaccard coefficient is calculated as follows:

Among them, N represents the number of real boxes in the image, box _i represents the area of the ith real box, box _cut represents the area of the cropped image, and operator ∩ represents the calculation of the overlap area;

Step S13: Perform left and right mirroring processing on the cropped image according to a preset probability T, and adjust the resolution of the mirrored image to 300×300 to obtain a mirrored image;

Step S14: De-average the mirrored image by using a de-average method to obtain a de-averaged image.

3. a kind of method of improving DSOD network according to claim 1, it is characterized in that: the concrete content of described step S2 is: at first use 1 × 1 convolution layer on each RFB_a network branch, in order to reduce the characteristic The number of channels; the first branch of the RFB_a network adopts a convolutional layer with a stride of 1 and a convolution kernel of 3 × 3 to obtain a 3 × 3 receptive field feature; the second branch adopts a 1 × 3 convolution layer and an Atrous convolutional layer with a sampling stride of 3 to obtain a 1×7 receptive field feature; on the third branch, a 3×1 convolutional layer and an Atrous convolutional layer with a sampling stride of 3 are used to obtain a 7×1 convolutional layer. 1 receptive field feature; adopt 3×3 convolutional layer and Atrous convolutional layer with sampling stride 5 on the fourth branch to obtain 11×11 receptive field feature; through channel stitching and 1×1 convolutional layer The features extracted by each branch are fused; finally, the fused features and the features generated by the second transition layer of the DSOD network are fused to form the final output features through residual fusion.

4. a kind of method of improving DSOD network according to claim 1, is characterized in that, adding described in step S3 has preset sampling step size and is the concrete method of Atrous convolution layer is: First, increase described DSOD network The number of output channels C of the feature extraction sub-network is used to extract richer feature information, and then the Atrous convolution layer with sampling step size r is added. The output channel of the Atrous convolution layer is equal to the original DSOD feature extraction sub-network. The number of output channels , so that Atrous convolution is embedded into the DSOD network; then a 1×1 convolution layer is added for feature fusion.

5. a kind of method for improving DSOD network according to claim 1, is characterized in that, described in step S5, setting learning rate by preheating strategy is specifically: setting initial learning rate to ^10-5 , in the first 5 The epoch increases the learning rate linearly to 10 ^-2 , divides the learning rate by 10 at the 75th epoch, 125th epoch and 175th epoch, respectively, and completes the training at the 200th epoch; batch normalized weight initial value It is set to 0.5, and the bias value is set to 0; all convolutions are initialized using the xavier method; the training batch sample size is reduced from 128 to 16 by improving the training strategy, in order to reduce the hardware equipment requirements for training the network.