CN115147745A

CN115147745A - Small target detection method based on urban unmanned aerial vehicle image

Info

Publication number: CN115147745A
Application number: CN202210942377.3A
Authority: CN
Inventors: 谭励; 黄小凯; 连晓峰; 刘宇昭
Original assignee: Beijing Technology and Business University
Current assignee: Beijing Technology and Business University
Priority date: 2022-08-08
Filing date: 2022-08-08
Publication date: 2022-10-04

Abstract

The invention discloses a small-scale target detection method based on an urban unmanned aerial vehicle image, and belongs to the technical field of computer vision technology and target detection. The invention carries out data enhancement; extracting features through a composite backbone network; combining a bidirectional characteristic pyramid network to perform multi-scale characteristic fusion; and according to the multi-scale characteristic information, combining a space pyramid attention mechanism to carry out target positioning and classification. The invention solves the problems of missing detection and false detection of small-scale targets in unmanned aerial vehicle target detection; improving the accuracy of model detection; and the detection performance of the small-scale target is ensured.

Description

A small target detection method based on urban UAV images

技术领域technical field

本发明属于计算机视觉技术和目标检测技术邻域，尤其涉及一种基于城市无人机图像的小尺度目标检测实现方法。The invention belongs to the neighborhood of computer vision technology and target detection technology, and in particular relates to a small-scale target detection implementation method based on urban drone images.

背景技术Background technique

近年来，配备摄像头的无人机受到了广泛关注。配备了嵌入式设备的无人机能够分析捕获的数据并催生出各种新的应用场景，例如：无人机可以提供有价值的见解，帮助农民或牧场主优化农业操作，监测作物生长，保持畜群安全等等。用无人机来提取航拍图像，来代替昂贵的起重机和直升机。无人机可以有效地将医疗用品、食品或其他货物等包裹送到指定地点。无人机可以通过监测大面积区域，为安全威胁和紧急情况提供实时性可视化结果。无人机在帮助搜索失踪人员、逃犯或救援幸存者以及在困难的地形和恶劣的条件下投放物资方面非常有用。因此，自动理解从无人机上收集的视觉数据变得非常重要，这使计算机视觉与无人机的关系越来越密切。作为计算机视觉的两个基本问题，针对无人机视频的目标检测和目标跟踪也正在被广泛研究。Camera-equipped drones have received a lot of attention in recent years. Drones equipped with embedded devices can analyze the captured data and give rise to a variety of new application scenarios, such as: drones can provide valuable insights to help farmers or ranchers optimize agricultural operations, monitor crop growth, maintain Herd safety and more. Use drones to extract aerial imagery instead of expensive cranes and helicopters. Drones can effectively deliver packages such as medical supplies, food or other goods to designated locations. Drones can provide real-time visualizations of security threats and emergencies by monitoring large areas. Drones are very useful in helping search for missing persons, fugitives or rescue survivors, as well as drop supplies in difficult terrain and harsh conditions. Therefore, it becomes very important to automatically understand the visual data collected from drones, which makes computer vision more and more closely related to drones. As two fundamental problems of computer vision, object detection and object tracking for UAV videos are also being extensively studied.

在无人机支持的城市空中监视和视觉目标识别场景中，无人机通常需要通过处理安装在无人机上的摄像头捕捉到的视频图像来近实时地检测目标。无人机能够跟随目标，并根据视觉反馈结果主动改变其方向和检测区域以优化检测性能。由于这些小型无人机受其自身的体型和供电性能限制能力有限，因此以高推理精度处理大量视频流变得不可行。在这种情况下，机器学习，尤其是深度学习，在许多基于计算机视觉的无人机视觉应用中变得越来越流行。由于传统的特征工程并不总是很适合在复杂环境中进行无人机空中跟踪，因此深度神经网络，尤其是卷积神经网络在图像分类和识别中用于从捕获的图像中提取关键特征。所以，采用深度学习技术在无人机航拍图像中进行目标检测具有重要的研究价值和意义。In UAV-supported urban aerial surveillance and visual object recognition scenarios, UAVs usually need to detect objects in near real-time by processing video images captured by cameras mounted on the UAV. The drone is able to follow the target and actively change its direction and detection area based on the visual feedback to optimize detection performance. Since these small UAVs are limited by their own size and power capabilities, it becomes infeasible to process massive video streams with high inference accuracy. In this context, machine learning, especially deep learning, is becoming more and more popular in many computer vision-based drone vision applications. Since traditional feature engineering is not always well suited for UAV aerial tracking in complex environments, deep neural networks, especially convolutional neural networks, are used in image classification and recognition to extract key features from captured images. Therefore, the use of deep learning technology for target detection in UAV aerial images has important research value and significance.

目前，基于深度学习的目标检测方法主要可以分为两类。一类是基于目标候选区域的two-stage类方法，此方法通过提取候选区域，并对相应区域进行深度学习来实现检测结果的分类，因此此类方法准确率较高。另一类是基于深度学习回归的one-stage类方法，可以直接计算出物体的类别概率和位置坐标值，经过单次检测即可直接得到最终的检测结果，大幅度提高检测速度。2012年深度卷积网络CNN的兴起，将目标检测技术划分为传统的目标检测时期和基于深度学习的检测时期两个阶段，并引领目标检测领域开始了飞速发展。其中，相对于two-stage系列的候选框提取与分类,YOLO将目标检测问题统一为一个回归问题，通过一次检测直接实现目标的分类和目标位置回归；YOLOv2在继续保持处理速度的基础上，同时在检测数据集和分类数据集上训练目标检测器，用检测数据集的数据学习物体的准确位置，用分类数据集的数据来增加分类的类别量，提升模型健壮性；YOLOv3将一个单神经网络应用于图像的多个位置和尺度，将图像划分为不同的区域，预测每一块区域的边界框和概率，将评分较高的边界框和分类概率视为检测结果；YOLOv4结合了大量研究技术，加以组合并进行适当创新，实现了速度和精度的完美平衡；YOLOv5算法在训练阶段，引入了Mosaic数据增强、自适应锚框计算、自适应图片缩放等技巧，在基准网络中融合了Focus结构与CSP结构，在Neck网络中添加了FPN+PAN结构，在Head输出层的使用GIOU_Loss损失函数和DIOU_nms，使其速度与精度都得到了极大的性能提升。YOLO系列算法模型虽然提高了检测的速度，但是通过回归直接预测目标位置和分类的方式，对小目标的检测效果欠佳。At present, the target detection methods based on deep learning can be mainly divided into two categories. One is the two-stage method based on the target candidate area. This method achieves the classification of the detection result by extracting the candidate area and performing deep learning on the corresponding area, so the accuracy of this method is high. The other is the one-stage method based on deep learning regression, which can directly calculate the category probability and position coordinate value of the object. After a single detection, the final detection result can be directly obtained, which greatly improves the detection speed. The rise of the deep convolutional network CNN in 2012 divided the target detection technology into two stages: the traditional target detection period and the detection period based on deep learning, and led the rapid development of the target detection field. Among them, compared with the two-stage series of candidate frame extraction and classification, YOLO unifies the target detection problem into a regression problem, and directly achieves target classification and target position regression through one detection; YOLOv2 continues to maintain processing speed. Train the target detector on the detection data set and the classification data set, use the data of the detection data set to learn the exact position of the object, and use the data of the classification data set to increase the amount of classification categories and improve the robustness of the model; YOLOv3 combines a single neural network It is applied to multiple positions and scales of the image, divides the image into different regions, predicts the bounding box and probability of each region, and regards the bounding box and classification probability with a higher score as the detection result; YOLOv4 combines a large number of research technologies, Combining them and making appropriate innovations achieves a perfect balance between speed and accuracy; in the training phase of the YOLOv5 algorithm, techniques such as Mosaic data enhancement, adaptive anchor box calculation, and adaptive image scaling are introduced, and the Focus structure and The CSP structure, the FPN+PAN structure is added to the Neck network, and the GIOU_Loss loss function and DIOU_nms are used in the Head output layer, which greatly improves the speed and accuracy. Although the YOLO series algorithm model improves the detection speed, the detection effect of small targets is not good by directly predicting the target position and classification through regression.

发明内容SUMMARY OF THE INVENTION

为了克服上述现有技术的不足，本发明提供一种基于城市无人机图像的小尺度目标检测方法，基于复合骨干网络、双向特征金字塔网络和空间金字塔注意力机制构建检测模型，将YOLOv5目标检测模型改进为适用于城市无人机图像中小目标的检测方法；由此解决无人机图片检测中小目标的漏检和误检问题，提高小目标检测精度。In order to overcome the above-mentioned deficiencies of the prior art, the present invention provides a small-scale target detection method based on urban UAV images. The model is improved to a detection method suitable for small targets in urban UAV images; thus, the problem of missed detection and false detection of small targets in UAV image detection is solved, and the detection accuracy of small targets is improved.

以下先对本发明的技术术语进行解释。The technical terms of the present invention are first explained below.

IoU(交并比，Intersection over Union)：IoU是先验框和预测框的交集与并集之比。IoU的取值范围为[0,1]，当真实框和先验目标框之间没有相交区域时IoU＝0，当两框完全重合时IoU＝1。IoU值越大说明真实框和先验目标框的形状和位置越相似。使度量值与相似度负相关，即度量值越小相似度越大，得到：d_iou＝1-IOU,即先验框和预测框的距离度量。IoU (Intersection over Union): IoU is the ratio of the intersection and union of the prior box and the predicted box. The value range of IoU is [0, 1], IoU=0 when there is no intersecting area between the real frame and the prior target frame, and IoU=1 when the two frames are completely coincident. The larger the IoU value, the more similar the shape and position of the ground-truth box and the prior target box. The metric value is negatively correlated with the similarity, that is, the smaller the metric value is, the greater the similarity is, and d _iou =1-IOU is obtained, that is, the distance metric between the prior frame and the prediction frame.

K-means聚类：是一种迭代求解的聚类分析算法。在本发明中的步骤是，根据初始给定的边界框的高宽，首先随机选取9个初始的聚类中心，然后计算每个初始给定边界框的高宽距离9个聚类中心的距离情况。把边界框宽高组成的样本点分配给距离最短的聚类中心。完成一次9个样本点的分配，聚类中心会根据中心点附近现有的样本点重新进行计算，得出新的聚类中心点。不断重复上述过程，直到满足终止条件，即没有样本点被重新分配给不同的聚类中心或者聚类中心不再发生变化。K-means clustering: It is an iterative solution clustering analysis algorithm. The steps in the present invention are: according to the height and width of the initially given bounding box, first randomly select 9 initial cluster centers, and then calculate the distance between the height and width of each initial given bounding box from the 9 cluster centers Happening. The sample points composed of the width and height of the bounding box are assigned to the cluster centers with the shortest distance. After completing the allocation of 9 sample points, the cluster center will be recalculated according to the existing sample points near the center point to obtain a new cluster center point. The above process is repeated continuously until the termination condition is satisfied, that is, no sample points are reassigned to different cluster centers or the cluster centers no longer change.

非极大抑制(Non-Maxium Suppression,NMS)：搜索局部最大值，抑制极大值。按照阈值，通过遍历、排序等过滤掉重复的检测框。在计算机视觉中得到了广泛的应用，例如边缘检测、目标检测等。Non-Maxium Suppression (NMS): Search for local maxima and suppress maxima. According to the threshold, duplicate detection frames are filtered out by traversing, sorting, etc. It has been widely used in computer vision, such as edge detection, object detection, etc.

本发明提供的技术方案是：The technical scheme provided by the present invention is:

一种基于城市无人机图像的小尺度目标检测方法，包括如下步骤：A small-scale target detection method based on urban UAV images, comprising the following steps:

1)初始化原始图像：1) Initialize the original image:

获取无人机图像的存储路径、检测目标种类、检测目标在图像中的位置信息，并根据目标边界框的位置和长宽得到的原始图片的目标位置和分类结果；Obtain the storage path of the UAV image, the type of detection target, the position information of the detection target in the image, and obtain the target position and classification result of the original image according to the position and length and width of the target bounding box;

2)图像数据增强：2) Image data enhancement:

通过Mosaic数据增强的方式，将原始图片利用随机缩放、随机裁剪、随机排布的方式进行拼接，之后获得一张包含目标边界框的新图片；By means of Mosaic data enhancement, the original images are spliced by random scaling, random cropping, and random arrangement, and then a new image containing the target bounding box is obtained;

3)图像特征提取：3) Image feature extraction:

3-1)针对不同的无人机图像尺寸大小，提前设置好固定的初始锚框大小，之后对边框大小进行聚类，将得到的结果和真实边界框大小进行比对，通过计算两者之间的差距，选取合适的边框大小来反向更新设定的初始边框大小，根据目标大小自适应调整边框大小；3-1) For different UAV image sizes, set a fixed initial anchor frame size in advance, then cluster the frame size, compare the obtained result with the real bounding box size, and calculate the difference between the two. The gap between the two, select the appropriate border size to reversely update the set initial border size, and adjust the border size adaptively according to the target size;

3-2)利用双层CSPDarknet53骨干网络相叠加的复合骨干网络，对同一个尺度的特征图进行多次重复采样，增强骨干网络对小目标的特征提取能力，其处理流程如下式所示：3-2) Using the composite backbone network superimposed by the double-layer CSPDarknet53 backbone network, the feature maps of the same scale are repeatedly sampled for many times to enhance the feature extraction capability of the backbone network for small targets. The processing flow is shown in the following formula:

其中，假设每个骨干网络有L个阶段，

表示在第l阶段主骨干网络的操作，

表示在第l阶段辅助骨干网络输出的特征图，

表示在第l阶段主骨干网络输出的特征图，

表示在第l-1阶段主骨干网络输出的特征图，G(·)表示辅助骨干网络和主骨干网络之间的1×1的卷积和归一化操作。Among them, it is assumed that each backbone network has L stages,

represents the operation of the backbone network in phase l,

represents the feature map output by the auxiliary backbone network in stage l,

represents the feature map output by the main backbone network in the lth stage,

represents the feature map output by the main backbone network at stage l-1, G( ) represents the 1×1 convolution and normalization operations between the auxiliary backbone network and the main backbone network.

4)多尺度特征融合：4) Multi-scale feature fusion:

使用自顶向下和自底向上双向路径的双向特征金字塔网络，多次双向融合多尺度的图像特征信息；Use a bidirectional feature pyramid network with top-down and bottom-up bidirectional paths to fuse multi-scale image feature information bidirectionally multiple times;

5)目标分类和位置预测：5) Object classification and location prediction:

利用空间金字塔注意力机制实现目标定位和分类。Using spatial pyramid attention mechanism to achieve object localization and classification.

进一步，在原始图像上用不同的颜色绘出每个目标的边界框以及其分类名称和置信度。Further, the bounding box of each object along with its classification name and confidence is drawn with different colors on the original image.

与现有技术相比，本发明的有益效果是：Compared with the prior art, the beneficial effects of the present invention are:

本发明提供一种基于无人机图像的小目标检测方法，在原本yolov5的基础上，利用多重复合骨干网络来加强对小尺度目标的特征提取能力，提升目标检测网络的特征提取能力；通过双向特征金字塔网络来加强网络对多尺度特征的融合能力，用于在训练过程中进行小尺度目标特征的多尺度学习；利用空间金字塔注意力机制来加强网络模型对小尺度目标的检测能力，进一步加强模型对小目标的检测性能。The invention provides a small target detection method based on unmanned aerial vehicle images. On the basis of the original yolov5, multiple composite backbone networks are used to strengthen the feature extraction capability of small-scale targets and improve the feature extraction capability of the target detection network; The feature pyramid network is used to strengthen the fusion ability of the network to multi-scale features, which is used for multi-scale learning of small-scale target features in the training process; the spatial pyramid attention mechanism is used to strengthen the network model's detection ability of small-scale targets, and further strengthen The detection performance of the model for small objects.

附图说明Description of drawings

图1是本发明实施例基于城市无人机图像的小目标检测方法的流程框图。FIG. 1 is a flowchart of a small target detection method based on an urban UAV image according to an embodiment of the present invention.

图2是本发明实施例目标检测方法的各阶段的关系示意图。FIG. 2 is a schematic diagram showing the relationship of each stage of the target detection method according to the embodiment of the present invention.

图3是本发明实施例提供的双向特征金字塔网络的流程框图。FIG. 3 is a flowchart of a bidirectional feature pyramid network provided by an embodiment of the present invention.

具体实施方式Detailed ways

为了使本发明的目的、技术方案以及实施方式更加容易理解，以下结合附图以及实施例对本发明作进一步说明。本实施例仅用以解释本发明，并不限定本发明。In order to make it easier to understand the purpose, technical solutions and embodiments of the present invention, the present invention will be further described below with reference to the accompanying drawings and embodiments. This embodiment is only used to explain the present invention, and does not limit the present invention.

本发明基于城市无人机图像的小尺度目标检测方法，通过对目标检测方法YOLOv5进行改进,基于复合骨干网络、双向特征金字塔网络和空间金字塔注意力机制构建检测模型，根据模型得到多尺度特征层的预测结果来实现对目标的分类和位置预测。该方法具体包括6个执行步骤，如图1所示，具体包括：The invention is based on a small-scale target detection method based on urban UAV images. By improving the target detection method YOLOv5, a detection model is constructed based on a composite backbone network, a bidirectional feature pyramid network and a spatial pyramid attention mechanism, and a multi-scale feature layer is obtained according to the model. The prediction results are used to realize the classification and location prediction of the target. The method specifically includes 6 execution steps, as shown in Figure 1, including:

1)对原始图像进行初始化：本发明实施例所使用的数据集是通过无人机拍摄的8629张图片，其中每张图片都有对应的标准文件，其中包含了图片中目标的种类，边界框的左上角坐标等信息。通过标准文件，获取无人机图像的存储路径、检测目标种类、检测目标在图像中的位置信息，根据标注好的目标边界框的左上角坐标和长宽获取原始图片中的目标位置和分类结果，打乱图像数据的存储顺序，随机生成训练集和测试集；1) Initialize the original image: The data set used in the embodiment of the present invention is 8629 pictures taken by drones, and each picture has a corresponding standard file, which contains the type of the target in the picture, and the bounding box. The coordinates of the upper left corner and other information. Obtain the storage path of the UAV image, the type of detection target, and the position information of the detection target in the image through the standard file, and obtain the target position and classification result in the original image according to the upper left corner coordinates and length and width of the marked target bounding box. , shuffle the storage order of image data, and randomly generate training set and test set;

2)对图像进行数据增强：通过Mosaic数据增强的方式，将四张原始图片利用随机缩放、随机裁剪、随机排布的方式进行拼接，之后获得一张包含目标边界框的新图片，然后将得到的新图片输入到模型中进行学习，提升了对小目标的检测效果。从数据图像中每次随机读取四张图片，分别对四张图片进行左右的翻转，大小的缩放，明亮度、饱和度、色调进行改变等操作，操作完成之后再将处理后的原始图片按照第一张图片摆放在左上，第二张图片摆放在左下，第三张图片摆放在右下，第四张图片摆放在右上四个方向位置摆好。之后进行图片的组合和框的组合，在完成四张图片的摆放之后，利用矩阵的方式将四张图片固定的区域截取下来，然后将它们拼接起来，拼接成一张新的图片，新的图片上含有框等一系列的内容。拼接完成之后得到的新的一张图片，当图片的框或者图片本身超出两张图片之间的边缘分割线的时候，把这个超出分割线的部分框或者图片的部分处理掉，进行边缘处理。2) Data enhancement of the image: Through the Mosaic data enhancement method, the four original images are spliced by random scaling, random cropping, and random arrangement, and then a new image containing the target bounding box is obtained. The new images are input into the model for learning, which improves the detection effect of small targets. Four pictures are randomly read from the data image each time, and the four pictures are flipped left and right, the size is scaled, and the brightness, saturation, and hue are changed. After the operation is completed, the processed original pictures are processed according to The first picture is placed in the upper left, the second picture is placed in the lower left, the third picture is placed in the lower right, and the fourth picture is placed in the upper right four directions. After that, the combination of pictures and the combination of boxes are carried out. After the placement of the four pictures is completed, the fixed area of the four pictures is intercepted by means of a matrix, and then they are spliced together to form a new picture, a new picture It contains a series of content such as boxes. For a new picture obtained after the stitching is completed, when the frame of the picture or the picture itself exceeds the edge dividing line between the two pictures, the part of the frame or the picture that exceeds the dividing line is processed, and edge processing is performed.

3)对图像中的目标进行特征提取，如图2所示：3) Perform feature extraction on the target in the image, as shown in Figure 2:

3-1)首先，针对不同的无人机图像尺寸大小，提前分别在80×80、40×40和20×20三个尺度上针对大中小三种类型目标设置好初始锚框大小。之后在这个取值的基础上使用K-means聚类算法聚类训练集中所有目标边界框的宽高，结果得到9个宽高组合，也即得到9个目标锚框。在此过程中，为了避免误差的出现，使用IoU距离度量K-means算法的样本距离。其中，IoU的取值范围为[0,1]，当真实边界框和目标锚框之间没有相交区域时IoU＝0，当两框完全重合时IoU＝1。IoU值越大说明真实边界框和目标锚框的形状和位置越相似。为使度量值与相似度负相关，即度量值越小相似度越大，对IoU值取个负值并加1，得到：d_iou＝1-IOU,即使用K-means算法聚类锚框宽高时使用的距离度量。通过利用K-means聚类算法得到的目标锚框结果和真实边界框大小进行比对，计算两者之间的差距来反向更新网络，迭代网络参数，不断根据目标大小自适应的调整锚框大小。3-1) First, according to different UAV image sizes, the initial anchor frame sizes are set in advance for three types of targets, large, medium and small, on three scales of 80×80, 40×40 and 20×20. Then, on the basis of this value, the K-means clustering algorithm is used to cluster the width and height of all target bounding boxes in the training set, and as a result, 9 width and height combinations are obtained, that is, 9 target anchor boxes are obtained. In this process, in order to avoid the appearance of errors, the IoU distance is used to measure the sample distance of the K-means algorithm. Among them, the value range of IoU is [0, 1], IoU=0 when there is no intersecting area between the real bounding box and the target anchor box, and IoU=1 when the two boxes are completely coincident. The larger the IoU value, the more similar the shape and position of the ground-truth bounding box and the target anchor box. In order to make the metric value negatively correlated with the similarity, that is, the smaller the metric value, the greater the similarity, take a negative value for the IoU value and add 1 to get: d _iou = 1-IOU, that is, use the K-means algorithm to cluster anchor boxes The distance metric to use for width and height. By comparing the target anchor box results obtained by the K-means clustering algorithm with the real bounding box size, calculating the gap between the two to update the network in reverse, iterating the network parameters, and continuously adjusting the anchor box adaptively according to the target size size.

3-2)通过双层CSPDarknet53骨干网络相叠加的方式，对同一个尺度的特征图进行多重采样，增强骨干网络对小目标的特征提取能力。3-2) Multi-sampling the feature maps of the same scale by superimposing the two-layer CSPDarknet53 backbone network to enhance the feature extraction capability of the backbone network for small targets.

基于CSPDarknet53的复合骨干网络：在yolov5中的CSPDarknet53骨干网络的基础上，与复合骨干网络相结合所提出的加强小目标特征提取的网络结构。本发明实施例中使用640×640大小的图像，将在辅助骨干网络中经过卷积处理后得到的特征图像输入到同一尺度的主骨干网络后，将其和主骨干网络的特征图像进行叠加融合，再将其作为输入内容输入到主骨干网络模型当中。在不同尺度的骨干网络阶段采取相同的操作，加强骨干网络对不同尺度目标的特征提取。具体步骤包括：Composite backbone network based on CSPDarknet53: Based on the CSPDarknet53 backbone network in yolov5, combined with the composite backbone network, the proposed network structure to enhance feature extraction of small objects. In the embodiment of the present invention, an image with a size of 640×640 is used, and the feature image obtained after convolution processing in the auxiliary backbone network is input into the main backbone network of the same scale, and then superimposed and fused with the feature image of the main backbone network. , and then input it as input into the backbone network model. The same operations are taken in the backbone network stages of different scales to strengthen the feature extraction of the backbone network for objects of different scales. Specific steps include:

首先，将大小为640×640×3的原始图像经过Focus切片，得到320×320×12的特征图，再经过3×3的卷积操作，得到的320×320×32特征图定义为x0。在辅助骨干网络中，将特征图x⁰通过3×3的卷积、归一化、激活函数操作之后，扩张通道得到160×160×64的图像特征，之后再通过残差特征学习模块将使用3个1×1的卷积、归一化、激活函数操作后的特征图和输入该模块的特征图融合之后的得到包含更多特征信息的160×160×64特征图

之后，进行相似的操作，将特征图

通过3×3的卷积、归一化、激活函数操作以及残差特征学习模块之后，得到80×80×128的特征图

再将特征图

进行3×3的卷积操作和残差学习之后，得到40×40×256的特征图

最后，将特征图

通过3×3的卷积操作得到的20×20×512的特征图，通过空间金字塔池化模块，将5×5，9×9，13×13的最大池化操作得到的结果和模块输入结果相融合，再进行残差特征学习后得到辅助骨干网络的最终输出的20×20×512特征图

First, the original image with a size of 640×640×3 is sliced by Focus to obtain a feature map of 320×320×12, and then through a 3×3 convolution operation, the obtained 320×320×32 feature map is defined as x0. In the auxiliary backbone network, after the feature map x ⁰ is operated by 3×3 convolution, normalization, and activation function, the expansion channel obtains 160×160×64 image features, and then the residual feature learning module will use 3 1×1 convolution, normalization, and activation function operation feature maps and the feature maps input to the module are merged to obtain a 160×160×64 feature map containing more feature information.

After that, a similar operation is performed to convert the feature map

After 3×3 convolution, normalization, activation function operation and residual feature learning module, a feature map of 80×80×128 is obtained

Then the feature map

After 3×3 convolution operation and residual learning, a 40×40×256 feature map is obtained

Finally, the feature map

The feature map of 20 × 20 × 512 obtained by the 3 × 3 convolution operation, through the spatial pyramid pooling module, the results obtained by the 5 × 5, 9 × 9, and 13 × 13 max pooling operations and the module input results After fusion, the residual feature learning is performed to obtain a 20×20×512 feature map of the final output of the auxiliary backbone network.

在主骨干网络中，首先将经过辅助骨干网络得到的特征图

通过1×1的卷积、归一化操作和经过切片处理后得到的特征图x⁰相叠加得到320×320×32的特征图定义为

之后，将特征图

通过3×3的卷积、归一化、激活函数操作和残差特征学习模块之后的得到160×160×64特征图

将特征图

通过卷积和归一化操作后和特征图

叠加之后的结果，再进行相似的操作，通过卷积归一化操作以及残差特征学习模块之后，得到80×80×128的特征图

之后，同样将经过特征图

通过卷积归一化操作后和特征图

叠加，将其作为输入特征输入到主骨干网络中，再次通过卷积和残差学习得到40×40×256的特征图

最后，将特征图

通过卷积归一化后和特征图

叠加后，输入到网络中，进行卷积操作、空间金字塔池化和残差学习之后，得到主骨干网络的最终输出的20×20×512特征图

其中通过主骨干网络得到的特征图

和

将作为骨干网络在80×80、40×40和20×20三个尺度上提取到的特征输出到模型的下一阶段进行处理。In the main backbone network, the feature map obtained through the auxiliary backbone network is firstly

The feature map of 320 × 320 × 32 is obtained by superimposing the feature map x ⁰ obtained after 1 × 1 convolution, normalization and slicing, which is defined as

After that, the feature map

The 160×160×64 feature map is obtained after 3×3 convolution, normalization, activation function operation and residual feature learning module

feature map

After operation and feature map through convolution and normalization

After superimposing the results, similar operations are performed. After the convolution normalization operation and the residual feature learning module, a feature map of 80×80×128 is obtained.

After that, the same will go through the feature map

After normalization operation and feature map by convolution

Superimpose, input it into the main backbone network as an input feature, and obtain a feature map of 40×40×256 through convolution and residual learning again

Finally, the feature map

After normalization and feature map by convolution

After superposition, input into the network, after convolution operation, spatial pyramid pooling and residual learning, the 20×20×512 feature map of the final output of the main backbone network is obtained

The feature map obtained through the main backbone network

and

The features extracted as the backbone network at three scales of 80×80, 40×40 and 20×20 are output to the next stage of the model for processing.

综上所述，在本发明特征提取阶段的处理流程如下式所示：To sum up, the processing flow in the feature extraction stage of the present invention is shown in the following formula:

其中，假设每个骨干网络有L个阶段，

表示在第l阶段主骨干网络的操作，

表示在第l阶段辅助骨干网络输出的特征图，

表示在第l阶段主骨干网络输出的特征图，

represents the operation of the backbone network in phase l,

represents the feature map output by the auxiliary backbone network in stage l,

4)多尺度特征融合：在多尺度特征融合阶段，通过使用自顶向下和自底向上双向路径的双向特征金字塔网络，多次双向融合多尺度的特征信息，强化对小目标的特征融合能力。通过特征提取得到的特征图，直接通过跳跃连接将输入到同一尺度的特征节点进行特征融合，融合了更多的特征。同时，四次调用自顶向下和自底向上双向路径看双向特征金字塔网络，实现了更高层次的特征融合。4) Multi-scale feature fusion: In the multi-scale feature fusion stage, by using a bidirectional feature pyramid network with top-down and bottom-up bidirectional paths, multi-scale feature information is fused in multiple directions in both directions to strengthen the feature fusion capability for small targets. . The feature map obtained by feature extraction directly fuses the feature nodes input to the same scale through skip connection, and fuses more features. At the same time, the top-down and bottom-up bidirectional paths are called four times to see the bidirectional feature pyramid network, which achieves higher-level feature fusion.

双向特征金字塔网络：是本发明中的加强特征多尺度融合结构，如图3所示。yolov5的PANet结构能够融合多尺度的目标特征信息，但是小尺度目标与大尺度目标的融合过程中容易造成特征缺失，而且在训练过程中需要消耗大量的时间，从而影响模型对小尺度目标的检测能力和效率。为此，本发明通过引入的双向特征金字塔网络，移除了80×80和20×20两个尺度的中间特征图中间结点，只保留40×40尺度的中间特征图结点，从而产生一个简单的双向网络；此外在同一尺度的输入特征图节点和输出特征图节点之间添加跳跃连接，在不增加计算成本的同时，融合了更多的特征；最后，将能够实现自顶向下和自底向上双向路径的该网络结构看作一个特征网络层,在模型中四次重复调用该特征网络层，在80×80、40×40、20×20三个尺度上多次双向融合不同尺度的目标特征信息，实现更高层次的小目标多尺度的更好的融合。具体步骤包括：Bidirectional Feature Pyramid Network: It is an enhanced feature multi-scale fusion structure in the present invention, as shown in FIG. 3 . The PANet structure of yolov5 can fuse multi-scale target feature information, but the fusion process of small-scale targets and large-scale targets is easy to cause missing features, and it takes a lot of time in the training process, which affects the model's detection of small-scale targets capability and efficiency. To this end, the present invention removes the intermediate feature map nodes of 80 × 80 and 20 × 20 scales by introducing a bidirectional feature pyramid network, and only retains the intermediate feature map nodes of 40 × 40 scale, thereby generating a A simple bidirectional network; in addition, adding skip connections between input feature map nodes and output feature map nodes of the same scale can integrate more features without increasing computational cost; finally, it will be able to achieve top-down and The network structure of the bottom-up bidirectional path is regarded as a feature network layer. The feature network layer is repeatedly called four times in the model, and different scales are fused in two directions on three scales of 80×80, 40×40, and 20×20. The target feature information can achieve better fusion of higher-level small targets at multiple scales. Specific steps include:

首先，将特征提取阶段提取到的20×20×512特征图

经过1×1的卷积处理后得到的20×20×256特征图定义为

之后将

通过上采样得到40×40×256特征图，和特征提取到的40×40×256的特征图

叠加后，通过双向特征金字塔网络双向多尺度融合得到40×40×512的特征图，之后通过残差特征学习得到40×40×256的特征图，再经过1×1的卷积处理后得到40×40×128的特征图

将

继续上采样得到80×80×128的特征图，将其和提取到的80×80×128特征图

叠加融合，通过双向特征金字塔网络得到80×80×256的特征图，再经过残差特征学习得到80×80×128的特征图

在这一过程中，网络融合了80×80、40×40和20×20三个尺度上的图像特征，实现了自顶向下的特征融合。First, extract the 20×20×512 feature map from the feature extraction stage

The 20×20×256 feature map obtained after 1×1 convolution processing is defined as

will later

The 40×40×256 feature map is obtained by upsampling, and the 40×40×256 feature map extracted from the feature

After superposition, the feature map of 40×40×512 is obtained by bidirectional multi-scale fusion of the bidirectional feature pyramid network, and then the feature map of 40×40×256 is obtained by residual feature learning, and then 40×40 is obtained after 1×1 convolution processing. ×40×128 feature map

Will

Continue upsampling to obtain a feature map of 80×80×128, and sum it with the extracted 80×80×128 feature map

Stacking and fusion, the feature map of 80×80×256 is obtained through the bidirectional feature pyramid network, and the feature map of 80×80×128 is obtained through residual feature learning.

In this process, the network fuses image features at three scales of 80×80, 40×40 and 20×20 to achieve top-down feature fusion.

之后，

经过3×3的卷积处理来进行下采样之后得到40×40×128的特征图，之后再和40×40×128的特征图

叠加，接着通过双向多尺度特征融合得到40×40×256的特征图，再进行残差特征学习后得到40×40×256的特征图

经过相似的操作，通过3×3卷积来进行下采样得到20×20×256的特征图，之后再和20×20×256的特征图

叠加，通过双向特征融合得到20×20×512的特征图，进行残差特征学习后得到20×20×512的特征图

在这一阶段，网络则实现了自底向上的融合三个尺度上的图像特征。其中，特征图

和

作为模型在80×80、40×40和20×20三个尺度上的特征结果输出到模型的下一阶段进行结果预测。after,

After 3×3 convolution processing for downsampling, the feature map of 40×40×128 is obtained, and then the feature map of 40×40×128 is added.

Overlay, and then obtain a feature map of 40 × 40 × 256 through bidirectional multi-scale feature fusion, and then perform residual feature learning to obtain a feature map of 40 × 40 × 256

After a similar operation, the 20×20×256 feature map is obtained by downsampling through 3×3 convolution, and then the 20×20×256 feature map is added.

Overlay, a feature map of 20 × 20 × 512 is obtained through bidirectional feature fusion, and a feature map of 20 × 20 × 512 is obtained after residual feature learning

At this stage, the network achieves bottom-up fusion of image features at three scales. Among them, the feature map

and

As the feature results of the model at three scales of 80×80, 40×40 and 20×20, it is output to the next stage of the model for result prediction.

5)进行目标分类和位置预测：5) Perform target classification and location prediction:

本发明利用空间金字塔注意力机制实现目标定位和分类。空间金字塔注意力机制：是本发明中使用的注意力机制来强化预测结果，如图2所示。在三个模型输出分支上使用空间金字塔注意力机制，输入的特征图通过在1×1、2×2、4×4三个尺度上自适应平均池化的空间金字塔结构生成注意力图。之后，将生成的注意力图通过一个全连接层和sigmoid激活层组合的权重来生成对应特征图中的注意力权重，将原始图像中的小目标更准确的标注出来。具体包括如下步骤：The invention utilizes the spatial pyramid attention mechanism to realize target positioning and classification. Spatial pyramid attention mechanism: is the attention mechanism used in the present invention to strengthen the prediction result, as shown in Figure 2. The spatial pyramid attention mechanism is used on the three model output branches, and the input feature maps generate attention maps through the spatial pyramid structure of adaptive average pooling at three scales of 1×1, 2×2, and 4×4. After that, the generated attention map is combined with a fully connected layer and a sigmoid activation layer to generate the attention weight in the corresponding feature map, and the small objects in the original image are more accurately marked. Specifically include the following steps:

5-1)利用注意力机制将模型的注意力集中在图像中的小目标部分，提取图像中的关键信息的同时忽略背景等无关信息的干扰，提高小目标定位和分类性能。首先，经过特征提取和多尺度融合的特征图

和

通过在80×80、40×40和20×20三个尺度上自适应平均池化的空间金字塔结构生成注意力图。其中的1×1的自适应平均池化层的目的是获取特征图中的类别关键信息，2×2池化层用来保存图像中的次重要的关键特征信息，4×4平均池化可以有效的获取特征图中的关键位置信息。5-1) Use the attention mechanism to focus the model's attention on the small target part in the image, extract the key information in the image while ignoring the interference of irrelevant information such as background, and improve the small target localization and classification performance. First, the feature map after feature extraction and multi-scale fusion

and

The attention maps are generated through a spatial pyramid structure with adaptive average pooling at three scales of 80×80, 40×40 and 20×20. The purpose of the 1×1 adaptive average pooling layer is to obtain the key category information in the feature map, the 2×2 pooling layer is used to save the less important key feature information in the image, and the 4×4 average pooling can Effectively obtain key location information in the feature map.

5-2)将生成的注意力图通过由一个全连接层和sigmoid激活层组合成的权重来生成对应特征图中的注意力权重。从而通过该注意力输出的注意力权重，将原始图像中的小目标更准确的标注出来。5-2) Pass the generated attention map through a weight composed of a fully connected layer and a sigmoid activation layer to generate the attention weights in the corresponding feature map. Thus, through the attention weight of the attention output, the small objects in the original image are more accurately marked.

本发明可以利用预测边界框的位置和得分进行非极大抑制，选择效果最好预测边界框作为目标边界框，避免一个目标有多个预测边界框。对预测模型得出的图像目标预测边界框，根据其预测得分由高到低进行排序；计算预测得分最大的预测边界框，与其他所有预测框的IoU值；当IoU值超过设定的阈值时，去除该预测边界框，然后在剩下预测边界框中继续寻找IoU最高的，再去除与之IoU超过阈值的预测边界框，直到最后会保留几乎没有重叠的预测边界框，做到每个目标只剩下一个预测边界框。之后再对该预测边界框区域内的目标种类进行排序和筛选，同一目标种类取其置信度的最大值，并筛选出在不同目标类别的置信度最大值排序中排名最高的种类，将其作为目标预测边界框区域的目标种类。The present invention can use the position and score of the predicted bounding box for non-maximum suppression, select the predicted bounding box with the best effect as the target bounding box, and avoid multiple predicted bounding boxes for one target. Sort the predicted bounding box of the image target obtained by the prediction model according to its prediction score from high to low; calculate the predicted bounding box with the largest predicted score, and the IoU value of all other predicted boxes; when the IoU value exceeds the set threshold , remove the predicted bounding box, and then continue to find the highest IoU in the remaining predicted bounding boxes, and then remove the predicted bounding box whose IoU exceeds the threshold, until the predicted bounding box with almost no overlap will be retained until each target. Only one prediction bounding box remains. Then, sort and filter the target types in the predicted bounding box area. The same target type takes the maximum value of its confidence, and selects the type with the highest ranking in the ranking of the maximum confidence value of different target categories, and takes it as Object The kind of object that predicts the bounding box region.

6)可视化：在原始图像上用不同的颜色绘出每个目标的预测边界框，以及其类别种类和置信度。6) Visualization: Draw the predicted bounding box of each object, its category category and confidence level on the original image with different colors.

需要注意的是，公布实施例的目的在于帮助进一步理解本发明，但是本领域的技术人员可以理解：在不脱离本发明及所附权利要求的精神和范围内，各种替换和修改都是可能的。因此，本发明不应局限于实施例所公开的内容，本发明要求保护的范围以权利要求书界定的范围为准。It should be noted that the purpose of publishing the embodiments is to help further understanding of the present invention, but those skilled in the art can understand that various replacements and modifications are possible without departing from the spirit and scope of the present invention and the appended claims of. Therefore, the present invention should not be limited to the contents disclosed in the embodiments, and the scope of protection of the present invention shall be subject to the scope defined by the claims.

Claims

1. a small-scale target detection method based on urban UAV image, is characterized in that, comprises the steps:

1) Initialize the original image:

Obtain the storage path of the UAV image, the type of detection target, the position information of the detection target in the image, and obtain the target position and classification result of the original image according to the position and length and width of the target bounding box;

2) Image data enhancement:

By means of Mosaic data enhancement, the original images are spliced by random scaling, random cropping, and random arrangement, and then a new image containing the target bounding box is obtained;

3) Image feature extraction:

3-1) Set a fixed initial anchor box size, then cluster the target bounding box size, compare the obtained target anchor box results with the real bounding box size, and select the appropriate one by calculating the gap between the two. The size of the target anchor frame is reversely updated to set the initial anchor frame size, and the anchor frame size is adaptively adjusted according to the target size;

3-2) Using the composite backbone network with the double-layer CSPDarknet53 backbone network superimposed, the feature maps of the same scale are repeatedly sampled for many times, and the processing flow is shown in the following formula:

Among them, it is assumed that each backbone network has L stages,

represents the operation of the backbone network in phase l,

represents the feature map output by the auxiliary backbone network in stage l,

represents the feature map output by the main backbone network in stage l-1, G( ) represents the 1×1 convolution and normalization operations between the auxiliary backbone network and the main backbone network;

4) Multi-scale feature fusion:

Use a bidirectional feature pyramid network with top-down and bottom-up bidirectional paths to fuse multi-scale image feature information bidirectionally multiple times;

5) Object classification and location prediction:

Using spatial pyramid attention mechanism to achieve object localization and classification.

2. the small-scale target detection method based on urban UAV image according to claim 1, is characterized in that, after setting initial anchor frame size in step 3-1), use K-means on the basis of this value The clustering algorithm clusters the width and height of all target bounding boxes in the training set, and the result is the target anchor box. The IoU distance is used to measure the sample distance of the K-means algorithm. The value range of IoU is [0, 1]. IoU=0 when there is no intersection area between the frame and the target anchor frame, and IoU=1 when the two frames are completely coincident.

3. the small-scale target detection method based on urban UAV image according to claim 1, is characterized in that, step 5) specifically comprises:

5-1) Extract the feature map of multi-scale fusion, and generate the attention map in the spatial pyramid structure of adaptive average pooling;

5-2) Use the generated attention map to generate the attention weight in the corresponding feature map through a weight composed of a fully connected layer and a sigmoid activation layer. The target is accurately marked.

4 . The small-scale target detection method based on urban UAV images according to claim 3 , wherein the key information in the multi-scale fusion feature map is extracted while ignoring the interference of background irrelevant information. 5 .

5. the small-scale target detection method based on urban UAV image according to claim 1, is characterized in that, step 5-2) utilizes the position and the score of predicted bounding box to carry out non-maximum suppression, namely calculates the maximum predicted score. The predicted bounding box, and the IoU value of all other predicted boxes; when the IoU value exceeds the set threshold, remove the predicted bounding box, and then continue to find the highest IoU in the remaining predicted bounding boxes, and then remove the one with the IoU exceeding the threshold The prediction bounding box of , until the end will keep almost no overlapping boxes, so that only one prediction bounding box is left for each target.

6. The small-scale target detection method based on urban UAV images according to claim 5, characterized in that, sorting and screening the target types in the screened predicted bounding box area, and the same target type is taken as its confidence level. The maximum value is selected, and the category with the highest ranking in the order of the maximum confidence value of different target categories is selected as the target category of the target prediction bounding box area.

7. The small-scale target detection method based on urban UAV images according to claim 1, wherein the predicted bounding box of each target, as well as its category type and confidence level, are drawn with different colors on the original image. .