CN114638971B

CN114638971B - Object detection method based on adaptive fusion of multi-level local and global features

Info

Publication number: CN114638971B
Application number: CN202210277660.9A
Authority: CN
Inventors: 曹家乐; 庞彦伟
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2022-03-21
Filing date: 2022-03-21
Publication date: 2025-01-07
Anticipated expiration: 2042-03-21
Also published as: CN114638971A

Abstract

The present invention relates to a target detection method for adaptive fusion of multi-level local and global features, comprising the following steps: step 1: preparing a training data set for target detection; step 2: selecting a backbone network for target detection, building a candidate detection window extraction network, and constructing a multi-level local and global feature adaptive fusion network MLGNet on the basis of the backbone network, and setting the training loss function of the candidate window extraction network and MLGNet; step 3: initializing the network parameters of each part of the detector and the hyperparameters required for the training process; step 4: updating the weight of the detector using a back propagation algorithm; and obtaining the final detector after a set number of training times.

Description

Multi-level local and global feature self-adaptive fusion target detection method

Technical Field

The invention relates to a target detection method in an intelligent system (such as intelligent driving, intelligent monitoring, intelligent interaction and the like), in particular to a target detection method based on deep learning.

Background

Object detection primarily refers to locating objects present in an image or video and gives a specific class of objects. In recent years, object detection has been greatly successful based on the deep convolutional neural network technology, and is widely applied to the fields of intelligent driving, intelligent traffic, intelligent searching, intelligent authentication and the like. For example, smart cars need to detect obstacles ahead before decision control, while smart interaction systems need to detect people needing interaction before recognizing relevant gestures and instructions.

Due to the strong feature expression capability of the deep convolutional neural network, the deep convolutional neural network has achieved great success in tasks such as image classification, target detection, semantic segmentation and the like. For deep learning-based target detection, the related methods mainly include two types, two-stage methods and single-stage methods. The two-stage method has higher detection performance than the single-stage method. This patent is primarily concerned with a two-stage process. The two-stage method mainly comprises two parts, namely a candidate detection window extraction network and a candidate detection window classification and regression network. To roughly extract objects that may be present in the image, the candidate window extraction network generates a number of candidate detection windows. Based on the candidate detection windows, the candidate window classification and regression network further classifies and regresses the candidate detection windows to obtain the final position and classification score of the detection windows.

In a two-stage process, a representative operation is Faster R-CNN ^[1]. It extracts and classifies candidate windows through the shared underlying network. And the Faster R-CNN adopts the RoI pooling layer to extract global features of the interested region of the candidate detection window to classify and regress the candidate detection window. Thus, it ignores the local features of the object. In fact, local features are more advantageous for enhancing detection performance for occluding objects. Meanwhile, the RoI pooling layer scales the original detection frame region feature map to a fixed size, and is not robust to deformation of an object. To encode local information into the features, dai et al ^[2] propose a position-sensitive pooling layer of interest PSRoI. Specifically, PSRoI divides each region of interest into sub-regions of k×k size. The response value of each sub-region corresponds to the average corresponding value of the corresponding region on the channel from the position sensitive feature map. The PSRoI-based target detector R-FCN has similar detection performance but Faster detection speed than the Faster R-CNN. Zhu et al ^[3] integrate the RoI layer and PSRoI layer together to take full advantage of global and local features. The method then lacks the ability to mine multi-scale features and how to adaptively fuse these local and global features. To encode multi-scale features, he et al ^[4] propose to fuse multi-scale features using a spatial pyramid structure. The method is more robust to deformation of the object. Zhao et al ^[5] employ similar structures to enhance the performance of semantic segmentation. Liu et al ^[6] use the related ideas of the spatial pyramid structure for single-stage object detection. Wang et al ^[7] fused multi-scale features using a three-dimensional convolution operation.

Reference is made to:

[1]S.Ren,K.He,R.Girshick,and J.Sun,Faster R-CNN:Towards Real-Time Object Detection with Region Proposal Networks,IEEE Trans.Pattern Analysis and Machine Intelligence,vol.39,no.6,pp.1137-1149,2017.

[2]J.Dai,Y.Li,K.He,and J.Sun,R-FCN:Object Detection via Region-based Fully Convolutional Networks,Proc.Advances in Neural Information Processing Systems,2015.

[3]Y.Zhu,C.Zhao,J.Wang,X.Zhao,Y.Wu,H.Lu,CoupleNet:Coupling Global Structure with Local Parts for Object Detection,IEEE Computer Vision and Pattern Recognition,2017.

[4]K.He,X.Zhang,S.Ren,J.Sun,Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition,IEEE Trans.Pattern Analysis and Machine Intelligence,2015.

[5]H.Zhao,J.Shi,X.Qi,X.Wang,and J.Jia,Pyramid Scene Parsing Network,Proc.IEEE Computer Vision and Pattern Recognition,2017.

[6]S.Liu,D.Huang,Y.Wang,Receptive Field Block Net for Accurate and Fast Object Detection,Proc.European Conference on Computer Vision,2018.

[7]X.Wang,S.Zhang,Z.Yu,L.Feng,and W.Zhang,Scale-Equalizing Pyramid Convolution for Object Detection,Proc.IEEE Computer Vision and Pattern Recognition,2020.

Disclosure of Invention

The invention provides a target detection method for multi-level local and global feature self-adaptive fusion, which can fully utilize multi-scale features and local and global features to improve the performance of target detection. The method provided by the invention adopts a multi-level PSRoI layer to extract multi-level local features, and adopts a multi-level RoI layer to extract multi-level global features. Based on these multi-level features, the proposed method predicts the respective weight coefficients to generate the final feature map. Therefore, the method can learn the local and global characteristics of multiple scales, and is beneficial to improving the performance of target detection. The technical proposal is as follows:

a target detection method for multi-level local and global feature self-adaptive fusion comprises the following steps:

step1, preparing a training data set for target detection, which comprises training images and corresponding object labels, wherein the labels are the coordinates of a detection frame where an object is located and the types of the object;

step 2, selecting a main network for target detection, building a candidate detection window extraction network, building a multi-level local and global feature self-adaptive fusion network MLGNet on the basis of the main network, and setting a training loss function of the candidate window extraction network and MLGNet;

the method for constructing the multi-level local and global feature self-adaptive fusion network MLGNet on the basis of the backbone network is as follows:

For each candidate detection window, respectively utilizing six branches to respectively extract multi-level local and global features, utilizing three branches to respectively generate three sensitive feature maps at different positions through three different convolution layers, respectively utilizing three layers PSRoI to respectively extract sensitive feature maps at 3×3,5×5,7×7 sizes and positions of a candidate window region of interest based on the three sensitive feature maps at different positions, then utilizing bilinear difference values to up-sample the three sensitive feature maps at different sizes and positions to the same 7×7 size feature map, utilizing convolution layers of the other three branches to generate a feature map, utilizing three different RoI layers to respectively extract 3×3,5×5,7×7 size global feature maps of the candidate window region of interest based on the feature map, and then up-sampling the three global feature maps at different sizes to the same 7×7 size feature map through bilinear difference values;

The method for constructing the feature self-adaptive fusion unit comprises the steps of connecting six 7X 7 size feature graphs generated by the upper six branches together along the direction of a feature channel to obtain an original serial feature graph, firstly generating a feature vector with the length of the feature channel by using a global average global pooling operation, then generating a feature vector with the length of 6 by using a full-connection layer, normalizing the feature vector with the length of 6 by using a Sigmoid layer, multiplying six values of the normalized feature vector with the features corresponding to the six branches respectively and connecting the feature vectors in series along the direction of the feature channel to obtain an enhanced serial feature graph, and adding the enhanced serial feature graph and the original serial feature graph to obtain a final output feature graph;

Forming a multi-level local and global feature self-adaptive fusion network MLGNet;

initializing network parameters of each part of the detector and super parameters required by a training process;

And 4, updating the weight of the detector by using a back propagation algorithm, and obtaining the final detector after the set training times.

Drawings

FIG. 1 multi-level local and global feature adaptive fusion network (MLGNet)

FIG. 2 feature adaptive fusion unit

Detailed description of the method presented in FIG. 3

Detailed Description

The invention mainly solves the technical problem of how to fully utilize the multi-scale characteristics and the local and global characteristics to improve the target detection performance. Aiming at the problem, the patent provides a target detection method for multi-level local and global feature self-adaptive fusion. Specifically, the proposed method employs a multi-level PSRoI layer to extract multi-level local features and a multi-level RoI layer to extract multi-level global features. Based on these multi-level features, the proposed method predicts the respective weight coefficients to generate the final feature map. Therefore, the method can learn the local and global characteristics of multiple scales, and is beneficial to improving the performance of target detection.

The proposed multi-level local and global feature adaptive fusion network is presented first, and then how to use the proposed network for object detection.

(1) Multi-level local and global feature self-adaptive fusion network

Fig. 1 shows a basic schematic of a multi-level local and global feature adaptive fusion network (MLGNet for short). For a given input image, it first generates a depth profile via a backbone network (e.g., VGG16 and ResNet). For each candidate detection window MLGNet, six branches are used to extract multi-level local and global features, respectively. The three branches first pass through three different convolution layers to generate three different position sensitive feature maps. Based on the three position sensitive feature maps, three different PSRoI layers respectively extract the 3×3,5×5 and 7×7 size position sensitive feature maps of the candidate window interested region, and then the three feature maps with different sizes are up-sampled to the same 7×7 size feature map through bilinear difference values. The following three branches first generate a feature map using a convolutional layer. Based on the feature map, three different RoI layers extract 3×3,5×5,7×7 global feature maps of the candidate window region of interest, respectively, and then upsample the three different feature maps to the same 7×7 feature map by bilinear difference.

Finally, six 7×7 size feature maps generated by the upper and lower branches are connected in series, and a new output feature map is generated by using the adaptive fusion module. The new output features pass through a convolution layer for subsequent classification and regression tasks. Fig. 2 shows a feature adaptive fusion module. For six feature graphs connected in series, a feature vector with the length of a channel is generated by using a global average global pooling operation, then a feature vector with the length of 6 is generated by using a full-connection layer, normalization is performed through a sigmoid layer, and finally the normalized vector is multiplied by six branches respectively to obtain the enhanced feature graph. And adding the enhanced feature map and the input feature map of the feature self-adaptive fusion module to obtain new output features.

The MLGNet provided can extract local and global characteristics of multiple layers and can perform self-adaptive fusion, so that the expression capability and robustness of the subsequent classification and regression characteristics are improved.

(2) Target detection based on multi-level local and global feature self-adaptive fusion network

This section describes how the proposed multi-level local and global feature adaptive fusion network MLGNet can be applied to object detection. The target detection mainly comprises two different steps, namely a training phase and a testing phase. The training stage is used for learning MLGNet network parameters, and the testing stage utilizes the trained MLGNet to detect the images and judge whether the images contain objects. First, we introduce a training process:

Step 1, preparing training data set (such as general target detection data MS COCO) of target detection. The data needs to contain a certain number of training images and corresponding object labels. The labels are the coordinates of the detection frame where the object is located and the type of the object.

And 2, selecting a backbone network of the target detector, and building a candidate detection window extraction network and a proposed multi-level local and global feature self-adaptive fusion network MLGNet. The candidate window extraction network and the training loss function of MLGNet are set.

And 3, initializing network parameters of each part of the detector and super parameters required by the training process. The initialization of the network parameters can be random initialization, and super parameters of the training process comprise iteration times, learning rate, batch processing size and the like.

And 4, updating the weight of the detector by using a back propagation algorithm. The final detector is obtained after the set training times.

Then, we introduce a test procedure:

Step 1, preparing a tested image. And extracting a plurality of candidate detection windows by using a candidate detection window network, and precisely classifying and regressing the candidate detection windows by using a multi-level local and global characteristic self-adaptive fusion network MLGNet.

And 2, performing post-processing on the MLGNet output result by using a non-maximum suppression algorithm NMS to generate a final detection result.

FIG. 3 shows a specific implementation method of the method provided by the invention, which mainly comprises the following steps:

and step1, selecting a target detection data set according to the occasion of application, wherein the target detection data set comprises a plurality of images and corresponding labeling information, such as the position information and the belonging category information of a target.

And 2, constructing a candidate detection window extraction network and a proposed multi-level local and global feature self-adaptive fusion network MLGNet by a backbone network of the selected target detector.

And 3, initializing network parameters of each part of the detector and super parameters required by the training process.

And 5, preparing a tested image. And extracting a plurality of candidate windows by using a candidate detection window network, and precisely classifying and regressing the detection windows by using a multi-level local and global feature self-adaptive fusion network MLGNet.

And 6, utilizing a non-maximum suppression algorithm to post-process the network output result to generate a final detection result.

Claims

1. A target detection method with multi-level local and global feature adaptive fusion, comprising the following steps:

Step 1: Prepare a training data set for target detection, including training images and corresponding object annotations; the annotations are the coordinates of the detection box where the object is located and the category of the object;

Step 2: Select the backbone network for target detection, build a candidate detection window extraction network, and build a multi-level local and global feature adaptive fusion network MLGNet based on the backbone network, and set the training loss function of the candidate window extraction network and MLGNet;

Among them, the method of constructing a multi-level local and global feature self-adaptive fusion network MLGNet based on the backbone network is as follows:

For a given input image, a deep feature map is generated through a backbone network; for each candidate detection window, six branches are used to extract multi-level local and global features respectively: three branches are used to generate three sensitive feature maps at different positions through three different convolutional layers, and based on the three sensitive feature maps at different positions, three different PSRoI layers are used to extract 3×3, 5×5, and 7×7 sensitive feature maps of the candidate window's region of interest, respectively. Then, the three sensitive feature maps of different sizes and positions are upsampled to the same 7×7 feature map through bilinear interpolation; a feature map is generated using the convolutional layers of the other three branches, and based on the feature map, three different RoI layers are used to extract 3×3, 5×5, and 7×7 global feature maps of the candidate window's region of interest, respectively. Then, the three global feature maps of different sizes are upsampled to the same 7×7 feature map through bilinear interpolation;

Construct a feature adaptive fusion unit by concatenating the six 7×7 feature maps generated by the previous six branches in the direction of the feature channel to obtain the original concatenated feature map; for the original concatenated feature map, we perform the following operations: first, use a global average global pooling operation to generate a feature vector of the length of the feature channel; then use a fully connected layer to generate a feature vector of length 6; then pass a Sigmoid layer to normalize the feature vector of length 6; finally, use the six values of the normalized feature vector to multiply the features corresponding to the above six branches respectively and concatenate them in the direction of the feature channel to obtain the enhanced concatenated feature map; add the enhanced concatenated feature map and the original concatenated feature map to obtain the final output feature map;

Form a multi-level local and global feature adaptive fusion network MLGNet;

Step 3: Initialize the network parameters of each part of the detector and the hyperparameters required for the training process;

Step 4: Use the back-propagation algorithm to update the weights of the detector; after a set number of training times, the final detector is obtained.