CN109101975B

CN109101975B - Image semantic segmentation method based on full convolution neural network

Info

Publication number: CN109101975B
Application number: CN201810947884.XA
Authority: CN
Inventors: 程建; 苏炎洲; 林莉; 高银星; 李恩泽
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2018-08-20
Filing date: 2018-08-20
Publication date: 2022-01-25
Anticipated expiration: 2038-08-20
Also published as: CN109101975A

Abstract

The invention discloses an image semantic segmentation method based on a fully convolutional neural network, which relates to the field of image semantic segmentation and deep learning, and includes the following steps: selecting a training data set; constructing and training a classification model from images to category labels, As the front-end network of the semantic segmentation model; the feature map output by each block of the front-end network is down-sampled to a uniform size through the detail preservation pooling layer, and then the four output feature maps are connected in series, and the feature map is re-corrected through the feature re-correction module Then, the obtained feature map is transmitted to the back-end network; the back-end network is mainly responsible for image upsampling. After upsampling, it undergoes a global pooling with variable weights, and finally intersects with the semantically labeled images of the training dataset. Entropy, backpropagation of errors. The invention solves the problem of low image segmentation accuracy in the prior art.

Description

Image Semantic Segmentation Method Based on Fully Convolutional Neural Network

技术领域technical field

本发明涉及图像语义分割和深度学习领域，尤其涉及基于全卷积神经网络的图像语义分割方法。The present invention relates to the field of image semantic segmentation and deep learning, in particular to an image semantic segmentation method based on a fully convolutional neural network.

背景技术Background technique

语义分割是计算机视觉领域里一个重要的问题。图像语义分割是给每一个像素都赋予一个不同的标签(类别)，因此可以被认为是一个密集分类问题。Semantic segmentation is an important problem in the field of computer vision. Image semantic segmentation is to assign a different label (category) to each pixel, so it can be considered as a dense classification problem.

近年来，绝大多数当前最佳的图像语义分割方法都是基于全卷积神经网络的。典型的语义分割网络结构是编码器-解码器结构，编码器是一个图像降采样过程，负责抽取图像粗糙的语义特征，紧接着就是一个解码器，解码器是一个图像上采样过程，负责对降采样得到的图像特征进行上采样恢复到输入图像原始维度。In recent years, the vast majority of current state-of-the-art image semantic segmentation methods are based on fully convolutional neural networks. A typical semantic segmentation network structure is an encoder-decoder structure. The encoder is an image downsampling process, which is responsible for extracting rough semantic features of the image, followed by a decoder, which is an image upsampling process responsible for downsampling. The sampled image features are upsampled to restore the original dimensions of the input image.

虽然池化在卷积神经网络的降采样过程中是一个关键的组成部分，可以用来降低参数的规模，增强对某些扭曲的不变性，同时增大感受野。但是因为池化本身就是一个有损耗的过程，所以在语义分割的图像降采样过程中，它会导致图像语义信息的丢失，使语义分割结果的精度偏低。While pooling is a key component in the downsampling process of convolutional neural networks, it can be used to reduce the size of parameters, enhance invariance to certain distortions, and at the same time increase the receptive field. However, because pooling itself is a lossy process, in the process of image downsampling for semantic segmentation, it will lead to the loss of image semantic information, which makes the accuracy of semantic segmentation results low.

在深度卷积神经网络中，经常使用跨步卷积(str ided convo l ut ions)代替池化层达到降采样的作用，跨步卷积只考虑每个局部邻域的固定位置的一个节点，而不考虑激活的重要性。从图像降采样的角度，这样的降采样方式同样也会导致特征的失真。全卷积神经网络为大量的应用程序设计了最先进的图像语义分割算法，其中网络结构的创新主要集中在改进空间编码或网络连接来促进梯度流。In deep convolutional neural networks, strided convolutions are often used instead of pooling layers to achieve downsampling, and strided convolutions only consider one node at a fixed position in each local neighborhood, regardless of the importance of activation. From the perspective of image downsampling, such downsampling also leads to feature distortion. Fully Convolutional Neural Networks have designed state-of-the-art image semantic segmentation algorithms for a large number of applications, where innovations in network structures focus on improving spatial encoding or network connections to facilitate gradient flow.

发明内容SUMMARY OF THE INVENTION

本发明的目的在于设计一种基于全卷积神经网络的图像语义分割方法，以解决现有技术中的图像分割准确率较低的问题。The purpose of the present invention is to design an image semantic segmentation method based on a fully convolutional neural network, so as to solve the problem of low image segmentation accuracy in the prior art.

本发明的技术方案如下：The technical scheme of the present invention is as follows:

基于全卷积神经网络的图像语义分割方法，包括如下步骤：The image semantic segmentation method based on fully convolutional neural network includes the following steps:

步骤1：选择训练数据集。Step 1: Choose a training dataset.

步骤2：构建并训练由图像到类别标签的分类模型，并将其作为语义分割模型前端网络；Step 2: Build and train a classification model from images to class labels, and use it as a front-end network for semantic segmentation models;

语义分割模型前端网络的结构包括Conv1、Conv2_x、Conv3_x和Conv4_x，Conv1、Conv2_x、Conv3_x和Conv4_x均包含多个卷积层，Conv1、Conv2_x、Conv3_x和Conv4_x的后面均连接一个细节保留池化层。The structure of the front-end network of the semantic segmentation model includes Conv1, Conv2_x, Conv3_x and Conv4_x. Conv1, Conv2_x, Conv3_x and Conv4_x all contain multiple convolutional layers, and Conv1, Conv2_x, Conv3_x and Conv4_x are all connected with a detail-preserving pooling layer behind.

步骤3：以训练好的语义分割模型前端网络为基础，构建语义分割模型后端网络。Step 3: Based on the trained front-end network of the semantic segmentation model, construct the back-end network of the semantic segmentation model.

后端网络的结构包括细节保留池化层、特征重校正模块、1×1的卷积层Conv5_x、Conv6_x、Conv7_x、变权全局池化层和上采样层；Conv1、Conv2_x、Conv3_x的输出分别通过三个细节保留池化层后与Conv4_x串联连接后，共同输入特征重校正模块；Conv5_x、Conv6_x和Conv7_x前均连接一个上采样层，Conv5_x、Conv6_x和Conv7_x均包括卷积层、批归一化层和线性整流单元，Conv5_x、Conv6_x、Conv7_x通过跳跃结构分别依次与Conv3_x、Conv2_x和Conv1的输出特征图串联；The structure of the back-end network includes a detail-preserving pooling layer, a feature re-correction module, 1×1 convolutional layers Conv5_x, Conv6_x, Conv7_x, a variable-weight global pooling layer, and an upsampling layer; the outputs of Conv1, Conv2_x, Conv3_x pass through After the three detail retention pooling layers are connected in series with Conv4_x, they are jointly input to the feature recalibration module; Conv5_x, Conv6_x and Conv7_x are all connected to an upsampling layer, Conv5_x, Conv6_x and Conv7_x all include convolutional layers, batch normalization layers And the linear rectifier unit, Conv5_x, Conv6_x, Conv7_x are connected in series with the output feature maps of Conv3_x, Conv2_x and Conv1 respectively through the skip structure;

特征重校正模块经过1个1×1的卷积层，得到特征图，将特征图上采样后与Conv5_x连接；The feature recalibration module passes through a 1×1 convolution layer to obtain a feature map, which is up-sampled and connected to Conv5_x;

其中，变权全局池化层表示给全局平均池化中的1×1卷积加上1个权值向量，通过标准高斯分布进行参数初始化，在误差反向传播过程中，不断更新像素的权值。Among them, the variable weight global pooling layer means adding a weight vector to the 1×1 convolution in the global average pooling, initializing the parameters through the standard Gaussian distribution, and continuously updating the weights of the pixels in the process of error back propagation. value.

步骤4：对整个图像语义分割模型进行训练。Step 4: Train the entire image semantic segmentation model.

步骤5：输入新的图像，在已训练好的深度神经网络模型中进行一次前向传播，端到端地输出预测的语义分割结果。Step 5: Input a new image, perform a forward propagation in the trained deep neural network model, and output the predicted semantic segmentation result end-to-end.

具体地，所述语义分割模块前端模型前端网络中包括33个残差结构，每个残差结构包含1个1×1的卷积、1个3×3的卷积、1个1×1的卷积和1条快捷连接。Specifically, the front-end network of the front-end model of the semantic segmentation module includes 33 residual structures, and each residual structure includes a 1×1 convolution, a 3×3 convolution, and a 1×1 convolution. Convolution and 1 shortcut connection.

具体地，所述Conv1后的细节保留池化层对Conv1的输出特征图降采样8倍，Conv2_x后的细节保留池化层对Conv2_x的输出特征图降采样4倍，Conv3_x后的细节保留池化层对Conv3_x的输出特征图降采样2倍。Specifically, the detail retention pooling layer after Conv1 downsamples the output feature map of Conv1 by 8 times, the detail retention pooling layer after Conv2_x downsamples the output feature map of Conv2_x by 4 times, and the detail retention pooling after Conv3_x The layer downsamples the output feature map of Conv3_x by a factor of 2.

具体地，所述细节保留池化层的具体过程为：Specifically, the specific process of the detail-preserving pooling layer is as follows:

根据输入的特征图I计算每个位置的输出：Calculate the output for each location based on the input feature map I:

其中，

表示输入特征图经过细节保留池化层后输出位置p的值；in,

Indicates the value of the output position p after the input feature map passes through the detail preservation pooling layer;

输入节点的空间平均权重ω_α,β[p,q]为

The spatially averaged weights ω _{α, β} [p, q] of the input nodes are

其中α为偏置指数，β为奖励指数。ρ_β(·)是反双边滤波函数，用来在邻域空间Ω_p计算输入点的权重，β减少奖励函数的动态范围，β→0就是简单的领域平均。where α is the bias index and β is the reward index. ρ _{β( )} is an inverse bilateral filter function used to calculate the weight of input points in the neighborhood space Ω _p , β reduces the dynamic range of the reward function, and β→0 is a simple domain average.

是线性尺度缩减因子，具体为：

is the linear scale reduction factor, specifically:

其中F是在邻域

上的一个可学习的，非标准化的2D滤波器，这个

的尺寸为3×3。where F is in the neighborhood

A learnable, non-normalized 2D filter on , this

The dimensions are 3×3.

具体地，所述特征重校正模块为结合了空间特征重矫正与通道特征重矫正的网络模块。Specifically, the feature re-correction module is a network module that combines spatial feature re-correction and channel feature re-correction.

具体地，训练整个图像语义分割模型的过程为：Specifically, the process of training the entire image semantic segmentation model is as follows:

步骤4.1：对训练数据集中的图像进行预处理，将图像剪裁为固定尺寸。Step 4.1: Preprocess the images in the training dataset and crop the images to a fixed size.

步骤4.2：对整个图像语义分割模型进行初始化。Step 4.2: Initialize the entire image semantic segmentation model.

步骤4.3：对训练数据集中的数据通过翻转、缩放和旋转的方式进行扩增。Step 4.3: Augment the data in the training dataset by flipping, scaling, and rotating.

步骤4.4：以每一像素的交叉熵损失的和作为损失函数，再使用随机梯度下降算法进行误差反向传播，更新模型参数，得到训练好的语义分割模型。Step 4.4: Use the sum of the cross-entropy loss of each pixel as the loss function, and then use the stochastic gradient descent algorithm to perform error back propagation, update the model parameters, and obtain a trained semantic segmentation model.

采用上述方案后，本发明的有益效果如下：After adopting the above scheme, the beneficial effects of the present invention are as follows:

(1)本发明的图像语义分割模型引入了细节保留池化层，在降采样过程中，能够保留更多的图像细节信息。细节保留池化层是一种自适应的池化方法，这种方法能够放大空间变化并保留重要的结构细节，同样重要的是，它的参数可以和网络的其余部分共同学习。(1) The image semantic segmentation model of the present invention introduces a detail-preserving pooling layer, which can retain more image detail information in the down-sampling process. The detail-preserving pooling layer is an adaptive pooling method that amplifies spatial variation and preserves important structural details, and just as importantly, its parameters can be learned jointly with the rest of the network.

(2)本发明中引入特征重校正模块，对特征进行重校正，空间特征重校正能够更好的将空间中所有同一位置像素的重要性得到重新校正，并赋以相应的权值，提高语义分割的准确率，通道特征重校正能够将重要的通道赋以高权值，突出重要性；总之，特征重校正模块能够有效地解决图像语义分割准确率低、池化过程中细节信息丢失的问题，最终得到较好的语义分割结果。(2) The feature re-correction module is introduced in the present invention to re-correct the features, and the spatial feature re-correction can better re-correct the importance of all the pixels at the same position in the space, and assign corresponding weights to improve the semantics The accuracy of segmentation and channel feature re-correction can assign high weights to important channels and highlight their importance; in short, the feature re-correction module can effectively solve the problem of low accuracy of image semantic segmentation and loss of detailed information during pooling. , and finally get better semantic segmentation results.

(3)所述的变权全局池化层，由于传统的全局平局池化操作，对所有特征通道的同一位置都执行相同操作，即1×1卷积，不能突出语义分割中的每个像素点的正确分类类别，给全局平均池化中的1×1卷积加上1个权值向量，通过标准高斯分布进行参数初始化，在误差反向传播过程中，不断更新像素的权值，能够更好的进行逐像素分类，还能起到加快收敛的作用。(3) The variable weight global pooling layer, due to the traditional global draw pooling operation, performs the same operation on the same position of all feature channels, that is, 1×1 convolution, which cannot highlight each pixel in the semantic segmentation. The correct classification category of the point, add a weight vector to the 1×1 convolution in the global average pooling, initialize the parameters through the standard Gaussian distribution, and continuously update the weights of the pixels in the process of error back propagation. Better pixel-by-pixel classification can also play a role in accelerating convergence.

附图说明Description of drawings

图1为本发明的流程图；Fig. 1 is the flow chart of the present invention;

图2为本发明的图像语义分割模型结构图；Fig. 2 is the structure diagram of the image semantic segmentation model of the present invention;

图3为本发明的残差结构图；3 is a residual structure diagram of the present invention;

图4为本发明的特征重校正模块结构图；4 is a structural diagram of a feature recalibration module of the present invention;

图5为本发明的通道特征重校正模块结构图；5 is a structural diagram of a channel feature recalibration module of the present invention;

图6为本发明的空间特征重校正模块结构图。FIG. 6 is a structural diagram of a spatial feature recalibration module of the present invention.

具体实施方式Detailed ways

为了使本发明的目的、技术方案和优点更加清楚明白，以下结合具体实施与和附图，对本发明作进一步详细说明。In order to make the objectives, technical solutions and advantages of the present invention more clear, the present invention will be further described in detail below with reference to the specific implementation and accompanying drawings.

为解决现有技术中的图像分割准确率较低的问题，本发明提出的一种基于全卷积神经网络的图像语义分割方法，能够广泛应用于一般二维图像语义分割的领域。In order to solve the problem of low image segmentation accuracy in the prior art, an image semantic segmentation method based on a fully convolutional neural network proposed by the present invention can be widely used in the field of general two-dimensional image semantic segmentation.

如图1所示，基于全卷积神经网络的图像语义分割方法，本发明包括如下步骤：As shown in Figure 1, the image semantic segmentation method based on the fully convolutional neural network, the present invention comprises the following steps:

步骤1：选择训练数据集；本实施例中以VOC 2012数据集的21类(其中1类为背景)场景类别为基准，采集COCO数据集中包含上述20类类别目标的图像加入数据集，最终得到训练和测试数据集。Step 1: Select the training data set; in this embodiment, the 21 categories of the VOC 2012 data set (one of which is the background) scene category is used as the benchmark, and the images containing the above 20 categories of objects in the COCO dataset are collected and added to the dataset, and finally obtained: training and testing datasets.

步骤2：构建并训练由图像到类别标签的分类模型，并将其作为语义分割模型前端网络。Step 2: Build and train an image-to-class label classification model as a front-end network for a semantic segmentation model.

如图2所示，语义分割模型前端网络的结构包括Conv1、Conv2_x、Conv3_x和Conv4_x，Conv1、Conv2_x、Conv3_x和Conv4_x均包含多个卷积层，Conv1、Conv2_x、Conv3_x和Conv4_x的后面均连接一个细节保留池化层；在每个块Conv2_x、Conv3_x和Conv4_x后面加一个细节保留池化层，这是一种自适应池化方法，能够放大空间变化并保留重要的结构细节。As shown in Figure 2, the structure of the front-end network of the semantic segmentation model includes Conv1, Conv2_x, Conv3_x and Conv4_x. Conv1, Conv2_x, Conv3_x and Conv4_x all contain multiple convolutional layers, and Conv1, Conv2_x, Conv3_x and Conv4_x are all connected with a detail behind Preserving pooling layer; a detail preserving pooling layer is added after each block Conv2_x, Conv3_x and Conv4_x, which is an adaptive pooling method capable of amplifying spatial variation and preserving important structural details.

如图3所示，语义分割前端模型前端网络中包括33个残差结构，每个残差结构包含1个1×1的卷积、1个3×3的卷积、1个1×1的卷积和1条快捷连接(shortcut connection)。As shown in Figure 3, the front-end network of the semantic segmentation front-end model includes 33 residual structures, each of which includes a 1×1 convolution, a 3×3 convolution, and a 1×1 convolution. Convolution and 1 shortcut connection.

为了方便描述，将Conv1输出的特征图(尺寸为112×112)、Conv2_x输出的特征图(尺寸为56×56)，Conv3_x输出的特征图(尺寸为28×28)，Conv4_x输出的特征图(尺寸为14×14)记为特征图Res_1、特征图Res_2、特征图Res_3和特征图Res_4。For the convenience of description, the feature map output by Conv1 (size is 112×112), the feature map output by Conv2_x (size is 56×56), the feature map output by Conv3_x (size is 28×28), and the feature map output by Conv4_x ( The size is 14×14) is denoted as feature map Res_1, feature map Res_2, feature map Res_3 and feature map Res_4.

Conv1后的细节保留池化层对特征图Res_1降采样8倍，Conv2_x后的细节保留池化层对特征图Res_2降采样4倍，Conv3_x后的细节保留池化层对特征图Res_3降采样2倍。The detail retention pooling layer after Conv1 downsamples the feature map Res_1 by 8 times, the detail retention pooling layer after Conv2_x downsamples the feature map Res_2 by 4 times, and the detail retention pooling layer after Conv3_x downsamples the feature map Res_3 by 2 times .

如图2所示，后端网络的结构包括细节保留池化层、特征重校正模块、卷积层、Conv5_x、Conv6_x、Conv7_x、卷积层、变权全局池化层和上采样层；特征图Res_1、特征图Res_2、特征图Res_3分别通过三个细节保留池化层后与Conv4_x串联连接后，共同输入特征重校正模块；Conv5_x、Conv6_x和Conv7_x前均连接一个上采样层，Conv5_x、Conv6_x和Conv7_x均包括卷积层、批归一化层和线性整流单元，Conv5_x、Conv6_x、Conv7_x通过跳跃结构分别依次与Conv3_x、Conv2_x和Conv1的输出特征图串联。As shown in Figure 2, the structure of the back-end network includes a detail-preserving pooling layer, a feature recalibration module, a convolutional layer, Conv5_x, Conv6_x, Conv7_x, a convolutional layer, a variable-weight global pooling layer, and an upsampling layer; the feature map Res_1, feature map Res_2, and feature map Res_3 are connected in series with Conv4_x through three detail retention pooling layers, respectively, and then jointly input the feature recalibration module; Conv5_x, Conv6_x and Conv7_x are connected to an upsampling layer before Conv5_x, Conv6_x and Conv7_x All include convolutional layers, batch normalization layers, and linear rectification units. Conv5_x, Conv6_x, and Conv7_x are connected in series with the output feature maps of Conv3_x, Conv2_x, and Conv1 through skip structures, respectively.

对于步骤2和步骤3中，所述细节保留池化层的具体过程为：For step 2 and step 3, the specific process of the detail retention pooling layer is as follows:

根据输入的特征图I计算每个位置的输出P：Calculate the output P at each location based on the input feature map I:

其中，

表示输入特征图经过细节保留池化层后输出位置P的值；邻域空间

输入节点的空间权重平均ω_α,β[p,q]为in,

Represents the value of the output position P after the input feature map passes through the detail preservation pooling layer; the neighborhood space

The spatial weight average ω _{α, β} [p, q] of the input nodes is

是线性尺度缩减因子，具体为：

is the linear scale reduction factor, specifically:

其中F是在邻域

上的一个可学习的，非标准化的2D滤波器，这个

的尺寸为3×3。where F is in the neighborhood

A learnable, non-normalized 2D filter on , this

The dimensions are 3×3.

具体地，特征重校正模块(如图4所示)为结合空间特征重校正与通道特征重校正的网络模块。Specifically, the feature recalibration module (as shown in FIG. 4 ) is a network module that combines spatial feature recalibration and channel feature recalibration.

下面将分开进行说明：The following will be explained separately:

如图5所示，空间特征重校正模块中过程为：As shown in Figure 5, the process in the spatial feature recalibration module is:

(1)将原始特征图

经过一个卷积核大小为1×1，通道数为c(每个通道的权值不共享，让其从学习中获得)的卷积，得到一个特征图

(1) Convert the original feature map

After a convolution with a convolution kernel size of 1×1 and a channel number of c (the weight of each channel is not shared, let it be obtained from learning), a feature map is obtained

(2)再将其经过一个sigmoid层，将M^c的每个空间位置M′(i,,j),i∈{1,2,…,H}，j∈{1，2,…，W}的重要性重新校正,并赋以每个空间位置一个权值p(i，j)，得到的p(i,j)与原始特征图M^c进行点乘。(2) Then pass it through a sigmoid layer, and each spatial position M'(i,,j) of M ^c , i∈{1,2,...,H}, j∈{1,2,...,W The importance of } is re-corrected, and a weight p(i, j) is assigned to each spatial position, and the obtained p(i, j) is dot-multiplied with the original feature map ^Mc .

最终，M^c经过空间特征重校正得到的特征图为：Finally, the feature map obtained by M ^c after spatial feature re-correction is:

空间特征重校正能够更好的将空间中所有同一位置像素的重要性得到重新校正，并赋以相应的权值，提高语义分割的准确率。Spatial feature recalibration can better recalibrate the importance of all pixels at the same location in the space, and assign corresponding weights to improve the accuracy of semantic segmentation.

如图6所示，通道特征重校正模块中过程为：As shown in Figure 6, the process in the channel feature recalibration module is:

(1)将原始特征图

经过一个全局平均池化，得到一个特征图

在再将M′与原始特征图M^c进行全连接，进行特征图的整合。(1) Convert the original feature map

After a global average pooling, a feature map is obtained

Then M' is fully connected with the original feature map ^Mc to integrate the feature maps.

(2)整合后的特征图再经过一个线性修正单元，对特征进行修正。(2) The integrated feature map is then subjected to a linear correction unit to correct the features.

(3)对修正后的特征图最后再经过一个卷积核大小为H×W，通道数为c的卷积得到一个特征向量

(3) The corrected feature map is finally subjected to a convolution with a convolution kernel size of H×W and a channel number of c to obtain a feature vector

(4)特征图再经过一个sigmoid层，将特征向量z的激活范围限定在[0,1]之间，得到一个通道权值向量

M^c经过通道特征重校正得到的特征图：(4) The feature map passes through a sigmoid layer, and the activation range of the feature vector z is limited to [0, 1], and a channel weight vector is obtained.

The feature map obtained by M ^c after channel feature recalibration:

经过通道特征重校正，能够将重要的通道赋以高权值，突出重要性。After channel feature recalibration, important channels can be assigned high weights to highlight their importance.

步骤4：对整个图像语义分割模型进行训练；训练整个图像语义分割模型的过程为。Step 4: Train the entire image semantic segmentation model; the process of training the entire image semantic segmentation model is:

步骤4.1：对训练数据集中的图像进行预处理，将图像剪裁为固定尺寸513×513。Step 4.1: Preprocess the images in the training dataset, and crop the images to a fixed size of 513×513.

步骤4.2：对整个图像语义分割模型进行初始化，即以预训练好的图像语义分割模型的参数值为初始值。Step 4.2: Initialize the entire image semantic segmentation model, that is, use the parameter values of the pre-trained image semantic segmentation model as initial values.

步骤4.3：对训练数据集中的数据通过翻转、缩放和旋转的方式进行扩增；具体地，翻转为随机翻转；在原图像的在0.5到2倍之间随机缩放图像；在原图像在-10到10度之间，随机旋转图像。Step 4.3: Augment the data in the training data set by flipping, scaling and rotating; specifically, flipping is random flipping; randomly scaling the image between 0.5 and 2 times of the original image; in the original image, between -10 and 10 between degrees, rotate the image randomly.

步骤4.4：以每一像素的交叉熵损失的和作为损失函数，再使用随机梯度下降算法进行误差反向传播，用多项式学习策略，更新模型参数，得到训练好的语义分割模型。多项式学习策略中，学习率lr设置为：Step 4.4: Use the sum of the cross-entropy loss of each pixel as the loss function, and then use the stochastic gradient descent algorithm for error back propagation, and use the polynomial learning strategy to update the model parameters to obtain a trained semantic segmentation model. In the polynomial learning strategy, the learning rate lr is set as:

其中，baselr为初始学习率，这里设置为0.001，power设置化0.9。Among them, baselr is the initial learning rate, which is set to 0.001 here, and power is set to 0.9.

本发明的原理和过程如下：在本发明的图像语义分割模型中采用Conv1输出的特征图Res_1，Conv2_x输出的特征图Res_2，Conv3_x输出的特征图Res_3，Conv4_x输出的特征图Res_4，分别为前端网络(即特征提取网络)的第一层、第二层、第三层和第四层。然后将特征图Res_1经过细节保留池化层进行保留细节卷积降采样8倍，Res_2经过经过细节保留池化层进行保留细节池化降采样4倍，Res_3经过经过细节保留池化层进行保留细节卷积降采样2倍以及Res_1串联起来，输入到特征重校正模块，经过通道特征重校正，空间特征重校正能够更好的将空间中所有同一位置像素的重要性得到重新校正，并赋以相应的权值，提高语义分割的准确率，通道特征重校正能够将重要的通道赋以高权值，突出重要性。然后将特征重校正模块输出的特征图经过1个1×1的卷积层，得到的特征图De_1,将特征图De_1上采样至28×28，得到的特征图经过2个3×3的卷积、批归一化层与线性整流单元，最后与特征图Res_3串联，得到的特征图De_2；将De_1De_2上采样至56×56，再经过2个3×3的卷积、批归一化层与线性整流单元，与特征图Res_2串联，得到的特征图De_3,将特征图De_3上采样至112×112，再经过2个3×3卷积、批归一化层与线性整流单元，得到的特征图De_4,最后将特征图De_4经过1个变权全局池化，最后上采样至原图大小，并与语义分割标注计算交叉熵，利用误差方向传播，得到语义分割的网络模型。变权全局池化，由于传统的全局平局池化操作，对所有特征通道的同一位置都执行相同操作，即1×1卷积，不能突出语义分割中的每个像素点的正确分类类别，给全局平均池化中的1×1卷积加上1个权值向量，通过标准高斯分布进行参数初始化，训练过程中，根据反向传播，对属于目标类别的像素赋以高权值，能够更好的进行逐像素分类，还能起到加快收敛的作用。本发明在VOC2012语义分割数据集上取得了mIoU为76.33％的结果。The principle and process of the present invention are as follows: the feature map Res_1 output by Conv1, the feature map Res_2 output by Conv2_x, the feature map Res_3 output by Conv3_x, and the feature map Res_4 output by Conv4_x are used in the image semantic segmentation model of the present invention, which are respectively the front-end network. (i.e. the first, second, third and fourth layers of the feature extraction network). Then the feature map Res_1 is downsampled by the detail preservation pooling layer to preserve the details convolution by 8 times, Res_2 is downsampled by 4 times by the detail preservation pooling layer, and Res_3 is preserved by the detail preservation pooling layer. The convolution downsampling is 2 times and Res_1 is connected in series, and input to the feature recalibration module. After channel feature recalibration, spatial feature recalibration can better recalibrate the importance of all pixels at the same position in the space, and assign corresponding To improve the accuracy of semantic segmentation, channel feature re-correction can assign high weights to important channels to highlight their importance. Then, the feature map output by the feature recalibration module is passed through a 1×1 convolution layer to obtain the feature map De_1. The feature map De_1 is upsampled to 28×28, and the obtained feature map is passed through two 3×3 convolutions Product, batch normalization layer and linear rectification unit, and finally concatenate with feature map Res_3 to obtain feature map De_2; upsample De_1De_2 to 56×56, and then go through two 3×3 convolution and batch normalization layers It is connected with the linear rectification unit and the feature map Res_2 in series to obtain the feature map De_3. The feature map De_3 is upsampled to 112×112, and then after two 3×3 convolutions, batch normalization layers and linear rectification units, the obtained Feature map De_4, finally the feature map De_4 is globally pooled by a variable weight, and finally upsampled to the size of the original image, and cross-entropy is calculated with the semantic segmentation annotation, and the error direction propagation is used to obtain the semantic segmentation network model. Variable weight global pooling, due to the traditional global draw pooling operation, performs the same operation on the same position of all feature channels, that is, 1×1 convolution, which cannot highlight the correct classification category of each pixel in semantic segmentation, giving The 1×1 convolution in the global average pooling adds a weight vector, and the parameters are initialized through the standard Gaussian distribution. During the training process, according to backpropagation, the pixels belonging to the target category are assigned high weights, which can be more effective. A good pixel-by-pixel classification can also play a role in speeding up the convergence. The present invention achieves an mIoU of 76.33% on the VOC2012 semantic segmentation dataset.

凡是根据本发明的技术方案做出的技术变形，均落入本发明的保护范围之内。All technical deformations made according to the technical solutions of the present invention fall within the protection scope of the present invention.

Claims

1. The image semantic segmentation method based on the full convolution neural network is characterized by comprising the following steps of:

step 1: selecting a training data set;

step 2: constructing and training a semantic segmentation model front-end network from an image to a category label;

the structure of the front-end network of the semantic segmentation model comprises Conv1, Conv2_ x, Conv3_ x and Conv4_ x, wherein each of Conv1, Conv2_ x, Conv3_ x and Conv4_ x comprises a plurality of convolutional layers, and a detail reservation pooling layer is connected behind each of Conv1, Conv2_ x, Conv3_ x and Conv4_ x;

and step 3: constructing a semantic segmentation model rear-end network on the basis of the trained semantic segmentation model front-end network;

the structure of the back-end network comprises a detail preservation pooling layer, a characteristic re-correction module, a 1 × 1 convolutional layer, a Conv5_ x, a Conv6_ x, a Conv7_ x, a variable weight global pooling layer and an upsampling layer; after outputs of Conv1, Conv2_ x and Conv3_ x pass through three detail preservation pooling layers and are connected with Conv4_ x in series respectively, the outputs are input into a feature re-correction module together; conv5_ x, Conv6_ x and Conv7_ x are connected with an up-sampling layer, Conv5_ x, Conv6_ x and Conv7_ x comprise convolution layers, batch normalization layers and linear rectification units, and Conv5_ x, Conv6_ x and Conv7_ x are connected with output characteristic diagrams of Conv3_ x, Conv2_ x and Conv1 in sequence through skip structures; the characteristic re-correction module obtains a characteristic diagram through 1 convolution layer of 1 multiplied by 1, and the characteristic diagram is connected with Conv5_ x after being up-sampled;

wherein, the variable-weight global pooling layer represents that 1 weight vector is added to 1 multiplied by 1 convolution in the global average pooling, parameter initialization is carried out through standard Gaussian distribution, and the weight of a pixel is continuously updated in the error back propagation process;

and 4, step 4: training the whole image semantic segmentation model;

and 5: inputting a new image, carrying out forward propagation once in the trained deep neural network model, and outputting a predicted semantic segmentation result end to end.

2. The image semantic segmentation method based on the full convolution neural network of claim 1, wherein the semantic segmentation model front-end network comprises 33 residual structures, and each residual structure comprises 1 convolution of 1 × 1, 1 convolution of 3 × 3, 1 convolution of 1 × 1 and 1 shortcut connection.

3. The full convolutional neural network-based image semantic segmentation method as claimed in claim 1, wherein the detail preserving pooling layer after Conv1 downsamples the output feature map of Conv1 by 8 times, the detail preserving pooling layer after Conv2_ x downsamples the output feature map of Conv2_ x by 4 times, and the detail preserving pooling layer after Conv3_ x downsamples the output feature map of Conv3_ x by 2 times.

4. The image semantic segmentation method based on the full convolution neural network of claim 1,

the specific process of the detail reservation pooling layer is as follows:

calculating the output of each position according to the input feature map I:

wherein,

representing the value of an output position p of the input feature diagram after the input feature diagram passes through a detail reservation pooling layer;

spatial weight averaging ω of input nodes_α，β[p，q]Is composed of

Wherein α is a bias index and β is a reward index; rho_β(. is) an inverse bilateral filter function for omega in the neighborhood space_pCalculating the weight of the input point, wherein beta reduces the dynamic range of the reward function, and beta → 0 is a simple neighborhood average;

is a linear scale reduction factor, specifically:

where F is in the neighborhood

The above-mentioned one can be learned,a non-normalized 2D filter, this

Has a size of 3 × 3.

5. The method for image semantic segmentation based on the full convolution neural network of claim 1, wherein the feature re-correction module is a network module combining spatial feature re-correction and channel feature re-correction.

6. The image semantic segmentation method based on the full convolution neural network as claimed in claim 1, wherein a process of training the whole image semantic segmentation model is as follows:

step 4.1: preprocessing images in the training data set, and cutting the images into fixed sizes;

step 4.2: initializing a whole image semantic segmentation model;

step 4.3: amplifying the data in the training data set in a turning, scaling and rotating mode;

step 4.4: and taking the sum of cross entropy losses of each pixel as a loss function, then performing error back propagation by using a random gradient descent algorithm, and updating model parameters to obtain a trained semantic segmentation model.