CN113221969A

CN113221969A - Semantic segmentation system and method based on Internet of things perception and based on dual-feature fusion

Info

Publication number: CN113221969A
Application number: CN202110446945.6A
Authority: CN
Inventors: 朱信忠; 徐慧英; 涂文轩; 刘新旺; 赵建民
Original assignee: Zhejiang Normal University CJNU
Current assignee: Zhejiang Normal University CJNU
Priority date: 2021-04-25
Filing date: 2021-04-25
Publication date: 2021-08-06
Also published as: LU503090B1; WO2022227913A1; ZA202207731B

Abstract

The invention discloses a semantic segmentation system and a semantic segmentation method based on dual-feature fusion of Internet of things perception, wherein the method comprises the following steps of S1, carrying out feature coding on an original image to obtain features of different scales; s2, learning the features of different scales through two attention thinning blocks to obtain multi-level fusion features; s3, reducing the dimension of the multilevel fusion features to obtain dimension reduction features; s4, context coding is carried out on the dimension reduction features by using depth separable convolutions with different convolution scales to obtain local features with different scales; s5, performing global pooling on the dimensionality reduction features by using a global mean pooling layer to obtain global features; s6, channel splicing and fusing the global features and the local features to obtain multi-scale context fusion features; s7, performing channel splicing and fusion on the dimensionality reduction feature and the multi-scale context fusion feature to obtain a splicing feature; and S8, obtaining output according to the splicing characteristics. The semantic difference among the multi-level features is relieved, the information representation is enriched, and the identification precision is improved.

Description

Semantic segmentation system and method based on Internet of things perception and based on dual-feature fusion

Technical Field

The invention belongs to the technical field of computer vision, and particularly relates to a semantic segmentation system and method based on dual-feature fusion of Internet of things perception.

Background

Semantic segmentation, aiming at densely assigning each pixel to a respective predefined class, is becoming an increasingly attractive direction for computer vision field research. Because the deep learning method has strong expression learning capability, the research in the semantic segmentation field obtains good performance in many applications of the internet of things such as automatic driving, diabetic retinopathy, image analysis and the like. Two important factors, namely the feature fusion mode and the complexity of the network, significantly determine the performance of the semantic segmentation method. In particular, to accurately resolve complex scenarios in resource-constrained Internet of Things (IoT) environments, it is extremely important and challenging to encode robust multi-level features and different contextual information in an efficient and effective manner to achieve accurate, fast, and lightweight performance.

Existing semantic segmentation methods can be roughly divided into two categories: precision-oriented and efficiency-oriented methods. In the early days, most of the previous work was too focused on a single viewing angle: the accuracy of the identification of the algorithm or the speed of the efficiency of execution of the algorithm. In the first category of methods, the design idea of the semantic segmentation model mainly focuses on how to integrate diversified features, and a complex framework is designed to realize high-precision segmentation performance. For example, researchers propose a pyramid structure, a hole space pyramid pooling module (ASPP) or a Context Pyramid Module (CPM), and encode multi-scale context information at the end of the backbone network ResNet101(2048 feature maps) for processing the multi-scale change problem of the target. In addition, the U-type network directly fuses the hierarchical features through long-hop connection operation, and extracts spatial information of different levels as much as possible, thereby realizing accurate pixel segmentation. On the other hand, a typical asymmetric decoder structure has also been extensively studied by scholars. The ENet and ESPNet networks compress the network size greatly through pruning operations, and process large-scale images on line at a very fast speed. In order to improve the overall performance of the semantic segmentation method, recent semantic segmentation documents show a trend of uniformly considering the high efficiency and effectiveness of a segmentation network when multi-level features and multi-scale context information are coded. In particular, ERFNet employs a large number of decomposed convolutions with different dilation rates in the decoder portion, resulting in reduced redundancy of parameters while enlarging the field of view. In addition, researchers have proposed BiSenet, CANet, and ICNet that can process input images separately through several lightweight sub-networks and then fuse together multiple layers of feature or depth context information. In recent research, CIFReNet encodes multi-layer and multi-scale information by introducing a feature refinement and context integration module to achieve accurate and efficient scene segmentation.

Although the existing research achieves better segmentation performance in terms of high precision or high speed, the existing method at least has the following problems: 1) in the multi-level information fusion process, the process of feature extraction is completed by depending on more time and calculation complexity, so that the model learning efficiency is low and the calculation cost is high; 2) methods that directly fuse multi-source information through element-level addition or cascading operations rarely consider how to narrow the semantic gap between multi-layer features. Therefore, interaction between various information sources is hindered, resulting in non-ideal segmentation accuracy.

Disclosure of Invention

Aiming at the problems in the prior art, the invention provides a semantic segmentation system and a semantic segmentation method based on dual-feature fusion of Internet of things perception, and the balance of comprehensive performances in the aspects of precision, speed, storage and computational complexity is realized.

A semantic segmentation method based on dual-feature fusion of Internet of things perception comprises the following steps:

s1, inputting an original image, and performing feature coding on the original image by using a backbone network to obtain features of different scales; s2, learning the features of different scales through two attention thinning blocks to obtain multi-level fusion features;

s3, reducing the dimension of the multilevel fusion features to obtain dimension reduction features;

s4, context coding is carried out on the dimension reduction features by using depth separable deconvolution with different convolution scales so as to obtain local features with different scales;

s5, performing global pooling on the dimensionality reduction features by using a global mean pooling layer to obtain global features;

s6, performing channel splicing and fusion on the global features and the local features of different scales to obtain multi-scale context fusion features;

s7, performing channel splicing and fusion on the dimensionality reduction feature and the multi-scale context fusion feature to obtain a splicing feature;

and S8, reducing the dimension of the spliced features and performing up-sampling to obtain the final output.

Preferably, step S1 specifically includes:

feature encoding is performed on the original image by using a backbone network to obtain a first feature, a second feature and a third feature, wherein the first feature is 1/4 of the original image scale, the second feature is 1/8 of the original image scale, and the third feature is 1/16 of the original image scale.

Preferably, step S2 includes the following steps:

s2.1, fusing the first feature and the second feature through a first attention thinning block to output semantic features;

and S2.2, fusing the semantic features and the third features through a second attention thinning block to obtain multi-level fusion features.

Preferably, step S2.1 specifically includes the following steps:

s2.1.1, mapping the first feature to a scale consistent with the second feature through a down-sampling layer to obtain a first scale feature;

s2.1.2, mapping the channel dimension of the first scale feature to be consistent with the channel dimension of the second feature through the first 1 x 1 convolution layer to obtain a first channel feature;

s2.1.3, channel splicing and fusing the first scale features and the second features to obtain first fused features;

s2.1.4, inputting the first fusion feature into a first adaptive mean pooling layer and a first adaptive maximum pooling layer respectively to output a first attention vector and a second attention vector respectively;

s2.1.5, performing nonlinear mapping on the first attention vector and the second attention vector through the first multi-layer perception layer to output a first mixed attention vector and a second mixed attention vector, and fusing the first mixed attention vector and the second mixed attention vector to output a first fused mixed attention vector;

s2.1.6, normalizing the first fused mixed attention vector to obtain a first normalized mixed attention vector;

s2.1.7, mapping the first channel feature with the first normalized mixed attention vector weighting;

s2.1.8, fusing the second feature and the weighted first channel feature to output a semantic feature.

Preferably, step S2.2 specifically includes the following steps:

s2.2.1, mapping the third feature to a scale consistent with the second feature through an upsampling layer to obtain a second scale feature;

s2.2.2, mapping the channel dimension of the second scale feature to be consistent with the channel dimension of the second feature through the second 1 x 1 convolution layer to obtain a second channel feature;

s2.2.3, channel splicing and fusing the second scale features and the semantic features to obtain second fusion features;

s2.2.4, inputting the second fusion feature into a second adaptive mean pooling layer and a second adaptive maximum pooling layer respectively to output a third attention vector and a fourth attention vector respectively;

s2.2.5, performing nonlinear mapping on the third attention vector and the fourth attention vector through the second multi-layer perception layer to output a third mixed attention vector and a fourth mixed attention vector, and fusing the third mixed attention vector and the fourth mixed attention vector to output a second fused mixed attention vector;

s2.2.6, normalizing the second fused mixed attention vector to obtain a second normalized mixed attention vector;

s2.2.7, mapping the second channel feature with a second normalized mixed attention vector weighting;

s2.2.8, fusing the semantic features and the weighted second channel features to obtain a multi-level fused feature.

Preferably, the first fusion feature is input into the first adaptive mean pooling layer and the first adaptive maximum pooling layer respectively in step S2.1.4 to output the first attention vector and the second attention vector respectively, specifically using the following formulas:

V₁＝AAP₁(C[F₁，F₂])，

V₂＝AMP₁(C[F₁，F₂])，

wherein, V₁Is the first attention vector, V₂Is the second attention vector, F₁Is a first scale feature, F₂In the second feature, C]Indicating channel splice fusion, AAP₁() Representing a first adaptive mean pooling layer, AMP₁() Representing a first adaptive max pooling layer.

Preferably, in step S2.2.4, the second fusion feature is respectively input into the second adaptive mean pooling layer and the second adaptive maximum pooling layer to respectively output the third attention vector and the fourth attention vector, specifically using the following formulas:

V₃＝AAP₂(C[L₁，L₂])，

V₄＝AMP₂(C[L₁，L₂])，

wherein, V₃Is the third attention vector, V₄Is the fourth attention vector, L₁Is a second scale feature, L₂Being a semantic feature, AAP₂() Representing a second adaptive mean pooling layer, AMP₂() Representing a second adaptive max pooling layer.

Preferably, in step S2.1.5, the first attention vector and the second attention vector are nonlinearly mapped by the first multi-layered sensing layer to output a first mixed attention vector and a second mixed attention vector, and the first mixed attention vector and the second mixed attention vector are channel-splicing-fused to output a first fused mixed attention vector, specifically using the following formula:

V_a1＝MLP₁(C[V₁，V₂])，

in step S2.2.5, the third attention vector and the fourth attention vector are nonlinearly mapped by the second multi-layered sensing layer to output a third mixed attention vector and a fourth mixed attention vector, and the third mixed attention vector and the fourth mixed attention vector are fused to output a second fused mixed attention vector, specifically using the following formulas:

V_a2＝MLP₂(C[V₃，V₄])，

wherein, V_a1As a first fusion mixed attention vector, V_a2For the second fused mixed attention vector, MLP₁() Being a first multi-layer sensing layer, MLP₂() A second multilayer sensing layer.

Preferably, the method comprises the following steps: s2.1.6, normalizing the first fused mixed attention vector to obtain a first normalized mixed attention vector, S2.1.7, weighting and mapping the first channel feature by the first normalized mixed attention vector, S2.1.8, fusing the second feature and the weighted first channel feature to output semantic features, specifically adopting the following formula:

the method comprises the following steps: s2.2.6, normalizing the second fused mixed attention vector to obtain a second normalized mixed attention vector, S2.2.7, weighting and mapping the second channel feature by the second normalized mixed attention vector, S2.2.8, fusing the semantic feature and the weighted second channel feature to obtain a multilevel fused feature, wherein the following formula is specifically adopted:

wherein L is₂Is semantic feature, L'₂For multi-level fusion of features, Sig₁() Representing a first activation function, Sig₂() Denotes a second activation function, F'₁Is a first channel feature, L'₁For the second channel feature, H represents the height of the feature map, W represents the width of the feature map,

representing a pixel-level dot-product operation,

representing a pixel level dot addition operation.

Correspondingly, a semantic segmentation system based on the Internet of things perception and adopting double-feature fusion is further provided, and comprises a multilayer feature fusion module and a lightweight semantic pyramid module which are connected with each other;

the multi-layer feature fusion module comprises a backbone network unit and a proofreading unit;

the lightweight semantic pyramid module comprises a first dimension reduction unit, a second dimension reduction unit, a third dimension reduction unit, a context coding unit, a global pooling unit, a first channel splicing and fusing unit, a second channel splicing and fusing unit and an upsampling unit;

the backbone network unit is connected with a proofreading unit, the proofreading unit is respectively connected with a first dimension reduction unit and a second dimension reduction unit, the first dimension reduction unit is respectively connected with a context coding unit and a global pooling unit, the context coding unit and the global pooling unit are both connected with a first channel splicing and fusing unit, the second dimension reduction unit and the first channel splicing and fusing unit are both connected with a second channel splicing and fusing unit, the second channel splicing and fusing unit is also connected with a third dimension reduction unit, and an up-sampling unit is connected with the third dimension reduction unit;

the backbone network unit is used for performing feature coding on the original image by using a backbone network to obtain features of different scales;

the checking unit is used for learning the features with different scales through the two attention thinning blocks so as to obtain multi-level fusion features;

the first dimension reduction unit and the second dimension reduction unit are used for reducing the dimension of the multi-level fusion feature so as to output a first dimension reduction feature and a second dimension reduction feature respectively, and the first dimension reduction feature and the second dimension reduction feature are the same;

the context coding unit is used for respectively carrying out context coding on the first dimension reduction characteristics through depth separable convolutions with different convolution scales so as to obtain local characteristics with different scales;

the global pooling unit is used for performing global pooling on the first dimension reduction feature through a global mean pooling layer to obtain a global feature;

the first channel splicing and fusing unit is used for carrying out channel splicing and fusing on the global features and the local features of different scales so as to obtain multi-scale context fusion features;

the second channel splicing and fusing unit is used for carrying out channel splicing and fusing on the second dimension reduction feature and the multi-scale context fusion feature to obtain a splicing feature;

the third dimension reduction unit is used for reducing the dimension of the splicing feature;

and the up-sampling unit is used for up-sampling the splicing characteristics subjected to the dimensionality reduction so as to obtain final output.

The invention has the beneficial effects that:

(1) a multi-level feature fusion module (MFFM) is proposed that employs two recursive attention-refining blocks (ARBs) to effectively improve the effectiveness of multi-level feature fusion. Under the condition that the computation cost of the executable is controllable, the proposed ARB corrects the spatial detail information in the low-order features by using the abstract semantic information of the high-order features, so that the semantic difference among the multi-level features is relieved.

(2) A Lightweight Semantic Pyramid Module (LSPM) is provided that decomposes convolution operators to reduce the computational overhead of context information coding. In addition, the module fuses the multi-level fusion features and the multi-scale context diagnosis, and information representation is enriched, so that the identification precision is improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a flow chart of a semantic segmentation method based on Internet of things perception for dual feature fusion according to the invention;

FIG. 2 is a schematic structural diagram of a semantic segmentation system based on Internet of things perception and dual-feature fusion according to the invention;

fig. 3 is a schematic structural diagram of the attention thinning block according to the present invention.

Detailed Description

The following description of the embodiments of the present invention is provided by way of specific examples, and other advantages and effects of the present invention will be readily apparent to those skilled in the art from the disclosure herein. The invention is capable of other and different embodiments and of being practiced or of being carried out in various ways, and its several details are capable of modification in various respects, all without departing from the spirit and scope of the present invention. It is to be noted that the features in the following embodiments and examples may be combined with each other without conflict.

The first embodiment is as follows:

referring to fig. 1, 2 and 3, the embodiment provides a semantic segmentation method based on dual-feature fusion of internet of things perception, which includes the steps of:

s1, inputting an original image, and performing feature coding on the original image by using a backbone network to obtain features of different scales;

s2, learning the features of different scales through two attention thinning blocks to obtain multi-level fusion features;

Wherein, step S1 specifically includes:

Each layer of the backbone network has different feature expression capabilities. The shallower layers contain more spatial detail but lack semantic information; while deeper layers retain rich semantic information but lose a large amount of spatial information. Intuitively, fusing multiple layers of information together has a gain effect on learning discriminative and comprehensive feature representations.

Based on the above observation, we obtain different scale features from the backbone network and sequentially note as I_1/4，I_1/8And I_1/16And secondly, unifying the dimensions of all feature maps into 1/8 size to reduce information loss and resource utilization. Specifically, by a1 × 1 pooling layer pair I_1/4Down-sampling to obtain I₁’_/8Then, using bilinear layer to pair high-order characteristic diagram I_1/16Up-sampling to obtain I'_1/8. And finally fusing the three to obtain a multi-level fusion characteristic O. The above process is represented by the following formula:

T’_1/8＝T(GAP_{k＝2，s＝2}(I_1/4))

T”_1/8＝Upsample(I_1/16)

wherein, GAP_{k＝2，S＝2}() Representing a global mean pooling layer with scale 2 and step 2, T () is defined as a channel transform operation for changing the number of feature maps. Upsample () represents an upsampled layer and ≧ represents a pixel-level point add operation.

Although the above-described feature fusion operation facilitates the mutual utilization of complementary information between multi-level features, it may not be efficient and comprehensive to directly integrate low-level features with high-level features due to semantic differences in multi-level stages. To address this problem, the present invention designs a feature refinement strategy, called an attention-collation Block (ARB). Both focus on modeling the inter-channel relationships for multi-level fusion features. By this method, when the current channel position contains the characteristics of the value information, the model can emphasize the weight of the neuron highly related to the target object.

Namely, step S2 includes the following steps:

Further, step S2.1 specifically includes the following steps:

s2.1.4, inputting the first fusion feature into a first adaptive average pooling layer (AAP) and a first adaptive maximum pooling layer (AMP) respectively to output a first attention vector and a second attention vector respectively; both the adaptive mean pooling layer (AAP) and the adaptive max pooling layer (AMP) model the importance of individual feature channels by weighting all channels of the multi-level fused feature. The higher the importance of the current feature channel is, the larger the weight corresponding to the layer is.

S2.1.5, performing nonlinear mapping on the first attention vector and the second attention vector through the first multi-layer sensing layer to output a first mixed attention vector and a second mixed attention vector, which are used for improving the nonlinearity and robustness of the feature, and fusing the first mixed attention vector and the second mixed attention vector to output a first fused mixed attention vector;

Further, step S2.2 specifically includes the following steps:

Further, in step S2.1.4, the first fusion feature is respectively input into the first adaptive mean pooling layer and the first adaptive maximum pooling layer to respectively output the first attention vector and the second attention vector, specifically using the following formulas:

V₁＝AAP₁(C[F₁，F₂])，

V₂＝AMP₁(C[F₁，F₂])，

Inputting the second fusion feature into the second adaptive mean pooling layer and the second adaptive maximum pooling layer respectively to output a third attention vector and a fourth attention vector, which are described in step S2.2.4, specifically using the following formulas:

V₃＝AAP₂(C[L₁，L₂])，

V₄＝AMP₂(C[L₁，L₂])，

Further, in step S2.1.5, the first attention vector and the second attention vector are nonlinearly mapped by the first multi-layered sensing layer to output a first mixed attention vector and a second mixed attention vector, and the first mixed attention vector and the second mixed attention vector are channel-splicing-fused to output a first fused mixed attention vector, specifically using the following formula:

V_a1＝MLP₁(C[V₁，V₂])，

V_a2＝MLP₂(C[V₃，V₄])，

Further, the steps of: s2.1.6, normalizing the first fused mixed attention vector to obtain a first normalized mixed attention vector, S2.1.7, weighting and mapping the first channel feature by the first normalized mixed attention vector, S2.1.8, fusing the second feature and the weighted first channel feature to output semantic features, specifically adopting the following formula:

representing a pixel-level dot-product operation,

representing a pixel level dot addition operation.

Technically, the design of the ARB can be regarded as an information collation strategy, and two attention-based paths predict the importance of each channel in a complementary manner, so that more semantic information is transferred to low-level features to relieve semantic difference among different-level features, and effective feature fusion is realized. The experimental results in the following section verify the effectiveness of this setup. It is worth noting that ARBs have only 0.03M parameters in total, and the entire multi-level feature fusion remains computationally lightweight.

Further, to enhance the computational efficiency of the context extraction module, the present invention proposes a deep separable Deconvolution (DFC) operation instead of the standard convolutional layer. Inspired by deep separable convolution and decomposed convolution, one main idea of lightweight feature extraction is to integrate the ideas of the two technologies. Firstly, a regularization layer and an activation function are used as two preprocessing steps to improve the regularity of the convolutional layer; next, the 3 × 3 depth convolution is decomposed to obtain two sets of depth separable convolution layers with dimensions of 3 × 1 and 1 × 3, respectively. By the method, the sparsity of the dense convolution kernels on all the channels is kept uniformly, so that the computational complexity and the resource overhead of convolution are reduced. And finally, fusing the local features and the global features of all scales to obtain the multi-scale context fusion feature.

After the multi-scale context is coded, the multi-level fusion features after dimension reduction are further combined with the global features and the local features of different scales to predict the final segmentation result. The above design has two advantages: on one hand, multi-level information and multi-scale context information are integrated in a unified system to realize more effective feature representation; on the other hand, the adoption of the jump connection can encourage the information transfer and gradient conduction of the front-layer multilevel information, thereby improving the identification efficiency.

The key points of the technology of the invention are as follows:

(1) the invention discloses a novel Internet of things-oriented dual-feature fusion real-time semantic segmentation network (DFFNet). Compared to advanced methods, DFFNet reduces FLOPs by about 2.5 times and increases model execution speed by 1.8 times, while achieving better accuracy.

(2) A multi-level feature fusion module (MFFM) is proposed that employs two recursive attention-collation blocks (ARBs) to effectively improve the effectiveness of multi-level feature fusion. Under the condition that the computation cost of the executable is controllable, the proposed ARB corrects the spatial detail information in the low-order features by using the abstract semantic information of the high-order features, so that the semantic difference among the multi-level features is relieved.

(3) A Lightweight Semantic Pyramid Module (LSPM) is provided that decomposes convolution operators to reduce the computational overhead of context information coding. In addition, the module fuses the multi-level fusion features and the multi-scale context diagnosis, and abundant information is represented, so that the identification precision is improved.

Further, this embodiment also compares the present invention across multiple data sets against existing methods to verify the effectiveness of the present invention.

Data set: the data set used in the present invention is a recognized standard scene perception data set, cityscaps, consisting of 25000 annotated 2048 × 1024 resolution images. The annotations set contains 30 classes, 19 of which are used for training and evaluation. In the experiments of the present invention, only 5000 images with fine annotations were used. There were 2975 images for training, 500 images for verification, and 1525 images for testing.

Setting parameters: all experiments were in NVIDIA1080 tivpu card. We perform a 0.5 to 1.5 times random scaling of the image and apply a random left-right flip operation randomly on all training sets. Further, the initial learning rate was set to 0.005, and the learning rate was attenuated using a poly strategy. The network was trained using a stochastic gradient descent optimization algorithm by minimizing the pixel cross entropy loss, with a momentum of 0.9 and a weight decay of 5 e-4. Finally, a batch normalization layer is applied before all conventional or expanded convolutional layers to achieve fast convergence.

Evaluation indexes are as follows: the invention adopts four evaluation indexes recognized in the field of semantic segmentation: segmentation accuracy, inference speed, network parameters and computational complexity.

The multistage feature fusion module ablation experiment:

as shown in table 1, the present invention compares four multi-level feature fusion models with a reference model: elemental Additive Fusion (EAF), average pool attention refinement (AAR), Maximum Attention Refinement (MAR), and the use of AAR and MAR in combination. As shown in the table, the EAF performance is only 1.12% higher than the baseline network, which indicates that directly fusing the multi-level feature is a sub-optimal solution. Compared with a reference network, the AAR and the MAR realize the performance improvement of 2.61% of mIoU and 2.54% of mIoU, which shows that the interdependence relationship between modeling channels can reduce the semantic difference between multi-level features. The bilateral pooling attention strategy provided by the invention mutually compensates the significance information and the global information. Thus, MFFM achieves a further boost of 0.55% and 0.62% mlou compared to AAR and MAR. In addition, the proposed MFFM adds negligible additional calculations (only 0.06M parameter and 0.11GFLOPs), which verifies the efficiency and effectiveness of the proposed module.

Model (model)	Speed (ms)	Reference quantity (M)	FLOPs(G)	MIoU(％)
					Baseline	15.40	1.82	2.79	67.83
EAF	15.81	1.85	2.90	68.95
					AAR	15.80	1.86	2.90	70.44
MAR	15.81	1.86	2.90	70.37
					AAR+MAR	16.03	1.88	2.90	70.99

TABLE 1 ablation learning of multilevel feature fusion modules

The ablation experiment of the lightweight semantic pyramid module comprises the following steps:

the test evaluates the performance of the lightweight semantic pyramid module. SC-SPM, FC-SPM, DC-SPM and DFC-SPM respectively represent methods with four semantic pyramid modules, which are built on conventional convolution, decomposed convolution, deep convolution and deep separable deconvolution, respectively. As shown in table 2, 1) compared with the reference model EAF, the semantic segmentation method with the semantic pyramid module can improve the mIOU segmentation accuracy by about 1.11% to 2.70%, which indicates that extracting the local and global context information can significantly improve the learning ability of the model. 2) Although SC-SPM, FC-SPM, DC-SPM, and DFC-SPM achieve similar accuracy performance, building a semantic pyramid module based on efficient convolution achieves better efficiency (faster speed and less computational complexity) than building a module based on conventional convolution. DFC-SPM achieved 71.02% IU, with only 0.05M additional parameters and 0.20G FLOPs. 3) The LSPM integrates context information and multi-level feature information by designing a short-distance feature learning operation, and is used for encouraging information transfer and gradient conduction of front-level multi-level information. Therefore, the accuracy performance of the DFC-SPM method is improved from 71.02% mIoU to 71.65% mIoU. The above results demonstrate the high efficiency and effectiveness of the proposed LSPM.

Model (model)	Speed (ms)	Reference quantity (M)	FLOPs(G)	MIoU(％)
					EAF	15.81	1.85	2.90	68.95
SC-SPM	16.22	2.11	4.43	70.81
					FC-SPM	16.10	2.03	3.72	70.06
DC-SPM	15.76	1.90	3.11	71.00
					DFC-SPM	15.72	1.90	3.10	71.02
LSPM	15.65	1.89	3.06	71.65

TABLE 2 ablation learning of lightweight semantic pyramid modules

Evaluation on the reference data set:

on the cityscaps dataset, DFFNet was compared to other existing semantic segmentation methods. "-" indicates that the method does not publish the corresponding performance value.

TABLE 3 comprehensive Properties of this chapter of methods and comparison methods on the Cityscapes dataset

As shown in table 3, SegNet and ENET improve speed by compressing the model scale significantly at the expense of segmentation accuracy. LW-Refine Net and ERFNet design an asymmetric codec structure to maintain a balance of accuracy and efficiency. The BeSiNet, CANet and ICNet adopt a multi-branch structure, so that good balance between precision and speed is achieved, but more additional learning parameters are introduced. In contrast, DFFNet achieves better accuracy and efficiency performance, particularly in terms of a reduction in network parameters (1.9M parameters) and computational complexity (3.1 GFLOPs). In addition, FCN and partition 10 use a VGG backbone network (e.g., VGG16 and VGG19) that is computationally expensive as a feature extractor, requiring 2 seconds or more to process an image. DRN, deedlab v2, reflinenet, and PSPNet employ deep ResNet backbone networks (e.g., ResNet50 and ResNet101) to enhance multi-scale feature representation, requiring significant computational cost and memory usage. Compared with the accuracy-oriented methods, the method only needs 12ms for processing images with 640 × 360 resolution, and achieves the segmentation accuracy of 71.0% mIoU.

In conclusion, the method realizes comprehensive segmentation performance in precision and efficiency (reasoning speed, network parameters and computational complexity), so that the method has great deployment potential on the Internet of things equipment with limited resources.

Example two:

referring to fig. 3, the embodiment provides a semantic segmentation system based on dual-feature fusion of internet of things perception, which includes a multilayer feature fusion module and a lightweight semantic pyramid module connected to each other;

It should be noted that, similar to the embodiment, the semantic segmentation system based on dual-feature fusion for internet of things perception provided in this embodiment is not described herein in detail.

The above-mentioned embodiments are merely illustrative of the preferred embodiments of the present invention, and do not limit the scope of the present invention, and various modifications and improvements of the technical solutions of the present invention by those skilled in the art should fall within the protection scope of the present invention without departing from the design spirit of the present invention.

Claims

1. a semantic segmentation method based on the dual feature fusion of Internet of Things perception, is characterized in that, comprises the steps:

S1. Input the original image, and use the backbone network to perform feature encoding on the original image to obtain features of different scales;

S2. Learning features of different scales through two attention refinement blocks to obtain multi-level fusion features;

S3. Perform dimensionality reduction on the multi-level fusion features to obtain dimensionality reduction features;

S4. Use depthwise decomposable convolutions of different convolution scales to perform context encoding on the dimensionality reduction features respectively to obtain local features of different scales;

S5. Use the global mean pooling layer to globally pool the dimensionality reduction features to obtain global features;

S6. Perform channel splicing and fusion of global features and local features of different scales to obtain multi-scale context fusion features;

S7. Perform channel splicing and fusion on dimension reduction features and multi-scale context fusion features to obtain splicing features;

S8. Reduce the dimensionality of the spliced features and perform up-sampling to obtain the final output.

2. a kind of semantic segmentation method based on the dual feature fusion of Internet of Things perception according to claim 1, is characterized in that, in step S1 specifically:

Use the backbone network to perform feature encoding on the original image to obtain the first, second and third features, where the first feature scale is 1/4 of the original image scale, the second feature scale is 1/8 of the original image scale, and the The three feature scale is 1/16 of the original image scale.

3. a kind of semantic segmentation method based on the dual feature fusion of Internet of Things perception according to claim 2, is characterized in that, comprises the following steps in step S2:

S2.1, fuse the first feature and the second feature through the first attention refinement block to output semantic features;

S2.2, the semantic feature and the third feature are fused through the second attention refinement block to obtain multi-level fusion features.

4. a kind of semantic segmentation method based on Internet of Things perception dual feature fusion according to claim 3, is characterized in that, specifically comprises the following steps in step S2.1:

S2.1.1. Map the first feature to a scale consistent with the second feature through the downsampling layer to obtain the first scale feature;

S2.1.2. Map the channel dimension of the first scale feature to be consistent with the second feature channel dimension through the first 1*1 convolution layer to obtain the first channel feature;

S2.1.3. Perform channel splicing and fusion of the first scale feature and the second feature to obtain the first fusion feature;

S2.1.4, input the first fusion feature into the first adaptive mean pooling layer and the first adaptive maximum pooling layer respectively, to output the first attention vector and the second attention vector respectively;

S2.1.5. Perform nonlinear mapping on the first attention vector and the second attention vector through the first multi-layer perceptual layer to output the first mixed attention vector and the second mixed attention vector, and the first mixed attention vector , the second mixed attention vector is fused to output the first fusion mixed attention vector;

S2.1.6, normalize the first fusion mixed attention vector to obtain the first normalized mixed attention vector;

S2.1.7, use the first normalized mixed attention vector to weight map the first channel feature;

S2.1.8, fuse the second feature and the weighted first channel feature to output semantic features.

5. a kind of semantic segmentation method based on the dual feature fusion of Internet of Things perception according to claim 4, is characterized in that, specifically comprises the following steps in step S2.2:

S2.2.1. Map the third feature to a scale consistent with the second feature through the upsampling layer to obtain the second scale feature;

S2.2.2. Map the channel dimension of the second scale feature to be consistent with the second feature channel dimension through the second 1*1 convolution layer to obtain the second channel feature;

S2.2.3. Perform channel splicing and fusion of the second scale feature and the semantic feature to obtain the second fusion feature;

S2.2.4, input the second fusion feature into the second adaptive mean pooling layer and the second adaptive maximum pooling layer respectively, to output the third attention vector and the fourth attention vector respectively;

S2.2.5. Perform nonlinear mapping on the third attention vector and the fourth attention vector through the second multi-layer perceptual layer to output the third mixed attention vector and the fourth mixed attention vector, and the third mixed attention vector , the fourth mixed attention vector is fused to output the second fusion mixed attention vector;

S2.2.6, normalize the second fusion mixed attention vector to obtain the second normalized mixed attention vector;

S2.2.7, use the second normalized mixed attention vector to weight map the second channel feature;

S2.2.8, fuse semantic features and weighted second channel features to obtain multi-level fusion features.

6. The method for semantic segmentation based on IoT-aware dual feature fusion according to claim 5, wherein the step S2.1.4 is to input the first fusion feature into the first adaptive mean pooling respectively layer and the first adaptive max pooling layer to output the first attention vector and the second attention vector respectively, and the following formula is specifically used:

V ₁ =AAP ₁ (C[F ₁ , F ₂ ]),

V ₂ =AMP ₁ (C[F ₁ , F ₂ ]),

Among them, V ₁ is the first attention vector, V ₂ is the second attention vector, F ₁ is the first scale feature, F ₂ is the second feature, C[] represents channel splicing and fusion, AAP ₁ () represents the first Adaptive mean pooling layer, AMP ₁ ( ) represents the first adaptive max pooling layer.

7. The method for semantic segmentation based on IoT-aware dual feature fusion according to claim 6, wherein the second fusion feature is input into the second adaptive mean pooling as described in step S2.2.4, respectively. layer and the second adaptive max pooling layer to output the third attention vector and the fourth attention vector, respectively, using the following formula:

V ₃ =AAP ₂ (C[L ₁ ,L ₂ ]),

V ₄ =AMP ₂ (C[L ₁ , L ₂ ]),

Among them, V ₃ is the third attention vector, V ₄ is the fourth attention vector, L ₁ is the second scale feature, L ₂ is the semantic feature, AAP ₂ ( ) represents the second adaptive mean pooling layer, AMP ₂ ( ) represents the second adaptive max pooling layer.

8. a kind of semantic segmentation method based on Internet of Things perception dual feature fusion according to claim 7, is characterized in that,

In step S2.1.5, the first attention vector and the second attention vector are nonlinearly mapped through the first multi-layer perceptual layer to output the first mixed attention vector and the second mixed attention vector, and the first mixed attention vector is The attention vector and the second mixed attention vector are channel-spliced and fused to output the first fusion mixed attention vector. Specifically, the following formula is used:

V _a1 =MLP ₁ (C[V ₁ , V ₂ ]),

In step S2.2.5, the third attention vector and the fourth attention vector are nonlinearly mapped through the second multi-layer perceptual layer to output the third mixed attention vector and the fourth mixed attention vector, and the third mixed attention vector is mixed. The attention vector and the fourth mixed attention vector are fused to output the second fused mixed attention vector, specifically using the following formula:

_Va2 = MLP ₂ (C[V ₃ , V ₄ ]),

Wherein, V _a1 is the first fusion mixed attention vector, V _a2 is the second fusion mixed attention vector, MLP ₁ ( ) is the first multi-layer perception layer, and MLP ₂ ( ) is the second multi-layer perception layer.

9. A kind of semantic segmentation method based on Internet of Things perception dual feature fusion according to claim 8, it is characterized in that, step: S2.1.6, normalize the first fusion mixed attention vector to obtain The first normalized mixed attention vector, S2.1.7, use the first normalized mixed attention vector to weight map the first channel feature, S2.1.8, fuse the second feature and the weighted first channel feature to output Semantic features, using the following formula:

Steps: S2.2.6, normalize the second fusion mixed attention vector to obtain the second normalized mixed attention vector, S2.2.7, use the second normalized mixed attention vector to weight map the second Channel features, S2.2.8, fusion semantic features and weighted second channel features to obtain multi-level fusion features, specifically using the following formula:

Wherein, L ₂ is the semantic feature, L' ₂ is the multi-level fusion feature, Sig ₁ ( ) represents the first activation function, Sig ₂ ( ) represents the second activation function, F' ₁ is the first channel feature, and L' ₁ is the The second channel feature, H represents the height of the feature map, W represents the width of the feature map,

represents the pixel-level dot product operation,

Represents a pixel-level dot-add operation.

10. A semantic segmentation system based on Internet-of-Things perception dual feature fusion, characterized in that it comprises a connected multi-layer feature fusion module and a lightweight semantic pyramid module;

The multi-layer feature fusion module includes a backbone network unit and a proofreading unit;

The lightweight semantic pyramid module includes a first dimension reduction unit, a second dimension reduction unit, a third dimension reduction unit, a context encoding unit, a global pooling unit, a first channel splicing and fusion unit, a second channel splicing and fusion unit, and an upsampling unit. unit;

The backbone network unit is connected with the proofreading unit, the proofreading unit is connected with the first dimension reduction unit and the second dimension reduction unit respectively, the first dimension reduction unit is connected with the context coding unit and the global pooling unit respectively, the context coding unit and the global pooling unit are respectively connected. The units are all connected with the first channel splicing and fusion unit, the second dimensionality reduction unit and the first channel splicing and fusion unit are all connected with the second channel splicing and fusion unit, and the second channel splicing and fusion unit is also connected with the third dimension reduction unit, upsampling The unit is connected to the third dimension reduction unit;

The backbone network unit is used to encode the features of the original image by using the backbone network to obtain features of different scales;

The proofreading unit is used to learn features of different scales through two attention refinement blocks to obtain multi-level fusion features;

Both the first dimension reduction unit and the second dimension reduction unit are used to reduce the dimension of the multi-level fusion feature, so as to output the first dimension reduction feature, the second dimension reduction feature, the first dimension reduction feature and the second dimension reduction feature respectively. are the same;

a context encoding unit, used to perform context encoding on the first dimension reduction feature through depthwise decomposable convolutions of different convolution scales, so as to obtain local features of different scales;

The global pooling unit is used to globally pool the first dimension reduction feature through the global mean pooling layer to obtain the global feature;

The first channel splicing and fusion unit is used for channel splicing and fusion of global features and local features of different scales to obtain multi-scale context fusion features;

The second channel splicing and fusion unit is used for channel splicing and fusion of the second dimension reduction feature and the multi-scale context fusion feature to obtain the splicing feature;

The third dimension reduction unit is used to reduce the dimension of the spliced feature;

The upsampling unit is used to upsample the dimension-reduced concatenated features to obtain the final output.