CN116664875A

CN116664875A - PVT-based saliency target detection method for gating network

Info

Publication number: CN116664875A
Application number: CN202310061980.5A
Authority: CN
Inventors: 霍丽娜; 周小力; 王威; 曹志义; 王张钰; 郭凯迪; 侯佳岳
Original assignee: Hebei Normal University
Current assignee: Hebei Normal University
Priority date: 2023-01-16
Filing date: 2023-01-16
Publication date: 2023-08-29

Abstract

The invention discloses a saliency object detection method based on a PVT-based gated network, which extracts global features through layer-by-layer fusion of the PVT network; adds a transition layer after four encoders of different scales in PVT, and integrates the transition layer and decoder Multi-level gate units are inserted between layers, and a pyramid pooling module is introduced on the top layer of the encoder to extract advanced semantic information; the advanced semantic information is propagated from top to bottom; the feature aggregation decoder continuously fuses advanced semantic information through element-level addition, and the encoder is effective. information and decoder features at different scales. The invention can extract the content of the target most interesting to human eyes in any given image, pay more attention to the salient area, and suppress the background interference.

Description

Salient Object Detection Method Based on PVT Gating Network

技术领域technical field

本发明涉及一种显著性目标检测方法，尤其涉及一种基于PVT的门控网络的显著性目标检测方法，属于计算机视觉领域。The invention relates to a salient target detection method, in particular to a salient target detection method based on a PVT gating network, and belongs to the field of computer vision.

背景技术Background technique

近年来，随着新媒体行业的快速发展，一些短视频、图像推文等数字信息融入到生活中，每天都有数以万计的图像、视频信息传入到互联网中，极大丰富了人们的生活、工作和娱乐。每日微信软件的数字图像上传量竟高达10亿张。由此可见，但是要想快速从海量的数字图像中选出有价值的内容，单靠人类自身感官将难以实现，这便对计算机视觉方面的研究提出更大的挑战。In recent years, with the rapid development of the new media industry, some short videos, image tweets and other digital information have been integrated into our lives. Every day, tens of thousands of images and video information are transmitted to the Internet, which greatly enriches people's life experience. Live, work and play. The number of digital images uploaded on the WeChat software is as high as 1 billion per day. It can be seen that, but in order to quickly select valuable content from a large number of digital images, it will be difficult to achieve by human sense alone, which poses a greater challenge to the research of computer vision.

数字图像信息，能够非常直观地反映出人们要表达的内容，在网络资源传送中被广泛使用，但是大量的数据信息难以被人们有效识别，即使运用计算机资源做协助，也会产生沉重的负担。因此，我们就需要从数字图像信息中进行重要内容的提取，从而利用有限的计算资源整合，减轻人们收集有用信息的负担。另一方面，在人类的视觉系统中，他们能够集中注视到图像的重要区域上，如微信朋友圈自拍照片的人脸、美食照片的食材等，而不去关注背景区域的无用信息，这对快速处理图像有着重要作用。因此，如何将这种注视图像关键区域的能力应用到计算机中，以帮助人们完成更加复杂的图像处理任务，便是当下计算机视觉研究者所关注的重点内容。Digital image information can very intuitively reflect the content that people want to express, and is widely used in the transmission of network resources. However, a large amount of data information is difficult to be effectively recognized by people. Even if computer resources are used for assistance, it will generate a heavy burden. Therefore, we need to extract important content from digital image information, so as to use limited computing resources to integrate and reduce people's burden of collecting useful information. On the other hand, in the human visual system, they can focus on the important areas of the image, such as the face of the self-portrait of the WeChat circle of friends, the ingredients of the food photo, etc., without paying attention to the useless information in the background area. Fast image processing plays an important role. Therefore, how to apply this ability of focusing on key areas of images to computers to help people complete more complex image processing tasks is the focus of current computer vision researchers.

近年来，由于U型结构能够通过构建多层次自上而下的路径来构建丰富的特征图，并取得良好的性能，因此受到极大的关注。现在，许多显著性目标检测网络采用U型多尺度分层编码器-解码器结构作为网络基本结构。然而，上述方法直接使用跨层连接结构，将编码器的特征直接连接到解码器，它们之间缺乏干扰控制。这种方法会将编码器中误导性的上下文信息引入解码器，导致真正有用的特征不能被充分利用。除此之外，现有的方法使用基于深度学习的全卷积神经网络(FCN)，大多使用预先训练好的图像分类模型，如VGG和ResNet作为编码器，重点通过聚合多级特征来设计有效的解码器。然而，CNN的模型结构特点是对局部信息汇聚建模，难以对长周期进行建模。从复杂场景中准确完整提取出显著性目标仍然具有很大的挑战性。金字塔视觉变压器(Pyramid Vision Transformer，PVT)为完整提取出显著性目标开辟了新的途径。In recent years, the U-shaped structure has received great attention due to its ability to construct rich feature maps by constructing multi-level top-down paths and achieve good performance. Now, many salient object detection networks adopt U-shaped multi-scale layered encoder-decoder structure as the basic structure of the network. However, the above methods directly use the cross-layer connection structure to directly connect the features of the encoder to the decoder, and there is a lack of interference control between them. This approach will introduce misleading contextual information from the encoder into the decoder, resulting in underutilization of truly useful features. In addition, the existing methods use fully convolutional neural network (FCN) based on deep learning, most of which use pre-trained image classification models such as VGG and ResNet as encoders, focusing on designing effective neural networks by aggregating multi-level features. the decoder. However, CNN's model structure is characterized by modeling local information aggregation, and it is difficult to model long periods. Accurate and complete extraction of salient objects from complex scenes is still a great challenge. The Pyramid Vision Transformer (PVT) opens up a new way to fully extract salient objects.

发明内容Contents of the invention

本发明的目的在于提供一种基于PVT的门控网络的显著性目标检测方法。The object of the present invention is to provide a salient object detection method based on PVT gating network.

为解决上述技术问题，本发明的技术方案是：一种基于PVT的门控网络的显著性目标检测方法，包括以下步骤：In order to solve the problems of the technologies described above, the technical solution of the present invention is: a salient object detection method based on a PVT-based gating network, comprising the following steps:

步骤1：图像预处理：将输入图像大小调整为预设大小的张量X；Step 1: Image preprocessing: Resize the input image to a tensor X of preset size;

步骤2：建立PVT门控网络：PVT门控网络包括第一至第四特征处理单元和金字塔池化模块PPM；第一至第三特征处理单元结构相同；第一特征处理单元包括特征编码器PVTE₁、过渡层T₁、门单元G₁和解码器特征器FAD₁；第四特征处理单元包括特征编码器PVTE₄、过渡层T₄、门单元G₄和解码器特征器FAD₄；张量X依次经特征编码器PVTE₁-PVTE₄处理后，得到第一至第四特征张量；Step 2: Establish a PVT gating network: the PVT gating network includes the first to fourth feature processing units and the pyramid pooling module PPM; the first to third feature processing units have the same structure; the first feature processing unit includes a feature encoder PVTE _1. Transition layer T ₁ , gate unit G ₁ and decoder feature unit FAD ₁ ; the fourth feature processing unit includes feature encoder PVTE ₄ , transition layer T ₄ , gate unit G ₄ and decoder feature unit FAD ₄ ; tensor After X is sequentially processed by the feature encoder PVTE ₁ -PVTE ₄ , the first to fourth feature tensors are obtained;

第四特征张量经过金字塔池化模块PPM处理后得到第四高级语义信息；第四特征张量经过渡层T₄处理后与第四特征张量拼接后输入门单元G₄，再与过渡层T₄的输出相乘得到第四编码器有效信息，第四编码器有效信息与第四高级语义信息相加后输入解码器特征器FAD₄；解码器特征器FAD₄输出第四解码特征向量；The fourth feature tensor is processed by the pyramid pooling module PPM to obtain the fourth high-level semantic information; the fourth feature tensor is processed by the transition layer T ₄ and then spliced with the fourth feature tensor and input to the gate unit G ₄ , and then combined with the transition layer The output of T ₄ is multiplied to obtain the fourth encoder effective information, and the fourth encoder effective information is added to the fourth high-level semantic information and input to the decoder feature device FAD ₄ ; the decoder feature device FAD ₄ outputs the fourth decoding feature vector;

第一特征张量与第二特征处理单元FAD₂输出的第二解码特征向量拼接后输入门单元G₁，第一特征张量经过渡层T₁处理后与门单元G₁的输出相乘得到第一编码器有效信息，第一编码器有效信息、第四高级语义信息与第二解码特征向量相加后输入解码器特征器FAD₁；输出第一解码特征向量作为显著性图；The first feature tensor is concatenated with the second decoded feature vector output by the second feature processing unit FAD ₂ and input to the gate unit G ₁ , the first feature tensor is processed by the transition layer T ₁ and multiplied by the output of the gate unit G ₁ to obtain The first encoder effective information, the first encoder effective information, the fourth high-level semantic information and the second decoding feature vector are added and input to the decoder feature device FAD ₁ ; the first decoding feature vector is output as a saliency map;

步骤3：检测显著性图：输入张量，经PVT门控网络处理后得到显著性图。Step 3: Detect saliency map: input tensor, get saliency map after processing by PVT gating network.

进一步，张量尺寸为384×384×3，第一至第四特征张量的尺寸分别为96×96×64、48×48×128、24×24×320、12×12×512。Further, the size of the tensor is 384×384×3, and the sizes of the first to fourth feature tensors are 96×96×64, 48×48×128, 24×24×320, and 12×12×512, respectively.

进一步，特征编码器PVTE_i和解码器特征FAD_i+1进行集成，再通过卷积、激活和池化操作，计算出门值：Further, the feature encoder PVTE _i and the decoder feature FAD _i+1 are integrated, and then the gate value is calculated through convolution, activation and pooling operations:

式中，Cat(·)为通道轴之间的拼接操作，Conv(·)为卷积操作，S(·)为元素级sigmoid函数，P(·)为全局平均池化；In the formula, Cat(·) is the splicing operation between the channel axes, Conv(·) is the convolution operation, S(·) is the element-level sigmoid function, and P(·) is the global average pooling;

应用门值对过渡层特征T₁-T₄加权，过渡层T₁-T₄是利用生成的3×3卷积对特征编码器PVTE₁-PVTE₄进行降维生成的。The gate value is applied to weight the transition layer features T ₁ -T ₄ , and the transition layer T ₁ -T ₄ is generated by reducing the dimensionality of the feature encoders PVTE ₁ -PVTE ₄ by using the generated 3×3 convolution.

进一步，金字塔池化模块PPM对第四特征张量进行4种不同尺度的自适应平均池化操作，得到4种不同尺寸的特征图，尺寸大小分别为1×1，2×2，3×3，6×6；使用1×1的卷积将对应级别通道降为原本的1/4；通过双线性插值获得未池化前的大小，再在通道维度上与第四特征张量拼接。Further, the pyramid pooling module PPM performs adaptive average pooling operations of 4 different scales on the fourth feature tensor to obtain feature maps of 4 different sizes, and the sizes are 1×1, 2×2, 3×3 , 6×6; use 1×1 convolution to reduce the corresponding level channel to 1/4 of the original; obtain the size before unpooling through bilinear interpolation, and then concatenate with the fourth feature tensor in the channel dimension.

进一步，解码器特征器的输出为：Further, the output of the decoder featureizer is:

其中特征聚合解码器FAD₁为与输入图像大小相同的显著性图；where the feature aggregation decoder FAD ₁ is a saliency map of the same size as the input image;

特征编码器PVTE_i特征输入过渡层T_i模块中；过渡层T_i减少通道数并馈送到门单元G_i；将金字塔池化模块PPM的输出进行降维操作并采用双线性插值法进行上采样，使其与T_i的输出维度和尺度相同；编码器有效信息、金字塔池化模块PPM输出与特征聚合解码器FAD_i+1的输出融合后传入特征聚合解码器FAD_i中。The features of the feature encoder PVTE _i are input into the transition layer T _i module; the transition layer T _i reduces the number of channels and feeds it to the gate unit G _i ; the output of the pyramid pooling module PPM is subjected to dimension reduction operation and bilinear interpolation method is used for uplink Sampling to make it the same as the output dimension and scale of T _i ; the effective information of the encoder, the PPM output of the pyramid pooling module and the output of the feature aggregation decoder FAD _i+1 are fused and then passed to the feature aggregation decoder FAD _i .

更进一步，特征聚合解码器FAD先把输入特征图用{2,4,8}的三种下采样平均池化；再用双线性插值法上采样得到原来尺寸的输出特征图。Furthermore, the feature aggregation decoder FAD first pools the input feature map with three downsampling averages of {2, 4, 8}; then uses bilinear interpolation to upsample to obtain the output feature map of the original size.

采用上述技术方案，本发明取得如下技术效果：Adopt above-mentioned technical scheme, the present invention obtains following technical effect:

本发明通过引入金字塔视觉变压器，能够对全局依赖关系进行强大建模，并获得更强大和稳健的特征；通过在编码器与解码器之间引入门控模块，进行信息筛选，将编码器中更有效的上下文信息传入解码器中，使其更关注显著区域，抑制背景干扰；通过在编码器顶层引入金字塔池化模块PPM，扩大感受野，来收集高级语义信息；通过在自顶向下的渐进路径将高级语义信息传播到各级金字塔特征，以弥补U形网络自上而下信号逐渐被稀释的缺陷。并通过在元素级加法不断融合各类特征后引入特征聚合解码器FAD，更好地权衡融合高级语义信息、编码器有效信息与自顶向下路径中不同尺度的解码器特征。By introducing a pyramid visual transformer, the present invention can powerfully model the global dependency and obtain more powerful and robust features; by introducing a gating module between the encoder and the decoder to perform information screening, the more Effective context information is passed into the decoder to make it pay more attention to salient areas and suppress background interference; by introducing a pyramid pooling module PPM on the top layer of the encoder to expand the receptive field to collect advanced semantic information; through top-down The progressive path propagates the high-level semantic information to the pyramid features at all levels to make up for the defect that the top-down signal of the U-shaped network is gradually diluted. And by introducing feature aggregation decoder FAD after continuous fusion of various features in element-level addition, it can better balance and fuse high-level semantic information, effective information of encoder and decoder features of different scales in the top-down path.

附图说明Description of drawings

图1是本发明的框架图。Figure 1 is a block diagram of the present invention.

图2是本发明的门单元模块的结构图。Fig. 2 is a structural diagram of the door unit module of the present invention.

图3是本发明的特征聚合解码器FAD的结构图。Fig. 3 is a structural diagram of the feature aggregation decoder FAD of the present invention.

图4是本发明实施例1的输入图像。Fig. 4 is an input image of Embodiment 1 of the present invention.

图5是本发明实施例1检测的显著性图。Fig. 5 is a significance diagram detected in Example 1 of the present invention.

具体实施方式Detailed ways

以下实施例用于说明本发明。The following examples serve to illustrate the invention.

实施例1Example 1

参照图1，一种基于PVT的门控网络的显著性目标检测方法，包括以下步骤：With reference to Fig. 1, a kind of salient object detection method based on PVT gating network, comprises the following steps:

步骤1：图像预处理：将输入图像大小调整为预设大小的张量X，本实施例中张量尺寸为384×384×3；Step 1: Image preprocessing: adjust the size of the input image to a tensor X of a preset size, and the tensor size in this embodiment is 384×384×3;

本实施例中第一至第四特征张量的尺寸分别为96×96×64、48×48×128、24×24×320、12×12×512；The dimensions of the first to fourth feature tensors in this embodiment are 96×96×64, 48×48×128, 24×24×320, 12×12×512;

特征编码器PVTE_i和解码器特征FAD_i+1进行集成，再通过卷积、激活和池化操作，计算出门值：The feature encoder PVTE _i and the decoder feature FAD _i+1 are integrated, and then the gate value is calculated through convolution, activation and pooling operations:

应用门值对过渡层特征T₁-T₄加权，过渡层T₁-T₄是利用生成的3×3卷积对特征编码器PVTE₁-PVTE₄进行降维生成的。通过多级门单元，我们可以抑制和平衡从不同编码器块流向解码器的信息。多级门单元可以显著抑制每个编码器块的干扰，增强显著区域和非显著区域之间的对比度。The gate value is applied to weight the transition layer features T ₁ -T ₄ , and the transition layer T ₁ -T ₄ is generated by reducing the dimensionality of the feature encoders PVTE ₁ -PVTE ₄ by using the generated 3×3 convolution. With multi-level gating units, we can suppress and balance the information flowing from different encoder blocks to the decoder. The multi-level gating unit can significantly suppress the perturbation of each encoder block and enhance the contrast between salient and non-salient regions.

在编码器顶层引入金字塔池化模块PPM，将特征编码器PVTE₄中学习到的高级语义特征输入金字塔池化模块PPM，通过不同池化操作获得多尺度池化特征，进一步扩大感受野，收集全局上下文信息，更加准确地捕捉显著物体的确切位置；The pyramid pooling module PPM is introduced at the top layer of the encoder, and the advanced semantic features learned in the feature encoder PVTE ₄ are input into the pyramid pooling module PPM, and multi-scale pooling features are obtained through different pooling operations to further expand the receptive field and collect the global Contextual information to more accurately capture the exact location of salient objects;

金字塔池化模块PPM对第四特征张量进行4种不同尺度的自适应平均池化操作，得到4种不同尺寸的特征图，尺寸大小分别为1×1，2×2，3×3，6×6；使用1×1的卷积将对应级别通道降为原来的1/4；通过双线性插值获得未池化前的大小，再在通道维度上拼接以包含第四特征张量，最终输出一个糅合了多尺度的复合特征图，从而达到兼顾全局语义信息与局部细节信息的目的；通过元素级加法不断融合过渡层T₁-T₄的不同级别特征；The pyramid pooling module PPM performs adaptive average pooling operations of 4 different scales on the fourth feature tensor to obtain feature maps of 4 different sizes, and the sizes are 1×1, 2×2, 3×3, 6 ×6; use 1×1 convolution to reduce the corresponding level channel to 1/4 of the original; obtain the size before unpooling through bilinear interpolation, and then splicing in the channel dimension to contain the fourth feature tensor, and finally Output a composite feature map that combines multiple scales, so as to achieve the purpose of taking into account the global semantic information and local detail information; continuously integrate the different levels of features of the transition layer T ₁ -T ₄ through element-level addition;

将特征编码器PVTE_i得到的编码特征输入过渡层T_i模块中；为了减少参数数量，过渡层T_i将减少通道数并馈送到门单元G_i；将PPM特征进行降维操作并采用双线性插值法进行上采样，使其与T_i的维度和尺度相同；通过元素加法卷积层将编码器有效信息、金字塔池化模块PPM的输出与金字塔池化模块FAD_i+1的输出融合后传入特征聚合解码器FAD_i中；每层解码器输出过程可以表述如下：The encoded features obtained by the feature encoder PVTE _i are input into the transition layer T _i module; in order to reduce the number of parameters, the transition layer T _i will reduce the number of channels and feed it to the gate unit G _i ; the PPM features are subjected to dimensionality reduction operations and double-line Upsampling by the linear interpolation method to make it the same dimension and scale as T _i ; after the effective information of the encoder, the output of the pyramid pooling module PPM and the output of the pyramid pooling module FAD _i+1 are fused through the element addition convolution layer Passed into the feature aggregation decoder FAD _i ; the output process of each layer of decoder can be expressed as follows:

其中特征聚合解码器FAD₁为与输入图像大小相同的单通道特征图；Among them, the feature aggregation decoder FAD ₁ is a single-channel feature map with the same size as the input image;

为了防止高级语义信息在以自顶向下的路径中被稀释，我们直接聚合金字塔池化模块PPM提供的高层特征到每个特征层的特征映射中，为各级解码器提供多尺度信息；In order to prevent high-level semantic information from being diluted in the top-down path, we directly aggregate the high-level features provided by the pyramid pooling module PPM into the feature map of each feature layer to provide multi-scale information for decoders at all levels;

在元素级加法不断融合各类特征后引入了特征聚合解码器FAD来编码器有效特征、高层特征与自顶向下路径中的各级特征，通过对融合后的特征映射转换为多个特征空间，获取不同尺度的局部上下文信息，然后将这些信息进行组合，以保证不同尺度的特征地图可以无缝融合；After the element-level addition continuously fuses various features, the feature aggregation decoder FAD is introduced to encode effective features, high-level features and features at all levels in the top-down path, and convert the fused feature maps into multiple feature spaces. , to obtain local context information of different scales, and then combine these information to ensure that feature maps of different scales can be seamlessly integrated;

特征聚合解码器FAD的处理过程：The processing of the feature aggregation decoder FAD:

将元素级加法融合后的特征图作为输入，使用池化技巧将融合后的特征图转换为多个特征空间。采取的操作是先把输入特征图用{2,4,8}的三种下采样平均池化得到三种不同尺寸的特征图；The feature map fused by element-level addition is taken as input, and the fused feature map is converted into multiple feature spaces using pooling techniques. The operation taken is to first use the input feature map with three downsampling average pooling of {2,4,8} to obtain three feature maps of different sizes;

将三种不同尺寸的特征图上采样回原来尺寸，得到相同的尺寸；Upsample the feature maps of three different sizes back to the original size to get the same size;

包含特征聚合解码器FAD的输入在内的四个像素分支相加得到整合后的特征图；The four pixel branches including the input of the feature aggregation decoder FAD are summed to obtain the integrated feature map;

将其送入第一个3×3卷积层，尺度维度不发生变化；Send it to the first 3×3 convolutional layer, the scale dimension does not change;

再通过第二个3×3卷积层减少维度，使特征聚合解码器FAD_ii＝2,3,4的输出与过渡层低阶特征T_i-1的输出维度相同，特征聚合解码器FAD_ii＝1维度直接降为1。Then reduce the dimension through the second 3×3 convolutional layer, so that the output of the feature aggregation decoder FAD _i i=2,3,4 is the same as the output dimension of the low-order feature T _i-1 of the transition layer, and the feature aggregation decoder FAD _i i=1 dimension is directly reduced to 1.

最后，特征聚合解码器FAD_ii＝2,3,4利用双线性插值法进行上采样匹配过渡层T_i-1的尺度。FAD_ii＝1进行4倍上采样，尺度变为384×384。Finally, the feature aggregation decoder FAD _i i=2, 3, 4 performs upsampling to match the scale of the transition layer T _i−1 using the bilinear interpolation method. FAD _i i=1 performs 4 times upsampling, and the scale becomes 384×384.

实施例2Example 2

与实施例1的区别在于：特征聚合解码器FAD的处理过程：The difference from Embodiment 1 is: the processing procedure of the feature aggregation decoder FAD:

将元素级加法融合后的特征图作为输入，使用池化技巧将融合后的特征图转换为多个特征空间。采取的操作是先把输入特征图用{4,16,64}的三种下采样平均池化得到三种不同尺寸的特征图；将三种不同尺寸的特征图上采样回原来尺寸，得到相同的尺寸。The feature map fused by element-level addition is taken as input, and the fused feature map is converted into multiple feature spaces using pooling techniques. The operation to be taken is to first pool the input feature map with three downsampling averages of {4, 16, 64} to obtain feature maps of three different sizes; upsample the feature maps of three different sizes back to the original size, and obtain the same size of.

实施例3Example 3

与实施例1和2的区别在于：特征聚合解码器FAD先把输入特征图用{2,4,8}的三种下采样平均池化；再用超分辨率法上采样得到原来尺寸的输出特征图。超分辨率法上采样得到的特征图，边缘更加清晰。The difference from Embodiments 1 and 2 is that the feature aggregation decoder FAD first averages the input feature map with three types of downsampling {2, 4, 8}; then uses the super-resolution method to up-sample to obtain the output of the original size feature map. The edge of the feature map obtained by upsampling by the super-resolution method is clearer.

Claims

1. The utility model provides a PVT-based gate control network's saliency target detection method which is characterized in that: the method comprises the following steps:

step 1: image preprocessing: adjusting the size of an input image to a tensor X with a preset size;

step 2: establishing a PVT gating network: the PVT gating network comprises first to fourth feature processing units and a pyramid pooling module PPM; the first to third feature processing units have the same structure; the first feature processing unit comprises a feature encoder PVTE ₁ Transition layer T ₁ Door unit G ₁ And decoder characterizer FAD ₁ The method comprises the steps of carrying out a first treatment on the surface of the The fourth feature processing unit comprises a feature encoder PVTE ₄ Transition layer T ₄ Door unit G ₄ And decoder characterizer FAD ₄ The method comprises the steps of carrying out a first treatment on the surface of the Tensor X is sequentially processed by PVTE of feature encoder ₁ -PVTE ₄ After the processing, obtaining first to fourth feature tensors;

the fourth feature tensor is subjected to PPM processing by a pyramid pooling module to obtain fourth high-level semantic information; fourth feature tensor transitionLayer T ₄ The processed image is spliced with a fourth characteristic tensor and then is input into a gate unit G ₄ Then is connected with the transition layer T ₄ Is multiplied by the output of the decoder to obtain the fourth encoder effective information, and the fourth encoder effective information is added with the fourth high-level semantic information and then is input into the decoder characterizer FAD ₄ The method comprises the steps of carrying out a first treatment on the surface of the Decoder characterizer FAD ₄ Outputting a fourth decoded feature vector;

first feature tensor and second feature processing unit FAD ₂ The output second decoding eigenvector is input into the gate unit G after being spliced ₁ The first characteristic tensor passes through the transition layer T ₁ After-process AND gate unit G ₁ The output of the first encoder is multiplied to obtain the effective information of the first encoder, the fourth advanced semantic information and the second decoding feature vector are added and then input into a decoder feature device FAD ₁ The method comprises the steps of carrying out a first treatment on the surface of the Outputting the first decoding feature vector as a saliency map;

step 3: detecting a significance map: and inputting tensors, and processing the tensors through a PVT gating network to obtain a saliency map.

2. The method for detecting significance targets in PVT-based gating networks according to claim 1, wherein: the input tensor size is 384×384×3, and the first to fourth feature tensors are 96×96×64, 48×48×128, 24×24×320, 12×12×512, respectively.

3. The method for detecting significance targets in PVT-based gating networks according to claim 1, wherein: feature encoder PVTE _i And decoder feature FAD _i+1 Integrating, and calculating a gate value through convolution, activation and pooling operations:

wherein Cat (·) is a splicing operation between channel axes, conv (·) is a convolution operation, S (·) is an element-level sigmoid function, and P (·) is global average pooling;

applying a threshold to transition layer characteristicsSign T ₁ -T ₄ Weighting, transition layer T ₁ -T ₄ Is a feature encoder PVTE using the generated 3X 3 convolution pair ₁ -PVTE ₄ And (5) carrying out vitamin reduction.

4. The method for detecting significance targets in PVT-based gating networks according to claim 1, wherein: the pyramid pooling module PPM carries out 4 kinds of adaptive average pooling operations with different scales on the fourth characteristic tensor to obtain 4 kinds of characteristic graphs with different sizes, wherein the sizes are respectively 1 multiplied by 1,2 multiplied by 2,3 multiplied by 3 and 6 multiplied by 6; the corresponding level channel is reduced to the original 1/4 by using the convolution of 1 multiplied by 1; the size before non-pooling is obtained through bilinear interpolation, and then the size is spliced with a fourth characteristic tensor in the channel dimension.

5. The method for detecting significance targets in PVT-based gating networks according to claim 1, wherein: the output of the decoder characterizer is:

wherein feature aggregation decoder FAD ₁ A saliency map of the same size as the input image;

feature encoder PVTE _i Feature input transition layer T _i In the module; transition layer T _i Reducing the number of channels and feeding to the gate unit G _i The method comprises the steps of carrying out a first treatment on the surface of the The output of the pyramid pooling module PPM is subjected to dimension reduction operation and up-sampled by adopting a bilinear interpolation method, so that the output is matched with T _i The output dimensions and dimensions of (a) are the same; encoder effective information, pyramid pooling module PPM output and feature aggregation decoder FAD _i+1 Output fused incoming feature aggregation decoder FAD _i Is a kind of medium.

6. The method for detecting significance target of PVT-based gate control network according to claim 5, wherein: the feature aggregation decoder FAD first pools the input feature map with three downsampled averages of {2,4,8 }; and then up-sampling by using a bilinear interpolation method to obtain an output characteristic diagram with the original size.

7. The method for detecting significance target of PVT-based gate control network according to claim 5, wherein: the feature aggregation decoder FAD firstly pools the input feature map with three downsampling averages of {2,4,8 }; and then up-sampling by a super-resolution method to obtain an output characteristic diagram with the original size.

8. The method for detecting significance target of PVT-based gate control network according to claim 5, wherein: the feature aggregation decoder FAD firstly uses three downsampling average pooling of {4,16,64} to obtain three feature graphs with different sizes; and up-sampling the characteristic diagrams with three different sizes to the original size to obtain the same size.