CN116664875A - PVT-based saliency target detection method for gating network - Google Patents
PVT-based saliency target detection method for gating network Download PDFInfo
- Publication number
- CN116664875A CN116664875A CN202310061980.5A CN202310061980A CN116664875A CN 116664875 A CN116664875 A CN 116664875A CN 202310061980 A CN202310061980 A CN 202310061980A CN 116664875 A CN116664875 A CN 116664875A
- Authority
- CN
- China
- Prior art keywords
- feature
- decoder
- fad
- pvt
- encoder
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
- G06V10/46—Descriptors for shape, contour or point-related descriptors, e.g. scale invariant feature transform [SIFT] or bags of words [BoW]; Salient regional features
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
- G06V10/44—Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/764—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/80—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
- G06V10/806—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/10—Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
- G06V40/103—Static body considered as a whole, e.g. static pedestrian or occupant recognition
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Multimedia (AREA)
- Evolutionary Computation (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Computing Systems (AREA)
- Software Systems (AREA)
- General Health & Medical Sciences (AREA)
- Medical Informatics (AREA)
- Databases & Information Systems (AREA)
- Human Computer Interaction (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Molecular Biology (AREA)
- General Engineering & Computer Science (AREA)
- Mathematical Physics (AREA)
- Image Analysis (AREA)
Abstract
本发明公开了一种基于PVT的门控网络的显著性目标检测方法,通过PVT网络的逐层融合提取全局特征;在PVT四个不同尺度的编码器后加入过渡层,在过渡层和解码器层之间插入多级门单元,在编码器顶层引入金字塔池化模块提取高级语义信息;高级语义信息以自顶向下传播;特征聚合解码器通过元素级加法不断融合高级语义信息、编码器有效信息和不同尺度的解码器特征。本发明能够提取任意给定图像中人眼最感兴趣目标内容,更关注显著区域,抑制背景干扰。
The invention discloses a saliency object detection method based on a PVT-based gated network, which extracts global features through layer-by-layer fusion of the PVT network; adds a transition layer after four encoders of different scales in PVT, and integrates the transition layer and decoder Multi-level gate units are inserted between layers, and a pyramid pooling module is introduced on the top layer of the encoder to extract advanced semantic information; the advanced semantic information is propagated from top to bottom; the feature aggregation decoder continuously fuses advanced semantic information through element-level addition, and the encoder is effective. information and decoder features at different scales. The invention can extract the content of the target most interesting to human eyes in any given image, pay more attention to the salient area, and suppress the background interference.
Description
技术领域technical field
本发明涉及一种显著性目标检测方法,尤其涉及一种基于PVT的门控网络的显著性目标检测方法,属于计算机视觉领域。The invention relates to a salient target detection method, in particular to a salient target detection method based on a PVT gating network, and belongs to the field of computer vision.
背景技术Background technique
近年来,随着新媒体行业的快速发展,一些短视频、图像推文等数字信息融入到生活中,每天都有数以万计的图像、视频信息传入到互联网中,极大丰富了人们的生活、工作和娱乐。每日微信软件的数字图像上传量竟高达10亿张。由此可见,但是要想快速从海量的数字图像中选出有价值的内容,单靠人类自身感官将难以实现,这便对计算机视觉方面的研究提出更大的挑战。In recent years, with the rapid development of the new media industry, some short videos, image tweets and other digital information have been integrated into our lives. Every day, tens of thousands of images and video information are transmitted to the Internet, which greatly enriches people's life experience. Live, work and play. The number of digital images uploaded on the WeChat software is as high as 1 billion per day. It can be seen that, but in order to quickly select valuable content from a large number of digital images, it will be difficult to achieve by human sense alone, which poses a greater challenge to the research of computer vision.
数字图像信息,能够非常直观地反映出人们要表达的内容,在网络资源传送中被广泛使用,但是大量的数据信息难以被人们有效识别,即使运用计算机资源做协助,也会产生沉重的负担。因此,我们就需要从数字图像信息中进行重要内容的提取,从而利用有限的计算资源整合,减轻人们收集有用信息的负担。另一方面,在人类的视觉系统中,他们能够集中注视到图像的重要区域上,如微信朋友圈自拍照片的人脸、美食照片的食材等,而不去关注背景区域的无用信息,这对快速处理图像有着重要作用。因此,如何将这种注视图像关键区域的能力应用到计算机中,以帮助人们完成更加复杂的图像处理任务,便是当下计算机视觉研究者所关注的重点内容。Digital image information can very intuitively reflect the content that people want to express, and is widely used in the transmission of network resources. However, a large amount of data information is difficult to be effectively recognized by people. Even if computer resources are used for assistance, it will generate a heavy burden. Therefore, we need to extract important content from digital image information, so as to use limited computing resources to integrate and reduce people's burden of collecting useful information. On the other hand, in the human visual system, they can focus on the important areas of the image, such as the face of the self-portrait of the WeChat circle of friends, the ingredients of the food photo, etc., without paying attention to the useless information in the background area. Fast image processing plays an important role. Therefore, how to apply this ability of focusing on key areas of images to computers to help people complete more complex image processing tasks is the focus of current computer vision researchers.
近年来,由于U型结构能够通过构建多层次自上而下的路径来构建丰富的特征图,并取得良好的性能,因此受到极大的关注。现在,许多显著性目标检测网络采用U型多尺度分层编码器-解码器结构作为网络基本结构。然而,上述方法直接使用跨层连接结构,将编码器的特征直接连接到解码器,它们之间缺乏干扰控制。这种方法会将编码器中误导性的上下文信息引入解码器,导致真正有用的特征不能被充分利用。除此之外,现有的方法使用基于深度学习的全卷积神经网络(FCN),大多使用预先训练好的图像分类模型,如VGG和ResNet作为编码器,重点通过聚合多级特征来设计有效的解码器。然而,CNN的模型结构特点是对局部信息汇聚建模,难以对长周期进行建模。从复杂场景中准确完整提取出显著性目标仍然具有很大的挑战性。金字塔视觉变压器(Pyramid Vision Transformer,PVT)为完整提取出显著性目标开辟了新的途径。In recent years, the U-shaped structure has received great attention due to its ability to construct rich feature maps by constructing multi-level top-down paths and achieve good performance. Now, many salient object detection networks adopt U-shaped multi-scale layered encoder-decoder structure as the basic structure of the network. However, the above methods directly use the cross-layer connection structure to directly connect the features of the encoder to the decoder, and there is a lack of interference control between them. This approach will introduce misleading contextual information from the encoder into the decoder, resulting in underutilization of truly useful features. In addition, the existing methods use fully convolutional neural network (FCN) based on deep learning, most of which use pre-trained image classification models such as VGG and ResNet as encoders, focusing on designing effective neural networks by aggregating multi-level features. the decoder. However, CNN's model structure is characterized by modeling local information aggregation, and it is difficult to model long periods. Accurate and complete extraction of salient objects from complex scenes is still a great challenge. The Pyramid Vision Transformer (PVT) opens up a new way to fully extract salient objects.
发明内容Contents of the invention
本发明的目的在于提供一种基于PVT的门控网络的显著性目标检测方法。The object of the present invention is to provide a salient object detection method based on PVT gating network.
为解决上述技术问题,本发明的技术方案是:一种基于PVT的门控网络的显著性目标检测方法,包括以下步骤:In order to solve the problems of the technologies described above, the technical solution of the present invention is: a salient object detection method based on a PVT-based gating network, comprising the following steps:
步骤1:图像预处理:将输入图像大小调整为预设大小的张量X;Step 1: Image preprocessing: Resize the input image to a tensor X of preset size;
步骤2:建立PVT门控网络:PVT门控网络包括第一至第四特征处理单元和金字塔池化模块PPM;第一至第三特征处理单元结构相同;第一特征处理单元包括特征编码器PVTE1、过渡层T1、门单元G1和解码器特征器FAD1;第四特征处理单元包括特征编码器PVTE4、过渡层T4、门单元G4和解码器特征器FAD4;张量X依次经特征编码器PVTE1-PVTE4处理后,得到第一至第四特征张量;Step 2: Establish a PVT gating network: the PVT gating network includes the first to fourth feature processing units and the pyramid pooling module PPM; the first to third feature processing units have the same structure; the first feature processing unit includes a feature encoder PVTE 1. Transition layer T 1 , gate unit G 1 and decoder feature unit FAD 1 ; the fourth feature processing unit includes feature encoder PVTE 4 , transition layer T 4 , gate unit G 4 and decoder feature unit FAD 4 ; tensor After X is sequentially processed by the feature encoder PVTE 1 -PVTE 4 , the first to fourth feature tensors are obtained;
第四特征张量经过金字塔池化模块PPM处理后得到第四高级语义信息;第四特征张量经过渡层T4处理后与第四特征张量拼接后输入门单元G4,再与过渡层T4的输出相乘得到第四编码器有效信息,第四编码器有效信息与第四高级语义信息相加后输入解码器特征器FAD4;解码器特征器FAD4输出第四解码特征向量;The fourth feature tensor is processed by the pyramid pooling module PPM to obtain the fourth high-level semantic information; the fourth feature tensor is processed by the transition layer T 4 and then spliced with the fourth feature tensor and input to the gate unit G 4 , and then combined with the transition layer The output of T 4 is multiplied to obtain the fourth encoder effective information, and the fourth encoder effective information is added to the fourth high-level semantic information and input to the decoder feature device FAD 4 ; the decoder feature device FAD 4 outputs the fourth decoding feature vector;
第一特征张量与第二特征处理单元FAD2输出的第二解码特征向量拼接后输入门单元G1,第一特征张量经过渡层T1处理后与门单元G1的输出相乘得到第一编码器有效信息,第一编码器有效信息、第四高级语义信息与第二解码特征向量相加后输入解码器特征器FAD1;输出第一解码特征向量作为显著性图;The first feature tensor is concatenated with the second decoded feature vector output by the second feature processing unit FAD 2 and input to the gate unit G 1 , the first feature tensor is processed by the transition layer T 1 and multiplied by the output of the gate unit G 1 to obtain The first encoder effective information, the first encoder effective information, the fourth high-level semantic information and the second decoding feature vector are added and input to the decoder feature device FAD 1 ; the first decoding feature vector is output as a saliency map;
步骤3:检测显著性图:输入张量,经PVT门控网络处理后得到显著性图。Step 3: Detect saliency map: input tensor, get saliency map after processing by PVT gating network.
进一步,张量尺寸为384×384×3,第一至第四特征张量的尺寸分别为96×96×64、48×48×128、24×24×320、12×12×512。Further, the size of the tensor is 384×384×3, and the sizes of the first to fourth feature tensors are 96×96×64, 48×48×128, 24×24×320, and 12×12×512, respectively.
进一步,特征编码器PVTEi和解码器特征FADi+1进行集成,再通过卷积、激活和池化操作,计算出门值:Further, the feature encoder PVTE i and the decoder feature FAD i+1 are integrated, and then the gate value is calculated through convolution, activation and pooling operations:
式中,Cat(·)为通道轴之间的拼接操作,Conv(·)为卷积操作,S(·)为元素级sigmoid函数,P(·)为全局平均池化;In the formula, Cat(·) is the splicing operation between the channel axes, Conv(·) is the convolution operation, S(·) is the element-level sigmoid function, and P(·) is the global average pooling;
应用门值对过渡层特征T1-T4加权,过渡层T1-T4是利用生成的3×3卷积对特征编码器PVTE1-PVTE4进行降维生成的。The gate value is applied to weight the transition layer features T 1 -T 4 , and the transition layer T 1 -T 4 is generated by reducing the dimensionality of the feature encoders PVTE 1 -PVTE 4 by using the generated 3×3 convolution.
进一步,金字塔池化模块PPM对第四特征张量进行4种不同尺度的自适应平均池化操作,得到4种不同尺寸的特征图,尺寸大小分别为1×1,2×2,3×3,6×6;使用1×1的卷积将对应级别通道降为原本的1/4;通过双线性插值获得未池化前的大小,再在通道维度上与第四特征张量拼接。Further, the pyramid pooling module PPM performs adaptive average pooling operations of 4 different scales on the fourth feature tensor to obtain feature maps of 4 different sizes, and the sizes are 1×1, 2×2, 3×3 , 6×6; use 1×1 convolution to reduce the corresponding level channel to 1/4 of the original; obtain the size before unpooling through bilinear interpolation, and then concatenate with the fourth feature tensor in the channel dimension.
进一步,解码器特征器的输出为:Further, the output of the decoder featureizer is:
其中特征聚合解码器FAD1为与输入图像大小相同的显著性图;where the feature aggregation decoder FAD 1 is a saliency map of the same size as the input image;
特征编码器PVTEi特征输入过渡层Ti模块中;过渡层Ti减少通道数并馈送到门单元Gi;将金字塔池化模块PPM的输出进行降维操作并采用双线性插值法进行上采样,使其与Ti的输出维度和尺度相同;编码器有效信息、金字塔池化模块PPM输出与特征聚合解码器FADi+1的输出融合后传入特征聚合解码器FADi中。The features of the feature encoder PVTE i are input into the transition layer T i module; the transition layer T i reduces the number of channels and feeds it to the gate unit G i ; the output of the pyramid pooling module PPM is subjected to dimension reduction operation and bilinear interpolation method is used for uplink Sampling to make it the same as the output dimension and scale of T i ; the effective information of the encoder, the PPM output of the pyramid pooling module and the output of the feature aggregation decoder FAD i+1 are fused and then passed to the feature aggregation decoder FAD i .
更进一步,特征聚合解码器FAD先把输入特征图用{2,4,8}的三种下采样平均池化;再用双线性插值法上采样得到原来尺寸的输出特征图。Furthermore, the feature aggregation decoder FAD first pools the input feature map with three downsampling averages of {2, 4, 8}; then uses bilinear interpolation to upsample to obtain the output feature map of the original size.
采用上述技术方案,本发明取得如下技术效果:Adopt above-mentioned technical scheme, the present invention obtains following technical effect:
本发明通过引入金字塔视觉变压器,能够对全局依赖关系进行强大建模,并获得更强大和稳健的特征;通过在编码器与解码器之间引入门控模块,进行信息筛选,将编码器中更有效的上下文信息传入解码器中,使其更关注显著区域,抑制背景干扰;通过在编码器顶层引入金字塔池化模块PPM,扩大感受野,来收集高级语义信息;通过在自顶向下的渐进路径将高级语义信息传播到各级金字塔特征,以弥补U形网络自上而下信号逐渐被稀释的缺陷。并通过在元素级加法不断融合各类特征后引入特征聚合解码器FAD,更好地权衡融合高级语义信息、编码器有效信息与自顶向下路径中不同尺度的解码器特征。By introducing a pyramid visual transformer, the present invention can powerfully model the global dependency and obtain more powerful and robust features; by introducing a gating module between the encoder and the decoder to perform information screening, the more Effective context information is passed into the decoder to make it pay more attention to salient areas and suppress background interference; by introducing a pyramid pooling module PPM on the top layer of the encoder to expand the receptive field to collect advanced semantic information; through top-down The progressive path propagates the high-level semantic information to the pyramid features at all levels to make up for the defect that the top-down signal of the U-shaped network is gradually diluted. And by introducing feature aggregation decoder FAD after continuous fusion of various features in element-level addition, it can better balance and fuse high-level semantic information, effective information of encoder and decoder features of different scales in the top-down path.
附图说明Description of drawings
图1是本发明的框架图。Figure 1 is a block diagram of the present invention.
图2是本发明的门单元模块的结构图。Fig. 2 is a structural diagram of the door unit module of the present invention.
图3是本发明的特征聚合解码器FAD的结构图。Fig. 3 is a structural diagram of the feature aggregation decoder FAD of the present invention.
图4是本发明实施例1的输入图像。Fig. 4 is an input image of Embodiment 1 of the present invention.
图5是本发明实施例1检测的显著性图。Fig. 5 is a significance diagram detected in Example 1 of the present invention.
具体实施方式Detailed ways
以下实施例用于说明本发明。The following examples serve to illustrate the invention.
实施例1Example 1
参照图1,一种基于PVT的门控网络的显著性目标检测方法,包括以下步骤:With reference to Fig. 1, a kind of salient object detection method based on PVT gating network, comprises the following steps:
步骤1:图像预处理:将输入图像大小调整为预设大小的张量X,本实施例中张量尺寸为384×384×3;Step 1: Image preprocessing: adjust the size of the input image to a tensor X of a preset size, and the tensor size in this embodiment is 384×384×3;
步骤2:建立PVT门控网络:PVT门控网络包括第一至第四特征处理单元和金字塔池化模块PPM;第一至第三特征处理单元结构相同;第一特征处理单元包括特征编码器PVTE1、过渡层T1、门单元G1和解码器特征器FAD1;第四特征处理单元包括特征编码器PVTE4、过渡层T4、门单元G4和解码器特征器FAD4;张量X依次经特征编码器PVTE1-PVTE4处理后,得到第一至第四特征张量;Step 2: Establish a PVT gating network: the PVT gating network includes the first to fourth feature processing units and the pyramid pooling module PPM; the first to third feature processing units have the same structure; the first feature processing unit includes a feature encoder PVTE 1. Transition layer T 1 , gate unit G 1 and decoder feature unit FAD 1 ; the fourth feature processing unit includes feature encoder PVTE 4 , transition layer T 4 , gate unit G 4 and decoder feature unit FAD 4 ; tensor After X is sequentially processed by the feature encoder PVTE 1 -PVTE 4 , the first to fourth feature tensors are obtained;
第四特征张量经过金字塔池化模块PPM处理后得到第四高级语义信息;第四特征张量经过渡层T4处理后与第四特征张量拼接后输入门单元G4,再与过渡层T4的输出相乘得到第四编码器有效信息,第四编码器有效信息与第四高级语义信息相加后输入解码器特征器FAD4;解码器特征器FAD4输出第四解码特征向量;The fourth feature tensor is processed by the pyramid pooling module PPM to obtain the fourth high-level semantic information; the fourth feature tensor is processed by the transition layer T 4 and then spliced with the fourth feature tensor and input to the gate unit G 4 , and then combined with the transition layer The output of T 4 is multiplied to obtain the fourth encoder effective information, and the fourth encoder effective information is added to the fourth high-level semantic information and input to the decoder feature device FAD 4 ; the decoder feature device FAD 4 outputs the fourth decoding feature vector;
第一特征张量与第二特征处理单元FAD2输出的第二解码特征向量拼接后输入门单元G1,第一特征张量经过渡层T1处理后与门单元G1的输出相乘得到第一编码器有效信息,第一编码器有效信息、第四高级语义信息与第二解码特征向量相加后输入解码器特征器FAD1;输出第一解码特征向量作为显著性图;The first feature tensor is concatenated with the second decoded feature vector output by the second feature processing unit FAD 2 and input to the gate unit G 1 , the first feature tensor is processed by the transition layer T 1 and multiplied by the output of the gate unit G 1 to obtain The first encoder effective information, the first encoder effective information, the fourth high-level semantic information and the second decoding feature vector are added and input to the decoder feature device FAD 1 ; the first decoding feature vector is output as a saliency map;
步骤3:检测显著性图:输入张量,经PVT门控网络处理后得到显著性图。Step 3: Detect saliency map: input tensor, get saliency map after processing by PVT gating network.
本实施例中第一至第四特征张量的尺寸分别为96×96×64、48×48×128、24×24×320、12×12×512;The dimensions of the first to fourth feature tensors in this embodiment are 96×96×64, 48×48×128, 24×24×320, 12×12×512;
特征编码器PVTEi和解码器特征FADi+1进行集成,再通过卷积、激活和池化操作,计算出门值:The feature encoder PVTE i and the decoder feature FAD i+1 are integrated, and then the gate value is calculated through convolution, activation and pooling operations:
式中,Cat(·)为通道轴之间的拼接操作,Conv(·)为卷积操作,S(·)为元素级sigmoid函数,P(·)为全局平均池化;In the formula, Cat(·) is the splicing operation between the channel axes, Conv(·) is the convolution operation, S(·) is the element-level sigmoid function, and P(·) is the global average pooling;
应用门值对过渡层特征T1-T4加权,过渡层T1-T4是利用生成的3×3卷积对特征编码器PVTE1-PVTE4进行降维生成的。通过多级门单元,我们可以抑制和平衡从不同编码器块流向解码器的信息。多级门单元可以显著抑制每个编码器块的干扰,增强显著区域和非显著区域之间的对比度。The gate value is applied to weight the transition layer features T 1 -T 4 , and the transition layer T 1 -T 4 is generated by reducing the dimensionality of the feature encoders PVTE 1 -PVTE 4 by using the generated 3×3 convolution. With multi-level gating units, we can suppress and balance the information flowing from different encoder blocks to the decoder. The multi-level gating unit can significantly suppress the perturbation of each encoder block and enhance the contrast between salient and non-salient regions.
在编码器顶层引入金字塔池化模块PPM,将特征编码器PVTE4中学习到的高级语义特征输入金字塔池化模块PPM,通过不同池化操作获得多尺度池化特征,进一步扩大感受野,收集全局上下文信息,更加准确地捕捉显著物体的确切位置;The pyramid pooling module PPM is introduced at the top layer of the encoder, and the advanced semantic features learned in the feature encoder PVTE 4 are input into the pyramid pooling module PPM, and multi-scale pooling features are obtained through different pooling operations to further expand the receptive field and collect the global Contextual information to more accurately capture the exact location of salient objects;
金字塔池化模块PPM对第四特征张量进行4种不同尺度的自适应平均池化操作,得到4种不同尺寸的特征图,尺寸大小分别为1×1,2×2,3×3,6×6;使用1×1的卷积将对应级别通道降为原来的1/4;通过双线性插值获得未池化前的大小,再在通道维度上拼接以包含第四特征张量,最终输出一个糅合了多尺度的复合特征图,从而达到兼顾全局语义信息与局部细节信息的目的;通过元素级加法不断融合过渡层T1-T4的不同级别特征;The pyramid pooling module PPM performs adaptive average pooling operations of 4 different scales on the fourth feature tensor to obtain feature maps of 4 different sizes, and the sizes are 1×1, 2×2, 3×3, 6 ×6; use 1×1 convolution to reduce the corresponding level channel to 1/4 of the original; obtain the size before unpooling through bilinear interpolation, and then splicing in the channel dimension to contain the fourth feature tensor, and finally Output a composite feature map that combines multiple scales, so as to achieve the purpose of taking into account the global semantic information and local detail information; continuously integrate the different levels of features of the transition layer T 1 -T 4 through element-level addition;
将特征编码器PVTEi得到的编码特征输入过渡层Ti模块中;为了减少参数数量,过渡层Ti将减少通道数并馈送到门单元Gi;将PPM特征进行降维操作并采用双线性插值法进行上采样,使其与Ti的维度和尺度相同;通过元素加法卷积层将编码器有效信息、金字塔池化模块PPM的输出与金字塔池化模块FADi+1的输出融合后传入特征聚合解码器FADi中;每层解码器输出过程可以表述如下:The encoded features obtained by the feature encoder PVTE i are input into the transition layer T i module; in order to reduce the number of parameters, the transition layer T i will reduce the number of channels and feed it to the gate unit G i ; the PPM features are subjected to dimensionality reduction operations and double-line Upsampling by the linear interpolation method to make it the same dimension and scale as T i ; after the effective information of the encoder, the output of the pyramid pooling module PPM and the output of the pyramid pooling module FAD i+1 are fused through the element addition convolution layer Passed into the feature aggregation decoder FAD i ; the output process of each layer of decoder can be expressed as follows:
其中特征聚合解码器FAD1为与输入图像大小相同的单通道特征图;Among them, the feature aggregation decoder FAD 1 is a single-channel feature map with the same size as the input image;
为了防止高级语义信息在以自顶向下的路径中被稀释,我们直接聚合金字塔池化模块PPM提供的高层特征到每个特征层的特征映射中,为各级解码器提供多尺度信息;In order to prevent high-level semantic information from being diluted in the top-down path, we directly aggregate the high-level features provided by the pyramid pooling module PPM into the feature map of each feature layer to provide multi-scale information for decoders at all levels;
在元素级加法不断融合各类特征后引入了特征聚合解码器FAD来编码器有效特征、高层特征与自顶向下路径中的各级特征,通过对融合后的特征映射转换为多个特征空间,获取不同尺度的局部上下文信息,然后将这些信息进行组合,以保证不同尺度的特征地图可以无缝融合;After the element-level addition continuously fuses various features, the feature aggregation decoder FAD is introduced to encode effective features, high-level features and features at all levels in the top-down path, and convert the fused feature maps into multiple feature spaces. , to obtain local context information of different scales, and then combine these information to ensure that feature maps of different scales can be seamlessly integrated;
特征聚合解码器FAD的处理过程:The processing of the feature aggregation decoder FAD:
将元素级加法融合后的特征图作为输入,使用池化技巧将融合后的特征图转换为多个特征空间。采取的操作是先把输入特征图用{2,4,8}的三种下采样平均池化得到三种不同尺寸的特征图;The feature map fused by element-level addition is taken as input, and the fused feature map is converted into multiple feature spaces using pooling techniques. The operation taken is to first use the input feature map with three downsampling average pooling of {2,4,8} to obtain three feature maps of different sizes;
将三种不同尺寸的特征图上采样回原来尺寸,得到相同的尺寸;Upsample the feature maps of three different sizes back to the original size to get the same size;
包含特征聚合解码器FAD的输入在内的四个像素分支相加得到整合后的特征图;The four pixel branches including the input of the feature aggregation decoder FAD are summed to obtain the integrated feature map;
将其送入第一个3×3卷积层,尺度维度不发生变化;Send it to the first 3×3 convolutional layer, the scale dimension does not change;
再通过第二个3×3卷积层减少维度,使特征聚合解码器FADii=2,3,4的输出与过渡层低阶特征Ti-1的输出维度相同,特征聚合解码器FADii=1维度直接降为1。Then reduce the dimension through the second 3×3 convolutional layer, so that the output of the feature aggregation decoder FAD i i=2,3,4 is the same as the output dimension of the low-order feature T i-1 of the transition layer, and the feature aggregation decoder FAD i i=1 dimension is directly reduced to 1.
最后,特征聚合解码器FADii=2,3,4利用双线性插值法进行上采样匹配过渡层Ti-1的尺度。FADii=1进行4倍上采样,尺度变为384×384。Finally, the feature aggregation decoder FAD i i=2, 3, 4 performs upsampling to match the scale of the transition layer T i−1 using the bilinear interpolation method. FAD i i=1 performs 4 times upsampling, and the scale becomes 384×384.
实施例2Example 2
与实施例1的区别在于:特征聚合解码器FAD的处理过程:The difference from Embodiment 1 is: the processing procedure of the feature aggregation decoder FAD:
将元素级加法融合后的特征图作为输入,使用池化技巧将融合后的特征图转换为多个特征空间。采取的操作是先把输入特征图用{4,16,64}的三种下采样平均池化得到三种不同尺寸的特征图;将三种不同尺寸的特征图上采样回原来尺寸,得到相同的尺寸。The feature map fused by element-level addition is taken as input, and the fused feature map is converted into multiple feature spaces using pooling techniques. The operation to be taken is to first pool the input feature map with three downsampling averages of {4, 16, 64} to obtain feature maps of three different sizes; upsample the feature maps of three different sizes back to the original size, and obtain the same size of.
实施例3Example 3
与实施例1和2的区别在于:特征聚合解码器FAD先把输入特征图用{2,4,8}的三种下采样平均池化;再用超分辨率法上采样得到原来尺寸的输出特征图。超分辨率法上采样得到的特征图,边缘更加清晰。The difference from Embodiments 1 and 2 is that the feature aggregation decoder FAD first averages the input feature map with three types of downsampling {2, 4, 8}; then uses the super-resolution method to up-sample to obtain the output of the original size feature map. The edge of the feature map obtained by upsampling by the super-resolution method is clearer.
Claims (8)
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202310061980.5A CN116664875A (en) | 2023-01-16 | 2023-01-16 | PVT-based saliency target detection method for gating network |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202310061980.5A CN116664875A (en) | 2023-01-16 | 2023-01-16 | PVT-based saliency target detection method for gating network |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| CN116664875A true CN116664875A (en) | 2023-08-29 |
Family
ID=87719492
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN202310061980.5A Pending CN116664875A (en) | 2023-01-16 | 2023-01-16 | PVT-based saliency target detection method for gating network |
Country Status (1)
| Country | Link |
|---|---|
| CN (1) | CN116664875A (en) |
Citations (6)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| JP2000132787A (en) * | 1998-10-21 | 2000-05-12 | Mitsubishi Electric Corp | Traffic monitoring system |
| JP2000251278A (en) * | 1999-02-26 | 2000-09-14 | Sankyo Seiki Mfg Co Ltd | Optical pickup apparatus |
| WO2018054329A1 (en) * | 2016-09-23 | 2018-03-29 | 北京市商汤科技开发有限公司 | Object detection method and device, electronic apparatus, computer program and storage medium |
| CN112329800A (en) * | 2020-12-03 | 2021-02-05 | 河南大学 | Salient object detection method based on global information guiding residual attention |
| CN113392727A (en) * | 2021-05-27 | 2021-09-14 | 杭州电子科技大学 | RGB-D (red, green and blue-D) significant target detection method based on dynamic feature selection |
| WO2022227913A1 (en) * | 2021-04-25 | 2022-11-03 | 浙江师范大学 | Double-feature fusion semantic segmentation system and method based on internet of things perception |
-
2023
- 2023-01-16 CN CN202310061980.5A patent/CN116664875A/en active Pending
Patent Citations (6)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| JP2000132787A (en) * | 1998-10-21 | 2000-05-12 | Mitsubishi Electric Corp | Traffic monitoring system |
| JP2000251278A (en) * | 1999-02-26 | 2000-09-14 | Sankyo Seiki Mfg Co Ltd | Optical pickup apparatus |
| WO2018054329A1 (en) * | 2016-09-23 | 2018-03-29 | 北京市商汤科技开发有限公司 | Object detection method and device, electronic apparatus, computer program and storage medium |
| CN112329800A (en) * | 2020-12-03 | 2021-02-05 | 河南大学 | Salient object detection method based on global information guiding residual attention |
| WO2022227913A1 (en) * | 2021-04-25 | 2022-11-03 | 浙江师范大学 | Double-feature fusion semantic segmentation system and method based on internet of things perception |
| CN113392727A (en) * | 2021-05-27 | 2021-09-14 | 杭州电子科技大学 | RGB-D (red, green and blue-D) significant target detection method based on dynamic feature selection |
Non-Patent Citations (1)
| Title |
|---|
| 沈慧;王森妹;: "基于卷积Hopfield网络的运动目标检测模型", 现代信息科技, no. 09, 10 May 2019 (2019-05-10) * |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| CN117078930B (en) | Medical image segmentation method based on boundary sensing and attention mechanism | |
| Yu et al. | A review of single image super-resolution reconstruction based on deep learning | |
| CN110569851B (en) | Real-time Semantic Segmentation with Gated Multilayer Fusion | |
| CN114973049B (en) | Lightweight video classification method with unified convolution and self-attention | |
| CN116778165A (en) | Remote sensing image disaster detection method based on multi-scale adaptive semantic segmentation | |
| CN116051549B (en) | A solar cell defect segmentation method, system, medium and equipment | |
| CN113705575B (en) | Image segmentation method, device, equipment and storage medium | |
| Wang et al. | TF-SOD: A novel transformer framework for salient object detection | |
| Jin et al. | DASFNet: Dense-attention–similarity-fusion network for scene classification of dual-modal remote-sensing images | |
| Cai et al. | A Lightweight underwater detector enhanced by Attention mechanism, GSConv and WIoU on YOLOv8 | |
| CN117496179A (en) | Carbon emission analysis method using multi-scale dual attention guided fusion network model | |
| CN115953582A (en) | Image semantic segmentation method and system | |
| CN118488202A (en) | Image compression processing method and device, electronic equipment and storage medium | |
| CN111898638A (en) | Image processing method, electronic device and medium for fusion of different vision tasks | |
| CN117853397A (en) | Image tampering detection and positioning method and system based on multi-level feature learning | |
| CN117853796A (en) | Data fusion classification method based on residual extrusion excitation | |
| CN118585669A (en) | Short video recommendation method and system based on multimodal feature fusion | |
| Luo et al. | RS-Dseg: semantic segmentation of high-resolution remote sensing images based on a diffusion model component with unsupervised pretraining | |
| CN117409208A (en) | A real-time semantic segmentation method and system for clothing images | |
| Zhao et al. | TransDehaze: transformer-enhanced texture attention for end-to-end single image dehaze | |
| Li et al. | High efficiency image compression for large visual-language models | |
| CN119851209A (en) | Cross-modal feature extraction and adaptive fusion crowd counting system and method | |
| CN119963807A (en) | A fusion model for RGB-D and its target detection method | |
| CN116664875A (en) | PVT-based saliency target detection method for gating network | |
| Li et al. | Green visual sensor of plant: an energy-efficient compressive video sensing in the internet of things |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination |