CN117689541A

CN117689541A - Multi-region classification video super-resolution reconstruction method with temporal redundancy optimization

Info

Publication number: CN117689541A
Application number: CN202311671815.8A
Authority: CN
Inventors: 唐述; 王新怡; 韦哲韬; 董文琦; 梁雅琪; 姚智皓
Original assignee: Chongqing University of Post and Telecommunications
Current assignee: Chongqing University of Post and Telecommunications
Priority date: 2023-12-07
Filing date: 2023-12-07
Publication date: 2024-03-12

Abstract

The invention relates to a multi-region classification video super-resolution reconstruction method with time redundancy optimization, and belongs to the field of computer vision. The present invention proposes an effective method for dividing blocks into multiple regions to determine whether image blocks can provide effective inter-frame information, thereby determining whether to propagate the information. In order to further improve the reconstruction effect of the network, the upsampling process is also improved. Previous upsampling methods used bilinear interpolation to directly amplify by 4 times. This invention uses two cascaded pixelshuffle (×2) layers and uses a two-stage progressive upsampling strategy to reduce direct bilinear interpolation by 4 times. noise in the upsampling process, thereby further improving the performance of video super-resolution reconstruction.

Description

Multi-region classification video super-resolution reconstruction method with temporal redundancy optimization

技术领域Technical field

本发明属于计算机视觉领域，涉及一种时间冗余优化的多区域分类视频超分辨率重建方法。The invention belongs to the field of computer vision and relates to a multi-region classification video super-resolution reconstruction method with time redundancy optimization.

背景技术Background technique

利用深度学习技术解决视频超分辨率(Video Super-Resolution，VSR)问题已成为一个重要且备受关注的热点问题。已有相当多的学者提出了基于深度学习的网络以解决此类问题。目前对于视频超分辨率重建这一领域的研究可以根据帧间信息的利用方式——是否对齐，分为两大类：即对齐方法和非对齐方法；还可以根据框架的不同将其划分为滑动窗口框架和递归框架两类。这些方法在视频超分辨率重建领域得到了广泛的应用和发展。Using deep learning technology to solve the video super-resolution (VSR) problem has become an important and hot topic that has attracted much attention. Many scholars have proposed deep learning-based networks to solve such problems. Current research in the field of video super-resolution reconstruction can be divided into two categories based on how inter-frame information is utilized - whether they are aligned or not: aligned methods and non-aligned methods; they can also be divided into sliding methods based on different frameworks. There are two types of window frames and recursive frames. These methods have been widely used and developed in the field of video super-resolution reconstruction.

对齐方法通过提取运动信息，使得相邻帧与当前帧对齐，常用的对齐方法主要是运动估计和运动补偿(Motion Estimation and Motion Compensation,MEMC)以及可变形卷积。运动估计和运动补偿被广泛地研究用于视频处理。在视频超分辨率中，主要采用运动估计和运动补偿来表示连续LR(Low Resolution)帧之间的时间相关性。通常情况下，运动估计模块将两个帧作为输入，并产生光流矢量场，用光流来表示两帧之间的运动信息。运动补偿模块用于根据运动信息在图像之间进行图像变换，使相邻帧在空间上与当前帧对齐。The alignment method aligns adjacent frames with the current frame by extracting motion information. Commonly used alignment methods are mainly motion estimation and motion compensation (Motion Estimation and Motion Compensation, MEMC) and deformable convolution. Motion estimation and motion compensation are widely studied for video processing. In video super-resolution, motion estimation and motion compensation are mainly used to represent the temporal correlation between consecutive LR (Low Resolution) frames. Normally, the motion estimation module takes two frames as input and generates an optical flow vector field, using optical flow to represent the motion information between the two frames. The motion compensation module is used to perform image transformation between images based on motion information so that adjacent frames are spatially aligned with the current frame.

Xintao Wang等人提出的增强型可变形视频恢复(EDVR)网络就是应用可变形卷积的一个很好示例。当需要恢复的视频中包含遮挡、大运动和严重模糊时，光流这一方法就变得不准确了。因此，它提出了两个关键模块：金字塔、级联和可变形对齐模块(PCD)和时空注意融合模块(TSA)，分别用于解决视频中的大运动和有效融合多帧。Yapeng Tian等人提出的时间可变形对齐网络(TDAN)，将可变形卷积应用于目标帧和相邻帧，获得相应的偏移量，然后根据偏移量扭曲相邻帧，使相邻帧与目标帧对齐。The Enhanced Deformable Video Restoration (EDVR) network proposed by Xintao Wang et al. is a good example of applying deformable convolution. Optical flow becomes inaccurate when the video to be restored contains occlusions, large motion, and severe blur. Therefore, it proposes two key modules: Pyramid, Cascade and Deformable Alignment Module (PCD) and Spatiotemporal Attention Fusion Module (TSA), which are used to solve large motions in videos and effectively fuse multiple frames respectively. The temporal deformable alignment network (TDAN) proposed by Yapeng Tian et al. applies deformable convolution to the target frame and adjacent frames to obtain the corresponding offset, and then distorts the adjacent frames according to the offset to make the adjacent frames Align to target frame.

非对齐方法不需要对齐相邻帧，这类方法主要利用空间或时空信息进行特征提取。非对齐方法主要分为四种类型：2D卷积方法(2D Conv)、3D卷积方法、循环卷积神经网络和基于非局部网络的方法。Takashi Isobe等人提出的时域注意机制(TGA)通过帧速率组以分层的方式有效融合了时空信息，通过使用2D密集块和3D单元的组内融合和组间注意力机制以生成最终的高分辨率图像。Non-aligned methods do not need to align adjacent frames. This type of method mainly uses spatial or spatiotemporal information for feature extraction. Non-aligned methods are mainly divided into four types: 2D convolution methods (2D Conv), 3D convolution methods, recurrent convolutional neural networks and methods based on non-local networks. The temporal attention mechanism (TGA) proposed by Takashi Isobe et al. effectively fuses spatiotemporal information in a hierarchical manner through frame rate groups, using intra-group fusion and inter-group attention mechanisms of 2D dense blocks and 3D units to generate the final High resolution images.

滑动窗口框架指的是网络每次输入的帧数固定，一次只能处理几帧图像，视频中的每个帧通过使用短时间窗口内的帧来进行恢复。早期的方法预测低分辨率帧之间的光流，并执行空间扭曲以用于对准。后来的方法采取一种更为复杂的隐式对齐方法。例如，TDAN采用可变形卷积(DCNs)去对齐不同层的帧。EDVR进一步以多尺度方式使用DCN，以实现更精确的对齐。递归框架试图通过传播潜在特征来利用长期依赖关系，能够将前面帧的有效信息传播到后续帧中，更加有利于后续视频帧的恢复。例如，RSDN采用单向传播，带有递归细节结构块和隐藏状态自适应模块，以增强对外观变化和错误累积的鲁棒性。KelvinC.K.Chan等人提出了Basic VSR，该工作证明了双向传播相对于单向传播的重要性，以更好地利用时间特征。Kelvin C.K.Chan等人提出的BasicVSR++对Basic VSR进一步改进，提出二阶网格传播和光流引导的可变形对齐，实现了在大约相同参数量的情况下大幅增加网络性能。The sliding window framework means that the network inputs a fixed number of frames each time and can only process a few frames at a time. Each frame in the video is restored by using frames within a short time window. Early methods predicted optical flow between low-resolution frames and performed spatial warping for alignment. Later methods took a more sophisticated approach to implicit alignment. For example, TDAN uses deformable convolutions (DCNs) to align frames in different layers. EDVR further uses DCN in a multi-scale manner to achieve more precise alignment. The recursive framework attempts to exploit long-term dependencies by propagating latent features, which can propagate effective information from previous frames to subsequent frames, which is more conducive to the recovery of subsequent video frames. For example, RSDN adopts one-way propagation with recursive detail building blocks and hidden state adaptation modules to enhance robustness to appearance changes and error accumulation. KelvinC.K.Chan et al. proposed Basic VSR, which proved the importance of two-way propagation relative to one-way propagation to better utilize temporal characteristics. BasicVSR++ proposed by Kelvin C.K.Chan and others further improves Basic VSR and proposes deformable alignment of second-order grid propagation and optical flow guidance, achieving a significant increase in network performance with approximately the same amount of parameters.

发明内容Contents of the invention

有鉴于此，本发明的目的在于提供一种时间冗余优化的多区域分类视频超分辨率重建方法。In view of this, the object of the present invention is to provide a multi-region classification video super-resolution reconstruction method with temporal redundancy optimization.

为达到上述目的，本发明提供如下技术方案：In order to achieve the above objects, the present invention provides the following technical solutions:

时间冗余优化的多区域分类视频超分辨率重建方法，该方法为：Temporal redundancy optimized multi-region classification video super-resolution reconstruction method, the method is:

在TROMCN网络结构中，给定一个低质量视频帧序列T，H，W，C分别表示视频的长度、高度、宽度和通道数，视频超分辨率的目标是重建高质量视频帧序列其中s表示比例因子；在特征提取之前，将输入的视频帧分成64×64的块/>重叠8个像素取块；在特征提取模块中，使用残差的Swin Transformer模块提取特征F^SF：In the TROMCN network structure, given a low-quality video frame sequence T, H, W, and C respectively represent the length, height, width and number of channels of the video. The goal of video super-resolution is to reconstruct a high-quality video frame sequence. where s represents the scale factor; before feature extraction, the input video frame is divided into 64×64 blocks/> Overlap 8 pixels to take blocks; in the feature extraction module, use the residual Swin Transformer module to extract the feature F ^SF :

F^SF＝H_RSTB(X^LQ)F ^SF = H _RSTB (X ^LQ )

其中，H_RSTB代表特征提取模块，然后特征提取模块经过特征对齐和融合得到：Among them, H _RSTB represents the feature extraction module, and then the feature extraction module is obtained through feature alignment and fusion:

F^MSF＝H_MRC(F^SF)F ^MSF = H _MRC (F ^SF )

其中，H_MRC表示特征对齐和融合模块，该模块采用二阶网格的模式进行特征传播，采用基于流的可形变对齐方式进行特征的对齐，然后将学习到的特征经过级联的上采样模块，得到：Among them, H _MRC represents the feature alignment and fusion module. This module uses a second-order grid mode for feature propagation, uses a flow-based deformable alignment method to align features, and then passes the learned features through a cascaded upsampling module. ,get:

F^UP＝H_UP(F^MSF)F ^UP =H _UP (F ^MSF )

其中，H_UP表示上采样模块的上采样操作，使用的上采样方式是级联两个×2的亚像素卷积层(pixelshuffle)，F^UP是经过上采样之后得到的特征；然后再将F_UP输入到重构层中生成最终的超分辨率视频帧：Among them, H _UP represents the upsampling operation of the upsampling module. The upsampling method used is to cascade two × 2 sub-pixel convolution layers (pixelshuffle). F ^UP is the feature obtained after upsampling; then F _UP is input to the reconstruction layer to generate the final super-resolution video frame:

Y^HQ＝H_R(F^UP)＝H_TROMCN(X^LQ)Y ^HQ = _HR (F ^UP ) =H _TROMCN (X ^LQ )

其中，H_R和H_TROMCN分别表示重建层操作和本次发明提出的TROMCN网络；Among them, _HR and H _TROMCN respectively represent the reconstruction layer operation and the TROMCN network proposed by this invention;

对于训练，使用Charbonnier损失函数进行优化，Y^HQ表示重建的图像，Y^aT表示真实的高分辨率图像，∈表示常量。For training, use the Charbonnier loss function For optimization, Y ^HQ represents the reconstructed image, Y ^aT represents the real high-resolution image, and ∈ represents a constant.

可选的，所述特征提取模块中，对于一个视频帧x_t，在进行特征提取之前，将视频帧分成大小为64×64的块重叠8个像素，经过RSTB模块提取特征之后得到特征；采用残差的Swin Transformer模块来提取特征，捕捉长距离依赖，聚合高频信息，提取出具有上下文关联性的特征表示，以分块后的图像/>为输入，经过一个普通的卷积层后，再经过L个残差结构的Swin Transformer模块和一个卷积块提取特征，得到提取后的特征单个STL模块表示Swin Transformer Layer，特征提取阶段的公式为：Optionally, in the feature extraction module, for a video frame x _t , the video frame is divided into blocks of size 64×64 before feature extraction. Overlap 8 pixels, and obtain features after extracting features through the RSTB module; use the residual Swin Transformer module to extract features, capture long-distance dependencies, aggregate high-frequency information, and extract contextually relevant feature representations, using the block-based Image/> As input, after passing through an ordinary convolution layer, L Swin Transformer modules with residual structure and a convolution block are used to extract features to obtain the extracted features. A single STL module represents Swin Transformer Layer, and the formula for the feature extraction stage is:

其中，表示浅层特征提取模块提取后的特征，HRSTB(·)表示残差的SwinTransformer模块，H_CONV(·)表示一个普通的卷积层；/>表示分块后的图像。in, Represents the features extracted by the shallow feature extraction module, HRSTB (·) represents the residual SwinTransformer module, and H _CONV (·) represents an ordinary convolution layer;/> Represents the divided image.

可选的，所述特征对齐和融合模块中，特征传播模块采用动态传播，其中当前帧的每一个块接收来自不同帧的信息，采用包含和/>的分支来恢复来自不同帧的块的有效信息；/>和/>分别表示N个重叠块和对应块的隐藏状态；Optionally, in the feature alignment and fusion module, the feature propagation module adopts dynamic propagation, in which each block of the current frame receives information from different frames, using and/> branch to recover valid information from blocks of different frames;/> and/> Represent the hidden states of N overlapping blocks and corresponding blocks respectively;

使用光流的平均值来表示运动状态，其公式为：The average value of optical flow is used to represent the motion state, and its formula is:

其中flow(·)表示光流估计，|·|表示绝对值，mean表示平均值计算，表示参考块/>和对应相邻块/>之间的运动状态；通过设定阈值γ并将其与计算出的光流相比较来判断当前块的相邻块是否能提供有效的帧间信息；比较公式如下：where flow(·) represents optical flow estimation, |·| represents absolute value, mean represents average calculation, Indicates reference block/> and corresponding adjacent blocks/> motion state between; by setting the threshold γ and comparing it with the calculated optical flow to determine whether the adjacent blocks of the current block can provide effective inter-frame information; the comparison formula is as follows:

当计算出的光流大于阈值时，认为相邻块提供有用的信息帮助恢复当前块，将其信息采纳，传播给当前块；反之，当计算出的光流小于阈值时，认为这两个块逐像素点之间的移动幅度小，重合像素多，存在时间冗余，容易出现模糊和噪点，这个帧的信息将被丢弃，以避免有用信息的消失；设定两个阈值γ₁＝0.2，γ₂＝0.3，分别对应一阶和二阶传播；将更新后的和/>传播到下一帧；经过特征传播，得到N个块的特征，串联一个RSTB模块并将N个块重新合成为一个完整的特征图h_t。When the calculated optical flow is greater than the threshold, the adjacent blocks are considered to provide useful information to help restore the current block, and their information is adopted and propagated to the current block; conversely, when the calculated optical flow is less than the threshold, the two blocks are considered The movement between pixels is small, there are many overlapping pixels, there is time redundancy, and blur and noise are prone to occur. The information of this frame will be discarded to avoid the disappearance of useful information; set two thresholds γ ₁ = 0.2, γ ₂ =0.3, corresponding to first-order and second-order propagation respectively; change the updated and/> Propagated to the next frame; after feature propagation, the features of N blocks are obtained, an RSTB module is concatenated and the N blocks are resynthesized into a complete feature map h _t .

可选的，所述上采样分为三个类别：基于线性插值的上采样、基于深度学习的上采样以及Unpooling方法。Optionally, the upsampling is divided into three categories: upsampling based on linear interpolation, upsampling based on deep learning, and Unpooling method.

本发明的有益效果在于：本发明提出的网络具有更强大的特征提取能力和特征对齐的能力以及减少时间冗余的能力，能够重建出更高质量的高分辨率图像。The beneficial effect of the present invention is that the network proposed by the present invention has more powerful feature extraction capabilities, feature alignment capabilities and the ability to reduce time redundancy, and can reconstruct higher-quality high-resolution images.

本发明的其他优点、目标和特征在某种程度上将在随后的说明书中进行阐述，并且在某种程度上，基于对下文的考察研究对本领域技术人员而言将是显而易见的，或者可以从本发明的实践中得到教导。本发明的目标和其他优点可以通过下面的说明书来实现和获得。Other advantages, objects, and features of the present invention will, to the extent that they are set forth in the description that follows, and to the extent that they will become apparent to those skilled in the art upon examination of the following, or may be derived from This invention is taught by practicing it. The objects and other advantages of the invention may be realized and obtained by the following description.

附图说明Description of the drawings

为了使本发明的目的、技术方案和优点更加清楚，下面将结合附图对本发明作优选的详细描述，其中：In order to make the purpose, technical solutions and advantages of the present invention clearer, the present invention will be described in detail below in conjunction with the accompanying drawings, in which:

图1为TROMCN整体网络结构；Figure 1 shows the overall network structure of TROMCN;

图2为RSTB的网络结构；Figure 2 shows the network structure of RSTB;

图3为STL的网络结构；Figure 3 shows the network structure of STL;

图4为特征前向传播结构；Figure 4 shows the feature forward propagation structure;

图5为级联上采样；Figure 5 shows cascade upsampling;

图6为DTVIT 046放大4倍的视觉结果；Figure 6 is the visual result of DTVIT 046 magnified 4 times;

图7为DTVIT 072放大4倍的视觉结果；Figure 7 is the visual result of DTVIT 072 magnified 4 times;

图8为REDS,000放大4倍的视觉结果；Figure 8 shows the visual results of REDS,000 magnified 4 times;

图9为REDS,015放大4倍的视觉结果；Figure 9 shows the visual result of REDS,015 magnified 4 times;

图10为Vid4,city放大4倍的视觉结果；Figure 10 shows the visual result of Vid4,city magnified 4 times;

图11为Vid4,walk放大4倍的视觉结果；Figure 11 shows the visual result of Vid4, walk magnified 4 times;

图12为UDM10,archpeople放大4倍的视觉结果；Figure 12 shows the visual results of UDM10 and archpeople magnified 4 times;

图13为UDM10,photography放大4倍的视觉结果；Figure 13 shows the visual result of UDM10, photography magnified 4 times;

图14为Vimeo,Sequence 00001,Clip 0837放大4倍的视觉结果；Figure 14 is the visual result of Vimeo, Sequence 00001, Clip 0837 magnified 4 times;

图15为Vimeo,Sequence 00010,Clip 0573放大4倍的视觉结果。Figure 15 is the visual result of Vimeo, Sequence 00010, Clip 0573 magnified 4 times.

具体实施方式Detailed ways

以下通过特定的具体实例说明本发明的实施方式，本领域技术人员可由本说明书所揭露的内容轻易地了解本发明的其他优点与功效。本发明还可以通过另外不同的具体实施方式加以实施或应用，本说明书中的各项细节也可以基于不同观点与应用，在没有背离本发明的精神下进行各种修饰或改变。需要说明的是，以下实施例中所提供的图示仅以示意方式说明本发明的基本构想，在不冲突的情况下，以下实施例及实施例中的特征可以相互组合。The following describes the embodiments of the present invention through specific examples. Those skilled in the art can easily understand other advantages and effects of the present invention from the content disclosed in this specification. The present invention can also be implemented or applied through other different specific embodiments. Various details in this specification can also be modified or changed in various ways based on different viewpoints and applications without departing from the spirit of the present invention. It should be noted that the illustrations provided in the following embodiments only illustrate the basic concept of the present invention in a schematic manner. The following embodiments and the features in the embodiments can be combined with each other as long as there is no conflict.

其中，附图仅用于示例性说明，表示的仅是示意图，而非实物图，不能理解为对本发明的限制；为了更好地说明本发明的实施例，附图某些部件会有省略、放大或缩小，并不代表实际产品的尺寸；对本领域技术人员来说，附图中某些公知结构及其说明可能省略是可以理解的。The drawings are only for illustrative purposes, and represent only schematic diagrams rather than actual drawings, which cannot be understood as limitations of the present invention. In order to better illustrate the embodiments of the present invention, some components of the drawings will be omitted. The enlargement or reduction does not represent the size of the actual product; it is understandable to those skilled in the art that some well-known structures and their descriptions may be omitted in the drawings.

本发明实施例的附图中相同或相似的标号对应相同或相似的部件；在本发明的描述中，需要理解的是，若有术语“上”、“下”、“左”、“右”、“前”、“后”等指示的方位或位置关系为基于附图所示的方位或位置关系，仅是为了便于描述本发明和简化描述，而不是指示或暗示所指的装置或元件必须具有特定的方位、以特定的方位构造和操作，因此附图中描述位置关系的用语仅用于示例性说明，不能理解为对本发明的限制，对于本领域的普通技术人员而言，可以根据具体情况理解上述术语的具体含义。In the drawings of the embodiments of the present invention, the same or similar numbers correspond to the same or similar components; in the description of the present invention, it should be understood that if there are terms "upper", "lower", "left" and "right" The orientation or positional relationship indicated by "front", "rear", etc. is based on the orientation or positional relationship shown in the drawings, and is only for the convenience of describing the present invention and simplifying the description, and does not indicate or imply that the device or element referred to must be It has a specific orientation and is constructed and operated in a specific orientation. Therefore, the terms describing the positional relationships in the drawings are only for illustrative purposes and cannot be understood as limitations of the present invention. For those of ordinary skill in the art, they can determine the specific position according to the specific orientation. Understand the specific meaning of the above terms.

现有的视频超分辨率方法都着重于特征传播的增强以及时间和空间上的对齐，而忽视了时间冗余对于重建效果的影响。视频超分和图像超分的区别在于，视频超分是连续的视频帧输入到神经网络中经过学习，得到一张或多张超分辨率图像。多个连续的视频帧之间存在大量的时间冗余，如果无差别的将所有像素提取出来的特征都作为有效特征进行传播，容易产生噪声，从而对重建产生干扰。因为图像中的不同成分具有不同的频率特性，因此需要对图像中的不同成分区别处理进行高分辨率重建。综上所述，针对现有方法存在的缺陷，本发明提出了一种时间冗余优化的多区域分类视频超分辨率重建方法(TemporaryRedundancy Optimization and Multi-Region Classification Network,TROMCN)，对输入视频帧进行分块处理，通过判断块间的光流将块分为静态区域和动态区域，并作出不同的处理，从而获得更准确的重建结果。Existing video super-resolution methods focus on the enhancement of feature propagation and temporal and spatial alignment, while ignoring the impact of temporal redundancy on the reconstruction effect. The difference between video super-resolution and image super-resolution is that in video super-resolution, continuous video frames are input into the neural network for learning, and one or more super-resolution images are obtained. There is a large amount of temporal redundancy between multiple consecutive video frames. If the features extracted from all pixels are propagated as effective features indiscriminately, noise will easily be generated, thus causing interference to the reconstruction. Because different components in the image have different frequency characteristics, high-resolution reconstruction is required to distinguish different components in the image. To sum up, in view of the shortcomings of the existing methods, the present invention proposes a temporal redundancy optimized multi-region classification video super-resolution reconstruction method (TemporaryRedundancy Optimization and Multi-Region Classification Network, TROMCN). Perform block processing, divide the blocks into static areas and dynamic areas by judging the optical flow between blocks, and perform different processing to obtain more accurate reconstruction results.

本发明提出的TROMCN网络首先使用RSTB(Residual Swin Transformer Block)代替传统的残差块(Residual Blocks)进行特征的提取和重构。与基于CNN(ConvolutionalNeural Network)的残差块相比，残差Swin Transformer模块能够捕捉长距离依赖，聚合相关的高频信息。其次，为了解决无差别使用图像中的特征带来的噪点问题，本发明提出了一种多区域划分块的有效方法，判断图像块是否能够提供有效的帧间信息，从而判断是否将该信息进行传播。为了进一步提高网络的重建效果，还将上采样过程进行了改进。以往的上采样方法都是采用双线性插值直接放大4倍，本发明采用两个级联的pixelshuffle(×2)层，采用两段渐进式的上采样策略可以减少直接双线性插值4倍上采样过程中的噪声，从而进一步提升视频超分辨率重建的性能。The TROMCN network proposed by this invention first uses RSTB (Residual Swin Transformer Block) instead of traditional residual blocks (Residual Blocks) to extract and reconstruct features. Compared with the residual block based on CNN (Convolutional Neural Network), the residual Swin Transformer module can capture long-distance dependencies and aggregate relevant high-frequency information. Secondly, in order to solve the noise problem caused by the indiscriminate use of features in the image, the present invention proposes an effective method of dividing blocks into multiple regions to determine whether the image blocks can provide effective inter-frame information, thereby determining whether to use the information. spread. In order to further improve the reconstruction effect of the network, the upsampling process is also improved. Previous upsampling methods used bilinear interpolation to directly amplify by 4 times. This invention uses two cascaded pixelshuffle (×2) layers and uses a two-stage progressive upsampling strategy to reduce direct bilinear interpolation by 4 times. noise in the upsampling process, thereby further improving the performance of video super-resolution reconstruction.

本发明提出的TROMCN网络结构共包含四个部分，分别为特征提取模块、特征对齐和融合模块、上采样模块以及重建模块，整个模型结构如图1所示。The TROMCN network structure proposed by this invention consists of four parts, which are feature extraction module, feature alignment and fusion module, upsampling module and reconstruction module. The entire model structure is shown in Figure 1.

1.整体流程1. Overall process

在TROMCN网络结构中，给定一个低质量视频帧序列T，H，W，C分别表示视频的长度、高度、宽度和通道数，视频超分辨率的目标是重建高质量视频帧序列其中s表示比例因子。在特征提取之前，将输入的视频帧分成64×64的块/>(重叠8个像素取块)。在特征提取模块中，使用残差的Swin Transformer模块提取特征F^SF，In the TROMCN network structure, given a low-quality video frame sequence T, H, W, and C respectively represent the length, height, width and number of channels of the video. The goal of video super-resolution is to reconstruct a high-quality video frame sequence. where s represents the scaling factor. Before feature extraction, the input video frame is divided into 64×64 blocks/> (Overlap 8 pixel blocks). In the feature extraction module, the residual Swin Transformer module is used to extract the feature F ^SF ,

F^SF＝H_RSTB(X^LQ)F ^SF = H _RSTB (X ^LQ )

F^MSF＝H_MRC(F^SF)F ^MSF = H _MRC (F ^SF )

F^UP＝H_UP(F^MAF)F ^UP =H _UP (F ^MAF )

其中，H_UP表示上采样模块的上采样操作，这里使用的上采样方式是级联两个×2的亚像素卷积层(pixelshuffle)，F^UP是经过上采样之后得到的特征。然后再将F_UP输入到重构层中生成最终的超分辨率视频帧：Among them, H _UP represents the upsampling operation of the upsampling module. The upsampling method used here is to cascade two × 2 sub-pixel convolution layers (pixelshuffle), and F ^UP is the feature obtained after upsampling. Then the F _UP is input into the reconstruction layer to generate the final super-resolution video frame:

Y^HQ＝H_R(F^UP)＝H_TROMCN(X^LQ)Y ^HQ = _HR (F ^UP ) =H _TROMCN (X ^LQ )

其中，H_R和H_TROMCN分别表示重建层操作和本次发明提出的TROMCN网络。Among them, _HR and H _{TROMCN respectively} represent the reconstruction layer operation and the TROMCN network proposed by this invention.

对于训练，本发明的网络使用Charbonnier损失函数进行优化，Y^HQ表示重建的图像，Y^GT表示真实的高分辨率图像，∈表示一个很小的常量。For training, the network of the present invention uses the Charbonnier loss function For optimization, Y ^HQ represents the reconstructed image, Y ^GT represents the real high-resolution image, and ∈ represents a small constant.

2.特征提取模块2. Feature extraction module

以一个视频帧x_t为例，在进行特征提取之前，将视频帧分成大小为64×64的块(重叠8个像素)，经过RSTB模块提取特征之后得到特征。现有的技术方案在提取图像特征时通常使用卷积神经网络、循环神经网络等，而这两种网络都会因为长期依赖性问题而导致性能的下降。本发明采用残差的Swin Transformer模块来提取特征，捕捉长距离依赖，聚合相关的高频信息，能够提取出具有上下文关联性的特征表示，训练效率更高。残差的Swin Transformer模块结构如图2所示。该模块以分块后的图像/>为输入，经过一个普通的卷积层后，再经过L个残差结构的Swin Transformer模块和一个卷积块提取特征，得到提取后的特征/>这样的残差设计为不同模块到重建模块提供了等效连接，促进了不同层级特征的聚合。图2中单个STL模块表示Swin Transformer Layer，其结构如图3所示。此特征提取阶段的公式为：Taking a video frame x _t as an example, before feature extraction, the video frame is divided into blocks of size 64×64 (overlapping 8 pixels), the features are obtained after feature extraction by the RSTB module. Existing technical solutions usually use convolutional neural networks, recurrent neural networks, etc. when extracting image features, and both of these networks will lead to performance degradation due to long-term dependency issues. The present invention uses the residual Swin Transformer module to extract features, capture long-distance dependencies, aggregate relevant high-frequency information, and can extract feature representations with contextual relevance, resulting in higher training efficiency. The residual Swin Transformer module structure is shown in Figure 2. This module takes the chunked image/> As input, after passing through an ordinary convolution layer, L Swin Transformer modules with residual structure and a convolution block are used to extract features to obtain the extracted features/> Such a residual design provides equivalent connections from different modules to the reconstruction module, promoting the aggregation of features at different levels. The single STL module in Figure 2 represents the Swin Transformer Layer, and its structure is shown in Figure 3. The formula for this feature extraction stage is:

其中，表示浅层特征提取模块提取后的特征，H_RSTB(·)表示残差的SwinTransformer模块，H_CONV(·)表示一个普通的卷积层。/>表示分块后的图像。in, Represents the features extracted by the shallow feature extraction module, H _RSTB (·) represents the residual SwinTransformer module, and H _CONV (·) represents an ordinary convolution layer. /> Represents the divided image.

3.特征对齐和融合模块3. Feature alignment and fusion module

在本发明提出的网络中，采用二阶网格传播进行特征的增强，可以利用时空信息更有效地跨越未对齐的视频帧。通过将上一步提取到的特征输入到基于流的可形变对齐的模块中，在增加偏移多样性的同时克服偏移溢出的问题。本发明提出的特征传播模块采用动态传播，其中当前帧的每一个块都可以接收来自不同帧的信息，为了实现这一点，采用包含和/>的分支来恢复来自不同帧的块的有效信息。/>和/>分别表示N个重叠块和对应块的隐藏状态。图4为特征前向传播结构。In the network proposed by the present invention, second-order grid propagation is used to enhance features, and spatio-temporal information can be used to more effectively span unaligned video frames. By inputting the features extracted in the previous step into the flow-based deformable alignment module, the problem of offset overflow is overcome while increasing the offset diversity. The feature propagation module proposed by this invention adopts dynamic propagation, in which each block of the current frame can receive information from different frames. In order to achieve this, the feature propagation module containing and/> branch to recover valid information from blocks of different frames. /> and/> represent the hidden states of N overlapping blocks and corresponding blocks respectively. Figure 4 shows the feature forward propagation structure.

本次发明提出了一种检测时间冗余的判断模块用于判断图像块是否能够提供有效的帧间信息，从而判断是否将该信息进行传播。由于光流是描述物体运动信息的度量，因此使用光流的平均值来表示运动状态，其公式为：This invention proposes a judgment module for detecting temporal redundancy to judge whether image blocks can provide effective inter-frame information, thereby judging whether to propagate this information. Since optical flow is a measure that describes the motion information of an object, the average value of optical flow is used to represent the motion state, and its formula is:

其中flow(·)表示光流估计，|·|表示绝对值，mean表示平均值计算，表示参考块/>和对应相邻块/>之间的运动状态。对于光流的计算采用传统的DIS算法，因为其仅略微增加一些计算成本。通过设定阈值γ并将其与计算出的光流相比较来判断当前块的相邻块是否能提供有效的帧间信息。比较公式如下：where flow(·) represents optical flow estimation, |·| represents absolute value, mean represents average calculation, Indicates reference block/> and corresponding adjacent blocks/> state of motion. For the calculation of optical flow, the traditional DIS algorithm is used because it only slightly increases the computational cost. By setting a threshold γ and comparing it with the calculated optical flow, it is judged whether the adjacent blocks of the current block can provide effective inter-frame information. The comparison formula is as follows:

当计算出的光流大于阈值时，认为相邻块可以提供有用的信息帮助恢复当前块，因此将其信息采纳，传播给当前块。反之，当计算出的光流小于阈值时，认为这两个块逐像素点之间的移动幅度小，重合像素多，存在时间冗余，容易出现模糊和噪点，该帧的信息将被丢弃，以避免有用信息的消失。由于本发明采用的是二阶网格传播，因此设定了两个阈值γ₁＝0.2，γ₂＝0.3，分别对应一阶和二阶传播。之后，将更新后的和/>传播到下一帧。最后，经过特征传播，得到N个块的特征，串联一个RSTB模块并将N个块重新合成为一个完整的特征图h_t。经过多次实验表明，此方法可以提高原模型的性能，并很好地解决了泛化能力。When the calculated optical flow is greater than the threshold, it is considered that the adjacent block can provide useful information to help restore the current block, so its information is adopted and propagated to the current block. On the contrary, when the calculated optical flow is less than the threshold, it is considered that the pixel-by-pixel movement between the two blocks is small, there are many overlapping pixels, there is temporal redundancy, and blur and noise are prone to occur, and the information of the frame will be discarded. to avoid the loss of useful information. Since the present invention uses second-order grid propagation, two thresholds γ ₁ =0.2 and γ ₂ =0.3 are set, corresponding to first-order and second-order propagation respectively. Afterwards, the updated and/> propagated to the next frame. Finally, after feature propagation, the features of N blocks are obtained, an RSTB module is concatenated and the N blocks are resynthesized into a complete feature map h _t . After many experiments, it has been shown that this method can improve the performance of the original model and solve the generalization ability well.

4.级联亚像素卷积层上采样4. Cascade sub-pixel convolutional layer upsampling

上采样技术是图像进行超分辨率的必要步骤，从以往的视频超分辨率方法上来看，上采样大致分为三个类别：基于线性插值的上采样、基于深度学习的上采样以及Unpooling方法。以往的视频超分辨率方法经常采用双线性插值法直接将图像放大4倍，但是其并没有考虑各邻点的灰度值之间的相互影响，容易导致图像的高频分量受到损失，图像边缘在一定程度上变得较为模糊。后来许多视频超分辨率方法采用亚像素卷积进行上采样操作，这种方法不再需要通过一开始的进行线性插值扩大输入的尺寸，从而可以用更小的卷积核来达到很好的效果，使用卷积学习可以学会比手工设计更好的拟合方式。在普通直接使用亚像素卷积的方法上进行了改进，采用级联的方式，将两个亚像素卷积层结合在一起，用两端渐进式的上采样策略可以很好地减少上采样过程中的噪声。Upsampling technology is a necessary step for image super-resolution. Judging from previous video super-resolution methods, upsampling is roughly divided into three categories: upsampling based on linear interpolation, upsampling based on deep learning, and Unpooling methods. Previous video super-resolution methods often use bilinear interpolation to directly enlarge the image by 4 times, but it does not consider the interaction between the gray values of adjacent points, which can easily lead to the loss of high-frequency components of the image. The edges become somewhat blurry. Later, many video super-resolution methods used sub-pixel convolution for upsampling operations. This method no longer needs to expand the size of the input through linear interpolation at the beginning, so that smaller convolution kernels can be used to achieve good results. , using convolutional learning can learn a better fitting method than manual design. Improvements have been made on the ordinary method of directly using sub-pixel convolution. A cascade method is used to combine two sub-pixel convolution layers together. The progressive up-sampling strategy at both ends can well reduce the up-sampling process. noise in.

5.实验设置5. Experimental settings

本发明使用REDS数据集和Vimeo-90K数据集分别训练该网络。REDS数据集中有270个可用的视频序列，每个序列包含一百帧。遵循常规的拆分方法，将数据拆分为训练集(266个序列)和测试集(4个序列)。Vimeo-90K包含64612和7824个视频序列分别用来训练和测试。虽然这两个数据集都是广泛使用的基准，但是这两个数据具有不同的运动条件。Vimeo-90K数据集中的运动通常较小，99％像素的运动幅度都小于10(对于每个片段，测量第4帧和第7帧的运动)。REDS数据集则存在大幅度的运动，至少20％像素的运动幅度都大于10(对于每个片段，测量第3帧和第5帧的运动)。此外还采用了DTVIT数据集，此数据集包括直播、电视节目、体育直播、影视、监控摄像头和广告。This invention uses the REDS data set and the Vimeo-90K data set to train the network respectively. There are 270 video sequences available in the REDS dataset, each sequence contains one hundred frames. Following the conventional splitting method, the data was split into a training set (266 sequences) and a test set (4 sequences). Vimeo-90K contains 64612 and 7824 video sequences for training and testing respectively. Although both datasets are widely used benchmarks, the two data have different motion conditions. Motion in the Vimeo-90K dataset is generally small, with 99% of pixels having a motion magnitude smaller than 10 (for each clip, motion was measured at frames 4 and 7). The REDS data set has large-scale motion, and at least 20% of the pixels have a motion amplitude greater than 10 (for each segment, the motion of frames 3 and 5 is measured). In addition, the DTVIT data set is used, which includes live broadcasts, TV programs, sports live broadcasts, movies, surveillance cameras, and advertisements.

为了证明本发明提出方法的优越性，将本发明提出的网络与现有的极具代表性的前沿超分辨率重建方法进行了对比，对比文献如表1所示。In order to prove the superiority of the method proposed in this invention, the network proposed in this invention was compared with the existing representative cutting-edge super-resolution reconstruction methods. The comparative documents are shown in Table 1.

表1Table 1

客观评价结果如表2所示，对效果最好的方法进行了加粗，从表中可以看出，在大多数情况下，本发明的PSNR和SSIM是最高的，重建效果明显优于一些现存的极具代表性的前沿视频超分辨率重建方法。图6～图15为本发明的主观视觉效果图。可以看出图6-图7生产更加锐利和清晰的高分辨率视频帧，而其他方法无法恢复精细的纹理和细节。图8-图15在视觉上呈现出更加清晰的边缘和精细的细节，而其他方法产生的模糊更多，也丢失了更多的细节。综上，无论是从客观的评价指标还是主观的视觉效果图，都可以证明本发明提出的方法在优化时间冗余、特征增强等方面都有很强的优越性。The objective evaluation results are shown in Table 2. The methods with the best results are bolded. It can be seen from the table that in most cases, the PSNR and SSIM of the present invention are the highest, and the reconstruction effect is significantly better than some existing methods. A very representative cutting-edge video super-resolution reconstruction method. Figures 6 to 15 are subjective visual effect diagrams of the present invention. It can be seen that Figures 6-7 produce sharper and clearer high-resolution video frames, while other methods cannot recover fine textures and details. Figures 8-15 visually present sharper edges and fine details, while other methods produce more blur and lose more details. In summary, whether it is from objective evaluation indicators or subjective visual effect diagrams, it can be proved that the method proposed by the present invention has strong advantages in optimizing time redundancy, feature enhancement, etc.

表2Table 2

最后说明的是，以上实施例仅用以说明本发明的技术方案而非限制，尽管参照较佳实施例对本发明进行了详细说明，本领域的普通技术人员应当理解，可以对本发明的技术方案进行修改或者等同替换，而不脱离本技术方案的宗旨和范围，其均应涵盖在本发明的权利要求范围当中。Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention and are not limiting. Although the present invention has been described in detail with reference to the preferred embodiments, those of ordinary skill in the art should understand that the technical solutions of the present invention can be modified. Modifications or equivalent substitutions without departing from the purpose and scope of the technical solution shall be included in the scope of the claims of the present invention.

Claims

1. The multi-region classified video super-resolution reconstruction method for time redundancy optimization is characterized by comprising the following steps of: the method comprises the following steps:

in a TROMCN network structure, a sequence of low quality video frames is givenT, H, W, C denote the length, height, width and channel number, respectively, of a video, the goal of which is to reconstruct a sequence of high quality video framesWherein s represents a scale factor; before feature extraction, the input video frame is divided into 64×64 blocks +.>Overlapping 8 pixels to obtain blocks; in the feature extraction module, the residual Swin transducer module is used to extract feature F ^SF ：

F ^SF ＝H _RSTB (X ^LQ )

Wherein H is _RSTB Representing a feature extraction module, and then obtaining the feature through feature alignment and fusion by the feature extraction module:

F ^MSF ＝H _MRC (F ^SF )

wherein H is _MRC The characteristic alignment and fusion module is used for carrying out characteristic propagation in a mode of a second-order grid, carrying out characteristic alignment in a deformable alignment mode based on flow, and then carrying out cascade upsampling on the learned characteristics to obtain the following components:

F ^UP ＝H _UP (F ^MSF )

wherein H is _UP Representing the upsampling operation of the upsampling module, the upsampling method used is to concatenate two x 2 sub-pixel convolution layers (pixelshuffles), F ^UP Is a feature obtained after upsampling; then F is again carried out _UP Inputting the video frame into a reconstruction layer to generate a final super-resolution video frame:

Y ^HQ ＝H _R (F ^UP )＝H _TROMCN (X ^LQ )

wherein H is _R And H _TROMCN Respectively representing the operation of a reconstruction layer and the TROMCN network proposed by the invention;

for training, a Charbonnier loss function is usedOptimizing Y ^HQ Representing reconstructed image, Y ^GT Representing a true high resolution image, e represents a constant.

2. The method for multi-region classified video super-resolution reconstruction with temporal redundancy optimization according to claim 1, wherein: in the feature extraction module, for a video frame x _t Prior to feature extraction, the video frame is divided into blocks of size 64 x 64Overlapping 8 pixels, and extracting features by an RSTB module to obtain features; extracting features by adopting a residual Swin transducer module, capturing long-distance dependence, aggregating high-frequency information, extracting a feature representation with contextual relevance, and obtaining a segmented image +.>For input, after passing through a common convolution layer, the characteristics are extracted by L Swin transform modules with residual structures and a convolution block, so as to obtain the extracted characteristics->The single STL module represents Swin Transformer Layer and the formula for the feature extraction stage is:

wherein,representing the features extracted by the shallow feature extraction module, H _RSTB (. Cndot.) Swin transducer module, H, representing residual error _CONV (-) represents a common convolution layer; />Representing the segmented image.

3. The method for multi-region classified video super-resolution reconstruction with temporal redundancy optimization according to claim 2, wherein: in the feature alignment and fusion module, the feature propagation module adopts dynamic propagation, wherein each block of the current frame receives information from different frames, and adopts a method comprising the following steps ofAnd->To recover valid information from blocks of different frames; />And->Respectively representing hidden states of N overlapped blocks and corresponding blocks;

the average value of the optical flow is used to represent the motion state, and the formula is:

where flow (·) represents optical flow estimation, |·| represents absolute value, mean represents mean calculation,representing reference block->And corresponding adjacent block->A state of motion therebetween; judging whether the neighboring block of the current block can provide effective inter-frame information by setting a threshold gamma and comparing it with the calculated optical flow; the comparison formula is as follows:

when the calculated optical flow is greater than the threshold value, the adjacent block is considered to provide useful information to help restore the current block, and the information is adopted and transmitted to the current block; conversely, when the calculated optical flow is smaller than the threshold value, the movement amplitude between the two blocks pixel by pixel points is considered to be small, the number of overlapped pixels is large, time redundancy exists, blurring and noise points are easy to occur, and the information of the frame is discarded to avoid the disappearance of useful information; setting two threshold values gamma ₁ ＝0.2，γ ₂ =0.3, corresponding to first and second order propagation, respectively; will be updatedAnd->Propagating to the next frame; through feature propagation, N blocks of features are obtained, one RST is connected in seriesB module and resynthesize N blocks into a complete feature map h _t 。

4. The method for multi-region classified video super-resolution reconstruction with temporal redundancy optimization according to claim 3, wherein: the upsampling is divided into three categories: upsampling based on linear interpolation, upsampling based on deep learning, and the Unpooling method.