CN119624782A

CN119624782A - A satellite video super-resolution reconstruction method and system for multiple degradation processes

Info

Publication number: CN119624782A
Application number: CN202510161817.5A
Authority: CN
Inventors: 李路; 王密
Original assignee: Wuhan University WHU
Current assignee: Wuhan University WHU
Priority date: 2025-02-14
Filing date: 2025-02-14
Publication date: 2025-03-14
Anticipated expiration: 2045-02-14
Also published as: CN119624782B

Abstract

The present invention provides a satellite video super-resolution reconstruction method and system for multiple degradation processes. The whole process of the method is as follows: the satellite video random degradation module is used to generate corresponding low-resolution video data, and then the feature extraction module is used to extract features of the low-resolution data. The extracted features use a high-order grid bidirectional propagation scheme to transmit the features of each frame of the entire video sequence, and the feature alignment module is used to achieve feature alignment of the video data. Finally, the aligned features are pixel reorganized and added to the up-sampled low-resolution features to obtain a reconstructed high-resolution image. Experiments with three types of satellite video data show that the present invention can effectively restore the high fidelity of satellite videos, significantly improve the reconstruction quality of remote sensing images, and also has a certain denoising capability.

Description

Satellite video super-resolution reconstruction method and system for multiple degradation process

Technical Field

The invention belongs to the field of high-resolution optical satellite video image processing, and particularly relates to a satellite video super-resolution reconstruction method and system for a multiple degradation process.

Background

With the rapid development of remote sensing technology, users can more easily acquire high-spatial-resolution and high-time-resolution earth observation remote sensing data, and the traditional static earth observation data is not satisfied. Video satellites are a new type of earth-looking satellites developed in recent years that "stare" at a particular target for a long period of time and capture time-series images of the particular target at a frame rate, as compared to conventional earth-looking satellites. Compared with the traditional static remote sensing image, the satellite video data not only has sub-meter high spatial resolution, but also has video-level high time resolution, can provide continuous information of specific targets, realizes high-space-time dynamic earth observation, and is widely applied to dynamic scene applications such as change detection, target tracking, traffic monitoring and the like. Therefore, the processing research on the video satellite data has important significance for homeland monitoring, disaster relief, national defense safety, smart cities and the like. However, the spatial resolution of satellite video data is affected by complex environmental factors in imaging and data transmission, such as atmospheric scattering, sensor tremors, data compression, and the like, high-frequency information in the satellite video data may be lost, and a blurring phenomenon occurs, which greatly reduces the performance of subsequent applications, so that improving the spatial resolution of satellite video is very important for human perception and realization of downstream tasks. Compared with the method of hardware to improve the resolution, the super-resolution reconstruction technology can greatly save the maintenance, transmission and storage costs of satellites.

Superresolution reconstruction is a very classical underlying task in computer vision that uses low resolution images to reconstruct high resolution images to improve image quality. The video super-resolution reconstruction requires the resolution of each frame in the video and puts requirements on the consistency of the images. In a broad sense, video super-resolution reconstruction can be regarded as an extension of image super-resolution reconstruction, which can be processed frame by frame using a single image super-resolution reconstruction algorithm. However, in practice, the results of processing video using image super-resolution reconstruction algorithms are difficult to satisfy, since time information is not considered, thereby causing artifacts and hysteresis. Video has more information, and video has an additional time dimension relative to the image, so it is more challenging to design a super-resolution reconstruction algorithm for video. To better utilize this time information, researchers often introduce frame alignment to eliminate the effects of object or background motion in the video. Compared with a natural scene video, the satellite video has the advantages that firstly, the spatial resolution of the satellite video is lower, texture information is lack, secondly, the visible range of the satellite video is larger than that of the natural scene video, the scenes are more various, the information density is higher, and thirdly, the variable scale of a moving object in the satellite video has more complex movement variation. Therefore, although there has been a great development in recent years of video super-resolution reconstruction based on deep learning, these general methods are not suitable for direct application to satellite video, and require more efforts to develop a method suitable for satellite video super-resolution reconstruction.

The existing method for acquiring the low-resolution data mainly comprises three modes, namely 1) synthesizing the low-resolution image from the corresponding high-resolution image based on degradation assumptions such as bicubic degradation and the like. However, when the test low resolution image does not meet the degradation assumption, the trained superreconstruction model performance may drop dramatically. 2) A low-resolution-high-resolution image pair is created from a real remote sensing image, wherein the low-resolution and high-resolution images are collected at the same location. However, this method has problems such as land coverage variation and spectral gaps between the two images. To solve the above problem, a third method is proposed to train a super-resolution reconstruction model using unpaired low-resolution-high-resolution images. These methods typically utilize a downgrader and a generator, where the low resolution image is first processed by the generator, and then the generator can be trained in a self-supervising manner. This makes it possible to use unpaired low-high resolution images, so training data is easier to collect. However, training of these methods is often unstable due to the field gap between low resolution and high resolution images and lack of intermediate supervision. For application of remote sensing image super-resolution reconstruction in the real remote sensing world, recent researches explore blind super-resolution considering various degradations in the actual scene. AWGN synthesizes training low-resolution-high-resolution image pairs in a blind super-resolution manner by introducing an anisotropic gaussian blur kernel and additive gaussian white noise, however, a degradation model based on the anisotropic gaussian blur assumption is too simple in an actual scene, limiting its application.

Furthermore, image or feature alignment of adjacent frames is a key and difficult problem in video super-resolution reconstruction algorithms. Due to the movement of ground objects such as vehicles, airplanes, ships and the like and the visual angle change caused by the movement of satellites, errors can be introduced in direct fusion, so that the effect is reduced. Thus, most current methods introduce an alignment operation that helps to accurately find missing information in adjacent frames. There are two types of alignment operations, image alignment and feature alignment. The method of image alignment typically uses a light flow method that works well for small scale movements (such as movement of features) but not for large scale changes (such as background movement). Feature alignment typically uses deformable convolutional networks and achieves good performance, but training of deformable convolutional networks is not stable. Neither optical flow nor deformable convolution is directly applicable to satellite video, because satellite video has both moving ground objects and moving background, with less obvious motion characteristics. Although the satellite video superdivision reconstruction introduces alignment operation, alignment errors cannot be eliminated, meanwhile, due to the fact that objects move in different shielded areas at fixed time, information of related areas is lost, poor effects are easily generated by directly fusing multi-frame information, and even wrong information is possibly fused, so that effects are reduced.

Disclosure of Invention

Based on the method, the invention provides a satellite video super-resolution reconstruction method and a satellite video super-resolution reconstruction system for a multiple degradation process, firstly, a degradation model of satellite video data is constructed, the degradation model considers a fuzzy core estimated from a real remote sensing image and a fuzzy core generated from predefined distribution, the low resolution data is ensured to be close to a real scene, the diversity of the low resolution data is not completely dependent on the diversity of an external data set, and secondly, the characteristics extracted from the whole video frame are propagated by using a bidirectional cyclic high-order grid characteristic propagation network, so that the reconstruction of each frame can utilize the information of all frames. Finally, a feature alignment module based on space-time fusion is provided, the method utilizes a space-time attention mechanism to fuse space-time feature information among frames, the utilization rate of key information is improved, the influence caused by error information is reduced, and a deformable convolution is used for aligning the features after space-time fusion, so that the error accumulation of long-sequence images generated along with time is reduced.

In order to achieve the above aim and achieve the above technical effects, the technical scheme adopted by the invention is that the satellite video super-resolution reconstruction method facing the multiple degradation process comprises the following steps:

step S1, acquiring a satellite video data set;

s2, constructing a satellite video super-resolution reconstruction model oriented to a multiple degradation process, wherein the satellite video super-resolution reconstruction model comprises a satellite video random degradation module, a feature extraction module, a feature alignment module and a pixel recombination module;

step S3, satellite video data in the satellite video data set passes through a satellite video random degradation module to obtain a low-resolution frame image, and data enhancement operation is carried out to obtain low-resolution frame data;

step S4, the low-resolution frame data obtained in the step S3 passes through a feature extraction module to obtain frame image feature information;

s5, inputting the frame image characteristic information obtained in the step S4 into a characteristic alignment module, and carrying out propagation, aggregation and alignment operation on the input characteristics to obtain the aligned characteristics;

Step S6, the aligned features obtained in the step S5 are passed through a pixel reorganization module to obtain residual features of the high-resolution image;

and S7, performing up-sampling operation on the low-resolution frame image obtained in the step S3, and adding the up-sampling operation with residual characteristics of the high-resolution image obtained in the step S6 to obtain a final high-resolution reconstructed image.

Further, the satellite video random degradation module in the step S3 includes a downsampling module, four random degradation modules and a video compression module, the downsampling module downsampling the image in one of nearest interpolation, bilinear interpolation or bicubic interpolation, each of the four random degradation modules performs degradation processing on the image in one of random blur, motion blur, random noise or sensor tremor, and the video compression module performs intra-frame compression or inter-frame compression, wherein intra-frame compression includes JPEG compression, JPEG 2000 compression, PNG compression, WEBP compression, BMP compression, TIFF compression, BPG compression and FLIF compression, and inter-frame compression includes h.263, h.264, h.265, MJPEG, mjg-2, MPEG-4, JPEG 8, VP9 and AV1.

Further, the feature extraction module in step S4 includes K residual blocks, where each residual block includes two convolution layers and an activation layer, and the residual blocks are connected by a jump connection manner.

Further, the feature alignment module in step S5 is configured to perform information propagation and aggregation on the frame image feature information obtained in step S4, and perform propagation through a high-order mesh propagation scheme, where the process is divided into three phases, namely, a first phase in which features are propagated forward along time increment, a second phase in which features are propagated backward along time decrement, and a third phase in which target frame information is connected with feature information of two frames before and after the first phase, where the feature alignment module includes a residual module, a feature alignment module based on optical flow guidance and a space-time feature fusion module, where the residual module includes K residual blocks, each including two convolution layers and an activation layer, and where the feature alignment module based on optical flow guidance is to introduce optical flow information to assist deformable convolution to perform feature alignment operation, and the space-time feature fusion module includes three convolution layers, a time attention mechanism and a space attention mechanism.

Further, the calculation process of the feature alignment module is as follows:

for backward propagation, the extracted feature information is further extracted by a residual error module:

(1)

Wherein, The residual block is represented as a block of residuals,Representing a cascading operation of the dimensions of the channel,Representing the feature information extracted by the feature extraction module,Representing the features computed at the ith time of forward propagation, and, in addition, aligning the two frames before and after the target frame:

(2)

Wherein, Representing the features after the alignment,,,AndFeatures of the i-2, i-1, i+1 and i+2 moments of backward propagation are shown,And finally, obtaining a final feature diagram through a space-time feature fusion module:

(3)

Wherein, The final feature map is shown in the figure,Represents the deeper feature information extracted through the residual module,Representing a temporal-spatial feature fusion module.

Further, in the deformable feature alignment module based on optical flow guidance, an optical flow estimation network is first used to calculate an optical flow graph, and then,,,Respectively represent the optical flow from the first,,,Frame to the firstFrame mapping, namely, twisting the front and rear frame images of the ith frame of the target frame:

(4)

(5)

(6)

(7)

Wherein, ,,,Respectively represent the first,,,The optical flow information at the moment is spatially warped,Representing spatial warping operations and then computing optical flow residuals using pre-aligned featuresAnd deformable convolution mask:

(8)

(9)

Wherein, Indicating that the channel is connected,AndA convolution calculation is represented and is performed,Representing activation functions, finally obtaining aligned features by deformable convolution:

(10)

Where DCN represents a deformable convolution operation.

Further, in the space-time feature fusion module, deep feature information extracted from the current frame is first extractedFeature map obtained after alignment with subsequent framePerforming convolution operation to calculate the similar distance after embedding:

(11)

Wherein, Representing the activation function, conv represents the convolution calculation,Representing dot product, then performing time attention mechanism processing on the similar distances, wherein the time attention is spatially anisotropic for each spatial position;

(12)

Wherein, A time attention graph is represented and a time attention graph is represented,Representing time attention processing, multiplying the aligned characteristic diagram with time attention force diagram pixel by pixel, and obtaining the characteristic diagram of time attention fusion processing by convolution layer:

(13)

Wherein, Representing a pixel-by-pixel dot product operation,Finally, the fused features are processed by a spatial attention mechanism to strengthen texture information, and a final feature map is obtained:

(14)

Wherein SA denotes a spatial attention operation,Representing element-by-element additions.

Further, the pixel reorganizing module in step S6 includes a rebuilding layer, four convolution layers, two pixel shuffling layers and three activating layers, the rebuilding layer is composed of K residual blocks with the same structure, each residual block includes two convolution layers and one activating layer, the two pixel shuffling layers adopt a pixel shuffling strategy, the first activating layer and the second activating layer both adopt a Leaky ReLU activating function, and the third activating layer adopts a ReLU activating function.

Further, the upsampling operation in step S7 employs one of nearest neighbor interpolation, bilinear interpolation, or bicubic interpolation.

The invention also provides a satellite video super-resolution reconstruction system facing the multiple degradation process, which comprises the following steps:

The system comprises a processor and a memory, wherein the memory is used for storing program instructions, and the processor is used for calling the stored instructions in the memory to execute the satellite video super-resolution reconstruction method facing the multiple degradation process according to the technical scheme.

According to the technical scheme, the satellite video super-resolution reconstruction method and system for the multiple degradation process are provided, the satellite video random degradation model is utilized to generate corresponding low-resolution data, the acquired low-resolution data passes through the feature extraction layer, the extracted features are propagated by adopting the high-order grid feature bidirectional propagation scheme, and finally, the feature alignment module for space-time fusion is used for carrying out feature alignment on the sequence images, so that error accumulation of long sequence images is reduced, high fidelity of the satellite video is effectively recovered, reconstruction quality of the satellite video is remarkably improved, and certain noise removal capability is provided.

Drawings

Fig. 1 is a schematic diagram of an overall framework of a satellite video super-resolution reconstruction method facing multiple degradation processes.

Fig. 2 is a schematic structural diagram of a satellite video random degradation model constructed in the invention.

Fig. 3 is a schematic structural diagram of a feature extraction module in the present invention.

Fig. 4 is a schematic diagram of a feature alignment module constructed in the present invention.

FIG. 5 is a schematic diagram of a deformable feature alignment module based on optical flow guidance constructed in the present invention.

FIG. 6 is a schematic diagram of a space-time feature fusion module constructed in the present invention.

Fig. 7 is a schematic diagram of a pixel reorganization module constructed in the present invention.

Fig. 8 is a visual result of 2 x superminute reconstruction in three satellite video data sets according to the present invention.

Fig. 9 is a visual result of the 3 x superscore reconstruction in three satellite video datasets according to the present invention.

Fig. 10 is a visual result of 4 x superscore reconstruction in three satellite video datasets according to the present invention.

Detailed Description

In order to make the objects and technical solutions of the present invention more clear, the following detailed description of the present invention is provided with reference to the accompanying drawings and specific embodiments, so that the features and performances of the present invention can be more easily understood by researchers in the related field, and thus the protection scope of the present invention is more clearly and clearly defined.

The embodiment of the invention discloses a satellite video super-resolution reconstruction method facing a multiple degradation process, which comprises the following steps of:

Step 1, constructing a satellite video data set, wherein the data set is cut out from 100 videos shot by a video satellite, and covers wide ground feature information, including cities, wharfs, airports, suburbs, cities, deserts and the like, and the videos comprise dynamic scenes, such as moving automobiles, airplanes, ships and the like. We clip the video into 284 small videos, each video containing 180 consecutive frames, with 240 small videos as training set, 44 small videos as verification set, and each video having a frame image size of 1280×720.

Step 2, a high-resolution optical satellite video super-resolution reconstruction model facing the multiple degradation process as shown in fig. 1 is constructed, and mainly comprises a satellite video random degradation module as shown in fig. 2, a feature extraction module as shown in fig. 3, a feature alignment module as shown in fig. 4 and a pixel recombination module as shown in fig. 7.

And 3, passing the data set obtained in the step 1 through a satellite video random degradation module to obtain a corresponding low-resolution video frame data set, and performing data enhancement operations such as rotation, translation, scaling and the like on the data to expand a sample library.

Specifically, the satellite video random degradation model includes a downsampling module, four random degradation modules, and a video compression module, as shown in fig. 2. Inputting an original frame imageThe method comprises the steps of firstly, through a downsampling module, performing downsampling processing on an image by randomly selecting one downsampling mode when the downsampling module passes through the downsampling module each time, wherein the downsampling module comprises three downsampling modes, namely nearest neighbor interpolation, bilinear interpolation and bicubic interpolation. The downsampled image is then passed through four random degenerate modules, each of which contains four degenerate modes, random blur, motion blur, random noise and sensor tremor, one of which is randomly selected each time the downsampled image is passed through the random degenerate modules. The image after the random degradation module is finally subjected to a video compression module, wherein the video compression module comprises two parts, namely, intra-frame compression and inter-frame compression, the intra-frame compression is JPEG compression, the inter-frame compression comprises three modes, namely H.264, H.265 and MEPG-4, and when the image is subjected to video compression each time, one inter-frame compression mode is randomly selected, so that the final low-resolution frame image is obtainedAnd is opposite toAnd performing data enhancement operations such as rotation, translation, scaling and the like, and expanding the sample library.

In a specific embodiment, the downsampling modes adopt nearest neighbor interpolation, bilinear interpolation and bicubic interpolation, and the probability of each downsampling mode being selected is equal to 1/3. The random blur employs an isotropic blur kernel, an anisotropic blur kernel, a generalized isotropic blur kernel, a generalized anisotropic blur kernel, a platform isotropic blur kernel, a platform anisotropic blur kernel, and a sinc function blur kernel, and the probability of each blur kernel type being selected is set to 0.405, 0.225, 0.108, 0.027, and 0.1, respectively. The standard deviation range of the Gaussian blur in the x and y directions is [0.2,3], and the value can be randomly selected from 0.2 to 3 to control the strength of the blur. The rotation angle of the fuzzy core is [ -pi, pi ], and the fuzzy core can randomly rotate within the range of-pi to pi, so as to simulate the fuzzy effect in different directions. The generalized Gaussian blur kernel range is [0.5,4], with larger values meaning that the distribution is more nearly uniform and smaller values representing sharper distributions. The motion blur mainly describes the motion direction and intensity of a moving object, wherein the motion direction comprises horizontal motion blur, vertical motion blur and diagonal motion blur, probability ranges randomly selected in three directions are respectively 0.4, 0.4 and 0.2, the step length of motion direction change is 1 degree, thus the motion blur direction change can be finely controlled, the random value range of the motion intensity is [2,10], and the step length of the intensity change is 1, so that the motion blur change is smoother and more accurate. The random noise is mainly Gaussian noise and poisson noise, the probability of selecting the two types of noise is 50%, the standard deviation of the Gaussian noise is randomly selected between [1,30], and the step size is adjusted by 0.1 increment. The scaling factor of poisson noise is randomly chosen between [0.05,3], and the step size is adjusted in 0.005 increments. Sensor tremors can model a high frequency vibration, and can be expressed generally as a function of time, the probability of sensor tremors is randomly selected in the range of [0.05,0.6], the tremors include horizontal tremors, vertical tremors and random direction tremors, the random probability ranges are [0,0.3], [0,0.3] and [0,0.4], and the tremor intensity is randomly selected in the range of [0.1,2]. The intra-frame JPEG compression quality range is [30,95], and the quality value will be randomly selected within this interval. There are three alternative encoders for inter-frame compression, h.264, h.265 and mpeg-4, respectively, each with equal probability of being selected, 1/3.

Step 4, the low-resolution frame image obtained in the step 3 is processedObtaining the characteristics through a characteristic extraction module. The feature extraction module is composed of five residual blocks with the same structure, as shown in fig. 3, each residual block comprises two convolution layers and an activation layer, and the residual blocks are connected in a jump connection mode.

In a specific embodiment, the convolution kernels of the two convolution layers are 3×3, the step size is 2, the padding is 1, the number of input channels is 3, the number of intermediate features is 64, the number of output channels is 64, the active layers adopt a ReLU activation function, and the slope is set to 0.1.

And 5, carrying out information propagation and aggregation on the features obtained in the step 4, mainly carrying out propagation through a high-order grid propagation scheme, wherein the process is divided into three stages, namely, carrying out forward propagation on the features along time increment in the first stage, carrying out backward propagation (time decrement) on the features in the second stage, and connecting the target frame information with the feature information of the front frame and the rear frame in the third stage, wherein the connection is used for gathering the feature information of different positions and improving the effectiveness of the model in a shielding area, as shown in a dotted line part of fig. 1. In the process of feature transfer, the input feature information needs to be aligned, as shown in fig. 4, the feature alignment mainly comprises a residual error module, a deformable feature alignment module based on optical flow guiding and a space-time feature fusion module.

Specifically, taking backward propagation as an example, the extracted features are further extracted by a residual module:

(1)

Wherein, The residual block is represented as a block of residuals,Representing a cascading operation of the dimensions of the channel,Representing the feature information extracted by the feature extraction module,Representing the characteristics of the forward propagation calculated at the ith time. In addition, the two frames before and after the target frame are aligned:

(2)

Wherein, Representing the features after the alignment,,,AndFeatures of the i-2, i-1, i+1 and i+2 moments of backward propagation are shown,Representing deformable feature alignment operations based on optical flow guidance. Finally, a final feature diagram is obtained through a space-time feature fusion module:

(3)

Wherein, The final feature map is shown in the figure,Represents the deeper feature information extracted through the residual module,And (5) performing space-time feature fusion operation.

In a deformable feature alignment module based on optical flow guidance, as shown in FIG. 5, an optical flow estimation network is first employed to calculate an optical flow graph using,,,Respectively represent the optical flow from the first,,,Frame to the firstMapping of frames, namely, twisting the front and rear frame images of the ith frame of the target frame:

(4)

(5)

(6)

(7)

Wherein the method comprises the steps of ,,,Respectively represent the first,,,The optical flow information at the moment is spatially warped,Representing spatial warping operations and then computing optical flow residuals using pre-aligned featuresAnd deformable convolution mask:

(8)

(9)

Wherein, Indicating that the channel is connected,AndA convolution calculation is represented and is performed,Representing an activation function. Finally, the aligned features are obtained through deformable convolution。

(10)

Where DCN represents a deformable convolution operation.

In the space-time feature fusion module, a time attention mechanism and a space attention mechanism are adopted, the attention mechanism redistributes weights, so that a model is helped to select important information from adjacent frames, error information is reduced, and longer sequences are utilized more effectively, as shown in fig. 6. Deep characteristic information extracted from current frameFeature map obtained after alignment with subsequent framePerforming convolution operation to calculate the similar distance after embedding:

(11)

Wherein, Representing the activation function, conv represents the convolution calculation,Representing dot product. The similar distances are then processed by a temporal attention mechanism, which is spatially specific for each spatial position.

(12)

Wherein, A time attention graph is represented and a time attention graph is represented,Representing a temporal attention process. Then multiplying the aligned characteristic diagram with the time attention force diagram pixel by pixel, and obtaining the characteristic diagram of the time attention fusion processing through a convolution layer。

(13)

Wherein, Representing a pixel-by-pixel dot product operation,Representing a fusion convolution operation. Finally, the fused features are subjected to spatial attention mechanism processing to enhance texture information and the like, so as to obtain a final feature map。

(14)

In a specific embodiment, the residual modules in feature alignment are composed of three residual blocks with the same structure, each residual block comprises two convolution layers and an activation layer, the residual blocks are connected in a jump connection mode, the convolution kernels of the two convolution layers are 3×3, the step size is 2, the filling is 1, the number of input channels is 64, the number of middle features is 64, the number of output channels is 128, the activation layer adopts a ReLU activation function, and the slope is set to 0.1.

Step 6, aligning the features obtained in the step 5Obtaining residual characteristics of the high-resolution image through a pixel reorganization module。

Specifically, the pixel reorganization module includes one reconstruction layer, four convolution layers, two pixel shuffling layers, and three activation layers, as shown in fig. 7. Firstly, the feature map obtained in the step 5 is further subjected to feature extraction through a reconstruction layer, the reconstruction layer is composed of five residual blocks with the same structure, the structure of the reconstruction layer is the same as that of the feature extraction module in the step 4, the number of input channels and the number of output channels are different, then the number of channels and the size of an image are increased through two convolutions and two pixel shuffling, and finally the two convolutions are performed to obtain the residual features of the high-resolution frame image with the number of channels being 3。

In a specific embodiment, the number of input channels of the reconstruction layer is 320 and the number of output channels is 64.

The first convolution layer has a number of input channels of 64, a number of output channels of 256, a convolution kernel size of 3x 3, a step size of 1, a fill of 1, the convolution layer amplifies the number of channels by a factor of 4 for use in cooperation with a subsequent pixel shuffling operation, the second convolution layer has a number of input channels of 64, a number of output channels of 256, a convolution kernel size of 3x 3, a step size of 1, a fill of 1, for further amplifying the number of channels. The third convolutional layer has a number of input channels of 64, a number of output channels of 64, a convolutional kernel size of 3×3, a step size of 1, and a fill of 1. The fourth convolutional layer has a number of input channels of 64, output channels of 3, convolutional kernel size of 3×3, step size of 1, and padding of 1. The first and second active layers each use a leak ReLU activation function, the third active layer uses a ReLU activation function, and the slope of the activation function is set to 0.1. The pixel shuffling strategy adopts a pixel shuffling package strategy, wherein the pixel shuffling package is an enhanced pixel shuffling module, standard pixel shuffling only simply redistributes channel data to a space dimension, and the pixel shuffling package can also use convolution for pretreatment before pixel shuffling so as to improve the expression capability and up-sampling quality of the features. The first pixel shuffle input channel number is 256, the output channel number is 64, the scale factor is 2, and the upsampling convolution kernel size is 3. The second pixel shuffle input channel number is 256, the output channel number is 64, the scale factor is 2, and the upsampling convolution kernel size is 3.

Step 7, interpolating the input low-resolution frame image, and adding the interpolation with the residual characteristics of the high-resolution frame image obtained in the step 6 to obtain a final reconstructed image。

In a specific embodiment, bicubic interpolation is used to interpolate the low resolution frame image, and PSNR and SSIM are used as objective evaluation indexes, where the evaluation indexes are obtained by performing average calculation on all frames of each tested satellite video, and testing is performed in three satellite video data sets, the objective evaluation results are shown in table 1, and the visualization results are shown in fig. 8, 9 and 10, respectively.

Table 1 results of objective evaluation of different superdivision magnifications on three satellite video datasets

Fig. 8 is a visual result of 2 x super resolution reconstruction on three satellite video datasets. In fig. 8, a is a 033 rd scene original image of a data set 1, A1 represents an original frame image amplification detail, A2 represents A2 times super-resolution reconstruction result of a corresponding position, B is a 000 th scene original frame image of the data set 2, B1 represents an original frame image amplification detail, B2 represents A2 times super-resolution reconstruction result of a corresponding position, C is a 3 rd 001 th scene original frame image of the data set, C1 represents an original frame image amplification detail, and C2 represents A2 times super-resolution reconstruction result of a corresponding position.

Fig. 9 is a visual result of a 3 x super resolution reconstruction on three satellite video datasets. In fig. 9, a is an original frame image of a 026 th scene of a data set 1, A1 represents an enlarged detail of the original frame image, A2 represents a 3 times super-resolution reconstruction result of a corresponding position, B is an original frame image of a 016 th scene of the data set 2, B1 represents an enlarged detail of the original frame image, B2 represents a 3 times super-resolution reconstruction result of the corresponding position, C is an original 001 th scene image of the data set 3, C1 represents an enlarged detail of the original frame image, and C2 represents a 3 times super-resolution reconstruction result of the corresponding position.

Fig. 10 is a visual result of a 4 x super resolution reconstruction on three satellite video datasets. In fig. 10, a is a 030 th scene original frame image of a data set 1, A1 represents an original frame image enlarged detail, A2 represents a 4 times super-resolution reconstruction result of a corresponding position, B is a 001 th scene original frame image of the data set 2, B1 represents an original frame image enlarged detail, B2 represents a 4 times super-resolution reconstruction result of a corresponding position, C is a 005 th scene original image of the data set 3, C1 represents an original frame image enlarged detail, and C2 represents a 4 times super-resolution reconstruction result of a corresponding position.

In summary, the invention solves the problem of poor reconstruction effect caused by multiple factor degradation of satellite video data in actual scenes by considering various noises in satellite video data in a satellite video random degradation module, adopts a high-order grid characteristic bidirectional propagation scheme to propagate characteristic information forwards and backwards in time in an alternating manner, can revisit information from different frames and refine the characteristics, improves the expression capability of the characteristics, fuses time characteristics in the characteristic alignment process, is favorable for a model to select important information from adjacent frames, reduces error information, thereby more effectively utilizing longer sequence images, and finally performs experiments on three satellite video data sets, and the test results prove the effectiveness of the method in qualitative and quantitative aspects.

On the other hand, the embodiment of the invention also provides a satellite video super-resolution reconstruction system facing the multiple degradation process, which comprises the following steps:

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A satellite video super-resolution reconstruction method for multiple degradation processes, characterized in that it comprises the following steps:

Step S1, obtaining a satellite video data set;

Step S2, constructing a satellite video super-resolution reconstruction model for multiple degradation processes, including a satellite video random degradation module, a feature extraction module, a feature alignment module and a pixel reconstruction module;

Step S3, the satellite video data in the satellite video data set is subjected to a satellite video random degradation module to obtain a low-resolution frame image, and a data enhancement operation is performed to obtain low-resolution frame data;

Step S4, passing the low-resolution frame data obtained in step S3 through a feature extraction module to obtain frame image feature information;

Step S5, inputting the frame image feature information obtained in step S4 into a feature alignment module, performing propagation, aggregation and alignment operations on the input features to obtain aligned features;

Step S6, passing the aligned features obtained in step S5 through a pixel reorganization module to obtain residual features of the high-resolution image;

Step S7, upsampling the low-resolution frame image obtained in step S3, and then adding the residual features of the high-resolution image obtained in step S6 to obtain a final high-resolution reconstructed image.

2. A satellite video super-resolution reconstruction method for multiple degradation processes according to claim 1, characterized in that: the satellite video random degradation module described in step S3 includes a downsampling module, four random degradation modules and a video compression module; the downsampling module uses a downsampling method selected from nearest neighbor interpolation, bilinear interpolation or bicubic interpolation to downsample the image; each of the four random degradation modules uses a degradation mode selected from random blur, motion blur, random noise or sensor tremor to degrade the image; the video compression module is implemented by intra-frame compression or inter-frame compression, wherein the intra-frame compression includes JPEG compression, JPEG 2000 compression, PNG compression, WEBP compression, BMP compression, TIFF compression, BPG compression and FLIF compression, and the inter-frame compression includes H.263, H.264, H.265, MJPEG, MJPEG-2, MPEG-4, VP8, VP9 and AV1.

3. The satellite video super-resolution reconstruction method for multiple degradation processes according to claim 1 is characterized in that: the feature extraction module described in step S4 includes K residual blocks, each residual block includes two convolutional layers and one activation layer, and the residual blocks are connected by skip connections.

4. According to claim 1, a satellite video super-resolution reconstruction method for multiple degradation processes is characterized in that: the feature alignment module described in step S5 is used to propagate and aggregate the frame image feature information obtained in step S4, and propagate it through a high-order grid propagation scheme. This process is divided into three stages: the first stage propagates the features forward along the time increment, the second stage propagates the features backward along the time decrement, and the third stage connects the target frame information with the feature information of the previous and next two frames; the feature alignment module includes a residual module, a feature alignment module based on optical flow guidance and a spatiotemporal feature fusion module; the residual module includes K residual blocks, each residual block includes two convolutional layers and an activation layer; the feature alignment module based on optical flow guidance introduces optical flow information to assist deformable convolution in feature alignment operation; the spatiotemporal feature fusion module includes three convolutional layers, a temporal attention mechanism and a spatial attention mechanism.

5. The satellite video super-resolution reconstruction method for multiple degradation processes according to claim 4 is characterized in that the calculation process of the feature alignment module is as follows:

For backward propagation, a residual module is first used to further extract the extracted feature information:

(1)

in, represents the residual module, represents the cascade operation of the channel dimension, Represents the feature information extracted by the feature extraction module. Represents the features calculated by the forward propagation at the i-th time; in addition, the two frames before and after the target frame are aligned:

(2)

in, represents the aligned features, , , and Respectively represent the features of the i-2th, i-1th, i+1th and i+2th moments of the backward propagation, It represents the deformable feature alignment module guided by optical flow; finally, a spatiotemporal feature fusion module is used to obtain the final feature map:

(3)

in, represents the final feature map, Represents the deeper feature information extracted by the residual module. Represents the spatiotemporal feature fusion module.

6. The satellite video super-resolution reconstruction method for multiple degradation processes according to claim 5 is characterized in that: in the deformable feature alignment module based on optical flow guidance, an optical flow estimation network is first used to calculate the optical flow map, and then , , , Respectively represent the optical flow from , , , Frame to Frame mapping, distorting the two frames before and after the target frame i:

(4)

(5)

(6)

(7)

in, , , , Respectively represent , , , The characteristics of the optical flow information at each moment after spatial distortion, Represents the spatial warping operation, and then uses the pre-aligned features to calculate the optical flow residual and deformable convolutional masks :

(8)

(9)

in, Indicates channel connection, and represents the convolution calculation, Represents the activation function, and finally the aligned features are obtained through deformable convolution :

(10)

Among them, DCN represents deformable convolution operation.

7. The satellite video super-resolution reconstruction method for multiple degradation processes according to claim 6 is characterized in that: in the spatiotemporal feature fusion module, the deep feature information extracted from the current frame is firstly Feature map obtained after alignment with subsequent frames Perform convolution operation to calculate the similarity distance after embedding :

(11)

in, represents the activation function, conv represents the convolution calculation, represents the dot product; then the similar distance is processed by the temporal attention mechanism, and for each spatial position, the temporal attention is spatially heterogeneous;

(12)

in, represents the temporal attention map, Represents the temporal attention processing; then the aligned feature map is multiplied pixel by pixel with the temporal attention map, and the feature map of the temporal attention fusion processing is obtained through the convolution layer :

(13)

in, represents pixel-by-pixel dot product operation, Represents the fused convolution operation; finally, the fused features are processed by the spatial attention mechanism to enhance the texture information and obtain the final feature map :

(14)

Among them, SA represents the spatial attention operation, Represents element-by-element addition.

8. According to the satellite video super-resolution reconstruction method for multiple degradation processes described in claim 1, it is characterized in that: the pixel reorganization module described in step S6 includes a reconstruction layer, four convolutional layers, two pixel shuffling layers and three activation layers; the reconstruction layer is composed of K residual blocks with the same structure, each residual block includes two convolutional layers and one activation layer; the two pixel shuffling layers adopt a pixel shuffling strategy; the first and second activation layers both adopt Leaky ReLU activation functions, and the third activation layer adopts ReLU activation function.

9. The satellite video super-resolution reconstruction method for multiple degradation processes according to claim 1, characterized in that the upsampling operation in step S7 adopts one of nearest neighbor interpolation, bilinear interpolation or bicubic interpolation.

10. A satellite video super-resolution reconstruction system for multiple degradation processes, comprising:

A processor and a memory, the memory is used to store program instructions, and the processor is used to call the stored instructions in the memory to execute a satellite video super-resolution reconstruction method for multiple degradation processes as described in any one of claims 1 to 9.