CN119399176A

CN119399176A - Stereo image quality evaluation method based on dual-frequency interactive enhancement and binocular matching

Info

Publication number: CN119399176A
Application number: CN202411557848.4A
Authority: CN
Inventors: 沈丽丽; 兰铖; 卢淼; 张晶; 孙乾龙
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2024-11-04
Filing date: 2024-11-04
Publication date: 2025-02-07
Anticipated expiration: 2044-11-04
Also published as: CN119399176B

Abstract

The invention discloses a stereoscopic image quality evaluation method based on dual-frequency interactive enhancement and binocular matching, and belongs to the technical field of stereoscopic image quality evaluation. The invention comprises the following steps: dividing the left and right views of the original stereoscopic image into non-overlapping image blocks as input samples of the network; using FFT to convert the left and right view image blocks to the frequency domain and decompose them according to high and low frequencies, using IFFT to restore the high-frequency and low-frequency image blocks of the left and right views, and then using a set of convolution and pooling operations to extract the primary features of the high-frequency and low-frequency image blocks of the left and right views; after the high and low frequency primary features are interactively enhanced by a dual-frequency interactive attention module, they are sent to a dual-frequency reorganization module to obtain a high and low frequency weighted feature map; at the same time, the left and right view image blocks are sent to a set of convolution and pooling to extract full-frequency primary features, and high and low frequency information is used for feature enhancement; the enhanced full-frequency features are subjected to binocular progressive registration and competitive selection by a binocular matching fusion module to obtain binocular fusion features; binocular feature fusion features are further extracted through two sets of convolutions, and after splicing with the high and low frequency reorganization features of the left and right views, objective prediction scores are obtained through a fully connected layer. The present invention can better simulate the binocular fusion and competition mechanism of the human visual system, utilize the influence of different spatial frequency information on quality perception, and improve the accuracy of the objective evaluation method.

Description

Stereoscopic image quality evaluation method based on dual-frequency interaction enhancement and binocular matching

Technical Field

The invention relates to the technical field of stereoscopic image quality evaluation, in particular to a stereoscopic image quality evaluation method based on dual-frequency interaction enhancement and binocular matching.

Background

With the increasing abundance of life entertainment modes of people, people are no longer satisfied with planar image experiences, but pursue stereoscopic images with more immersive and realistic senses. Compared with a planar (2D) image, a stereoscopic image (3D) can make a viewer feel stereoscopic while displaying image contents, creating an immersive atmosphere, and thus is widely used in various fields such as 3D movies, 3D games, and virtual reality. However, during the acquisition, compression and transmission of stereoscopic images, the images may be distorted by some unavoidable factors, thereby affecting the visual experience of the viewer. Therefore, it is important to construct a stereoscopic image quality perception model that is highly consistent with human evaluation criteria. In addition, the stereo image quality evaluation (Stereoscopic Image Quality Assessment, SIQA) can also be used as a loop in an image processing system, and provides optimization guidance for the fields of stereo image super-division, segmentation and the like.

Generally, the IQA method can be divided into two types, i.e., subjective and objective. The subjective IQA method is that, under a certain experimental environment, a human observer evaluates and scores the quality of an image based on subjective feelings thereof and according to various indexes of the image. Since the final receiver of the image is human, the subjective evaluation method based on opinion scores given by human observers can be more true and reliable in image quality, and the method has a lower technical threshold and is considered to be the most effective way. Although the subjective evaluation method is most accurate in evaluating image quality, it has many drawbacks. In practice, the result of the subjective evaluation method is easily affected by the psychological state, cognitive level, the environment of the testee and other factors, so that the subjective evaluation application scene is very small in actual life. The objective image quality evaluation method is to establish an evaluation model to evaluate the quality of the image by simulating a human visual system (Human Visual System, HVS), so that the limitation of subjective evaluation can be well solved, and the method is more in line with the actual needs. The objective evaluation method is further classified into Full Reference (FR), half Reference (Reduced Reference, RR) and No Reference (No Reference, NR) according to the participation degree of the Reference image. Since many application scenes cannot obtain ideal reference images, the objective evaluation method for the quality of the reference-free images, which does not need the ideal reference images, is widely applied.

Currently, in SIQA fields, the work of researchers is mostly focused on the study of a stereoscopic image quality evaluation method of an NR type, and numerous NR-SIQA methods are proposed. The earlier FR-SIQA method was modified based on the classical FR-2DIQA method. Such as peak signal to Noise Ratio (PSNR) and structural similarity index (Structural Similarity Index, SSIM). However, such methods cannot reflect human visual perception characteristics, and also do not consider the influence of binocular parallax, and cannot be directly applied to stereoscopic image quality evaluation. In addition, there are some NR-SIQA methods of traditional manual feature extraction for image extraction based on the Human Visual System (HVS) or Natural Scene Statistics (NSS). The method has higher design threshold, requires a designer to have higher level, has rich experience and poorer generalization capability, and is difficult to adapt to the actual situation with various distortion changes.

In recent years, under the push of a deep learning technology, a plurality of stereoscopic image binocular vision models based on convolutional neural networks (Convolution Neural Network, CNN) are developed, and the development of NR-SIQA is further promoted. However, the existing SIQA method based on CNN does not pay attention to the influence of the spatial frequency of the image on the human perceived image quality. Most methods do not take into account the interaction of high and low frequency information and the difference in contribution of high and low frequencies to image quality perception. Based on biological knowledge and human visual psychology knowledge, the human visual system behaves differently in terms of processing high and low spatial frequency information of the same image. In fact, extracting features from the entire frequency range may suffer from non-uniform information distribution in the spatial domain. In fact, visual information processing first extracts overall contour features of the visual scene based on low spatial frequency information, and then uses this information to facilitate detailed texture feature extraction corresponding to high frequency information. On the other hand, strong edges tend to draw more attention and can enhance the perception of surrounding smooth areas. These machines demonstrate complex interactions between high and low frequency information.

Further, since the evaluation object SIQA is a stereoscopic image, binocular vision characteristics are generally required to be considered. However, most methods simply splice left and right features and then perform quality regression, but the left and right views have no interaction and fusion process. Although some methods use differential and summation of left and right view features or twin networks to achieve left and right view interaction, it is clear that a relatively complex process of binocular interaction fusion of a human visual system cannot be well simulated. According to biological knowledge and human visual psychology knowledge, the human visual system usually fuses the matched features in the left and right views, and then adopts a competition mechanism to select and inhibit the unmatched features. The matching features between the left and right views are key factors to stabilize the perception of both eyes. Alignment of binocular features may facilitate integration of binocular information, forming a unified and coherent visual perception. Through simulating the process, stable and accurate binocular vision characteristics can be effectively extracted.

Based on the above, the invention provides a stereoscopic image quality evaluation method based on dual-frequency interaction enhancement and binocular matching.

Disclosure of Invention

The invention aims to provide a stereoscopic image quality evaluation method based on dual-frequency interaction enhancement and binocular matching, which aims to solve the problem that the high-low frequency sensitivity of human eyes to pictures with different distortion types and different semantic information is not considered when the stereoscopic image quality is evaluated in the prior art.

In order to achieve the above purpose, the present invention adopts the following technical scheme:

A stereoscopic image quality evaluation method based on dual-frequency interaction enhancement and binocular matching comprises the following steps:

cutting an original image of a left view and a right view into image blocks with the size of 32 multiplied by 32 and not overlapped, converting the image blocks of the left view and the right view into a frequency domain by using FFT, decomposing the image blocks according to high and low frequencies, and restoring the image blocks of the left view and the right view by using IFFT to obtain high-frequency and low-frequency image blocks of the left view and the right view;

Training a convolutional neural network based on double-frequency interaction enhancement and binocular matching, wherein the convolutional neural network comprises a left view double-frequency interaction characteristic enhancement sub-network, a right view double-frequency interaction characteristic enhancement sub-network and a binocular matching fusion sub-network, and the convolutional neural network comprises the following three main sub-networks:

since the left-right view subnetwork is symmetrical, only the implementation of the left-view dual frequency interaction feature enhancement subnetwork will be described herein. In a left view dual-frequency interaction feature enhancement sub-network, a primary feature map is extracted from high-frequency and low-frequency image blocks of a left view by using a group of convolution and pooling operations, the attention weights of high-frequency and low-frequency signals are calculated by using a dual-frequency interaction attention module, the high-frequency and low-frequency features are subjected to interaction weighting, then the features are further extracted by three groups of convolution, and finally a high-frequency and low-frequency weighting feature map is obtained by using a dual-frequency recombination module;

Extracting full-frequency primary feature images of image blocks of left and right views by using a group of convolution and pooling operations in a binocular matching fusion sub-network, carrying out feature enhancement on the full-frequency primary feature images by using high-low frequency information of a dual-frequency interaction attention module to obtain full-frequency enhancement feature images, and then sequentially carrying out low-frequency binocular registration and high-frequency binocular registration on the full-frequency features of the enhanced left and right views;

And (3) splicing the binocular fusion characteristic diagram with the high-low frequency weighted characteristic diagram of the left view and the low-low frequency weighted characteristic diagram of the right view, obtaining objective prediction scores through a full-connection layer, and in the training process, measuring the difference between the image quality scores predicted by the network and the real image quality scores by using the loss function of the network model through L1 loss, and guiding the optimization of the model.

And thirdly, performing image quality evaluation by using the trained network. And slicing the stereoscopic image to be evaluated, inputting the sliced stereoscopic image into a network, and averaging the mass fraction of all image blocks of each stereoscopic image output by the network to obtain the mass fraction of the whole image.

Preferably, the processing flow of the dual-frequency interaction attention module in the second step specifically includes the following:

And (3) the low-frequency primary features are subjected to 3 convolution layers with convolution kernels of 1 multiplied by 1 in parallel to obtain three feature graphs Q, K and V, the features Q and K are subjected to dimension transformation and then are subjected to matrix multiplication operation, and then a low-frequency attention diagram is obtained through a softMax layer. Similarly, the same operation is performed on the high-frequency characteristic to obtain a high-frequency attention map, and then matrix multiplication operation is performed on the low-frequency characteristic map V and the high-frequency attention map to obtain a low-frequency characteristic map enhanced by high-frequency information. The same operation is carried out on the high-frequency primary characteristics to obtain a high-frequency characteristic diagram enhanced by the low-frequency information.

Preferably, the processing flow of the dual-frequency reorganization module in the second step specifically includes the following contents:

And carrying out difference and summation on the low-frequency and high-frequency advanced features, cascading the difference and summation, sending the obtained product into a residual error module ResBlock formed by connecting two layers of convolution layers with one layer of residual error, further extracting the features, obtaining self-adaptive weights after the features pass through a global average pooling layer GAP and a full connecting layer, and finally carrying out normalization on the weights, and weighting the low-frequency and high-frequency advanced features to obtain high-frequency and low-frequency weighted features.

Preferably, the binocular matching fusion module in the second step is composed of a binocular progressive registration module and a binocular competition selection module, and specifically includes the following contents:

① The binocular progressive registration module consists of a low-frequency binocular registration unit and a high-frequency binocular registration unit;

In a low-frequency binocular registration unit, cascading the low-frequency primary features of the left and right views and the differential features thereof, then sending the low-frequency primary features into a layer of convolution layer with the convolution kernel size of 1 multiplied by 1 to obtain low frequency offset, and sending the full-frequency primary features of the left and right views and the low frequency offset into deformable convolution to obtain the registration features of the left and right views on the low frequency scale;

in a high-frequency binocular registering unit, cascading the high-frequency primary features of the left and right views and the differential features thereof, then sending the high-frequency primary features into a layer of convolution layer with a convolution kernel size of 1 multiplied by 1 to obtain a high-frequency offset, and sending the features of the left and right views registered on a low-frequency scale and the high-frequency offset into a deformable convolution to obtain the features of the left and right views registered on the high-frequency scale and the low-frequency scale;

② The binocular competitive selection module consists of a space dimension selection block and a channel dimension selection block, and the registration features obtained by the binocular progressive registration module are added with full-frequency primary features to be used as the input of the module;

In the space dimension selection block, firstly, an input left view feature and an input right view feature are subjected to a convolution layer with a convolution kernel size of 1 multiplied by 1 to obtain two groups of feature images, then the two groups of feature images are subjected to dimension transformation and then are subjected to matrix multiplication operation, and then a softMax layer is subjected to the dimension transformation to obtain a parallax attention map, the parallax attention map is respectively subjected to matrix multiplication with the left view feature and the right view feature, and the parallax attention map is weighted in the space dimension.

In the channel dimension selection block, the features weighted in the space dimension are sent to a global average pooling layer to obtain left and right view channel attention force diagrams, then the channel attention force diagrams are normalized, point multiplication is carried out on the channel attention force diagrams and the features weighted in the space dimension of the left and right views respectively, and finally the features of the left and right views weighted in the channel dimension are added to obtain binocular fusion features.

Compared with the prior art, the invention provides a stereoscopic image quality evaluation method based on dual-frequency interaction enhancement and binocular matching, which has the following beneficial effects:

The invention fully utilizes the information of different spatial frequencies of the image to enhance the characteristics, simulates the mechanism of different frequency sensitivity of human visual system quality perception, and can effectively extract the stereoscopic visual characteristics conforming to human visual quality perception by simulating the binocular fusion and binocular competition mechanism of the human visual system and assisting the binocular fusion by utilizing the information of different spatial frequencies of the image. The evaluation result of the method has high consistency with the subjective evaluation result of human eyes, and has important value.

Drawings

Fig. 1 is a flow chart of a stereo image quality evaluation method based on dual-frequency interaction enhancement and binocular matching, wherein LFBRU represents a low-frequency dual-eye progressive registration unit, HFBRU represents a high-frequency dual-eye progressive registration unit, BRS represents a dual-eye competition selection module, CCP represents two convolution layers superimposed by a pooling layer, and nCCP represents n cascaded CCP modules;

fig. 2 is an overall block diagram of a binocular progressive registration module in the mentioned binocular feature matching fusion module in embodiment 1 of the present invention;

FIG. 3 is an overall block diagram of a binocular competitive selection module in the binocular feature matching fusion module of embodiment 1 of the present invention;

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention will be described in further detail below.

Example 1

The embodiment provides a stereoscopic image quality evaluation method based on dual-frequency interaction enhancement and binocular matching, as shown in fig. 1, the method comprises the following steps:

s101, preprocessing and data enhancing the original stereo image for training

The original image of the left and right view is cut into 220 image blocks which are not overlapped and have the size of 32 multiplied by 32, and the quality score of each image block is the true quality score of the original image. Converting the left and right view image blocks into a frequency domain by using FFT, decomposing according to high and low frequencies, and restoring by using IFFT to obtain left and right view high-frequency and low-frequency image blocks;

S102, training convolutional neural network based on dual-frequency interaction enhancement and binocular matching

The method comprises the steps of constructing a convolution neural network based on double-frequency interaction enhancement and binocular matching, wherein the network consists of a left view double-frequency interaction characteristic enhancement sub-network, a right view double-frequency interaction characteristic enhancement sub-network and a binocular matching fusion sub-network, and specifically comprises the following contents:

Since the left-right view subnetwork is symmetrical, only the implementation of the left-view dual frequency interaction feature enhancement subnetwork will be described herein. In a left view dual-frequency interaction characteristic enhancement sub-network, a primary characteristic diagram is extracted from a high-frequency image block and a low-frequency image block of a left view by using a group of convolution and pooling operations with the convolution kernel size of 3 multiplied by 3, the attention weights of high-frequency signals and low-frequency signals are calculated by using a dual-frequency interaction attention module, the high-frequency characteristics and the low-frequency characteristics are subjected to interaction weighting, then the characteristics are further extracted by using three groups of convolution with the convolution kernel size of 3 multiplied by 3, and finally the high-frequency weighted characteristic diagram and the low-frequency weighted characteristic diagram are obtained by using a dual-frequency recombination module;

S103, using the trained network to evaluate the quality of the stereoscopic image

And slicing the stereoscopic image to be evaluated, inputting the sliced stereoscopic image into a network, and averaging the mass fraction of all image blocks of each stereoscopic image output by the network to obtain the mass fraction of the whole image.

Example 2:

Based on embodiment 1, but with the difference that the scheme in embodiment 1 is further described below in connection with specific calculation formulas and example data, since the left and right view sub-networks are symmetrical, only one sub-network implementation is described herein. The details are described below:

Step1, acquiring and preprocessing brain electrical signals

The original image of the left and right view is cut into 220 image blocks which are not overlapped and have the size of 32 multiplied by 32, and the quality score of each image block is the true quality score of the original image. The image block I is converted into a frequency domain to be subjected to high-low frequency decomposition by using FFT (fast Fourier transform), and the high-frequency and low-frequency image blocks are obtained by using IFFT (inverse fast Fourier transform) reduction:

Where z represents the frequency component of the sample, t (·; r) represents a threshold function that separates the low and high frequency components from z according to the hyper-parameter radius r.

To simplify the processing, we first consider decomposing a single channel image into high and low frequencies. For the three-channel image used herein, this operation is performed independently on each channel. A single channel image is defined as I ε N ^n×n, whose frequency domain is denoted as z ε C ^n×n, where C represents complex numbers. The implementation form of the function z _lf,z_hf =t (z; r) is defined as:

Where z (i, j) represents the value of position (i, j), c _i,c_j represents the centroid, and d (·, ·) is the euclidean distance.

Step 2, training convolutional neural network based on dual-frequency interaction enhancement and binocular matching

The embodiment constructs a convolution neural network based on double-frequency interaction enhancement and binocular matching, wherein the network is composed of a left view double-frequency interaction characteristic enhancement sub-network, a right view double-frequency interaction characteristic enhancement sub-network and a binocular matching fusion sub-network, and the specific method is as follows:

Since the left-right view subnetwork is symmetrical, only the implementation of the left-view dual frequency interaction feature enhancement subnetwork will be described herein.

In the left view dual frequency interaction feature enhancement sub-network, high frequency and low frequency image blocks are fed into a set of convolution sums of convolution kernel size 3 x 3 to extract primary features. Calculating the attention weights of the high-frequency and low-frequency signals by using a dual-frequency interaction attention module DFIA, carrying out interaction weighting on the high-frequency and low-frequency characteristics, then further extracting the characteristics through three groups of convolutions with convolution kernel sizes of 3 multiplied by 3, and finally obtaining a high-frequency and low-frequency weighted characteristic diagram through a dual-frequency recombination module DFR;

The dual frequency interaction attention module DFIA, as shown in FIG. 1, first characterizes the primary features Respectively obtaining two groups of characteristic graphs through 3 convolution layers with convolution kernels of1 multiplied by 1And (3) withThe dimensions of the two sets of feature maps K and Q are then transformed intoWhere n=h×w. Matrix multiplication is then performed on the transposes of Q and K, and the SoftMax layer is applied to calculate the spatial attention patterns, resulting in spatial attention patterns Att _lf and Att _hf calculated from { Q _lf,K_lf } and { Q _hf,K_hf } respectively. Finally, the spatial attention pattern Att _lf(Att_hf) is matrix multiplied with the feature map V _hf(V_lf) to achieve interactive weighting of the high-low frequency features. The whole process can be expressed as:

Wherein, Representing matrix multiplication, { IF _hf,IF_lf } is the high and low frequency characteristics after interaction enhancement.

The dual frequency recombination module DFR is shown in fig. 1 and consists of a ResBlock, a GAP layer and a full connection hierarchy. Wherein ResBlock contains two convolutional layers and one residual. And after the difference sum of the low-frequency and high-frequency advanced features after interactive weighting is cascaded, the difference sum is sent to a DFR module to obtain weights W _hf and W _lf.

Finally, after normalizing the obtained weight, the high-level characteristics of the low frequency and the high frequency are weighted. The feature weights are formulated as follows:

Wherein, As an input feature of the DFR,Is the final single view frequency reorganization feature.

In a binocular matching fusion subnetwork, a set of convolution and pooling operations is used to extract full frequency primary feature maps of image blocks of left and right views. In order to highlight the contour and the detail for better registration, the high-low frequency information of the dual-frequency interaction attention module is used for carrying out feature enhancement on the full-frequency primary feature map, namely, the full-frequency enhancement feature map is obtained by respectively carrying out matrix multiplication on S _lf and S _hf and V _ff. The specific implementation process is represented by the following formula:

And then, sequentially carrying out low-frequency binocular registration and high-frequency binocular registration on the enhanced left and right view full-frequency features. And finally, adopting a binocular competition selection module to perform feature screening in the space dimension and the channel dimension to obtain a binocular fusion feature map, and further extracting features through two groups of convolution. The binocular feature matching fusion module consists of a binocular progressive registration module and a binocular competition selection module, and the specific implementation process is as follows:

① As shown in fig. 2, the binocular progressive registration module is composed of a low-frequency binocular registration unit and a high-frequency binocular registration unit;

In a low-frequency binocular registration unit, after cascading low-frequency primary features of a left view and a right view with differential features thereof, sending the low-frequency primary features into a layer of convolution layer with a convolution kernel size of 1 multiplied by 1 to obtain low-frequency offset, and inputting full-frequency primary features of the left view and the right view into deformable convolution with the low-frequency offset to obtain features of registration of the left view and the right view on a low-frequency scale;

In a high-frequency binocular registering unit, after cascading the high-frequency primary features of the left and right views and the differential features thereof, sending the high-frequency primary features into a layer of convolution layer with the convolution kernel size of 1 multiplied by 1 to obtain high-frequency offset, registering the features of the left and right views on the low-frequency scale with the high-frequency offset, and inputting the high-frequency offset into a deformable convolution to obtain registering features of the left and right views on the high-frequency and low-frequency scale;

The specific implementation process is expressed as the following formula:

Where Conv represents a convolution with a convolution kernel size of 1×1, O _lf and O _hf represent deformable convolution offsets, deConv represents deformable convolution operations, and AF _left and AF _right are aligned features.

② As shown in fig. 3, the binocular competitive selection module is composed of a spatial dimension selection block and a channel dimension selection block, and the alignment feature { AF _left,AF_right } and the full-frequency feature obtained by the binocular progressive registration module are used for the alignment feature { AF _left,AF_right }, the full-frequency featureAdding to obtain an input { MF _left,MF_right };

in the space dimension selection block, the binocular feature { MF _left,MF_right } is first passed through a convolution layer with a convolution kernel size of 1×1 to obtain Transpose F _l with F _r Matrix multiplication is performed, and attention is sought after the SoftMax layerThe attention map AM _R→L,AM_L→R is then matrix multiplied with the inputs MF _left,MF_right of the module, respectively:

Wherein the method comprises the steps of Is a matrix multiplication, T denotes a transpose, and SF _left and SF _right are features weighted by parallax attention.

In the channel dimension selection block, the resulting SF _left and SF _right are fed into the global averaging pooling layer, the channel attention map { AM _left,AM_right } is obtained, and the channel attention map is normalized at the pixel level:

The channel attention was then sought to be point multiplied with SF _left and SF _right to yield a fusion feature F _b of the binocular view.

Further, F _b was fed into the two sets of convolution and pooling layers to extract the final binocular fusion feature F _bino.

Finally, binocular fusion feature F _bino is combined with left and right view high and low frequency weighting featuresAndAnd obtaining objective prediction scores through a full-connection layer after splicing, wherein in the training process, the network model adopts an L1 loss function to measure the difference between the image quality scores predicted by the network and the real image quality scores, and guides the optimization of the model.

203 Image quality assessment using trained network

Example 3:

Based on examples 1-2, but with the differences, the protocols in examples 1 and 2 were validated in connection with specific experiments as follows:

To verify the performance of the proposed method, we tested the proposed method on four datasets (LIVE I, LIVE II, WIVC I and WIVC II). To examine the performance of the method presented herein, two general indices were chosen, spearman scale correlation coefficient (SPEARMAN RANK order correlation coefficient, SROCC/SRCC), pearson linear correlation coefficient (Pearson linear correlation coefficient, PLCC). SROCC evaluate monotonicity predicted by the method, PLCC is used to describe the linear dependence of predicted values on true values. For PLCC and SROCC, the larger their values, the better the performance.

To verify the performance of the present invention, 7 mainstream, reference-free stereoscopic image quality assessment methods were selected for comparison on LIVE 3D database with Waterloo IVC 3D database. These methods are 8 CNN-based evaluation methods (Fang, liu, yang, zhou, sim, si and Chang et al). The results are shown in tables 1 and 2, and the results of the two sets of methods with the best performance are shown in bold in each column.

Table 1 comparison of method Performance based on LIVE 3D image database

TABLE 2 comparison of method Performance based on Waterloo IVC 3D image database

As can be seen from tables 1 and 2, the performance indicators SROCC and PLCC of the present invention are at a leading level in all methods of comparison, on a LIVE-based 3D image database and a Waterloo IVC 3D image database. The invention obtains excellent results, has better generalization, and the prediction result of the model is highly consistent with the subjective score of human beings.

The foregoing description of the preferred embodiments of the invention is not intended to limit the invention to the precise form disclosed, and any such modifications, equivalents, and alternatives falling within the spirit and scope of the invention are intended to be included within the scope of the invention.

Claims

1. A stereo image quality evaluation method based on dual-frequency interactive enhancement and binocular matching, characterized in that it comprises the following steps:

Step 1: Cut the left and right view original images into 32×32 non-overlapping image blocks, use FFT to convert the left and right view image blocks to the frequency domain and decompose them into high and low frequencies, and use IFFT to restore the high-frequency and low-frequency image blocks of the left and right views;

Step 2: Train a convolutional neural network based on dual-frequency interaction enhancement and binocular matching. The convolutional neural network includes three subnetworks: a left view dual-frequency interaction feature enhancement subnetwork, a right view dual-frequency interaction feature enhancement subnetwork, and a binocular matching fusion subnetwork. Specifically, the following contents are included:

The left and right view subnetworks are symmetrical. Taking the processing flow of the left view dual-frequency interactive feature enhancement subnetwork as an example, the specific contents include the following: In the left view dual-frequency interactive feature enhancement subnetwork, a set of convolution and pooling operations are used to extract primary feature maps of the high-frequency and low-frequency image blocks of the left view, and the dual-frequency interactive attention module is used to calculate the attention weights of the high-frequency and low-frequency signals, and the high- and low-frequency features are interactively weighted, and then three sets of convolutions are used to further extract features, and finally a dual-frequency reorganization module is used to obtain high- and low-frequency weighted feature maps;

The processing flow of the binocular matching fusion sub-network specifically includes the following contents: in the binocular matching fusion sub-network, a set of convolution and pooling operations are used to extract the full-frequency primary feature map of the image blocks of the left and right views, and the high- and low-frequency information of the dual-frequency interactive attention module is used to enhance the features of the full-frequency primary feature map to obtain a full-frequency enhanced feature map; then, the enhanced full-frequency features of the left and right views are sequentially subjected to low-frequency binocular registration and high-frequency binocular registration; finally, a binocular competition selection module is used to perform feature screening in the spatial dimension and the channel dimension to obtain a binocular fusion feature map, and features are further extracted through two sets of convolutions;

The binocular fusion feature map is concatenated with the high- and low-frequency weighted feature maps of the left and right views and then passed through a fully connected layer to obtain an objective prediction score. During the training process, the loss function of the network model uses L1 loss to measure the difference between the image quality score predicted by the network and the actual image quality score, and guides the optimization of the model.

Step 3: Use the trained network to evaluate image quality. Slice the stereo image to be evaluated and input it into the network. Average the quality scores of all image blocks of each stereo image output by the network to obtain the quality score of the entire image.

2. According to the method for evaluating the quality of stereo images based on dual-frequency interaction enhancement and binocular matching in claim 1, it is characterized in that the processing flow of the dual-frequency interaction attention module in step 2 specifically includes the following contents:

The low-frequency primary features are passed through three convolutional layers with a convolution kernel of 1×1 in parallel to obtain three feature maps Q, K and V. The features Q and K are transformed in dimension and then matrix multiplication is performed, and then the low-frequency attention map is obtained through the SoftMax layer. Similarly, the same operation is performed on the high-frequency features to obtain the high-frequency attention map. Then, the low-frequency feature map V is matrix multiplied with the high-frequency attention map to obtain the low-frequency feature map after high-frequency information enhancement. The same operation is performed on the high-frequency primary features to obtain the high-frequency feature map after low-frequency information enhancement.

3. The method for evaluating the quality of a stereoscopic image based on dual-frequency interaction enhancement and binocular matching according to claim 1, wherein the processing flow of the dual-frequency reorganization module in step 2 specifically includes the following contents:

The low-frequency and high-frequency high-level features are differentiated and summed, and the difference and sum are cascaded and sent to the residual module ResBlock composed of two convolutional layers and one residual connection to further extract features. The features are then passed through a global average pooling layer GAP and a fully connected layer to obtain adaptive weights. Finally, the weights are normalized and the low-frequency and high-frequency high-level features are weighted to obtain high- and low-frequency weighted features.

4. The stereo image quality evaluation method based on dual-frequency interactive enhancement and binocular matching according to claim 1 is characterized in that the binocular feature matching fusion module in step 2 is composed of a binocular progressive registration module and a binocular competition selection module, and specifically includes the following contents:

In the low-frequency binocular registration unit, the low-frequency primary features of the left and right views are cascaded with their differential features and sent to a convolution layer with a convolution kernel size of 1×1 to obtain the low-frequency offset. The full-frequency primary features of the left and right views and the low-frequency offset are sent to the deformable convolution to obtain the features of the left and right views registered at the low-frequency scale.

In the high-frequency binocular registration unit, the high-frequency primary features of the left and right views are cascaded with their differential features and sent to a convolution layer with a convolution kernel size of 1×1 to obtain the high-frequency offset. The features of the left and right views after registration at the low-frequency scale and the high-frequency offset are sent to the deformable convolution to obtain the features of the left and right views registered at both the high and low frequency scales.

② The binocular competition selection module consists of a spatial dimension selection block and a channel dimension selection block. The registration features obtained by the binocular progressive registration module are added to the full-frequency primary features as the input of the module;

In the spatial dimension selection block, the input left and right view features are first passed through a convolution layer with a convolution kernel size of 1×1 to obtain two sets of feature maps. Then the two sets of feature maps are transformed in dimension and matrix multiplied. Then they are passed through the SoftMax layer to obtain the disparity attention map. The disparity attention map is matrix multiplied with the left and right view features respectively and weighted in the spatial dimension.

In the channel dimension selection block, the weighted features in the spatial dimension are sent to the global average pooling layer to obtain the left and right view channel attention map, and then the channel attention map is normalized and point-multiplied with the weighted features of the left and right views in the spatial dimension respectively. Finally, the weighted left and right view features in the channel dimension are added to obtain the binocular fusion feature.