Background
With the increasing abundance of life entertainment modes of people, people are no longer satisfied with planar image experiences, but pursue stereoscopic images with more immersive and realistic senses. Compared with a planar (2D) image, a stereoscopic image (3D) can make a viewer feel stereoscopic while displaying image contents, creating an immersive atmosphere, and thus is widely used in various fields such as 3D movies, 3D games, and virtual reality. However, during the acquisition, compression and transmission of stereoscopic images, the images may be distorted by some unavoidable factors, thereby affecting the visual experience of the viewer. Therefore, it is important to construct a stereoscopic image quality perception model that is highly consistent with human evaluation criteria. In addition, the stereo image quality evaluation (Stereoscopic Image Quality Assessment, SIQA) can also be used as a loop in an image processing system, and provides optimization guidance for the fields of stereo image super-division, segmentation and the like.
Generally, the IQA method can be divided into two types, i.e., subjective and objective. The subjective IQA method is that, under a certain experimental environment, a human observer evaluates and scores the quality of an image based on subjective feelings thereof and according to various indexes of the image. Since the final receiver of the image is human, the subjective evaluation method based on opinion scores given by human observers can be more true and reliable in image quality, and the method has a lower technical threshold and is considered to be the most effective way. Although the subjective evaluation method is most accurate in evaluating image quality, it has many drawbacks. In practice, the result of the subjective evaluation method is easily affected by the psychological state, cognitive level, the environment of the testee and other factors, so that the subjective evaluation application scene is very small in actual life. The objective image quality evaluation method is to establish an evaluation model to evaluate the quality of the image by simulating a human visual system (Human Visual System, HVS), so that the limitation of subjective evaluation can be well solved, and the method is more in line with the actual needs. The objective evaluation method is further classified into Full Reference (FR), half Reference (Reduced Reference, RR) and No Reference (No Reference, NR) according to the participation degree of the Reference image. Since many application scenes cannot obtain ideal reference images, the objective evaluation method for the quality of the reference-free images, which does not need the ideal reference images, is widely applied.
Currently, in SIQA fields, the work of researchers is mostly focused on the study of a stereoscopic image quality evaluation method of an NR type, and numerous NR-SIQA methods are proposed. The earlier FR-SIQA method was modified based on the classical FR-2DIQA method. Such as peak signal to Noise Ratio (PSNR) and structural similarity index (Structural Similarity Index, SSIM). However, such methods cannot reflect human visual perception characteristics, and also do not consider the influence of binocular parallax, and cannot be directly applied to stereoscopic image quality evaluation. In addition, there are some NR-SIQA methods of traditional manual feature extraction for image extraction based on the Human Visual System (HVS) or Natural Scene Statistics (NSS). The method has higher design threshold, requires a designer to have higher level, has rich experience and poorer generalization capability, and is difficult to adapt to the actual situation with various distortion changes.
In recent years, under the push of a deep learning technology, a plurality of stereoscopic image binocular vision models based on convolutional neural networks (Convolution Neural Network, CNN) are developed, and the development of NR-SIQA is further promoted. However, the existing SIQA method based on CNN does not pay attention to the influence of the spatial frequency of the image on the human perceived image quality. Most methods do not take into account the interaction of high and low frequency information and the difference in contribution of high and low frequencies to image quality perception. Based on biological knowledge and human visual psychology knowledge, the human visual system behaves differently in terms of processing high and low spatial frequency information of the same image. In fact, extracting features from the entire frequency range may suffer from non-uniform information distribution in the spatial domain. In fact, visual information processing first extracts overall contour features of the visual scene based on low spatial frequency information, and then uses this information to facilitate detailed texture feature extraction corresponding to high frequency information. On the other hand, strong edges tend to draw more attention and can enhance the perception of surrounding smooth areas. These machines demonstrate complex interactions between high and low frequency information.
Further, since the evaluation object SIQA is a stereoscopic image, binocular vision characteristics are generally required to be considered. However, most methods simply splice left and right features and then perform quality regression, but the left and right views have no interaction and fusion process. Although some methods use differential and summation of left and right view features or twin networks to achieve left and right view interaction, it is clear that a relatively complex process of binocular interaction fusion of a human visual system cannot be well simulated. According to biological knowledge and human visual psychology knowledge, the human visual system usually fuses the matched features in the left and right views, and then adopts a competition mechanism to select and inhibit the unmatched features. The matching features between the left and right views are key factors to stabilize the perception of both eyes. Alignment of binocular features may facilitate integration of binocular information, forming a unified and coherent visual perception. Through simulating the process, stable and accurate binocular vision characteristics can be effectively extracted.
Based on the above, the invention provides a stereoscopic image quality evaluation method based on dual-frequency interaction enhancement and binocular matching.
Disclosure of Invention
The invention aims to provide a stereoscopic image quality evaluation method based on dual-frequency interaction enhancement and binocular matching, which aims to solve the problem that the high-low frequency sensitivity of human eyes to pictures with different distortion types and different semantic information is not considered when the stereoscopic image quality is evaluated in the prior art.
In order to achieve the above purpose, the present invention adopts the following technical scheme:
A stereoscopic image quality evaluation method based on dual-frequency interaction enhancement and binocular matching comprises the following steps:
cutting an original image of a left view and a right view into image blocks with the size of 32 multiplied by 32 and not overlapped, converting the image blocks of the left view and the right view into a frequency domain by using FFT, decomposing the image blocks according to high and low frequencies, and restoring the image blocks of the left view and the right view by using IFFT to obtain high-frequency and low-frequency image blocks of the left view and the right view;
Training a convolutional neural network based on double-frequency interaction enhancement and binocular matching, wherein the convolutional neural network comprises a left view double-frequency interaction characteristic enhancement sub-network, a right view double-frequency interaction characteristic enhancement sub-network and a binocular matching fusion sub-network, and the convolutional neural network comprises the following three main sub-networks:
since the left-right view subnetwork is symmetrical, only the implementation of the left-view dual frequency interaction feature enhancement subnetwork will be described herein. In a left view dual-frequency interaction feature enhancement sub-network, a primary feature map is extracted from high-frequency and low-frequency image blocks of a left view by using a group of convolution and pooling operations, the attention weights of high-frequency and low-frequency signals are calculated by using a dual-frequency interaction attention module, the high-frequency and low-frequency features are subjected to interaction weighting, then the features are further extracted by three groups of convolution, and finally a high-frequency and low-frequency weighting feature map is obtained by using a dual-frequency recombination module;
Extracting full-frequency primary feature images of image blocks of left and right views by using a group of convolution and pooling operations in a binocular matching fusion sub-network, carrying out feature enhancement on the full-frequency primary feature images by using high-low frequency information of a dual-frequency interaction attention module to obtain full-frequency enhancement feature images, and then sequentially carrying out low-frequency binocular registration and high-frequency binocular registration on the full-frequency features of the enhanced left and right views;
And (3) splicing the binocular fusion characteristic diagram with the high-low frequency weighted characteristic diagram of the left view and the low-low frequency weighted characteristic diagram of the right view, obtaining objective prediction scores through a full-connection layer, and in the training process, measuring the difference between the image quality scores predicted by the network and the real image quality scores by using the loss function of the network model through L1 loss, and guiding the optimization of the model.
And thirdly, performing image quality evaluation by using the trained network. And slicing the stereoscopic image to be evaluated, inputting the sliced stereoscopic image into a network, and averaging the mass fraction of all image blocks of each stereoscopic image output by the network to obtain the mass fraction of the whole image.
Preferably, the processing flow of the dual-frequency interaction attention module in the second step specifically includes the following:
And (3) the low-frequency primary features are subjected to 3 convolution layers with convolution kernels of 1 multiplied by 1 in parallel to obtain three feature graphs Q, K and V, the features Q and K are subjected to dimension transformation and then are subjected to matrix multiplication operation, and then a low-frequency attention diagram is obtained through a softMax layer. Similarly, the same operation is performed on the high-frequency characteristic to obtain a high-frequency attention map, and then matrix multiplication operation is performed on the low-frequency characteristic map V and the high-frequency attention map to obtain a low-frequency characteristic map enhanced by high-frequency information. The same operation is carried out on the high-frequency primary characteristics to obtain a high-frequency characteristic diagram enhanced by the low-frequency information.
Preferably, the processing flow of the dual-frequency reorganization module in the second step specifically includes the following contents:
And carrying out difference and summation on the low-frequency and high-frequency advanced features, cascading the difference and summation, sending the obtained product into a residual error module ResBlock formed by connecting two layers of convolution layers with one layer of residual error, further extracting the features, obtaining self-adaptive weights after the features pass through a global average pooling layer GAP and a full connecting layer, and finally carrying out normalization on the weights, and weighting the low-frequency and high-frequency advanced features to obtain high-frequency and low-frequency weighted features.
Preferably, the binocular matching fusion module in the second step is composed of a binocular progressive registration module and a binocular competition selection module, and specifically includes the following contents:
① The binocular progressive registration module consists of a low-frequency binocular registration unit and a high-frequency binocular registration unit;
In a low-frequency binocular registration unit, cascading the low-frequency primary features of the left and right views and the differential features thereof, then sending the low-frequency primary features into a layer of convolution layer with the convolution kernel size of 1 multiplied by 1 to obtain low frequency offset, and sending the full-frequency primary features of the left and right views and the low frequency offset into deformable convolution to obtain the registration features of the left and right views on the low frequency scale;
in a high-frequency binocular registering unit, cascading the high-frequency primary features of the left and right views and the differential features thereof, then sending the high-frequency primary features into a layer of convolution layer with a convolution kernel size of 1 multiplied by 1 to obtain a high-frequency offset, and sending the features of the left and right views registered on a low-frequency scale and the high-frequency offset into a deformable convolution to obtain the features of the left and right views registered on the high-frequency scale and the low-frequency scale;
② The binocular competitive selection module consists of a space dimension selection block and a channel dimension selection block, and the registration features obtained by the binocular progressive registration module are added with full-frequency primary features to be used as the input of the module;
In the space dimension selection block, firstly, an input left view feature and an input right view feature are subjected to a convolution layer with a convolution kernel size of 1 multiplied by 1 to obtain two groups of feature images, then the two groups of feature images are subjected to dimension transformation and then are subjected to matrix multiplication operation, and then a softMax layer is subjected to the dimension transformation to obtain a parallax attention map, the parallax attention map is respectively subjected to matrix multiplication with the left view feature and the right view feature, and the parallax attention map is weighted in the space dimension.
In the channel dimension selection block, the features weighted in the space dimension are sent to a global average pooling layer to obtain left and right view channel attention force diagrams, then the channel attention force diagrams are normalized, point multiplication is carried out on the channel attention force diagrams and the features weighted in the space dimension of the left and right views respectively, and finally the features of the left and right views weighted in the channel dimension are added to obtain binocular fusion features.
Compared with the prior art, the invention provides a stereoscopic image quality evaluation method based on dual-frequency interaction enhancement and binocular matching, which has the following beneficial effects:
The invention fully utilizes the information of different spatial frequencies of the image to enhance the characteristics, simulates the mechanism of different frequency sensitivity of human visual system quality perception, and can effectively extract the stereoscopic visual characteristics conforming to human visual quality perception by simulating the binocular fusion and binocular competition mechanism of the human visual system and assisting the binocular fusion by utilizing the information of different spatial frequencies of the image. The evaluation result of the method has high consistency with the subjective evaluation result of human eyes, and has important value.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention will be described in further detail below.
Example 1
The embodiment provides a stereoscopic image quality evaluation method based on dual-frequency interaction enhancement and binocular matching, as shown in fig. 1, the method comprises the following steps:
s101, preprocessing and data enhancing the original stereo image for training
The original image of the left and right view is cut into 220 image blocks which are not overlapped and have the size of 32 multiplied by 32, and the quality score of each image block is the true quality score of the original image. Converting the left and right view image blocks into a frequency domain by using FFT, decomposing according to high and low frequencies, and restoring by using IFFT to obtain left and right view high-frequency and low-frequency image blocks;
S102, training convolutional neural network based on dual-frequency interaction enhancement and binocular matching
The method comprises the steps of constructing a convolution neural network based on double-frequency interaction enhancement and binocular matching, wherein the network consists of a left view double-frequency interaction characteristic enhancement sub-network, a right view double-frequency interaction characteristic enhancement sub-network and a binocular matching fusion sub-network, and specifically comprises the following contents:
Since the left-right view subnetwork is symmetrical, only the implementation of the left-view dual frequency interaction feature enhancement subnetwork will be described herein. In a left view dual-frequency interaction characteristic enhancement sub-network, a primary characteristic diagram is extracted from a high-frequency image block and a low-frequency image block of a left view by using a group of convolution and pooling operations with the convolution kernel size of 3 multiplied by 3, the attention weights of high-frequency signals and low-frequency signals are calculated by using a dual-frequency interaction attention module, the high-frequency characteristics and the low-frequency characteristics are subjected to interaction weighting, then the characteristics are further extracted by using three groups of convolution with the convolution kernel size of 3 multiplied by 3, and finally the high-frequency weighted characteristic diagram and the low-frequency weighted characteristic diagram are obtained by using a dual-frequency recombination module;
Extracting full-frequency primary feature images of image blocks of left and right views by using a group of convolution and pooling operations in a binocular matching fusion sub-network, carrying out feature enhancement on the full-frequency primary feature images by using high-low frequency information of a dual-frequency interaction attention module to obtain full-frequency enhancement feature images, and then sequentially carrying out low-frequency binocular registration and high-frequency binocular registration on the full-frequency features of the enhanced left and right views;
And (3) splicing the binocular fusion characteristic diagram with the high-low frequency weighted characteristic diagram of the left view and the low-low frequency weighted characteristic diagram of the right view, obtaining objective prediction scores through a full-connection layer, and in the training process, measuring the difference between the image quality scores predicted by the network and the real image quality scores by using the loss function of the network model through L1 loss, and guiding the optimization of the model.
S103, using the trained network to evaluate the quality of the stereoscopic image
And slicing the stereoscopic image to be evaluated, inputting the sliced stereoscopic image into a network, and averaging the mass fraction of all image blocks of each stereoscopic image output by the network to obtain the mass fraction of the whole image.
Example 2:
Based on embodiment 1, but with the difference that the scheme in embodiment 1 is further described below in connection with specific calculation formulas and example data, since the left and right view sub-networks are symmetrical, only one sub-network implementation is described herein. The details are described below:
Step1, acquiring and preprocessing brain electrical signals
The original image of the left and right view is cut into 220 image blocks which are not overlapped and have the size of 32 multiplied by 32, and the quality score of each image block is the true quality score of the original image. The image block I is converted into a frequency domain to be subjected to high-low frequency decomposition by using FFT (fast Fourier transform), and the high-frequency and low-frequency image blocks are obtained by using IFFT (inverse fast Fourier transform) reduction:
Where z represents the frequency component of the sample, t (·; r) represents a threshold function that separates the low and high frequency components from z according to the hyper-parameter radius r.
To simplify the processing, we first consider decomposing a single channel image into high and low frequencies. For the three-channel image used herein, this operation is performed independently on each channel. A single channel image is defined as I ε N n×n, whose frequency domain is denoted as z ε C n×n, where C represents complex numbers. The implementation form of the function z lf,zhf =t (z; r) is defined as:
Where z (i, j) represents the value of position (i, j), c i,cj represents the centroid, and d (·, ·) is the euclidean distance.
Step 2, training convolutional neural network based on dual-frequency interaction enhancement and binocular matching
The embodiment constructs a convolution neural network based on double-frequency interaction enhancement and binocular matching, wherein the network is composed of a left view double-frequency interaction characteristic enhancement sub-network, a right view double-frequency interaction characteristic enhancement sub-network and a binocular matching fusion sub-network, and the specific method is as follows:
Since the left-right view subnetwork is symmetrical, only the implementation of the left-view dual frequency interaction feature enhancement subnetwork will be described herein.
In the left view dual frequency interaction feature enhancement sub-network, high frequency and low frequency image blocks are fed into a set of convolution sums of convolution kernel size 3 x 3 to extract primary features. Calculating the attention weights of the high-frequency and low-frequency signals by using a dual-frequency interaction attention module DFIA, carrying out interaction weighting on the high-frequency and low-frequency characteristics, then further extracting the characteristics through three groups of convolutions with convolution kernel sizes of 3 multiplied by 3, and finally obtaining a high-frequency and low-frequency weighted characteristic diagram through a dual-frequency recombination module DFR;
The dual frequency interaction attention module DFIA, as shown in FIG. 1, first characterizes the primary features Respectively obtaining two groups of characteristic graphs through 3 convolution layers with convolution kernels of1 multiplied by 1And (3) withThe dimensions of the two sets of feature maps K and Q are then transformed intoWhere n=h×w. Matrix multiplication is then performed on the transposes of Q and K, and the SoftMax layer is applied to calculate the spatial attention patterns, resulting in spatial attention patterns Att lf and Att hf calculated from { Q lf,Klf } and { Q hf,Khf } respectively. Finally, the spatial attention pattern Att lf(Atthf) is matrix multiplied with the feature map V hf(Vlf) to achieve interactive weighting of the high-low frequency features. The whole process can be expressed as:
Wherein, Representing matrix multiplication, { IF hf,IFlf } is the high and low frequency characteristics after interaction enhancement.
The dual frequency recombination module DFR is shown in fig. 1 and consists of a ResBlock, a GAP layer and a full connection hierarchy. Wherein ResBlock contains two convolutional layers and one residual. And after the difference sum of the low-frequency and high-frequency advanced features after interactive weighting is cascaded, the difference sum is sent to a DFR module to obtain weights W hf and W lf.
Finally, after normalizing the obtained weight, the high-level characteristics of the low frequency and the high frequency are weighted. The feature weights are formulated as follows:
Wherein, As an input feature of the DFR,Is the final single view frequency reorganization feature.
In a binocular matching fusion subnetwork, a set of convolution and pooling operations is used to extract full frequency primary feature maps of image blocks of left and right views. In order to highlight the contour and the detail for better registration, the high-low frequency information of the dual-frequency interaction attention module is used for carrying out feature enhancement on the full-frequency primary feature map, namely, the full-frequency enhancement feature map is obtained by respectively carrying out matrix multiplication on S lf and S hf and V ff. The specific implementation process is represented by the following formula:
And then, sequentially carrying out low-frequency binocular registration and high-frequency binocular registration on the enhanced left and right view full-frequency features. And finally, adopting a binocular competition selection module to perform feature screening in the space dimension and the channel dimension to obtain a binocular fusion feature map, and further extracting features through two groups of convolution. The binocular feature matching fusion module consists of a binocular progressive registration module and a binocular competition selection module, and the specific implementation process is as follows:
① As shown in fig. 2, the binocular progressive registration module is composed of a low-frequency binocular registration unit and a high-frequency binocular registration unit;
In a low-frequency binocular registration unit, after cascading low-frequency primary features of a left view and a right view with differential features thereof, sending the low-frequency primary features into a layer of convolution layer with a convolution kernel size of 1 multiplied by 1 to obtain low-frequency offset, and inputting full-frequency primary features of the left view and the right view into deformable convolution with the low-frequency offset to obtain features of registration of the left view and the right view on a low-frequency scale;
In a high-frequency binocular registering unit, after cascading the high-frequency primary features of the left and right views and the differential features thereof, sending the high-frequency primary features into a layer of convolution layer with the convolution kernel size of 1 multiplied by 1 to obtain high-frequency offset, registering the features of the left and right views on the low-frequency scale with the high-frequency offset, and inputting the high-frequency offset into a deformable convolution to obtain registering features of the left and right views on the high-frequency and low-frequency scale;
The specific implementation process is expressed as the following formula:
Where Conv represents a convolution with a convolution kernel size of 1×1, O lf and O hf represent deformable convolution offsets, deConv represents deformable convolution operations, and AF left and AF right are aligned features.
② As shown in fig. 3, the binocular competitive selection module is composed of a spatial dimension selection block and a channel dimension selection block, and the alignment feature { AF left,AFright } and the full-frequency feature obtained by the binocular progressive registration module are used for the alignment feature { AF left,AFright }, the full-frequency featureAdding to obtain an input { MF left,MFright };
in the space dimension selection block, the binocular feature { MF left,MFright } is first passed through a convolution layer with a convolution kernel size of 1×1 to obtain Transpose F l with F r Matrix multiplication is performed, and attention is sought after the SoftMax layerThe attention map AM R→L,AML→R is then matrix multiplied with the inputs MF left,MFright of the module, respectively:
Wherein the method comprises the steps of Is a matrix multiplication, T denotes a transpose, and SF left and SF right are features weighted by parallax attention.
In the channel dimension selection block, the resulting SF left and SF right are fed into the global averaging pooling layer, the channel attention map { AM left,AMright } is obtained, and the channel attention map is normalized at the pixel level:
The channel attention was then sought to be point multiplied with SF left and SF right to yield a fusion feature F b of the binocular view.
Further, F b was fed into the two sets of convolution and pooling layers to extract the final binocular fusion feature F bino.
Finally, binocular fusion feature F bino is combined with left and right view high and low frequency weighting featuresAndAnd obtaining objective prediction scores through a full-connection layer after splicing, wherein in the training process, the network model adopts an L1 loss function to measure the difference between the image quality scores predicted by the network and the real image quality scores, and guides the optimization of the model.
203 Image quality assessment using trained network
And slicing the stereoscopic image to be evaluated, inputting the sliced stereoscopic image into a network, and averaging the mass fraction of all image blocks of each stereoscopic image output by the network to obtain the mass fraction of the whole image.
Example 3:
Based on examples 1-2, but with the differences, the protocols in examples 1 and 2 were validated in connection with specific experiments as follows:
To verify the performance of the proposed method, we tested the proposed method on four datasets (LIVE I, LIVE II, WIVC I and WIVC II). To examine the performance of the method presented herein, two general indices were chosen, spearman scale correlation coefficient (SPEARMAN RANK order correlation coefficient, SROCC/SRCC), pearson linear correlation coefficient (Pearson linear correlation coefficient, PLCC). SROCC evaluate monotonicity predicted by the method, PLCC is used to describe the linear dependence of predicted values on true values. For PLCC and SROCC, the larger their values, the better the performance.
To verify the performance of the present invention, 7 mainstream, reference-free stereoscopic image quality assessment methods were selected for comparison on LIVE 3D database with Waterloo IVC 3D database. These methods are 8 CNN-based evaluation methods (Fang, liu, yang, zhou, sim, si and Chang et al). The results are shown in tables 1 and 2, and the results of the two sets of methods with the best performance are shown in bold in each column.
Table 1 comparison of method Performance based on LIVE 3D image database
TABLE 2 comparison of method Performance based on Waterloo IVC 3D image database
As can be seen from tables 1 and 2, the performance indicators SROCC and PLCC of the present invention are at a leading level in all methods of comparison, on a LIVE-based 3D image database and a Waterloo IVC 3D image database. The invention obtains excellent results, has better generalization, and the prediction result of the model is highly consistent with the subjective score of human beings.
The foregoing description of the preferred embodiments of the invention is not intended to limit the invention to the precise form disclosed, and any such modifications, equivalents, and alternatives falling within the spirit and scope of the invention are intended to be included within the scope of the invention.