Disclosure of Invention
The invention provides a fixation point detection method based on quaternion wavelet transform depth visual perception, which generates 12 detail sub-band diagrams reflecting image detail information after quaternion wavelet transform is carried out on an image; the method comprises the steps that a deep convolutional neural network is used for learning characteristic information representing a fixation point, and as the data volume contained in 12 detail sub-band graphs is large, in order to improve the training efficiency of the deep convolutional neural network, the network constructed by 1 x 1 convolutional kernels is used for carrying out dimensionality reduction on the detail sub-band graphs, and low-dimensionality characteristic graphs to be trained are extracted; training a low-dimensional feature map to be trained by using a deep convolutional neural network; and extracting the fixation point information of the image by using the trained network structure, and detecting the fixation point to obtain the fixation view.
The purpose of the invention is realized by the following technical scheme.
A fixation point detection method based on quaternion wavelet transform depth vision perception comprises the following steps:
step 1, performing quaternion wavelet one-level decomposition on a natural scene image, and performing filtering processing in different combination modes on row pixels and column pixels of the image by adopting a low-pass filter and a high-pass filter to obtain 4 channels, namely a low-pass and a low-pass, a low-pass and a high-pass, a high-pass and a low-pass, a high-pass and a high-pass, and 12 fine-pitch sub-band diagrams in three directions, namely a horizontal direction, a vertical direction and a diagonal direction and 4 approximate diagrams in 4 channels;
step 2, using a convolution network structure with reduced dimensionality constructed by 1 × 1 convolution kernel to perform dimensionality reduction processing on 12 detail sub-band graphs, and extracting 3 detail feature graphs capable of better representing image detail information from the 12 detail sub-band graphs for training a deep convolution neural network structure for extracting an image fixation point;
and 3, training by adopting a deep convolutional neural network based on the detail characteristic graph extracted by the dimensionality reduction convolutional network, establishing a mapping network between the detail characteristic graph and the image fixation point, and detecting the fixation point by adopting the trained dimensionality reduction convolutional network and the trained deep convolutional neural network.
Preferably, step 1 further comprises: the quaternion wavelet transform in the invention refers to dual-tree quaternion two-dimensional discrete wavelet transform, the quaternion wavelet transform of the image is formed by real number wavelet transform and two-dimensional Hilbert transform, a standard orthogonal basis of the quaternion wavelet transform is constructed by the two-dimensional Hilbert transform, and the quaternion wavelet transform is carried out on the image, so that wavelet coefficients of four channels, namely 12 detail sub-band diagrams and 4 approximate diagrams, can be obtained. The method is realized by the following steps:
1) if it is used
And psi for quaternion wavelet transformWavelet scale function and wavelet basis function, then
The hilbert transform in the horizontal direction x, the vertical direction y, and the diagonal direction xy can be expressed as:
wherein H represents a Hilbert transform,
and the results of the transformation of equation (1) together form a set of orthonormal bases.
2) Analogously to step 1), for the wavelet functions separately
And
ψ
h(x)ψ
h(y) performing Hilbert transform to construct four sets of orthonormal bases included in quaternion wavelet transform, which can be expressed as matrix G:
3) the image is subjected to one-level decomposition of quaternion wavelet transform, so that wavelet decomposition coefficients on four channels can be obtained, and are represented by a matrix F:
where LL denotes channel low-pass and low-pass, LH denotes channel low-pass and high-pass, HL denotes channel high-pass and low-pass, and HH denotes channel high-pass and high-pass. In the matrix F, the first row represents the coefficient matrix of the approximation part, i.e. 4 approximation graphs; the second, third and fourth rows represent detail coefficient matrixes in the horizontal direction, the vertical direction and the diagonal direction respectively, namely 12 detail subband graphs.
Preferably, step 2 further comprises: the 12 detail sub-band maps obtained in the step 1 contain a large data volume, and if the 12 detail sub-band maps are used for directly training the deep convolutional neural network, a long training time is needed, and in order to improve the training efficiency, the dimension reduction operation is performed on the 12 detail sub-band maps by adopting the convolutional neural network constructed by 1 × 1 convolutional kernel.
Preferably, step 2 further comprises: the convolutional neural network for reducing the dimensionality of the data to be trained comprises 1 input layer, 3 convolutional layers and 1 output layer, and the connection mode is as follows: input layer → convolutional layer 1 → convolutional layer 2 → convolutional layer 3 → output layer, the output of each convolutional layer is input to the next adjacent layer after a batch normalization (BatchNorm) and activation function 1 (ReLU). And the input layer inputs the 12 detailed sub-band diagrams into the convolution neural network with reduced dimensionality, and the characteristic diagram to be trained with low dimensionality can be obtained after multilayer convolution processing. The representation of the ReLU function is as follows:
f(x)=max(0,x) (4)
preferably, step 3 further includes inputting the feature map to be trained after dimensionality reduction into a deep convolutional neural network, training the network, and detecting the image fixation point by using the trained network structure. The deep convolutional neural network is further divided into a network for extracting the characteristics of the fixation point and a network for detecting the fixation point, and the specific network structure and implementation steps are as follows:
1) the network for extracting the gazing point features is constructed and comprises 1 input layer, 5 convolution stages and 1 output layer, wherein the first two convolution stages respectively comprise 2 convolution layers and 1 pooling layer, the second two convolution stages respectively comprise 3 convolution layers and 1 pooling layer, and the last convolution stage only comprises 3 convolution layers. The specific connection mode is as follows: input layer → convolution stage 1 (convolution layer 1_1 → convolution layer 1_2 → pooling layer 1) → convolution stage 2 (convolution layer 2_1 → convolution layer 2_2 → pooling layer 2) → convolution stage 3 (convolution layer 3_1 → convolution layer 3_2 → convolution layer 3_3 → pooling layer 3) → convolution stage 4 (convolution layer 4_1 → convolution layer 4_2 → convolution layer 4_3 → pooling layer 4) → convolution stage 5 (convolution layer 5_1 → convolution layer 5_2 → convolution layer 5_3) → output layer, and the output of each convolution layer passes through activation function 1(ReLU) before being input to the next adjacent layer.
Each convolution layer adopts a small-scale convolution kernel, and compared with a large-scale convolution kernel, the convolution process of the small-scale convolution kernel can reduce the parameters of the network structure. The input layer inputs 3 detailed feature maps to be trained into a network for extracting features, and outputs the characteristic information of the fixation point after 5 convolution stages of operation.
2) The network structure for constructing the detection annotation view comprises 3 deconvolution layers, 1 convolution layer and 1 output layer, and the specific connection mode is described as follows: respectively inputting feature information output by a convolutional layer 3_3, a convolutional layer 4_3 and a convolutional layer 5_3 in a network for extracting the gazing point feature into different deconvolution layers; then respectively carrying out primary cropping (Crop) processing to obtain 3 characteristic graphs with the size consistent with that of the original graph; and then outputting 1 characteristic diagram representing the image fixation point information after passing through 1 convolution layer, and outputting the image fixation view after passing through an activation function 2 (Sigmod).
The characteristic information output by each convolutional layer is different, and in order to improve the detection effect, the characteristic information output by each convolutional layer 3_3, 4_3 and 5_3 is fused for detecting the fixation point. Because the gaze point features are extracted through a plurality of convolutional layers and pooling layers, the feature maps output by different convolutional layers have different sizes, and the feature maps output by the convolutional layers 3_3, 4_3 and 5_3 are deconvoluted and then fused. The fused gazing point characteristic information is subjected to Sigmod function calculation to calculate the significance value of each pixel point, and therefore the detected gazing view is obtained. The Sigmod function is represented as follows:
Detailed Description
The present invention will be further described with reference to the accompanying drawings and detailed description, but such embodiments are described by way of illustration only, and are not intended to limit the scope of the invention.
The invention relates to a method for detecting a fixation point based on quaternion wavelet transform depth visual perception, and a figure 1 is an integral flow block diagram of the method, and the method comprises the following specific implementation steps:
1. acquisition of the sub-band diagram
The quaternion wavelet transform is to perform dual-tree quaternion two-dimensional discrete wavelet transform on the image, the image can generate 12 detail subband diagrams and 4 approximate diagrams after the quaternion wavelet transform, and the image characteristics are extracted from the 12 detail subband diagrams and used for detecting the fixation point. The specific steps of quaternion wavelet transform are as follows:
1) constructing a wavelet function: by using
And psi represents wavelet scale function and wavelet basis function of quaternion wavelet transform, respectively
The hilbert transform in the x-direction, y-direction, and xy-direction can be expressed as:
2) constructing the orthonormal basis of the quaternion wavelet transform using the hilbert transform: analogously to step 1), for the wavelet functions separately
And
by performing Hilbert transform, four sets of orthonormal bases included in quaternion wavelet transform can be constructed and expressed as a matrix G:
3) and (3) carrying out quaternion wavelet transform on the image: wavelet decomposition coefficients on four channels are obtained, represented by a matrix F:
in the matrix F, the first row represents the coefficient matrix of the approximation part, i.e. 4 approximation graphs; the second, third and fourth lines represent the detail coefficient matrixes in the horizontal direction, the vertical direction and the diagonal direction respectively, namely 12 detail sub-band diagrams, as shown in fig. 1(a), the wavelet coefficients (detail sub-band diagrams) selected by the invention are the second, third and fourth lines in the matrix F.
2. Acquisition of a feature map to be trained
The invention uses 1 × 1 convolution kernel to construct a convolution network with reduced dimensionality, as shown in fig. 1(b), the feature diagram to be trained with low dimensionality can be extracted from 12 detail sub-band diagrams, and the interlayer connection mode of the convolution network with reduced dimensionality is shown in fig. 2: input layer → convolutional layer 1 → convolutional layer 2 → convolutional layer 3 → output layer, the output of each convolutional layer is input into the next adjacent layer after being processed by batch normalization (BatchNorm) and activation function 1(ReLU) once, convolutional layer 1 selects convolution kernel of 1 × 1 × 16, and the input 12 detailed sub-band maps output 16-layer feature maps after the first layer of convolution; the convolution layer 2 selects a convolution kernel of 1 multiplied by 8, and 8 layers of characteristic graphs are output after the convolution of the second layer; the convolution layer 3 selects a convolution kernel of 1 multiplied by 3, and outputs a 3-layer characteristic diagram after the convolution of the third layer; all convolutional layers have step size of 1, the data output by each convolutional layer is input into the next adjacent layer after undergoing batch normalization (BatchNorm) processing and activation function 1(ReLU) operation, the data output by the convolutional network with reduced dimensionality is a feature map to be trained and is input into the deep convolutional neural network, as shown in fig. 1(c), the ReLU activation function is represented in the following form:
f(x)=max(0,x) (4)
3. acquisition of annotation views
And inputting the feature map to be trained after dimensionality reduction into a deep convolutional neural network, training the network, and detecting an image fixation point by using the trained network structure so as to obtain a fixation view. The interlayer connection mode of the deep convolutional neural network is shown in fig. 3, and is further divided into a network for extracting the gazing point characteristics and a network for detecting the gazing point, and the specific network structure and implementation steps are as follows:
1) the network for extracting the gazing point features is constructed and comprises 1 input layer, 5 convolution stages and 1 output layer, wherein the first two convolution stages respectively comprise 2 convolution layers and 1 pooling layer, the second two convolution stages respectively comprise 3 convolution layers and 1 pooling layer, and the last convolution stage only comprises 3 convolution layers in a specific connection mode: input layer → convolution stage 1 (convolution layer 1_1 → convolution layer 1_2 → pooling layer 1) → convolution stage 2 (convolution layer 2_1 → convolution layer 2_2 → pooling layer 2) → convolution stage 3 (convolution layer 3_1 → convolution layer 3_2 → convolution layer 3_3 → pooling layer 3) → convolution stage 4 (convolution layer 4_1 → convolution layer 4_2 → convolution layer 4_3 → pooling layer 4) → convolution stage 5 (convolution layer 5_1 → convolution layer 5_2 → convolution layer 5_3) → output layer, and the output of each convolution layer passes through activation function 1(ReLU) before being input to the next adjacent layer.
In the network for extracting the gazing point characteristics, each convolution layer adopts a convolution kernel of 3 multiplied by 3, and compared with the convolution kernel of large scale, in the process of performing convolution operation, parameters of a network structure can be reduced, all convolution kernels in a first convolution stage are 3 × 3 × 64, all convolution kernels in a second convolution stage are 3 × 3 × 128, all convolution kernels in a third convolution stage are 3 × 3 × 256, all convolution kernels in a fourth convolution stage are 3 × 3 × 512, all convolution kernels in a fifth convolution stage are 3 × 3 × 512, all pooling layers are 2 × 2 maximum pooling operation, an input layer inputs 3 detailed feature graphs to be trained output by the convolution network with reduced dimensionality into a network with gaze point features extracted, and gaze point feature information is output after 5 convolution stages.
2) The network structure for constructing the detection annotation view comprises 3 deconvolution layers, 1 convolution layer and 1 output layer, and the specific connection mode is described as follows: respectively inputting feature information output by a convolutional layer 3_3, a convolutional layer 4_3 and a convolutional layer 5_3 in a network for extracting the gazing point feature into a deconvolution layer 1, a deconvolution layer 2 and a deconvolution layer 3; then respectively carrying out primary cropping (Crop) processing to obtain 3 characteristic graphs with the size consistent with that of the original graph; and then outputting 1 characteristic graph representing the gaze point information of the image after passing through the convolutional layer 6, and outputting the gaze view of the image after passing through an activation function 2 (Sigmod).
In order to keep the size of the output feature map consistent with the size of the original image, the invention adopts 3 deconvolution layers to respectively carry out size expansion operation on the feature maps output by the convolution layers 3_3, 4_3 and 5_ 3; cutting the feature graph with the enlarged size according to the size of the original graph to obtain 3 feature graphs with the size consistent with that of the original graph; and then after 1 convolution kernel processing of 1 × 1 × 1, fusing 3 feature maps into 1 feature map representing the gaze point information of the image, and calculating the significant value of each pixel point by the fused gaze point feature map through a Sigmod function to obtain the detected gaze map, wherein the expression form of the Sigmod function is as follows: