CN118799949B

CN118799949B - High-precision line of sight estimation method in low-light environment

Info

Publication number: CN118799949B
Application number: CN202410833622.6A
Authority: CN
Inventors: 王进; 王可; 曹硕裕; 徐嘉玲; 赵颖钏; 吕泽
Original assignee: Nantong University
Current assignee: Beijing Cactus Technology Co.,Ltd.
Priority date: 2024-06-26
Filing date: 2024-06-26
Publication date: 2025-04-11
Anticipated expiration: 2044-06-26
Also published as: CN118799949A

Abstract

The invention discloses a high-precision sight estimation method under a low-light environment, which comprises the following steps of S1, preprocessing a data set, simulating low-light environment S2, increasing a low-light image to obtain an enhanced image, S3, enhancing image calibration to obtain a calibration image, S4, extracting features of the image, outputting a feature vector, using an improved residual network model ResNet to extract features of the calibrated image, S5, mapping the feature vector into a three-dimensional output vector through a full connection layer, S6, applying hyperbolic tangent transformation to the first two elements of the three-dimensional output vector to obtain an accurate prediction sight direction, S7, transforming a third element of the three-dimensional output vector through a sigmoid function to obtain the uncertainty of sight prediction, and S8, measuring the error between a prediction result and a true value by adopting a MSELoss loss function. The method can effectively solve the problem that the sight line estimation precision is obviously reduced in a low-light environment, thereby improving the practicability and the accuracy of the system.

Description

High-precision sight line estimation method in low-light environment

Technical Field

The invention relates to the field of deep learning and computer vision, in particular to a high-precision sight estimation method in a low-light environment.

Background

Line-of-sight estimation techniques aim at determining the gaze direction and focus of a person in an image or video. As a core element of human social interaction, gaze behavior can reveal a large amount of potential information. Traditional line-of-sight estimation methods rely mainly on model driven techniques and require the use of expensive high resolution and complex equipment such as infrared cameras. However, with the rapid development of appearance-based gaze estimation techniques, particularly a method of estimating gaze location directly from a face or eye image by using a Convolutional Neural Network (CNN), significant technological breakthroughs have been made. The methods mainly adopt an end-to-end learning strategy, and can realize accurate sight estimation by mapping an image to a function of the sight direction and only inputting a face image. The method not only simplifies the data processing flow, but also remarkably improves the accuracy and the practicability of the sight line estimation by comprehensively analyzing the global characteristics of the whole face, and becomes the mainstream technology in the sight line estimation field.

While these innovative approaches have achieved significant results in terms of cross-domain adaptation, fine-grained processing, multi-person recognition, real-time application, and robustness, achieving high accuracy gaze estimation at low light scenes remains a challenge. Therefore, the invention aims to improve the performance and accuracy of the sight line estimation in the low-light environment so as to meet the increasingly strict practical application requirements.

Aiming at the high precision problem of the sight line estimation, a method for exchanging and fusing features by utilizing a double-view interactive convolution block and a double-view TransFormer is proposed in the prior art. The sight line estimation network of the method receives face images with two different visual angles as input, and exchanges and fuses features by means of TransFormer to complete the sight line estimation task. However, this approach shows its limitations when faced with a variety of low-light complex scenes.

Therefore, a new method for estimating the line of sight with high accuracy in low light environments is needed.

Disclosure of Invention

The invention aims to provide a high-precision sight line estimation method in a low-light environment, and designs and introduces an innovative ALGCF (self-adaptive local-global fusion) module into a low-light image enhancement network, so as to specially cope with the importance of eye characteristics and the dynamic change of the eye characteristics in a sight line estimation task. The ALGCF module provides an effective multi-scale fusion strategy by combining a local feature extractor and a global context information extractor through a self-adaptive fusion gating mechanism, and the accuracy is greatly improved. In addition, based on the weak light image enhancement technology, the invention designs a sight estimation model integrating a characteristic purification module and an attention mechanism. This design ensures that ocular features are extracted effectively even in low light conditions, allowing accurate gaze estimation. The method can effectively solve the problem that the sight line estimation precision is obviously reduced in a low-light environment, thereby improving the practicability and the accuracy of the system. In order to achieve the above purpose, the present invention adopts the following technical scheme:

In order to solve at least one of the above problems, according to an aspect of the present invention, a high-precision line-of-sight estimation method in a low-light environment specifically includes the following steps:

s1, preprocessing a data set, and simulating a low-light environment;

s2, adding low-light images to obtain an enhanced image I _enhanced;

S3, enhancing the calibration of the image I _enhanced to obtain a calibration image I _calibrated;

S4, extracting features of the image, and outputting feature vectors, wherein the calibrated image is extracted by using an improved residual error network model ResNet;

s5, mapping the feature vector into a three-dimensional output vector O through the full connection layer;

S6, applying hyperbolic tangent transformation to the first two elements of the three-dimensional output vector O to obtain an accurate predicted line-of-sight direction;

S7, transforming a third element of the three-dimensional output vector O through a sigmoid function to obtain uncertainty of sight prediction;

S8, adopting MSELoss loss functions to measure errors between the predicted result and the true value, and updating network parameters through back propagation.

Further, in S1, to simulate visual effects in different low light environments, preprocessing is performed on the Gaze360 data set, and a new Gaze360 data set reflecting various low light conditions is constructed;

categorizing low light environments, including darker scenes, extremely dark scenes, low light environments simulated using gamma correction, and dark scenes with unknown light source locations;

S1 comprises the following specific steps:

S101, for darker scenes and extremely dark scenes, the brightness and the contrast of the images are adjusted to increase the sense of reality of night vision, specifically, the brightness adjustment is realized by changing the dark part and the bright part interval of the images, the contrast adjustment is realized by adjusting the distribution range of the colors in the images, so that the color display is more concentrated, and the distinction of the dark part and the bright part is enhanced;

S102, obtaining a gamma corrected output image O through O=I ^(1/G) ×255 for a low-illumination environment image set simulated by using gamma correction, wherein I is an input image, and G is a gamma value;

s103, for a dark scene image set with an unknown light source position, darkening through an image and adding a local light source effect, firstly, obviously reducing the overall brightness of the image by adjusting the brightness, enhancing the appearance of a dark part in a night environment, and then, introducing a gradual change light source effect at a random position of the image and applying Gaussian blur, simulating the illumination of a specific light source, and ensuring the natural fusion of an illumination effect in a scene.

In step S2, the low-light image is enhanced by an enhancement network module, wherein the enhancement network module aims at extracting multi-scale features from the input image and finally realizing detail enhancement and illumination balance by combining global context information;

s2, the specific steps are as follows:

s201, extracting initial features, wherein basic features are extracted from an input image I through an initial convolution layer:

F₀=ReLU(Conv0(I))

The ReLU activation function is used for nonlinear processing, so that the feature map F ₀ keeps basic illumination and detail effects, and reliable initial features are provided for subsequent fusion and enhancement;

s202, self-adaptive local-global fusion, namely ALGCF;

Extracting detail information of the eye region by a local feature extractor:

F_local＝Conv1(F₀)

the Conv1 is a convolution layer of the local feature extractor and is used for extracting eye details from the initial feature map;

Acquiring whole illumination and structure information of a face through a global context information extractor:

F_global＝In(Conv2(AAP(F₀)),size＝F_local)

The AAP is an adaptive average pooling layer, can extract illumination and structure information from a global context, uses a1 multiplied by 1 convolution kernel to generate a global feature map by Conv2, and then adjusts the global feature to be the same size as the local feature through interpolation In operation so as to be fused;

Combining local and global features, generating fusion weights, and carrying out self-adaptive feature fusion through the fusion weights:

The method comprises the steps of Concat combining local and global features into a fusion feature map, wherein Conv3 is a convolution layer of a gating mechanism, sigma represents a Sigmoid function, fusion weight G is generated through the Sigmoid function, and fusion feature F _ALGCF adaptively fuses the local and global features according to the fusion weights G and 1-G, so that the facial feature map is ensured to have global illumination consistency and eye local details are maintained;

S203, further extracting features of the fused feature map F _ALGCF through a plurality of convolution blocks:

F_i+1＝F_i+ReLU(BN(Conv_i(F_i)))

Conv _i is the convolution layer of each convolution block, F _i is the i-th layer feature map, BN, batchNorm is used for feature standardization, and ReLU activation function ensures nonlinear processing of features;

The output convolution layer converts the fused feature map into an enhanced image:

F_output＝σ(Conv4(F_i+1))

conv4 is an output convolution layer, converts the feature map into a final enhanced image through convolution, and normalizes the pixel value to be within a range of 0-1 through a Sigmoid activation function;

s204, finally, adding the enhancement feature map to the input image to obtain a final enhancement image:

I_enhanced＝Clamp(F_output+I,0,1)

The Clamp function ensures that the pixel values of the enhanced image are within a reasonable range. By enhancing and fusing the details of the original image, the finally generated enhanced image has higher brightness and contrast.

Further, in step S3, the enhanced image I _enhanced passes through the initial convolution layer to obtain a calibration feature map:

F_calib(0)＝ReLU(BN(Conv5(I_enhanced)))

The method comprises the steps of (1) extracting a calibration feature map by Conv5 which is an input convolution layer of a calibration network and carrying out standardization through BatchNorm, wherein a ReLU activation function is used for nonlinear processing, and F _calib(0) represents a feature map obtained by processing an enhanced image through an initial convolution layer, namely the calibration feature map;

The calibration feature map F _calib(0) is subjected to illumination and detail adjustment through a plurality of convolution blocks:

F_calib(i)＝F_calib(i-1)+ReLU(BN(Conv_i(F_calib(i-1))))

Wherein F _calib(i-1) represents the output characteristic diagram of the i-1 layer, is also the input of the i-1 layer convolution block, provides input data for the current layer and contains accumulated characteristic information of previous hierarchical processing, F _calib(i) is the output characteristic diagram of the i-1 layer convolution block, combines the original input characteristic F _calib(i-1) and new characteristics of convolution, batch normalization and ReLU processing, and residual connection (+) is helpful for preventing gradient vanishing problem in deep network and ensuring that enough original information can be reserved even in the deep network;

The final calibration feature map passes through the output convolution layer to generate a difference image:

ΔI=σ(Conv_out(F_{calib_i}))

Conv _out is an output convolution layer, generates a final difference image through convolution, and ensures that an output value is in a reasonable range through a Sigmoid function;

Finally, subtracting the difference image from the enhanced input image to obtain a calibrated final image:

I_calibrated＝I_enhanced-ΔI

the calibration module eliminates illumination non-uniformity and artifacts in the enhanced image through subtraction operation, so that the final calibration image is more approximate to an ideal state.

Further, in step S4, feature extraction is performed on the calibrated image using the modified residual network model ResNet, the attention mechanism module is introduced in the second and third phases, and the feature purification module is introduced in the fourth phase.

According to an aspect of the present invention, there is provided a storage medium having instructions stored therein, which when read by a computer, cause the computer to execute the high-precision line-of-sight estimation method in any one of the above-described low-light environments.

According to another aspect of the present invention, there is provided an electronic device comprising a processor and the storage medium described above, the processor executing instructions in the storage medium.

Compared with the prior art, the invention has the following beneficial effects:

1. the invention introduces a weak light image enhancement technology, which focuses on improving the definition of eye details in a low light environment through multi-scale feature fusion and global information extraction. The synergistic effect of the enhancement and calibration modules significantly improves the visibility and quality of the image, and provides high quality input data for line-of-sight estimation, thereby enhancing the accuracy and reliability of the model.

2. The present invention proposes a line-of-sight estimation model that combines feature purification and attention mechanisms. The feature purification module effectively extracts eye details through the self-adaptive weight and the regional attention mechanism, and the attention mechanism realizes more accurate sight estimation through integrating space and channel weight and feature correlation analysis. Through extensive testing, the angle errors displayed on the Gaze360 data sets in four low-light environments of dark_comp, dark_super, dark_gamma and dark_light by the method are respectively 12.75 degrees, 13.27 degrees, 12.43 degrees and 11.57 degrees, which are superior to the existing advanced network model.

3. The invention not only opens up a new research direction for improving the sight line estimation in the low-light environment by using the low-light image enhancement technology, but also provides a new thought and solution for the sight line estimation technology. In addition, the technology provides important technical support for application fields such as man-machine interaction, driver fatigue driving monitoring and the like in a low-light environment, and widens the application range of the technology.

Drawings

In order to make the objects, technical solutions and advantages of the embodiments of the present invention more clear, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings of the embodiments of the present invention. It will be apparent that the described embodiments are some, but not all, embodiments of the invention.

Unless defined otherwise, technical or scientific terms used herein should be given the ordinary meaning as understood by one of ordinary skill in the art to which this invention belongs.

FIG. 1 is a view line estimation model diagram in embodiment 1 of the present invention;

FIG. 2 is a diagram of a FWF-Gaze network framework in accordance with example 1 of the present invention;

FIG. 3 is a diagram of an enhanced network model of low-light images in embodiment 1 of the present invention;

FIG. 4 is a calibration network model diagram of low-light images in embodiment 1 of the present invention;

FIG. 5 is a diagram of the attention model in example 1 of the present invention;

FIG. 6 is a diagram of a characteristic purifying network model in embodiment 1 of the present invention;

FIG. 7 is a flow chart of a method according to an embodiment of the invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in detail with reference to the accompanying drawings and specific embodiments.

As shown in fig. 1, a high-precision gaze estimation method in low-light environments uses an improved residual network model ResNet, which significantly optimizes the performance of the model in various low-light environments by introducing an attention mechanism module in its second and third stages, and a feature purification module in its fourth stage. In addition, the introduced weak light image enhancement technology combines an enhancement module and a calibration module, so that the detail visibility and quality of the image are effectively improved. In particular, the enhancement module specifically enhances details of the eye region through multi-scale feature fusion and global context information extraction, and the calibration module further refines the image to eliminate artifacts possibly introduced in the enhancement process.

Fig. 2 shows the overall network framework of the FWF-size model proposed by the present invention, which is specifically used for line-of-sight estimation in low-light environments. The image is processed by the image enhancement network and then transmitted to the sight estimation network, and the end-to-end integration method enables the FWF-Gaze model to show excellent performance on the sight estimation problem under the complex illumination environment.

In order to meet the requirements, the technical scheme adopted by the invention is as follows:

A high-precision sight line estimation method in a low-light environment specifically comprises the following steps:

s1, preprocessing the Gaze360 data set to simulate four low light environments, and preparing data for subsequent image processing and analysis.

S2, inputting the low-light image into an enhancement network, and processing to obtain an enhanced image I _enhanced so as to improve the visibility and detail definition of the image under the low-light condition.

And S3, the enhanced image I _enhanced is further processed by a calibration network, and a final calibration image I _calibrated is output so as to eliminate noise and artifacts possibly introduced in the enhancement process.

And S4, performing feature extraction on the calibrated image by using the improved residual network model ResNet, and outputting a feature vector.

And S5, mapping the characteristic vector into a three-dimensional output vector O through a full connection layer, wherein the vector comprises the predicted sight direction (horizontal and vertical angles) and the predicted uncertainty (angle error) thereof.

And S6, applying hyperbolic tangent transformation to the first two elements of the three-dimensional output vector O to obtain accurate predicted sight line directions.

And S7, transforming a third element of the three-dimensional output vector O through a sigmoid function to obtain uncertainty of sight prediction.

And S8, measuring errors between the predicted result and the true value by adopting MSELoss loss functions, and updating network parameters through back propagation to optimize the performance and accuracy of the model.

Preferably, in step S1, in order to simulate the visual effect in different low light environments, the present invention performs four specific preprocessing techniques on the Gaze360 dataset. This includes darker scenes

(Dark_comp), extremely Dark scenes (dark_super), low-light environments simulated using gamma correction (dark_gamma), and Dark scenes with unknown light source positions (dark_light) are simulated. By these methods, the present invention successfully constructs new Gaze360 datasets reflecting a variety of low light conditions.

S101, for darker scenes (dark_comp) and extremely Dark scenes (dark_super), the invention increases the sense of realism of night vision by adjusting the brightness and contrast of the image. Specifically, brightness adjustment is realized by changing the dark part and the bright part of the image, and contrast adjustment is realized by adjusting the distribution range of colors in the image, so that the color display is more concentrated, and the distinction of the dark part and the bright part is enhanced.

S102 for a low illumination environment (dark_gamma) image set simulated using gamma correction, the present invention obtains a gamma corrected output image O by o=i ^(1/G) ×255, where I is the input image and G is the gamma value. When the gamma value is greater than 1, the brightness of the image increases accordingly, and similarly, when the gamma value is less than 1, the image becomes darker. The closer the gamma value is to 0, the more difficult the eye can recognize. Therefore, the invention finally selects the gamma value of 0.7 to simulate the low-illumination environment by correcting the gamma value of the image.

S103, in order to further simulate low light conditions in reality, such as museums or night driving scenes, the invention develops specific image processing steps, including image darkening and adding a local light source effect (dark_light). First, the overall brightness of the image is remarkably reduced by adjusting the brightness, and the appearance of dark parts in the night environment is enhanced. Then, by introducing gradual light source effects at random positions of the image and applying Gaussian blur, the illumination of a specific light source, such as a streetlamp, is simulated, and natural fusion of the illumination effects in the scene is ensured, so that the realism and visual focus of the scene are enhanced.

Preferably, in the step S2, the enhancement network module is configured to extract multi-scale features from the input image, and combine global context information to finally implement detail enhancement and illumination balancing. It consists of an input convolution layer, ALGCF modules, a plurality of convolution blocks and an output convolution layer, and the image enhancement network model is shown in fig. 3.

The specific method comprises the following steps:

s201, firstly, extracting initial characteristics, wherein basic characteristics are extracted from an input image I through an initial convolution layer:

F₀=ReLU(Conv0(I))

Conv0 is the initial convolution layer through which features of the input image are extracted. The ReLU activation function is used for nonlinear processing, so that the feature map F ₀ keeps basic illumination and detail effects, and reliable initial features are provided for subsequent fusion and enhancement.

S202, an adaptive local-global fusion (ALGCF) module is then performed, which is responsible for fusing local and global information.

The local feature extractor is used for extracting detail information of the eye region:

F_local＝Conv1(F₀)

Where Conv1 is the convolution layer of the local feature extractor focusing on extracting ocular details from the initial feature map. The local feature extractor captures important detail features through convolution kernels.

The global context information extractor is used to obtain the whole illumination and structure information of the face:

F_global＝In(Conv2(AAP(F₀)),size＝F_local)

Wherein AAP is an adaptive averaging pooling layer capable of extracting illumination and structural information from a global context. Conv2 uses a1 x 1 convolution kernel to generate the global feature map. The global features are then adjusted to the same size as the local features by an interpolation In operation for fusion.

Wherein Concat combines the local and global features into a fused feature map. Conv3 is a convolution layer of a gating mechanism, sigma represents a Sigmoid function, and the sigma is the same as the sigma in the following, and fusion weight G is generated through the Sigmoid function, so that reasonable combination of information of different sources is ensured. The fusion feature F _ALGCF adaptively fuses the local and global features according to the weights G and 1-G, so that the facial feature map is ensured to have global illumination consistency and the local detail of eyes is kept.

S203, further extracting features of the fused feature map F _ALGCF through a plurality of volumes:

F_i+1＝F_i+ReLU(BN(Conv_i(F_i)))

Conv _i is the convolutional layer for each convolutional block, F _i is the ith layer feature map, batchNorm is feature normalized, and ReLU activation function ensures nonlinear processing of features. Through residual join (+) operations, each convolution block preserves the integrity of the input features while extracting the features.

F_output＝σ(Conv4(F_i+1))

Conv4 is an output convolution layer, converts the feature map into a final enhanced image through convolution, and normalizes the pixel value to be within a range of 0-1 through a Sigmoid activation function.

And S204, finally, adding the enhancement feature map to the input image to obtain a final enhancement image:

I_enhanced＝Clamp(F_output+I,0,1)

Preferably, in the step S3, after the calibration network processing, the calibration network model is as shown in fig. 4, and the enhanced image I _enhanced is subjected to an initial convolution layer to obtain a calibration feature map:

F_calib(0)＝ReLU(BN(Conv5(I_enhanced)))

Conv5 is an input convolution layer of the calibration network, extracts a calibration feature map, and performs standardization through BatchNorm to ensure stability and consistency of the feature map. The ReLU activation function is used for nonlinear processing, so that the features are more recognizable. F _calib(0) represents the feature map obtained after the enhanced image is processed through the initial convolution layer, which is the first feature map in the calibration module, as the basis for the subsequent convolution block processing.

The calibration feature map F _calib-0 is subjected to illumination and detail adjustment through a plurality of convolution blocks:

F_calib(i)＝F_calib(i-1)+ReLU(BN(Conv_i(F_calib(i-1))))

Wherein F _calib(i-1) represents the output feature map of the i-1 layer, which is also the input of the convolution block of the i layer, which provides input data for the current layer, including the accumulated feature information of the previous layer processing. F _calib(i) is the output signature of the i-th layer convolution block, which combines the original input signature F _calib(i-1) with the new signature that has been convolved, batch normalized, and ReLU processed. Residual connection (+) helps to prevent the problem of gradient extinction in deep networks while ensuring that sufficient original information is retained even in deep networks.

ΔI=σ(Conv_out(F_{calib_i}))

conv _out is an output convolution layer, generates a final difference image through convolution, and ensures that an output value is within a reasonable range through a Sigmoid function.

I_calibrated＝I_enhanced-ΔI

Preferably, in the step S4, the feature extraction is performed on the calibrated image using the modified residual network model ResNet, and the performance of the model in various low-light environments is significantly optimized by introducing the attention mechanism module in the second and third stages thereof and the feature purification module in the fourth stage thereof.

S401, an attention mechanism module:

Spatial attention is focused on specific areas of the enhanced image, enabling the model to focus more on critical areas such as the eyes. The attention mechanism model diagram is shown in fig. 5. The present invention captures a wider range of contextual information by using a larger convolution kernel to emphasize the importance of the eye region:

S(x)=(Conv_spatial(x))

Where S (x) represents the spatial weight map obtained after passing a convolution layer Conv _spatial (using a 7 x 7 convolution kernel and appropriate padding to keep the feature map size unchanged).

Channel attention evaluates the extent to which each channel contributes to line of sight estimation, and suppresses unimportant channels by enhancing the characteristics of important channels, thereby optimizing the quality of the overall characteristics.

C(x)=σ(Conv_↑(ReLU(Conv_↓(GAP(x)))))

Where GAP (x) represents global averaging pooling of input x, compressing the spatial information for each channel into a single value. Then, the channel weight is generated through a two-layer convolution operation (dimension reduction is performed firstly and dimension increase is performed later) and a ReLU activation function, and finally, the channel weight is generated through a Sigmoid function sigma.

Then, feature correlation analysis is introduced, which not only fuses the attention of the space and the channel, but also optimizes the interaction between the features and enhances the integration effect of the features. The spatial and channel attention weighted features are connected in the channel dimension to yield feature F _concat.

F_concat=[x⊙S(x),x⊙C(x)]

Wherein S (x) ·x and C (x) ·x are feature graphs to which spatial and channel weights are applied, respectively.

And then carrying out feature correlation analysis on the combined features.

R(x)=σ(Conv_↑(ReLU(Conv_↓(F_concat))))

Conv _↓ is a reduced dimension convolutional layer that helps the module focus on capturing the most critical feature correlation information by reducing the number of channels. The ReLU activation function is used here to increase nonlinearities, thereby enabling the model to capture more complex feature relationships. Conv _↑ is an extended convolutional layer that restores the channel number to the original state and generates the final feature adjustment weights. These weights are adjusted to within a reasonable range by the Sigmoid function, ensuring that the original input can be adjusted as an effective scaling factor

Through feature correlation analysis, the model can dynamically evaluate and adjust the relationships between features so that important features are highlighted while uncorrelated or noisy features are suppressed. The fine feature processing strategy remarkably improves the performance of the model under complex illumination and low light conditions, and ensures the accuracy and the robustness of sight estimation.

In combination with the above components, the enhanced feature F _R(x) can be obtained by:

F_R(x)=x·R(x)

By combining spatial and channel attention, and feature correlation analysis, the proposed attention mechanism is able to effectively identify and enhance key ocular features in gaze estimation, especially in varying illumination and complex visual scenes.

S402, a characteristic purifying module:

The depth separable convolution is used for basic feature extraction, and the method effectively reduces the computational complexity and simultaneously maintains the extraction efficiency:

F_basic=ReLU(BN(Conv(x)))

Wherein Conv (x) represents group convolution for extracting grouping features, and model complexity is reduced through the grouping convolution, which helps to keep parameters independent among different channels, and reduce the number of parameters, and F _basic represents a feature map subjected to preliminary processing.

The adaptive weight layer adjusts its contribution to the final output by learning the importance of each feature, thereby optimizing the representation of the key features:

W_adaptive＝σ(Conv_adaptive(F_basic))

Where W _adaptive represents the adaptive weights generated for the feature map, conv _adaptive comprises a two-step convolution operation to adjust the feature weights by reducing and expanding the number of channels.

The region attention module focuses on the eye region, emphasizing important features by generating an attention mask for a particular region:

M_focus＝σ(Conv_focus(GAP(F_basic)))

Wherein GAP focuses on global information, generates context for local features, M _focus is a focus mask, conv _focus includes a convolution step to convert global information into a focus mask.

Finally, combining the adaptive weight W _adaptive and the attention mask M _focus, performing detail enhancement and feature optimization:

F_enhanced＝(F_basic×M_focus)×W_adaptive

And the focusing and the weight adjustment are combined, so that the expression of eye details and the quality of integral features are effectively improved. This module ensures that high quality and high accuracy ocular features can be obtained in gaze estimation by precisely adjusting the contribution and focal region of each feature. The feature purification model is shown in fig. 6.

S403 residual error connection module

In ResNet reference model 18, the input image is first processed through a 7 x 7 convolutional layer, BN layer and ReLU activation function, followed by a 3 x3 max pooling layer. The network then goes through four stages, each of which goes through a residual block, and we add a feature purification module and an attention mechanism module on the basis of ResNet reference models to increase the accuracy of line-of-sight estimation in low light scenes. The structure of the feature extraction network after low light processing is shown in fig. 7, where Res1, res2, res3, res4 represent four residual phases, each with a similar structure.

After the Res2 and Res3 phases, the attention mechanism module is integrated. These modules optimize feature recognition of key visual areas through spatial and channel attention and feature correlation analysis. The method is beneficial to the model to process and utilize the key information of eyes more accurately, so that more accurate and more stable sight tracking can be realized under various low-light environment conditions.

F_attention＝EyeFocusedSCAttentionModule(F_input)

The feature purification module is embedded after the Res4 stage. The module particularly strengthens the perception of the model to the eye details through the optimization processing of the self-adaptive weight and the regional attention mechanism. This strategy is based on the importance of ocular features in gaze estimation, especially in low light or complex lighting conditions.

F_purify＝EnhancedFeaturePurificationModule(F_input)

By alternately using feature purification and attention mechanisms at different stages, the depth network advantage can be maintained while optimizing for the line-of-sight estimation needs at various low light scenes.

Preferably, in order to evaluate the predictive power of the model, the present invention measures the difference between the predicted result and the true gaze point using the mean square error MSELoss as a loss function, steps S5-S8. The loss function MSELoss of this module is expressed as:

where L _mse represents the mean square error loss, and P _i and Y _i represent the predicted and actual values of the i-th sample, respectively. The model was optimized using L _mse. By back-propagating the error, the parameters of the network can be iteratively updated, thereby gradually improving the performance of the model on the gaze estimation task.

Example 1:

The Gaze360 dataset is video data collected from 238 subjects in the real world, which is large-scale, combining 3D Gaze annotation, broad Gaze and head gestures, various indoor and outdoor capture environments, and diversity of subjects. In the Gaze360 dataset, the present invention follows a predefined training-test segmentation, training with only 84902 frontal face images. The test set had 16031 images altogether to fully evaluate the performance of the model. Four pre-processed size 360 datasets (dark_comp, dark_super, dark_gamma, dark_light) were used for training the entire FWF-size network, respectively.

Environment we performed experiments on NVIDIAGeForce RTX 3090GPU using Windows environment and Pytorch framework. The model hyper-parameter settings are the same for both datasets. The model trained 100 epochs, a batch size of 40, a learning rate of 0.0001, and a decay of 1. The attenuation step size is set to 5000.

The evaluation index is that the performance of the main stream of the sight line estimation is compared with that of other sight line estimation models by adopting the angle error of the evaluation index, namely the deviation angle of the predicted value and the true value of the sight line estimation, the performance is represented by numerical values, and the smaller the index is, the better the effect is. Assuming the actual gaze direction isEstimating gaze direction asThe angle error can be calculated as:

The contrast model adopts the advanced methods of sight 360, fullFace, RT-Gene and Dilated-Net for sight estimation. The experimental setup of each method uses the setup in the corresponding treatises, including the framework of the model and the hyper-parameters, to reproduce their network performance.

The experimental results are shown in table 1:

table 1 experimental results of the network and other advanced networks proposed by the present invention

As shown in the experimental data of the table 1, the method of the invention can effectively solve the problem of reduced accuracy of the sight line estimation in the low-light environment, and has a strong practical value.

Example 2:

the embodiment will introduce an applicable scenario of the present invention:

The sight estimation has wide application scenes, one application scene is an intelligent interaction system, and the FWF-Gaze network can be applied to detecting fatigue driving of a driver. Such an application scenario contributes to improvement of safety and convenience.

An important indicator of driver fatigue is the driver's Gaze status, and the FWF-size network model of the present invention is used to predict Gaze.

First, when a driver drives a vehicle, a face image of the driver can be captured in real time by a camera.

Then, the predicted Gaze state of the driver is inputted into the FWF-size network model of the present invention.

Finally, when the model predicts that the driver is likely to be in a tired state, the system may alert the driver to rest by sound or other means, or automatically switch to an automatic driving mode (if the vehicle supports it).

The examples of the present invention are merely for describing the preferred embodiments of the present invention, and are not intended to limit the spirit and scope of the present invention, and those skilled in the art should make various changes and modifications to the technical solution of the present invention without departing from the spirit of the present invention.

Claims

1. A high-precision line of sight estimation method in a low-light environment, characterized in that it specifically includes the following steps:

S1. Dataset preprocessing, simulating low light environment;

S2. low-light image enhancement to obtain an enhanced image I _enhanced ;

S3. The enhanced image I _enhanced is calibrated to obtain a calibrated image I _calibrated ;

S4. Extract features from the image and output feature vectors; use the improved residual network model ResNet18 to extract features from the calibration image;

S5. Map the feature vector into a three-dimensional output vector O through a fully connected layer;

S6. Apply a hyperbolic tangent transform to the first two elements of the three-dimensional output vector O to obtain a predicted sight direction;

S7. The third element of the three-dimensional output vector O is transformed by a sigmoid function to obtain the uncertainty of the line of sight prediction;

S8. Use the MSELoss loss function to measure the error between the predicted result and the true value, and update the network parameters through back propagation;

In step S2, the low-light image is enhanced by an enhancement network module; the specific steps of S2 are as follows:

S201. Initial feature extraction: Input image I passes through the initial convolution layer to extract basic features:

_F0 = ReLU(Conv0(I))

Conv0 is the initial convolutional layer; S202. Adaptive local-global fusion, i.e. ALGCF;

Extract detailed information of the eye area through the local feature extractor:

F _local = Conv1(F ₀ )

Among them, Conv1 is the convolution layer of the local feature extractor;

The global context information extractor is used to obtain the overall illumination and structure information of the face:

F _global =In(Conv2(AAP(F ₀ )), size = F _local )

Among them, AAP is an adaptive average pooling layer; Conv2 uses a 1×1 convolution kernel to generate a global feature map; then the global feature is adjusted to the same size as the local feature through the interpolation In operation;

Combine local and global features, generate fusion weights, and perform adaptive feature fusion through fusion weights:

Among them, Concat combines local and global features into a fused feature map; Conv3 is the convolution layer of the gating mechanism, σ represents the Sigmoid function, and the fusion weight G is generated by the Sigmoid function; the fusion feature F _ALGCF adaptively fuses local and global features according to the fusion weight G and 1-G;

S203. The fused feature map F _ALGCF is further extracted through multiple convolution blocks:

F _i+1 =F _i +ReLU(BN(Conv _i (F _i )))

Conv _i is the convolution layer of each convolution block, F _i is the i-th layer feature map; the output convolution layer converts the fused feature map into an enhanced image:

F _output = σ(Conv4(Fi ₊₁ ))

Conv4 is the output convolution layer, which converts the feature map into an enhanced image through convolution and normalizes the pixel values to the range of 0-1 through the Sigmoid activation function;

S204. Add the enhanced feature map to the input image to obtain the final enhanced image:

I _enhanced =Clamp(F _output +I, 0, 1)

The Clamp function ensures that the pixel values of the augmented image are within the range.

2. The method according to claim 1, characterized in that, in S1, preprocessing is performed on the Gaze360 dataset to construct a new Gaze360 dataset reflecting various low-light conditions;

Classify low-light environments, including: dark scenes, extremely dark scenes, low-light environments simulated using gamma correction, and simulated dark scenes with unknown light source locations;

The specific steps of S1 are as follows:

S101. For darker scenes and extremely dark scenes, adjust the brightness and contrast of the image;

S102. For a set of low-light environment images simulated using gamma correction, obtain a gamma-corrected output image O by O=I ^(1/G) ×255, where I is the input image and G is the gamma value;

S103. For a set of images of simulated dark scenes with unknown light source positions, the image is darkened and a local light source effect is added; first, the overall brightness of the image is significantly reduced by adjusting the brightness to enhance the performance of the dark part in the night environment; then, a gradient light source effect is introduced at random positions of the image and Gaussian blur is applied to simulate the illumination of a specific light source to ensure that the lighting effect is naturally integrated into the scene.

3. The method according to claim 1, characterized in that in step S3, the enhanced image I _enhancea is passed through an initial convolution layer to obtain a calibration feature map:

F _calib(0) =ReLU(BN(Conv5(I _enhanced )))

Where Conv5 is the input convolution layer of the calibration network; F _calib (0) represents the feature map obtained after the enhanced image is processed by the initial convolution layer, that is, the calibration feature map;

The calibration feature map F _calib(0) is adjusted for illumination and details through multiple convolution blocks:

F _calib(i) =F _calib(i-1) +ReLU(BN(Conv _i (F _calib(i-1) )))

Among them, _Fcalib(i-1) represents the output feature map of the i-1th layer; _Fcalib(i) is the output feature map of the i-th convolutional block;

The final calibrated feature map is passed through the output convolution layer to generate a difference image:

ΔI＝σ(Conv _out (F _{calib_i} ))

Where Conv _out is the output convolution layer; I _calibrated = I _enhanced - ΔI.

4. The method according to claim 3 is characterized in that, in step S4, an improved residual network model ResNet18 is used to extract features of the calibration image, an attention mechanism module is introduced in the second and third stages, and a feature purification module is introduced in the fourth stage.

5. A computer-readable storage medium having a computer program stored thereon, wherein when the program is executed by a processor, the steps of the high-precision line of sight estimation method in a low-light environment as described in any one of claims 1 to 4 are implemented.

6. A computer device comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein when the processor executes the program, the steps of the high-precision line of sight estimation method in a low-light environment as described in any one of claims 1 to 4 are implemented.