CN117830783B

CN117830783B - Sight estimation method based on local super-resolution fusion attention mechanism

Info

Publication number: CN117830783B
Application number: CN202410005814.8A
Authority: CN
Inventors: 王进; 曹硕裕; 王可; 杨杨; 梁瑞
Original assignee: Nantong University
Current assignee: Tianjin Boxin Energy Technology Co ltd
Priority date: 2024-01-03
Filing date: 2024-01-03
Publication date: 2024-09-03
Anticipated expiration: 2044-01-03
Also published as: CN117830783A

Abstract

The present invention belongs to the field of computer vision, and specifically relates to a line of sight estimation method based on a local super-resolution fusion attention mechanism. The method comprises the following steps: step S1, using a camera to obtain a frame image; step S2, using a face detection model to detect and locate the face area and the binocular area, cropping the face image, and intercepting the eye image; step S3, passing the face image through a face attention enhancement feature extraction module to enhance and extract face image features; step S4, passing the binocular image through an eye feature extraction module based on local super-resolution to extract binocular image features; step S5, fusing the extracted face image and binocular image features through a fully connected layer to obtain a line of sight estimation result. The present invention extracts the eye features after super-resolution, performs accurate line of sight estimation, and enhances the low-resolution global features from both spatial and channel directions to increase the ability to extract face features in a low-resolution environment and improve the effect of line of sight estimation.

Description

A gaze estimation method based on local super-resolution fusion attention mechanism

技术领域Technical Field

本发明属于计算机视觉领域，具体涉及一种基于局部超分辨率融合注意力机制的视线估计方法。The present invention belongs to the field of computer vision, and in particular relates to a line of sight estimation method based on a local super-resolution fusion attention mechanism.

背景技术Background Art

视线估计是计算机视觉领域的一个关键子领域，专注于确定个体眼睛所观察的方向。视线行为是人类社会互动的重要组成部分，通过分析视线的方向，可以获取大量潜在的信息，例如，商场可以根据顾客的视线数据分析哪些商品最受欢迎；监考员可以根据学生的视线方向判断其是否有作弊嫌疑等。此外，视线估计技术已被广泛应用于多个领域，如虚拟现实、驾驶辅助系统、人机交互等，视线估计在这些领域上都有着广泛的应用前景。Gaze estimation is a key subfield in computer vision that focuses on determining the direction in which an individual's eyes are looking. Gaze behavior is an important part of human social interaction. By analyzing the direction of gaze, a lot of potential information can be obtained. For example, shopping malls can analyze which products are most popular based on customers' gaze data; invigilators can determine whether students are suspected of cheating based on their gaze direction. In addition, gaze estimation technology has been widely used in many fields, such as virtual reality, driver assistance systems, human-computer interaction, etc., and gaze estimation has broad application prospects in these fields.

随着深度学习的发展，基于卷积神经网络的视线估计方法已经被普及。然而，这些方法通常需要大量的数据集，并且主要在理想条件下实验。通常，这些方法训练主要采用高分辨率的面部图像。事实上，相机质量和面部距离等因素往往会导致面部输入不清晰，上述所提及的顾客视线分析以及考场监考等案例都会由于相机支持清晰度以及人的距离等因素影响导致输入图像不清晰。不同分辨率下人脸和双眼图如图1所示，随着输入图像分辨率的下降，信息会逐步丢失，这会让网络提取特征越来越困难，进而导致视线估计精度的降低。With the development of deep learning, convolutional neural network-based line of sight estimation methods have become popular. However, these methods usually require a large amount of data sets and are mainly tested under ideal conditions. Usually, these methods are trained mainly with high-resolution facial images. In fact, factors such as camera quality and facial distance often lead to unclear facial input. The above-mentioned customer line of sight analysis and exam room proctoring cases will cause unclear input images due to factors such as camera support clarity and human distance. The face and eye images at different resolutions are shown in Figure 1. As the resolution of the input image decreases, information will gradually be lost, which will make it increasingly difficult for the network to extract features, resulting in a decrease in the accuracy of line of sight estimation.

在实际场景中，由于输入图像往往受到多种因素的影响，由于摄像头分辨率低、人脸距离远等原因，输入的人脸图像往往存在清晰度低的问题，信息会逐步丢失，这会让网络提取特征越来越困难，进而导致视线估计精度的降低。目前的视线估计技术在低分辨率场景下的视线估计准确率较低，尚未找到有效的解决方案。In actual scenarios, input images are often affected by many factors. Due to low camera resolution and long face distance, input face images often have low clarity and information is gradually lost, which makes it increasingly difficult for the network to extract features, leading to a decrease in the accuracy of line of sight estimation. The current line of sight estimation technology has a low accuracy rate in low-resolution scenarios, and no effective solution has yet been found.

发明内容Summary of the invention

本发明的目的在于克服现有技术的缺点，提出了一种基于局部超分辨率融合注意力机制的视线估计方法。The purpose of the present invention is to overcome the shortcomings of the prior art and propose a line of sight estimation method based on local super-resolution fusion attention mechanism.

本发明方法采用的技术方案如下：The technical solution adopted by the method of the present invention is as follows:

一种基于局部超分辨率融合注意力机制的视线估计方法，包括如下步骤：A line of sight estimation method based on local super-resolution fusion attention mechanism includes the following steps:

步骤S1、使用摄像头获取帧图像；Step S1, using a camera to obtain a frame image;

步骤S2、采用人脸检测模型对人脸区域和双眼区域进行检测和定位，将人脸图像进行裁剪，并截取眼部图像；Step S2: using a face detection model to detect and locate the face area and eye area, cropping the face image, and capturing the eye image;

步骤S3、将人脸图像通过人脸注意力强化特征提取模块，强化并提取人脸图像特征；Step S3, passing the face image through the face attention enhancement feature extraction module to enhance and extract the face image features;

步骤S4、将双眼图像通过基于局部超分辨率的眼部特征提取模块,提取双眼图像特征；Step S4, extracting binocular image features from the binocular images through an eye feature extraction module based on local super-resolution;

步骤S5、通过全连接层融合提取的人脸图像和双眼图像特征得到视线估计结果。Step S5: fusing the extracted face image and binocular image features through a fully connected layer to obtain a sight line estimation result.

进一步的作为本发明的优选技术方案，步骤S2中，所述人脸检测模型采用基于卷积神经网络的人脸检测模型。和传统的人脸检测模型相比，采用卷积神经网络的人脸检测模型具有更高的准确性，能够处理各种复杂情况，以及具有更强的鲁棒性、实时性能和适应性，具体来说，使用“dlib.cnn_face_detection_model_v1”模型，该模型包含了卷积神经网络的权重和结构，这些是通过在大量图像数据上进行训练得到的，加载此模型后，就可以在图像上检测人脸。Further, as a preferred technical solution of the present invention, in step S2, the face detection model adopts a face detection model based on a convolutional neural network. Compared with the traditional face detection model, the face detection model using a convolutional neural network has higher accuracy, can handle various complex situations, and has stronger robustness, real-time performance and adaptability. Specifically, the "dlib.cnn_face_detection_model_v1" model is used, which includes the weights and structure of the convolutional neural network, which are obtained by training on a large amount of image data. After loading this model, faces can be detected on images.

进一步的作为本发明的优选技术方案，步骤S3中，所述人脸注意力强化特征提取模块，采用ResNet18模型作为脸部特征提取模块的基准模型，对于每个标准的残差块，公式如下所示：Further, as a preferred technical solution of the present invention, in step S3, the face attention enhancement feature extraction module uses the ResNet18 model as the benchmark model of the face feature extraction module. For each standard residual block, the formula is as follows:

F_out＝l(f(F_in,W_i)+F_in) (1)F _out =l(f(F _in ,W _i )+F _in ) (1)

其中，l表示ReLU激活函数，f是残差块中的权重操作，W是该操作的权重，F_in是残差块的输入，F_out表示残差块的输出。Among them, l represents the ReLU activation function, f is the weight operation in the residual block, W is the weight of the operation, _Fin is the input of the residual block, and _Fout represents the output of the residual block.

进一步的作为本发明的优选技术方案，步骤S3中，所述人脸注意力强化特征提取模块在ResNet18基准模型的基础上添加了CBAM注意力机制以增加视线估计的准确度；CBAM注意力机制在特征映射中加入两种注意力模块：通道注意力模块和空间注意力模块；通道注意力模块旨在为每个通道分配一个权重，一般通过全局平均池化和全局最大池化获得的特征来实现。对于给定的特征图F∈R^C×H×W，计算首先计算全局平均池化和全局最大池化，那么通道注意力可以表示为这两个值经过一个共享的多层感知器进行处理，并组合，公式如下所示：Further, as a preferred technical solution of the present invention, in step S3, the face attention enhancement feature extraction module adds a CBAM attention mechanism on the basis of the ResNet18 benchmark model to increase the accuracy of line of sight estimation; the CBAM attention mechanism adds two attention modules to the feature map: a channel attention module and a spatial attention module; the channel attention module aims to assign a weight to each channel, which is generally achieved by the features obtained by global average pooling and global maximum pooling. For a given feature map F∈R ^C×H×W , the calculation first calculates the global average pooling and the global maximum pooling, then the channel attention can be expressed as these two values are processed by a shared multi-layer perceptron and combined, and the formula is as follows:

其中σ表示Sigmoid激活函数，M_CA表示通道注意力。表示全局平均池化，表示全局最大池化；Where σ represents the Sigmoid activation function and M _CA represents channel attention. represents global average pooling, represents global maximum pooling;

空间注意力模块旨在为每个空间分配一个权重，首先仍然是计算全局平均池化和全局最大池化，但这次是沿着通道的维度，最后拼接两个特征图，并且通过一个7×7的卷积层，然后通过一个Sigmoid激活函数，公式如下：The spatial attention module aims to assign a weight to each space. First, the global average pooling and global maximum pooling are still calculated, but this time along the channel dimension. Finally, the two feature maps are concatenated and passed through a 7×7 convolution layer, and then through a Sigmoid activation function. The formula is as follows:

M_SA(F)＝σ(Conv^7×7([F_gap；F_gmp])) (3)M _SA (F)=σ(Conv ^7×7 ([F _gap ; F _gmp ])) (3)

其中Conv^7×7表示7×7的卷积层处理。Conv ^7×7 represents the 7×7 convolutional layer processing.

人脸注意力强化特征提取模块在每个阶段的最后都添加CBAM注意力模块以增强特征，对每个残差阶段，可以表示公式如下：The face attention enhanced feature extraction module adds a CBAM attention module at the end of each stage to enhance the features. For each residual stage, the formula can be expressed as follows:

其中↓表示下采样。那么经过CBAM注意力机制强化后的特征可表示为：Where ↓ represents downsampling. Then the features enhanced by the CBAM attention mechanism can be expressed as:

其中表示对应元素相乘，F_FA表示经过CBAM注意力强化后的脸部特征图。in represents the multiplication of corresponding elements, and F _FA represents the facial feature map after CBAM attention enhancement.

进一步的作为本发明的优选技术方案，步骤S4中，所述基于局部超分辨率的眼部特征提取模块，采用FSRCNN作为超分辨率重建的网络，FSRCNN算法是一种为单图像超分辨率设计的一个深度卷积神经网络模型，它是在SRCNN的基础上进一步改进而来，目的是为了提高超分辨率重建的速度和效率。Further, as a preferred technical solution of the present invention, in step S4, the eye feature extraction module based on local super-resolution adopts FSRCNN as the network for super-resolution reconstruction. The FSRCNN algorithm is a deep convolutional neural network model designed for single image super-resolution. It is further improved on the basis of SRCNN. The purpose is to improve the speed and efficiency of super-resolution reconstruction.

FSRCNN算法主要分为特征提取、收缩与扩展、反卷积三部分。第一部分为特征提取，这部分使用较小的卷积核，大小为5×5，从低分辨率的眼部图像中提取特征，目的是从原始低分辨率图像中提取有用的信息，这些信息随后会被用于重建高分辨率图像。特征提取阶段表示的公式如下：The FSRCNN algorithm is mainly divided into three parts: feature extraction, contraction and expansion, and deconvolution. The first part is feature extraction, which uses a smaller convolution kernel of size 5×5 to extract features from low-resolution eye images. The purpose is to extract useful information from the original low-resolution image, which will then be used to reconstruct a high-resolution image. The formula for the feature extraction stage is as follows:

其中I_LR表示低分辨率眼部图像PRELU表示PreLU激活函数，d表示特征图的数量。Where I _LR represents the low-resolution eye image, PRELU represents the PreLU activation function, and d represents the number of feature maps.

第二部分为收缩与扩展，其中收缩阶段使用1×1的卷积核，以减少特征映射的数量，这一阶段的目的是减少模型的参数数量，从而提高运算速度，映射阶段是在收缩后的特征空间上进行非线性映射，通过多个连续的3×3卷积层实现，其中特征映射阶段可以表示的公式如下：The second part is contraction and expansion. The contraction stage uses a 1×1 convolution kernel to reduce the number of feature maps. The purpose of this stage is to reduce the number of model parameters and thus improve the operation speed. The mapping stage performs nonlinear mapping on the contracted feature space through multiple consecutive 3×3 convolution layers. The formula that can be expressed in the feature mapping stage is as follows:

第三部分是反卷积，使用反卷积层来扩大特征映射的空间尺寸，从而得到高分辨率的输出图像，不同于传统的双三次插值方法，这个阶段的反卷积是学习式的，这一阶段可以将低分辨率的特征映射转换为高分辨率图像的特征图，公式如下所示：The third part is deconvolution, which uses the deconvolution layer to expand the spatial size of the feature map to obtain a high-resolution output image. Different from the traditional bicubic interpolation method, the deconvolution at this stage is a learning method. This stage can convert the low-resolution feature map into the feature map of the high-resolution image. The formula is as follows:

F_SR＝DeConv^9×9(F) (9)F _SR = DeConv ^9×9 (F) (9)

其中DeConv表示反卷积操作。Where DeConv represents the deconvolution operation.

进一步的作为本发明的优选技术方案，步骤S4中，所述基于局部超分辨率的眼部特征提取模块，采用眼部特征提取DeepEyeNet进行提取特征；DeepEyeNet是本发明提出的眼部深度特征提取CNN网络，它由十个卷积块组成，这是个相对深的卷积神经网络，特征映射的空间尺寸在网络的深度中逐渐减小，这种深度的网络可以更好地提取超分辨率后的眼部特征，以进行准确地视线估计；左右眼均经过类似的结构，公式如下所示：Further, as a preferred technical solution of the present invention, in step S4, the eye feature extraction module based on local super-resolution uses eye feature extraction DeepEyeNet to extract features; DeepEyeNet is a CNN network for deep eye feature extraction proposed in the present invention, which consists of ten convolution blocks. This is a relatively deep convolutional neural network, and the spatial size of the feature map gradually decreases in the depth of the network. This deep network can better extract eye features after super-resolution to accurately estimate the line of sight; both the left and right eyes go through a similar structure, and the formula is as follows:

F′_E＝FC_E(FLAT(γ(F_SR))) (10)F′ _E =FC _E (FLAT(γ(F _SR ))) (10)

其中γ表示DeepEyeNet的特征提取操作，其中F′_E表示输出眼部特征，FC_E表示眼部全连接操作；Where γ represents the feature extraction operation of DeepEyeNet, F′ _E represents the output eye feature, and FC _E represents the eye fully connected operation;

获得脸部、双眼的特征后，网络经过一个全连接层，连接脸部、左眼、右眼特征，输出最后的二维特征作为视线估计结果。After obtaining the features of the face and eyes, the network passes through a fully connected layer to connect the features of the face, left eye, and right eye, and outputs the final two-dimensional features as the line of sight estimation result.

ξ_pred＝FC_T([FC_FA(F_FA)；F′_E]) (11)ξ _pred =FC _T ([FC _FA (F _FA ); F′ _E ]) (11)

其中FC_FA表示脸部全连接操作。Where FC _FA represents the fully connected face operation.

进一步的作为本发明的优选技术方案，采用MSELoss作为视线估计的损失函数。MSELoss又叫均方误差，代表模型估计预测值和真实值之间的误差的平方的期望值，MSELoss越小表示误差较大。所以该模块视线估计的损失函数是：Further, as a preferred technical solution of the present invention, MSELoss is used as the loss function of line of sight estimation. MSELoss is also called mean square error, which represents the expected value of the square of the error between the model estimated prediction value and the true value. The smaller the MSELoss, the larger the error. Therefore, the loss function of the line of sight estimation of this module is:

其中，视线的真实值为ξ_gt，视线估计的预测值为ξ_pred。Among them, the true value of the line of sight is ξ _gt , and the predicted value of the line of sight estimation is ξ _pred .

与现有技术相比，本发明具有以下有益效果：Compared with the prior art, the present invention has the following beneficial effects:

(1)本发明设计了基于局部超分辨率的眼部特征提取模块，该网络首先对眼部图像进行超分辨率重建，以高效快速地恢复低分辨率的视线相关特征，再经过一种新颖的深度卷积神经网络，特征映射的空间尺寸在网络的深度中逐渐减小，这种网络可以更好地提取超分辨率后的眼部特征，以进行准确地视线估计。(1) The present invention designs an eye feature extraction module based on local super-resolution. The network first performs super-resolution reconstruction on the eye image to efficiently and quickly restore the low-resolution line of sight related features. Then, it passes through a novel deep convolutional neural network. The spatial size of the feature map gradually decreases in the depth of the network. This network can better extract the eye features after super-resolution to perform accurate line of sight estimation.

(2)本发明提出了人脸注意力强化特征提取网络，本发明改良了普通的ResNet18特征提取网络，从空间和通道两个方向增强低分辨率全局特征，以增加低分辨率环境下提取人脸特征的能力，以此提升视线估计的效果。(2) The present invention proposes a face attention enhanced feature extraction network. The present invention improves the ordinary ResNet18 feature extraction network and enhances low-resolution global features from both spatial and channel directions to increase the ability to extract face features in low-resolution environments, thereby improving the effect of line of sight estimation.

(3)本发明提高了低分辨率环境下视线估计的精度。(3) The present invention improves the accuracy of line of sight estimation in low-resolution environments.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

附图用来提供对本发明的进一步理解，并且构成说明书的一部分，与本发明的实施例一起用于解释本发明，并不构成对本发明的限制。The accompanying drawings are used to provide further understanding of the present invention and constitute a part of the specification. They are used to explain the present invention together with the embodiments of the present invention and do not constitute a limitation of the present invention.

图1为现有技术中不同分辨率下人脸和双眼图示意图；FIG1 is a schematic diagram of a face and two-eye image at different resolutions in the prior art;

图2为本发明实施例的方法流程图；FIG2 is a flow chart of a method according to an embodiment of the present invention;

图3为本发明实施例的网络整体框架图；FIG3 is a diagram of the overall network framework of an embodiment of the present invention;

图4为本发明实施例的注意力改良残差阶段结构图；FIG4 is a structural diagram of the attention improvement residual stage according to an embodiment of the present invention;

图5为本发明实施例的眼部特征提取模块结构图。FIG5 is a structural diagram of an eye feature extraction module according to an embodiment of the present invention.

具体实施方式DETAILED DESCRIPTION

下面结合附图详细的描述本发明的作进一步的解释说明，以使本领域的技术人员可以更深入地理解本发明并能够实施，但下面通过参考实例仅用于解释本发明，不作为本发明的限定。The present invention is further explained below in detail with reference to the accompanying drawings so that those skilled in the art can more deeply understand the present invention and be able to implement it. However, the following reference examples are only used to explain the present invention and are not intended to limit the present invention.

不同分辨率下的人脸和双眼图如图1所示，可以看到在128×128的HR场景下，信息比较丰富，能够完整地提取双眼和全脸特征，我们也能凭肉眼分辨人物的视线方向，而到了64×64、32×32、16×16的LR场景下，随着分辨率逐渐降低，信息逐步丢失，提取特征越来越困难，到了16×16的极端场合下，甚至不能分辨出眼睛的形状，这无疑增加了视线估计的难度。Figure 1 shows the face and eye images at different resolutions. It can be seen that in the 128×128 HR scene, the information is relatively rich, and the features of both eyes and the whole face can be completely extracted. We can also distinguish the direction of the person's line of sight with the naked eye. In the 64×64, 32×32, and 16×16 LR scenes, as the resolution gradually decreases, information is gradually lost, and it becomes increasingly difficult to extract features. In the extreme case of 16×16, even the shape of the eyes cannot be distinguished, which undoubtedly increases the difficulty of line of sight estimation.

如图2所示，一种基于局部超分辨率融合注意力机制的视线估计方法，包括如下步骤：步骤S1、使用摄像头获取帧图像；步骤S2、采用人脸检测模型对人脸区域和双眼区域进行检测和定位，将人脸图像进行裁剪，并截取眼部图像；步骤S3、将人脸图像通过人脸注意力强化特征提取模块，强化并提取人脸图像特征；步骤S4、将双眼图像通过基于局部超分辨率的眼部特征提取模块,提取双眼图像特征；步骤S5、通过全连接层融合提取的人脸图像和双眼图像特征得到视线估计结果。As shown in Figure 2, a line of sight estimation method based on local super-resolution fusion attention mechanism includes the following steps: step S1, using a camera to obtain a frame image; step S2, using a face detection model to detect and locate the face area and the eye area, cropping the face image, and intercepting the eye image; step S3, passing the face image through a face attention enhancement feature extraction module to enhance and extract face image features; step S4, passing the eye feature extraction module based on local super-resolution to extract the eye image features; step S5, fusing the extracted face image and eye image features through a fully connected layer to obtain a line of sight estimation result.

步骤S2中，采用基于卷积神经网络的人脸检测模型，和传统的人脸检测模型相比，采用卷积神经网络的人脸检测模型具有更高的准确性，能够处理各种复杂情况，以及具有更强的鲁棒性、实时性能和适应性，具体来说，使用“dlib.cnn_face_detection_model_v1”模型，该模型包含了卷积神经网络的权重和结构，这些是通过在大量图像数据上进行训练得到的，加载此模型后，就可以在图像上检测人脸。In step S2, a face detection model based on a convolutional neural network is used. Compared with the traditional face detection model, the face detection model using a convolutional neural network has higher accuracy, can handle various complex situations, and has stronger robustness, real-time performance and adaptability. Specifically, the "dlib.cnn_face_detection_model_v1" model is used, which contains the weights and structure of the convolutional neural network, which are obtained by training on a large amount of image data. After loading this model, faces can be detected on images.

步骤S3中，采用ResNet18模型作为脸部特征提取模块的基准模型。In step S3, the ResNet18 model is used as the benchmark model of the facial feature extraction module.

残差神经网络由He等人首次提出，这种网络结构的核心部分就是残差块的设计，它解决了深度神经网络训练中的梯度消失和梯度爆炸问题。残差块的基本思想是，相比学习直接的输出，学习其与输入之间的残差可能更简单。The residual neural network was first proposed by He et al. The core part of this network structure is the design of the residual block, which solves the gradient vanishing and gradient exploding problems in deep neural network training. The basic idea of the residual block is that it may be simpler to learn the residual between it and the input than to learn the direct output.

对于每个标准的残差块，公式如下所示：For each standard residual block, the formula is as follows:

F_out＝l(f(F_in,W_i)+F_in) (1)F _out =l(f(F _in ,W _i )+F _in ) (1)

其中l表示ReLU激活函数，f是残差块中的权重操作，W是该操作的权重，而F_in是残差块的输入。F_out表示残差块的输出。Where l represents the ReLU activation function, f is the weight operation in the residual block, W is the weight of the operation, and _Fin is the input of the residual block. _Fout represents the output of the residual block.

步骤S3中，在ResNet18基准模型的基础上添加了注意力机制以增加视线估计的准确度。脸部特征提取网络结构图如图3最上方的支路所示，其中，Res1、Res2、Res3、Res4代表resnet18的四个残差阶段，每个阶段都有类似的结构。In step S3, an attention mechanism is added to the ResNet18 baseline model to increase the accuracy of gaze estimation. The structure of the facial feature extraction network is shown in the top branch of Figure 3, where Res1, Res2, Res3, and Res4 represent the four residual stages of resnet18, and each stage has a similar structure.

CBAM是一种在卷积神经网络中引入注意力机制的方法，它被设计为在不显著增加计算复杂度的情况下，增强卷积神经网络对于空间和通道信息的适应性。CBAM的主要思想是按顺序在特征映射中加入两种注意力模块：通道注意力模块和空间注意力模块。CBAM is a method that introduces an attention mechanism into convolutional neural networks. It is designed to enhance the adaptability of convolutional neural networks to spatial and channel information without significantly increasing computational complexity. The main idea of CBAM is to sequentially add two attention modules to the feature map: a channel attention module and a spatial attention module.

通道注意力旨在为每个通道分配一个权重，一般通过全局平均池化和全局最大池化获得的特征来实现。对于给定的特征图F∈R^C×H×W，计算首先计算全局平均池化和全局最大池化，那么通道注意力可以表示为这两个值经过一个共享的多层感知器进行处理，并组合，公式如下所示：Channel attention aims to assign a weight to each channel, which is generally achieved through the features obtained by global average pooling and global maximum pooling. For a given feature map F∈R ^C×H×W , the calculation first calculates the global average pooling and global maximum pooling, then the channel attention can be expressed as these two values processed by a shared multi-layer perceptron and combined, the formula is as follows:

其中σ表示Sigmoid激活函数，M_CA表示通道注意力。表示全局平均池化，表示全局最大池化。Where σ represents the Sigmoid activation function and M _CA represents channel attention. represents global average pooling, Represents global maximum pooling.

空间注意力旨在为每个空间分配一个权重，首先仍然是计算全局平均池化和全局最大池化，但这次是沿着通道的维度，最后拼接两个特征图，并且通过一个7×7的卷积层，然后通过一个Sigmoid激活函数，公式如下：Spatial attention aims to assign a weight to each space. First, the global average pooling and global maximum pooling are still calculated, but this time along the channel dimension. Finally, the two feature maps are concatenated and passed through a 7×7 convolution layer, and then through a Sigmoid activation function. The formula is as follows:

人脸注意力强化特征提取模块在每个阶段的最后都添加CBAM注意力模块以增强特征，使得模块可以强调更为重要的特征通道和空间位置，也能增加在低分辨率环境下提取人脸特征的能力。对每个残差阶段，可以表示公式如下：The face attention enhanced feature extraction module adds a CBAM attention module at the end of each stage to enhance the features, so that the module can emphasize more important feature channels and spatial positions, and also increase the ability to extract face features in low-resolution environments. For each residual stage, the formula can be expressed as follows:

第一个注意力改良残差阶段结构图如图4所示，剩下的四个阶段都采用类似的结构。The structure of the first attention-modified residual stage is shown in Figure 4, and the remaining four stages adopt a similar structure.

步骤S4中，采用基于眼部图像的局部超分辨率方法提取特征，简称LSRE模块，该模块仅对眼部图像进行超分辨率重建，而不对整个脸部图像进行超分辨率重建。其原因是在视线估计中，眼部图像中包含了主要的视线相关特征，对视线估计起主要作用，因此仅对眼部图像进行超分辨率能提升模型的效率。In step S4, a local super-resolution method based on eye images is used to extract features, referred to as the LSRE module, which only performs super-resolution reconstruction on the eye images instead of the entire face image. The reason is that in line of sight estimation, the eye images contain the main line of sight related features and play a major role in line of sight estimation. Therefore, super-resolution only on the eye images can improve the efficiency of the model.

步骤S4中，采用FSRCNN作为超分辨率重建的网络，FSRCNN算法是一种为单图像超分辨率设计的一个深度卷积神经网络模型，它是在SRCNN的基础上进一步改进而来，目的是为了提高超分辨率重建的速度和效率。In step S4, FSRCNN is used as the network for super-resolution reconstruction. The FSRCNN algorithm is a deep convolutional neural network model designed for single image super-resolution. It is further improved on the basis of SRCNN in order to improve the speed and efficiency of super-resolution reconstruction.

第二部分为收缩与扩展，其中收缩阶段使用1×1的卷积核，以减少特征映射的数量，这一阶段的目的是减少模型的参数数量，从而提高运算速度，映射阶段是在收缩后的特征空间上进行非线性映射，通过多个连续的3×3卷积层实现，这些层的目的是捕获图像的高级特征，从而进一步增强模型对细节的捕捉能力。其中特征映射阶段可以表示的公式如下：The second part is contraction and expansion. The contraction stage uses a 1×1 convolution kernel to reduce the number of feature maps. The purpose of this stage is to reduce the number of model parameters, thereby increasing the speed of operation. The mapping stage is to perform nonlinear mapping on the contracted feature space, which is achieved through multiple consecutive 3×3 convolution layers. The purpose of these layers is to capture high-level features of the image, thereby further enhancing the model's ability to capture details. The formula that can be expressed in the feature mapping stage is as follows:

F_SR＝DeConv^9×9(F) (9)F _SR = DeConv ^9×9 (F) (9)

步骤S4中，采用眼部特征提取模块DeepEyeNet进行提取特征，该模块的结构图如图5所示。DeepEyeNet是本发明提出的眼部深度特征提取CNN网络，它由十个卷积块组成，网络的浅层特征提取部分参考了VGG16网络。该网络一共有10个卷积块，每个卷积块都包含Conv2d卷积层、批归一化、ReLU，其中第2、5、8个卷积块的最后执行最大池化以缩小特征图的尺寸，网络卷积层的输出通道依次为64*2、128*3、256*3、512、1024。卷积层的最后一层经过适应性平均池化层，让输出特征图的尺寸变为1×1×1024。这是个相对深的卷积神经网络，特征映射的空间尺寸在网络的深度中逐渐减小，这种深度的网络可以更好地提取超分辨率后的眼部特征，以进行准确地视线估计。左右眼均经过类似的结构，因为左右眼具有相似性，这两条支路会共享权重信息。公式如下所示：In step S4, the eye feature extraction module DeepEyeNet is used to extract features, and the structure diagram of the module is shown in Figure 5. DeepEyeNet is a CNN network for eye deep feature extraction proposed by the present invention. It consists of ten convolution blocks, and the shallow feature extraction part of the network refers to the VGG16 network. The network has a total of 10 convolution blocks, each of which contains a Conv2d convolution layer, batch normalization, and ReLU. The maximum pooling is performed at the end of the 2nd, 5th, and 8th convolution blocks to reduce the size of the feature map. The output channels of the network convolution layer are 64*2, 128*3, 256*3, 512, and 1024, respectively. The last layer of the convolution layer passes through an adaptive average pooling layer to make the size of the output feature map become 1×1×1024. This is a relatively deep convolutional neural network. The spatial size of the feature map gradually decreases in the depth of the network. This deep network can better extract eye features after super-resolution to accurately estimate the line of sight. The left and right eyes go through a similar structure. Because the left and right eyes are similar, the two branches share weight information. The formula is as follows:

F′_E＝FC_E(FLAT(γ(F_SR))) (10)F′ _E =FC _E (FLAT(γ(F _SR ))) (10)

其中γ表示DeepEyeNet的特征提取操作，其中F_E′表示输出眼部特征，FC_E表示眼部全连接操作。Where γ represents the feature extraction operation of DeepEyeNet, _FE ′ represents the output eye feature, and FC _E represents the eye fully connected operation.

其中FC_FA表示脸部全连接操作。Where FC _FA represents the fully connected operation of the face.

步骤S4中，采用MSELoss作为视线估计的损失函数。MSELoss又叫均方误差，代表模型估计预测值和真实值之间的误差的平方的期望值，MSELoss越小表示误差较大。所以该模块视线估计的损失函数是：In step S4, MSELoss is used as the loss function for line of sight estimation. MSELoss is also called mean square error, which represents the expected value of the square of the error between the model's estimated prediction value and the true value. The smaller the MSELoss, the larger the error. Therefore, the loss function for line of sight estimation in this module is:

为了评估所提出的网络模型的性能，本发明在视线估计任务的知名公开数据集MPIIFaceGaze上进行了模型的训练和评估。MPIIFaceGaze数据集是由Zhang等人提出的，这个数据集包含了来自15个不同受试者，这些图片是在不同头部偏转方向、不同的头部姿势、以及不同的光照条件下收集的，提供的评估子集是从每个受试者的图片都随机选择的3000个样本，总共45000个样本。和其他工作一样，本发明遵循数据集提供的评估方式，即采用了留一人交叉验证进行评估。In order to evaluate the performance of the proposed network model, the present invention trains and evaluates the model on the well-known public dataset MPIIFaceGaze for gaze estimation tasks. The MPIIFaceGaze dataset was proposed by Zhang et al. This dataset contains images from 15 different subjects. These images are collected under different head deflection directions, different head postures, and different lighting conditions. The evaluation subset provided is 3,000 samples randomly selected from each subject's image, for a total of 45,000 samples. Like other works, the present invention follows the evaluation method provided by the dataset, that is, one-person-leave-one-out cross-validation is used for evaluation.

本发明遵循该数据集作者Zhang等人所提出的数据处理方法进行数据预处理。同时，为了模拟LR环境，本发明对数据集进行下采样，模拟了不同分辨率环境，其中高分辨率(HR)图像尺寸大小为128×128，本发明采用不同的比例系数(2×、4×、8×)对HR图像进行双三次下采样，得到三组LR图像，它们的分辨率分别为64×64、32×32、16×16。采用这四组数据集进行实验，以提高实验的鲁棒性，并且能综合评估不同分辨率下的实验结果，同样采用留一人交叉验证来进行评估。The present invention follows the data processing method proposed by Zhang et al., the author of the dataset, to perform data preprocessing. At the same time, in order to simulate the LR environment, the present invention downsamples the dataset and simulates different resolution environments, where the high-resolution (HR) image size is 128×128. The present invention uses different scale factors (2×, 4×, 8×) to perform bicubic downsampling on the HR image to obtain three groups of LR images, whose resolutions are 64×64, 32×32, and 16×16, respectively. These four groups of datasets are used for experiments to improve the robustness of the experiment and to comprehensively evaluate the experimental results at different resolutions. The evaluation is also performed using leave-one-out cross-validation.

为了验证本发明提出的LOSRG-Net网络的有效性，和其他先进的方法在MPIIFaceGaze数据集上进行实验对比。为了公平地对比，其中每个方法的实验设置都采用对应论文中的设置，包括模型的框架和超参数，以此来复现它们的网络性能。In order to verify the effectiveness of the LOSRG-Net network proposed in this paper, an experimental comparison is conducted with other advanced methods on the MPIIFaceGaze dataset. In order to make a fair comparison, the experimental settings of each method adopt the settings in the corresponding paper, including the model framework and hyperparameters, so as to reproduce their network performance.

本发明提出的LOSRG-Net与这些先进方法在数据集MPIIFaceGaze上的性能比较如表1所示。其中，最上面一行表示数据集的分辨率，分别为128×128、64×64、32×32、16×16,表示分辨率从高到底的结果，每列表示不同的模型在不同分辨率下的角度误差。The performance comparison of the proposed LOSRG-Net and these advanced methods on the dataset MPIIFaceGaze is shown in Table 1. The top row indicates the resolution of the dataset, which are 128×128, 64×64, 32×32, and 16×16, respectively, indicating the results from high to low resolutions, and each column indicates the angular error of different models at different resolutions.

表1本发明与其他先进方法在MPIIFaceGaze数据集上的对比实验结果Table 1 Comparative experimental results of the present invention and other advanced methods on the MPIIFaceGaze dataset

如表1所示，LOSRG-Net方法在各个分辨率场合下的角度误差都小于所有先进的方法，这表示本发明的方法实现了最好的性能。同时，几乎所有的模型的性能都会随着分辨率的降低而降低，有些模型会在分辨率降低的时候大幅度降低性能。而本发明的方法随输入图像分辨率降低而性能恶化的程度最慢，本发明的方法在极端低分辨率的场合下效果突出，说明能够有效实现在低分辨率场合下的视线估计。As shown in Table 1, the angle error of the LOSRG-Net method is smaller than that of all the advanced methods in each resolution scenario, which means that the method of the present invention achieves the best performance. At the same time, the performance of almost all models will decrease as the resolution decreases, and some models will significantly reduce the performance when the resolution decreases. However, the performance of the method of the present invention deteriorates the slowest as the resolution of the input image decreases. The method of the present invention has outstanding effects in extremely low-resolution scenarios, indicating that it can effectively achieve line of sight estimation in low-resolution scenarios.

本实施例将介绍本发明的一种适用场景：This embodiment will introduce an applicable scenario of the present invention:

视线估计有广大的应用场景，其中的一个应用场景就是在线学习专注度的监测，目前线上教学的形式已经普及，并且出现在很多教学场合中，线上教学和线下教学相比的弊端是无人监督，学生很容易出现疲劳或者注意力涣散导致上课不集中的情况，这种情况下会影响教育质量和教学效果，而采用本发明提出的方法能有效解决该问题。具体步骤如下：Line of sight estimation has a wide range of application scenarios, one of which is the monitoring of online learning concentration. At present, online teaching has become popular and appears in many teaching occasions. The disadvantage of online teaching compared to offline teaching is that there is no supervision, and students are prone to fatigue or distraction, resulting in inattention in class, which will affect the quality of education and teaching effect. The method proposed in this invention can effectively solve this problem. The specific steps are as follows:

S1、提取在线学习者视频流数据的帧图像数据；S1, extracting frame image data of online learner video stream data;

S2、对帧图像数据进行面部特征提取，并且对人脸区域和双眼区域进行定位；S2, extracting facial features from the frame image data, and locating the face area and eye area;

S3、将人脸图像双眼图像通过本发明的视线估计模型，得到视线估计方向；S3, passing the facial image and binocular image through the sight line estimation model of the present invention to obtain the sight line estimation direction;

S4、根据本发明得到的视线估计方向结果计算专注度，设置注意力检测模块阈值，大于等于阈值判断为注意力分散，小于阈值为注意力集中；S4, calculating the concentration according to the sight line estimation direction result obtained by the present invention, setting the threshold of the attention detection module, if it is greater than or equal to the threshold, it is judged as distracted, and if it is less than the threshold, it is judged as focused;

S5、实时采集在线学习者在线学习视频流数据，然后执行步骤S1至步骤S2；之后执行步骤S3并应用视线估计模型获得眼睛视线方向，然后应用步骤S4的注意力检测模块，判断在线学习者注意力。S5. Collect the online learning video stream data of the online learner in real time, and then execute step S1 to step S2; then execute step S3 and apply the sight estimation model to obtain the eye sight direction, and then apply the attention detection module of step S4 to determine the attention of the online learner.

S6、系统发出警告给学习者，提醒其集中注意力。S6. The system sends a warning to the learner, reminding him to concentrate.

本发明提出了一种基于局部超分辨率融合注意力机制的视线估计方法，设计了基于局部超分辨率的眼部特征提取模块，该网络首先对眼部图像进行超分辨率重建，以高效快速地恢复低分辨率的视线相关特征，再经过一种新颖的深度卷积神经网络，特征映射的空间尺寸在网络的深度中逐渐减小，这种网络可以更好地提取超分辨率后的眼部特征，以进行准确地视线估计。本发明还提出了人脸注意力强化特征提取网络，融合注意力机制从空间和通道两个方向增强低分辨率全局特征，以增加低分辨率环境下提取人脸特征的能力，以此提升视线估计的效果。经过实验验证，本方法能够有效提高低分辨率场景下视线估计的精度。The present invention proposes a line of sight estimation method based on local super-resolution fused attention mechanism, and designs an eye feature extraction module based on local super-resolution. The network first reconstructs the eye image with super-resolution to efficiently and quickly restore the low-resolution line of sight related features, and then passes through a novel deep convolutional neural network. The spatial size of the feature map gradually decreases in the depth of the network. This network can better extract the eye features after super-resolution to accurately estimate the line of sight. The present invention also proposes a face attention enhanced feature extraction network, which fuses the attention mechanism to enhance the low-resolution global features from both spatial and channel directions to increase the ability to extract facial features in a low-resolution environment, thereby improving the effect of line of sight estimation. After experimental verification, this method can effectively improve the accuracy of line of sight estimation in low-resolution scenes.

以上所述的具体实施方案，对本发明的目的、技术方案和有益效果进行了进一步的详细说明，所应理解的是，以上所述仅为本发明的具体实施方案而已，并非用以限定本发明的范围，任何本领域的技术人员，在不脱离本发明的构思和原则的前提下所做出的等同变化与修改，均应属于本发明保护的范围。The specific implementation scheme described above further describes in detail the purpose, technical scheme and beneficial effects of the present invention. It should be understood that the above is only a specific implementation scheme of the present invention and is not intended to limit the scope of the present invention. Any equivalent changes and modifications made by any technician in the field without departing from the concept and principle of the present invention should fall within the scope of protection of the present invention.

Claims

1. A sight line estimation method based on a local super-resolution fusion attention mechanism is characterized by comprising the following steps:

s1, acquiring a frame image by using a camera;

S2, detecting and positioning a face area and a binocular area by adopting a face detection model, cutting a face image, and intercepting an eye image;

S3, strengthening and extracting the facial image features through a facial attention strengthening feature extraction module;

s4, extracting features of the binocular image through an eye feature extraction module based on local super resolution;

S5, obtaining a sight estimation result through fusion of the extracted face image and the binocular image features by the full-connection layer;

In step S3, the facial attention enhancing feature extraction module uses ResNet model as the reference model of the facial feature extraction module, and for each standard residual block, the formula is as follows:

F_out＝l(f(F_in,W_i)+F_in) (1)

Where l represents the ReLU activation function, F is the weight operation in the residual block, W is the weight of the operation, F _in is the input of the residual block, and F _out represents the output of the residual block;

In step S4, the local super-resolution-based eye feature extraction module uses FSRCNN as a super-resolution reconstruction network;

the FSRCNN algorithm mainly comprises three parts of feature extraction, contraction, expansion and deconvolution; the first part is feature extraction, which uses a smaller convolution kernel, with a size of 5×5, to extract features from low resolution eye images, and the formula of the feature extraction stage is as follows:

Wherein I _LR represents a low-resolution eye image, PRELU represents a PreLU activation function, and d represents the number of feature maps;

The second part is contraction and expansion, wherein a1×1 convolution kernel is used in the contraction stage to reduce the number of feature maps, and the mapping stage is to perform nonlinear mapping on the contracted feature space and is implemented by a plurality of continuous 3×3 convolution layers, wherein the feature mapping stage is expressed by the following formula:

The third part is deconvolution, which uses deconvolution layers to expand the spatial size of the feature map, converting the low resolution feature map into a feature map of the high resolution image, as shown below:

F_SR＝DeConv^9×9(F) (9)

wherein DeConv denotes a deconvolution operation.

2. The line-of-sight estimation method based on local super-resolution fusion attention mechanism according to claim 1, wherein in step S2, the face detection model adopts a face detection model based on a convolutional neural network.

3. The line-of-sight estimation method based on local super-resolution fusion attention mechanism according to claim 2, wherein in step S3, the face attention enhancement feature extraction module adds CBAM attention mechanism on the basis of ResNet a 18 reference model; CBAM attention mechanism two attention modules are added to the feature map: a channel attention module and a spatial attention module;

The channel attention module aims at distributing a weight to each channel, and is realized through the characteristics obtained by global average pooling and global maximum pooling; for a given feature map F ε R ^C×H×W, first compute global average pooling and global maximum pooling, and channel attention is expressed as the two values are processed by a shared multi-layer perceptron and combined as follows:

where σ represents the Sigmoid activation function, M _CA represents the channel attention, Representing a global average pooling of the data,Representing global maximum pooling;

The spatial attention module aims to assign a weight to each space, first still to compute global average pooling and global maximum pooling, but this time along the dimensions of the space, finally concatenate the two feature maps and pass through a 7 x 7 convolution layer, then activate the function by a Sigmoid, formulated as follows:

M_SA(F)＝σ(Conv^7×7([F_gap;F_gmp])) (3)

Wherein Conv ^7×7 denotes a 7 x 7 convolutional layer process;

The human face attention enhancement feature extraction module adds CBAM attention modules at the end of each stage to enhance the features, and for each residual stage, the expression formula is as follows:

wherein ∈represents downsampling; the features after CBAM attention mechanism enhancement are expressed as follows:

wherein, Representing the multiplication of the corresponding elements, and F _FA represents the attention-enriched facial feature map of CBAM.

4. The sight line estimation method based on the local super-resolution fusion attention mechanism according to claim 3, wherein in step S4, the local super-resolution based eye feature extraction module extracts features by using eye feature extraction DEEPEYENET; DEEPEYENET is an eye depth feature extraction CNN network, which is composed of ten convolution blocks, and is a convolution neural network with relative depth, the spatial dimension of feature mapping gradually decreases in the depth of the network, and the left eye and the right eye are in similar structures, and the formula is as follows:

F_E′＝FC_E(FLAT(γ(F_SR))) (10)

Wherein γ represents the feature extraction operation of DEEPEYENET, F _E' represents the output eye feature, and FC _E represents the eye full connection operation;

after the characteristics of the face and the eyes are obtained, the network is connected with the characteristics of the face, the left eye and the right eye through a full connection layer, and the final two-dimensional characteristics are output as a sight line estimation result:

ξ_pred＝FC_T([FC_FA(F_FA);F_E′]) (11)

Where FC _FA represents a face full join operation.

5. The line-of-sight estimation method based on local super-resolution fusion attention mechanism according to claim 4, wherein MSELoss is adopted as a loss function of line-of-sight estimation, and the loss function of line-of-sight estimation is:

Wherein the true value of the line of sight is ζ _gt and the predicted value of the line of sight estimate is ζ _pred.