CN117612142B

CN117612142B - Head posture and fatigue state detection method based on multi-task joint model

Info

Publication number: CN117612142B
Application number: CN202311520633.0A
Authority: CN
Inventors: 贺晨; 刘营; 缪小然; 胡建峰; 赵广明; 周杰; 闵冰冰; 高宇蒙; 雅可; 赵作鹏
Original assignee: Yanyuan Security Technology Xuzhou Co ltd; China University of Mining and Technology Beijing CUMTB
Current assignee: Yanyuan Security Technology Xuzhou Co ltd; China University of Mining and Technology Beijing CUMTB
Priority date: 2023-11-14
Filing date: 2023-11-14
Publication date: 2024-07-12
Anticipated expiration: 2043-11-14
Also published as: CN117612142A

Abstract

The invention discloses a head posture and fatigue state detection method based on a multi-task joint model, which comprises the following steps of: designing an enhanced feature extraction network based on an aggregation and diversion mechanism on the basis of YOLOv; adding a head posture estimation branch fused with a large kernel attention mechanism in the model; labeling the face data set to form a fatigue driving data set; training a fatigue distraction detection model through a target detection loss function and a head posture estimation loss function; deploying the model on the vehicle-mounted terminal equipment, detecting the head gesture and the fatigue state through the model, and outputting information; whether in a tired state or a distraction state is determined by comparing a certain category duration with a set threshold value. The invention improves the generalization performance, robustness, reliability and detection precision of the model, reduces the training time and calculation resources of the model, improves the safety of a driver, reduces the fatigue distraction behavior in driving and reduces the occurrence rate of traffic accidents.

Description

Head posture and fatigue state detection method based on multi-task joint model

技术领域Technical Field

本发明涉及计算机视觉技术领域，尤其涉及一种基于多任务联合模型的头部姿态与疲劳状态检测方法。The present invention relates to the field of computer vision technology, and in particular to a head posture and fatigue state detection method based on a multi-task joint model.

背景技术Background technique

随着汽车行业的飞速发展，驾驶安全日益受到人们的重视。疲劳驾驶作为一个关键的安全隐患，引起了广泛的关注。疲劳驾驶可能导致驾驶员反应迟缓、判断失误，从而增加交通事故的风险。为了提高道路交通安全，对驾驶员的疲劳状态进行实时监测和预警是至关重要的。With the rapid development of the automobile industry, driving safety has received increasing attention. Fatigue driving, as a key safety hazard, has attracted widespread attention. Fatigue driving may cause drivers to react slowly and make misjudgments, thereby increasing the risk of traffic accidents. In order to improve road traffic safety, real-time monitoring and early warning of driver fatigue are crucial.

传统的疲劳驾驶检测方法主要基于驾驶员的生理信号，如脑电、心率等进行分析，但这些方法需要与驾驶员身体进行接触，安装和使用过程相对复杂。近年来，基于计算机视觉的疲劳驾驶检测技术得到了快速发展，此类方法主要通过分析驾驶员的面部特征，如眼睛、嘴巴的状态，来判断驾驶员是否疲劳。但这些方法往往依赖于高算力设备进行实时分析，不仅计算资源消耗大，而且可能出现误报率较高的问题。Traditional methods of fatigue driving detection are mainly based on the analysis of the driver's physiological signals, such as EEG and heart rate, but these methods require contact with the driver's body, and the installation and use process is relatively complicated. In recent years, fatigue driving detection technology based on computer vision has developed rapidly. Such methods mainly judge whether the driver is fatigued by analyzing the driver's facial features, such as the state of the eyes and mouth. However, these methods often rely on high-computing power equipment for real-time analysis, which not only consumes a lot of computing resources, but also may have a high false alarm rate.

随着越来越多的场景都需要同时处理多个任务或目标，多任务联合模型应运而生。多任务联合模型是一种机器学习模型，可以同时处理多个任务，并且这些任务可以是不同类型或不同领域的；其基本思路是通过共享底层特征如卷积层或词嵌入层来使不同任务之间产生联结。As more and more scenarios require processing multiple tasks or goals at the same time, multi-task joint models have emerged. A multi-task joint model is a machine learning model that can process multiple tasks at the same time, and these tasks can be of different types or in different fields; its basic idea is to connect different tasks by sharing underlying features such as convolutional layers or word embedding layers.

但在运用多任务联合模型时也可能存在一些缺点，例如不同任务之间的相关性可能不完全相同，导致一些任务在训练过程中可能没有得到充分的学习；不同任务之间的信息可能存在冗余和噪声，导致模型在训练过程中可能会受到干扰；多任务联合模型的参数数量可能会增加，导致模型在训练过程中可能会受到过拟合的困扰等等。However, there may be some disadvantages when using the multi-task joint model. For example, the correlation between different tasks may not be exactly the same, resulting in some tasks not being fully learned during the training process; there may be redundancy and noise in the information between different tasks, resulting in the model being disturbed during the training process; the number of parameters of the multi-task joint model may increase, resulting in the model being plagued by overfitting during the training process, etc.

因此，在基于多任务联合模型的疲劳驾驶检测方法中，如何在保证利用不同任务之间的相关性，提高疲劳检测模型的泛化性能、增强模型的鲁棒性和可靠性、减少模型训练的时间和计算资源的同时，恰当处理不同任务之间的关系和矛盾、数据资源和计算资源的消耗、模型的可扩展性和可维护性以及模型的训练和优化等方面的问题，是当前疲劳驾驶检测方法研究的热点和难点。Therefore, in the fatigue driving detection method based on the multi-task joint model, how to ensure the utilization of the correlation between different tasks, improve the generalization performance of the fatigue detection model, enhance the robustness and reliability of the model, reduce the time and computing resources of model training, while properly handling the relationship and contradiction between different tasks, the consumption of data resources and computing resources, the scalability and maintainability of the model, and the training and optimization of the model, etc., are the hot spots and difficulties in the current research on fatigue driving detection methods.

发明内容Summary of the invention

本发明的目的在于提供一种基于多任务联合模型的头部姿态与疲劳状态检测方法，可在提高模型的泛化性能、增强模型的鲁棒性和可靠性、减少模型训练的时间和计算资源的同时，恰当处理不同任务之间的关系和矛盾，可用于各种类型的汽车驾驶场景，提高驾驶员的安全性，减少驾驶中的疲劳分心行为，降低交通事故的发生率。The purpose of the present invention is to provide a head posture and fatigue state detection method based on a multi-task joint model, which can improve the generalization performance of the model, enhance the robustness and reliability of the model, reduce the time and computing resources of model training, and properly handle the relationship and contradiction between different tasks. It can be used in various types of automobile driving scenarios, improve the safety of drivers, reduce fatigue and distraction behaviors during driving, and reduce the incidence of traffic accidents.

为实现上述目的，本发明一种基于多任务联合模型的头部姿态与疲劳状态检测方法，包括以下步骤：To achieve the above object, the present invention provides a head posture and fatigue state detection method based on a multi-task joint model, comprising the following steps:

S1:疲劳分神检测模型的设计，以YOLOv6为基线模型进行改进，在YOLOv6的基础上，设计基于聚集与分流机制的加强特征提取网络；S1: Design of fatigue and distraction detection model, taking YOLOv6 as the baseline model for improvement, and designing an enhanced feature extraction network based on aggregation and diversion mechanism on the basis of YOLOv6;

S2:在疲劳分神检测模型中增加融合大核注意力机制的头部姿态估计分支；S2: Add a head posture estimation branch that integrates the large core attention mechanism to the fatigue distraction detection model;

S3:准备人脸数据集，并对人脸数据集标注形成疲劳驾驶数据集；标注时除了标注每个目标的类别和检测框外，并向人脸数据添加额外的头部转动角度是否大于45°标签；标注的类别包括睁眼、闭眼、张嘴及闭嘴；S3: Prepare a face dataset and annotate the face dataset to form a fatigue driving dataset; when annotating, in addition to annotating the category and detection frame of each target, add an additional label to the face data to indicate whether the head rotation angle is greater than 45°; the annotated categories include eyes open, eyes closed, mouth open, and mouth closed;

S4:通过目标检测损失函数和头部姿态估计损失函数训练疲劳分神检测模型；S4: training the fatigue distraction detection model through the target detection loss function and the head posture estimation loss function;

S5:将疲劳分神检测模型部署于车载终端设备，将终端设备摄像头拍摄的视频流输入疲劳分神检测模型，通过训练后的疲劳分神检测模型检测头部姿态和疲劳状态并输出信息，所述输出信息包括目标的类别、检测框和头部转动角度是否大于45°；S5: deploying the fatigue and distraction detection model on the vehicle-mounted terminal device, inputting the video stream captured by the terminal device camera into the fatigue and distraction detection model, detecting the head posture and fatigue state through the trained fatigue and distraction detection model and outputting information, wherein the output information includes the target category, the detection frame and whether the head rotation angle is greater than 45°;

S6:通过某一类别持续时间与设定阈值进行比较来判定是否处于疲劳状态或分神状态。S6: Determine whether the user is in a fatigue or distracted state by comparing the duration of a certain category with a set threshold.

进一步，所述步骤S1中，基于聚集于分流机制的加强特征提取网路，包括使用低层聚集与分流机制代替YOLOv6中加强特征提取网络的上采样融合阶段，使用高层聚集与分流机制代替YOLOv6中加强特征提取网络的下采样融合阶段。Furthermore, in step S1, the enhanced feature extraction network based on the aggregation and diversion mechanism includes using a low-level aggregation and diversion mechanism to replace the upsampling fusion stage of the enhanced feature extraction network in YOLOv6, and using a high-level aggregation and diversion mechanism to replace the downsampling fusion stage of the enhanced feature extraction network in YOLOv6.

进一步，所述聚集与分流机制包括信息对齐模块、信息融合模块以及信息分流模块；所述信息对齐模块收集来自骨干网络的多层特征图，并通过上采样或下采样的方式进行对齐；所述信息融合模块融合对齐后的特征生成全局范围的特征；所述信息分流模块使用自注意力机制将全局特征分流至各个特征层。Furthermore, the aggregation and diversion mechanism includes an information alignment module, an information fusion module and an information diversion module; the information alignment module collects multi-layer feature maps from the backbone network and aligns them by upsampling or downsampling; the information fusion module fuses the aligned features to generate global features; the information diversion module uses a self-attention mechanism to divert global features to each feature layer.

进一步，所述步骤S2中，融合大核注意力机制的头部姿态估计分支由多个卷积层、大核注意力机制模块和一个全连接层组成。Furthermore, in step S2, the head posture estimation branch integrating the large core attention mechanism is composed of multiple convolutional layers, a large core attention mechanism module and a fully connected layer.

进一步，所述大核注意力机制模块能够捕获长距离关系；所述大核注意力机制模块使用大核卷积层来建立全局相关性并产生注意力结果，同时使用深度可分离卷积减少参数量。Furthermore, the large-core attention mechanism module is capable of capturing long-distance relationships; the large-core attention mechanism module uses a large-core convolution layer to establish global correlations and produce attention results, while using depthwise separable convolution to reduce the amount of parameters.

进一步，所述步骤S4中，目标检测损失函数和头部姿态估计损失函数由两部分组成，分别为基于SIoU的回归损失函数和基于分类与回归对齐方法的分类损失函数；所述头部姿态估计损失函数为模型预测结果与真实标签值的交叉熵损失函数，并通过权重参数平衡两种损失，进行模型训练。Furthermore, in step S4, the target detection loss function and the head posture estimation loss function are composed of two parts, namely, a regression loss function based on SIoU and a classification loss function based on the classification and regression alignment method; the head posture estimation loss function is a cross entropy loss function between the model prediction result and the true label value, and the two losses are balanced by weight parameters to perform model training.

进一步，将步骤S3中获得的疲劳驾驶数据集按照8:1:1的比例划分训练集、验证集和测试集，在训练阶段加载数据集时，使用masoic和mixup数据增强方法提高数据鲁棒性，并通过水平和垂直翻转、随机旋转、随机裁剪、变形和缩放的数据增强方式增加数据量较少的类型的样本量。Furthermore, the fatigue driving dataset obtained in step S3 is divided into training set, validation set and test set in a ratio of 8:1:1. When loading the dataset in the training stage, the masoic and mixup data enhancement methods are used to improve the data robustness, and the sample size of the type with less data volume is increased through the data enhancement methods of horizontal and vertical flipping, random rotation, random cropping, deformation and scaling.

进一步，所述疲劳分神检测模型由卷积神经网络训练得到，所述卷积神经网络包括骨干网络、聚集与分流加强特征提取网络、目标检测头以及大核注意力机制头部姿态估计分支；所述骨干网络用于提取图片特征；所述目标检测头输出检测框和类别；所述目标检测头包括分类回归分支、边界框回归分支及深度信息回归分支；所述大核注意力机制头部姿态估计分支输出是否转头结果。Furthermore, the fatigue and distraction detection model is obtained by training a convolutional neural network, which includes a backbone network, an aggregation and diversion enhanced feature extraction network, a target detection head, and a large core attention mechanism head posture estimation branch; the backbone network is used to extract image features; the target detection head outputs a detection box and category; the target detection head includes a classification regression branch, a bounding box regression branch, and a depth information regression branch; the large core attention mechanism head posture estimation branch outputs whether the head is turned.

本发明的有益效果：Beneficial effects of the present invention:

本发明一种基于多任务联合模型的头部姿态与疲劳状态检测方法，设计了基于聚集与分流机制的加强特征提取网络，通过统一的模块对不同尺度的特征信息进行采集和融合，然后将融合后的特征分流至不同层，既避免了YOLOv6中加强特征提取网络结构固有的信息丢失的问题，也在不显著增加推理时间的情况下增强了特征提取网络部分的特征信息融合能力；且聚集与分流机制加强模型的全局特征提取能力和对图片全局信息的学习，提高了模型的检测能力。The invention discloses a head posture and fatigue state detection method based on a multi-task joint model, designs an enhanced feature extraction network based on an aggregation and shunting mechanism, collects and fuses feature information of different scales through a unified module, and then shunts the fused features to different layers, thereby avoiding the problem of information loss inherent in the enhanced feature extraction network structure in YOLOv6, and enhancing the feature information fusion capability of the feature extraction network part without significantly increasing the reasoning time; and the aggregation and shunting mechanism strengthens the global feature extraction capability of the model and the learning of the global information of the picture, thereby improving the detection capability of the model.

本发明增加融合大核注意力机制的头部姿态估计分支，以进一步加强对图片全局信息的学习，为模型增加学习头部姿态的能力，从而提高头部姿态估计的准确性；通过该分支将回归问题简化为分类问题，具有实时性好、准确率高的优点。且该分支在训练时不需要繁琐的关键点标注，方便在不同场景和任务上进行微调，且在一定程度上减少了相机位置对检测结果的影响，从而提高模型的鲁棒性。The present invention adds a head posture estimation branch that integrates a large core attention mechanism to further strengthen the learning of global information of the image, increase the model's ability to learn head posture, and thus improve the accuracy of head posture estimation; this branch simplifies the regression problem into a classification problem, which has the advantages of good real-time performance and high accuracy. In addition, this branch does not require cumbersome key point annotation during training, which facilitates fine-tuning in different scenes and tasks, and reduces the impact of camera position on detection results to a certain extent, thereby improving the robustness of the model.

本发明头部姿态估计分支与疲劳状态检测分支共享权重，前者定位了眼睛和嘴巴的位置，给后者提供了额外的语义信息，从而使模型在此任务上有更好的效果；头部姿态估计分支的大核注意力机制模块能够捕获长距离关系，从而有效提取人脸全局特征，进行头部姿态估计，大核注意力机制模块使用大核卷积层来建立全局相关性并产生注意力结果，同时使用深度可分离卷积减少参数量，减少模型推理时间。The head posture estimation branch of the present invention shares weights with the fatigue state detection branch. The former locates the positions of the eyes and mouth, providing additional semantic information to the latter, so that the model has a better effect on this task; the large-core attention mechanism module of the head posture estimation branch can capture long-distance relationships, thereby effectively extracting global features of the face and performing head posture estimation. The large-core attention mechanism module uses a large-core convolutional layer to establish global correlation and generate attention results, and uses deep separable convolution to reduce the number of parameters and reduce the model inference time.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

图1是本发明的工作原理图。Fig. 1 is a diagram showing the working principle of the present invention.

图2是CUDA异构并行计算示意图。Figure 2 is a schematic diagram of CUDA heterogeneous parallel computing.

具体实施方式Detailed ways

以下结合附图对本发明作进一步详细的说明。The present invention is further described in detail below with reference to the accompanying drawings.

参照图1和图2，一种基于多任务联合模型的头部姿态与疲劳状态检测方法，包括以下步骤：1 and 2, a head posture and fatigue state detection method based on a multi-task joint model includes the following steps:

S1:疲劳分神检测模型的设计，以YOLOv6为基线模型进行改进，在YOLOv6的基础上，设计基于聚集与分流机制的加强特征提取网络。S1: The design of the fatigue and distraction detection model is improved based on YOLOv6 as the baseline model. On the basis of YOLOv6, an enhanced feature extraction network based on the aggregation and diversion mechanism is designed.

基于聚集与分流机制的加强特征提取网络使用低层聚集与分流机制代替YOLOv6中加强特征提取网络的上采样融合阶段，用高层聚集与分流机制代谢YOLOv6中加强特征提取网络的下采样融合阶段；通过统一的模块对不同尺度的特征信息进行采集和融合，然后将融合后的特征分流至不同层，既避免了YOLOv6中加强特征提取网络结构固有的信息丢失的问题，也在不显著增加推理时间的情况下增强了特征提取网络部分的特征信息融合能力。The enhanced feature extraction network based on the aggregation and diversion mechanism uses a low-level aggregation and diversion mechanism to replace the upsampling fusion stage of the enhanced feature extraction network in YOLOv6, and uses a high-level aggregation and diversion mechanism to metabolize the downsampling fusion stage of the enhanced feature extraction network in YOLOv6; feature information of different scales is collected and fused through a unified module, and then the fused features are diverted to different layers, which not only avoids the problem of information loss inherent in the enhanced feature extraction network structure in YOLOv6, but also enhances the feature information fusion capability of the feature extraction network part without significantly increasing the reasoning time.

聚集与分流机制包括信息对齐模块、信息融合模块和信息分流模块。其中，信息对齐模块收集来自骨干网络的多层特征图，并通过上采样或下采样的方式进行对齐。信息融合模块融合对齐后的特征生成全局范围的特征。信息分流模块使用自注意力机制将全局特征分流至各个特征层。聚集与分流机制能够有效的将融合后的全局信息分流至各个特征层，从而加强模型的全局特征提取能力。因此，基于聚集与分流机制的加强特征提取网络加强了模型对于图片全局信息的学习，提高模型的检测能力。The aggregation and diversion mechanism includes an information alignment module, an information fusion module, and an information diversion module. Among them, the information alignment module collects multi-layer feature maps from the backbone network and aligns them by upsampling or downsampling. The information fusion module fuses the aligned features to generate global features. The information diversion module uses the self-attention mechanism to divert global features to each feature layer. The aggregation and diversion mechanism can effectively divert the fused global information to each feature layer, thereby enhancing the global feature extraction capability of the model. Therefore, the enhanced feature extraction network based on the aggregation and diversion mechanism strengthens the model's learning of the global information of the image and improves the model's detection capability.

S2:增加融合大核注意力机制的头部姿态估计分支，以进一步加强对图片全局信息的学习，为模型增加学习头部姿态的能力，从而提高头部姿态估计的准确性。融合大核注意力机制的头部姿态估计分支由多个卷积层、大核注意力机制模块和一个全连接层组成。通过该分支将回归问题简化为分类问题，直接判断头部转动角度是否大于45°，具有实时性好、准确率高的优点，针对其对遮挡和噪声较敏感、在自然场景中头部姿态估计的精度较低的缺点，在训练疲劳分神检测模型时补充了不同场景和人脸的数据集训练更新模型。S2: Add a head pose estimation branch that integrates the large core attention mechanism to further strengthen the learning of the global information of the image, increase the ability of learning head pose for the model, and thus improve the accuracy of head pose estimation. The head pose estimation branch that integrates the large core attention mechanism consists of multiple convolutional layers, a large core attention mechanism module, and a fully connected layer. This branch simplifies the regression problem into a classification problem, and directly determines whether the head rotation angle is greater than 45°. It has the advantages of good real-time performance and high accuracy. In view of its sensitivity to occlusion and noise and low accuracy of head pose estimation in natural scenes, different scenes and face data sets are added to train the updated model when training the fatigue distraction detection model.

该分支在训练时不需要繁琐的关键点标注，方便在不同场景和任务上进行微调，且在一定程度上减少了相机位置对检测结果的影响，从而提高模型的鲁棒性。同时，该头部姿态估计分支与疲劳状态检测分支共享权重，前者定位了眼睛和嘴巴的位置，给后者提供了额外的语义信息，从而使模型在此任务上有更好的效果。且大核注意力机制模块能够捕获长距离关系，从而有效提取人脸全局特征，进行头部姿态估计。大核注意力机制模块使用大核卷积层来建立全局相关性并产生注意力结果，同时使用深度可分离卷积减少参数量，减少模型推理时间。This branch does not require tedious key point annotation during training, which facilitates fine-tuning in different scenarios and tasks, and reduces the impact of camera position on detection results to a certain extent, thereby improving the robustness of the model. At the same time, the head pose estimation branch shares weights with the fatigue state detection branch. The former locates the position of the eyes and mouth, providing additional semantic information to the latter, so that the model has better results on this task. And the large-core attention mechanism module can capture long-distance relationships, thereby effectively extracting global features of the face and performing head pose estimation. The large-core attention mechanism module uses large-core convolutional layers to establish global correlations and produce attention results, while using deep separable convolutions to reduce the number of parameters and reduce model inference time.

S3:准备人脸数据集，并对人脸数据集标注形成疲劳驾驶数据集；本实施例中，通过摄像头采集视频数据，人工标注后形成疲劳驾驶数据集，卷积神经网络利用疲劳驾驶数据集进行训练后得到疲劳分神检测模型。标注时除了标注每个目的类别和检测框外，并向人脸数据添加额外的头部转动角度是否大于45°标签；标注的类别包括睁眼、闭眼、张嘴及闭嘴。S3: Prepare a face data set, and annotate the face data set to form a fatigue driving data set; in this embodiment, video data is collected by a camera, and a fatigue driving data set is formed after manual annotation. The convolutional neural network is trained using the fatigue driving data set to obtain a fatigue distraction detection model. When annotating, in addition to annotating each target category and detection frame, an additional label is added to the face data to indicate whether the head rotation angle is greater than 45°; the annotated categories include eyes open, eyes closed, mouth open, and mouth closed.

将获得的疲劳驾驶数据集按照8:1:1的比例划分训练集、验证集和测试集，在训练阶段加载数据集时，使用masoic和mixup数据增强方法提高数据鲁棒性，并通过水平和垂直翻转、随机旋转、随机裁剪、变形和缩放的数据增强方式增加数据量较少的类型的样本量，以提高模型的泛化能力。The obtained fatigue driving dataset is divided into training set, validation set and test set in the ratio of 8:1:1. When loading the dataset in the training stage, the masoic and mixup data augmentation methods are used to improve the data robustness, and the sample size of the type with less data volume is increased through horizontal and vertical flipping, random rotation, random cropping, deformation and scaling data augmentation methods to improve the generalization ability of the model.

卷积神经网络包括骨干网络、聚集与分流加强特征提取网络、目标检测头以及大核注意力机制头部姿态估计分支；骨干网络用于提取图片特征，目标检测头输出检测框和类别；目标检测头包括分类回归分支、边界框回归分支以及深度信息回归分支，分类回归分支和边界框回归分支对应输出类别和检测框，大核注意力机制头部姿态估计分支输出是否转头结构；深度信息回归分支的设置用于区分车内前后座人员，避免误识别。The convolutional neural network includes a backbone network, an aggregation and diversion enhanced feature extraction network, a target detection head, and a large core attention mechanism head posture estimation branch; the backbone network is used to extract image features, and the target detection head outputs detection boxes and categories; the target detection head includes a classification regression branch, a bounding box regression branch, and a depth information regression branch. The classification regression branch and the bounding box regression branch output categories and detection boxes respectively, and the large core attention mechanism head posture estimation branch outputs whether the head is turned; the depth information regression branch is set to distinguish between people in the front and rear seats in the car to avoid misidentification.

深度信息回归分支由多个卷积层、一个池化层和一个全连接层组成。深度信息的解码方式如下：首先将车内深度信息均分为s个阶段，即对于[0,V]的深度跨度，每个深度跨度为V/s，该段的代表性深度取u＝V/s，然后对于一个s类的分类模型，取其每一类的概率与当前类的代表深度的乘积和作为最终的预测值。由于YOLOv6模型基于锚点进行预测，即对每个锚点预测边界框信息和类别信息，所以粗粒度深度估计同样以锚点为单位，即为每个锚点预测边界框信息、类别信息和深度信息。The depth information regression branch consists of multiple convolutional layers, a pooling layer, and a fully connected layer. The decoding method of the depth information is as follows: first, the depth information inside the car is evenly divided into s stages, that is, for the depth span of [0, V], each depth span is V/s, and the representative depth of the segment is u=V/s. Then, for a classification model of s classes, the product of the probability of each class and the representative depth of the current class is taken as the final prediction value. Since the YOLOv6 model predicts based on anchor points, that is, the bounding box information and category information are predicted for each anchor point, the coarse-grained depth estimation is also based on anchor points, that is, the bounding box information, category information, and depth information are predicted for each anchor point.

通过深度信息回支解码出的粗粒度深度值所在区间区分前排人员和后面人员，当粗粒度深度值位于[0,1]区间内则为前排人员，当粗粒度深度值位于[1,2]区间内则为后排人员；粗粒度深度值是指疲劳分神检测模型模型识别并检测出的目标物体距离摄像头距离的特征缩放，实际距离被粗粒度缩放至[0,2]区间内；位于[0,1]区间则表示目标物体距离摄像头更近，被识别为位于车内前排；位于[1,2]区间则表面目标物体距离摄像有更远，被识别为位于车内后排；从而避免误识别的情况发生。The coarse-grained depth value interval decoded through the depth information backhaul is used to distinguish the front-row occupants and the back-row occupants. When the coarse-grained depth value is in the interval [0,1], it is the front-row occupants, and when the coarse-grained depth value is in the interval [1,2], it is the back-row occupants. The coarse-grained depth value refers to the feature scaling of the distance from the target object recognized and detected by the fatigue and distraction detection model to the camera. The actual distance is coarse-grainedly scaled to the interval [0,2]. If it is in the interval [0,1], it means that the target object is closer to the camera and is identified as being in the front row of the car. If it is in the interval [1,2], it means that the target object is farther from the camera and is identified as being in the back row of the car, thereby avoiding misidentification.

S4:通过目标检测损失函数和头部姿态估计损失函数训练疲劳分神检测模型。目标检测损失函数和头部姿态估计损失函数分别为基于SIoU的回归损失函数和基于分类与回归对齐方法的分类损失函数。头部姿态估计损失函数为模型预测结果与真实标签值的交叉熵损失函数；并通过权重参数平衡两种损失，进行模型训练。S4: The fatigue distraction detection model is trained through the target detection loss function and the head posture estimation loss function. The target detection loss function and the head posture estimation loss function are respectively the regression loss function based on SIoU and the classification loss function based on the classification and regression alignment method. The head posture estimation loss function is the cross entropy loss function between the model prediction result and the true label value; and the two losses are balanced through the weight parameter to perform model training.

S5:将疲劳分神检测模型部署于车载终端设备，将终端设备摄像头拍摄的视频流输入疲劳分神检测模型，并输出检测信息；输出检测信息包括目标的类别、检测框和头部转动角度是否大于45°。疲劳分神检测模型的部署方法为包括以下步骤：S5: deploy the fatigue and distraction detection model on the vehicle-mounted terminal device, input the video stream captured by the terminal device camera into the fatigue and distraction detection model, and output detection information; the output detection information includes the target category, detection frame and whether the head rotation angle is greater than 45°. The deployment method of the fatigue and distraction detection model includes the following steps:

先将疲劳分神检测模型转换为ONNX模型，再将ONNX转换为TensorRT模型。具体为：先将训练好的网络利用Pytorch内部接口转化为ONNX模型，在TensorRT中使用解析器读取ONNX模型并构建引擎；然后调用TensorRT的C++接口以及Libtorch库实现模型后处理部分。在推理过程中应注意显存的分配，在计算时借助CUDA库将数据从CPU端搬到GPU端，在推理计算后再将数据从GPU端搬回CPU端。First, convert the fatigue and distraction detection model into an ONNX model, and then convert ONNX into a TensorRT model. Specifically, first convert the trained network into an ONNX model using the Pytorch internal interface, use the parser in TensorRT to read the ONNX model and build the engine; then call the TensorRT C++ interface and the Libtorch library to implement the model post-processing part. During the inference process, attention should be paid to the allocation of video memory. During calculation, use the CUDA library to move data from the CPU to the GPU, and then move data from the GPU back to the CPU after the inference calculation.

S6:通过某一类别持续时间是否超过设定的阈值来判定是否处于疲劳状态或分神状态。根据头部姿态以及眼嘴位置的变化判断驾驶员是否左顾右盼、分神驾驶；根据头部姿态信息、眼睛的睁闭及嘴巴的开合来判断驾驶员是否疲劳驾驶。通过眼睛和嘴巴判断驾驶员是否疲劳驾驶的相关检测类别分为睁眼、闭嘴、闭眼和张嘴。因为人在疲劳状态下的特征直观且明显，如眨眼次数、眼球转动、打哈欠、点头等，这些状态会被摄像头记录下来，并加以识别判断。S6: Determine whether the driver is in a fatigued or distracted state by whether the duration of a certain category exceeds the set threshold. Determine whether the driver is looking around or driving distracted based on the changes in head posture and eye and mouth position; determine whether the driver is driving fatigued based on head posture information, eye opening and closing, and mouth opening and closing. The relevant detection categories for judging whether the driver is driving fatigued by eyes and mouth are divided into open eyes, closed mouth, closed eyes and open mouth. Because the characteristics of a person in a fatigued state are intuitive and obvious, such as the number of blinks, eye movement, yawning, nodding, etc., these states will be recorded by the camera and identified and judged.

其具体判断方法流程为：首先取一帧图像进行人脸检测，若为人脸则进行嘴巴和眼睛的定位并提取疲劳信息与转头信息，再进行信息融合；如果为设定的异常状态则进行累计，直至持续时间大于阈值，本实施例该阈值设定为3s，则判定为疲劳状态，进行警告或提示；若获取图像识别为非人脸，则随机再取一帧图像重复上述流程。其中人眼的疲劳状态是一个时间段内的状态，因此采用PERCLOS方法来判定，当眼睛张开度大于20％判定为睁眼，小于等于20％认为是闭眼。对于嘴部状态检测，因为嘴巴的状态有很多种，其中打哈欠是一种疲劳状态的体现，因此只要将打哈欠的嘴巴状态与其他状态进行区分，就能判别驾驶员是否疲劳。通过嘴巴的几何形状来计算其张开度，将嘴巴用矩形框标记位置，用嘴巴的高与宽之比计算嘴巴的张开度。当嘴巴张开度大于0.8时判定为打哈欠的张嘴状态，当嘴巴张开度小于等于0.8时判定位闭嘴状态。The specific judgment method flow is as follows: first take a frame of image for face detection. If it is a face, locate the mouth and eyes and extract fatigue information and head turning information, and then perform information fusion; if it is a set abnormal state, accumulate until the duration is greater than the threshold. In this embodiment, the threshold is set to 3s, then it is determined to be a fatigue state, and a warning or prompt is given; if the acquired image is recognized as a non-face, randomly take another frame of image and repeat the above process. The fatigue state of the human eye is a state within a time period, so the PERCLOS method is used to determine that when the eye opening is greater than 20%, it is determined to be open, and less than or equal to 20% is considered to be closed. For mouth state detection, because there are many states of the mouth, among which yawning is a manifestation of fatigue, as long as the yawning mouth state is distinguished from other states, it is possible to determine whether the driver is fatigued. The opening degree of the mouth is calculated by the geometric shape of the mouth, the position of the mouth is marked with a rectangular frame, and the opening degree of the mouth is calculated by the ratio of the height to the width of the mouth. When the mouth opening degree is greater than 0.8, it is judged as an open mouth state of yawning, and when the mouth opening degree is less than or equal to 0.8, it is judged as a closed mouth state.

Claims

1. A head posture and fatigue state detection method based on a multi-task joint model is characterized by comprising the following steps of: the method comprises the following steps:

S1, designing a fatigue distraction detection model, taking YOLOv as a baseline model for improvement, and designing a reinforced feature extraction network based on an aggregation and diversion mechanism on the basis of YOLOv;

s2, adding a head gesture estimation branch fused with a large nuclear attention mechanism into the fatigue distraction detection model;

s3, preparing a face data set, and labeling the face data set to form a fatigue driving data set; when in labeling, except labeling the category and the detection frame of each target, and adding an additional label for judging whether the head rotation angle is larger than 45 degrees to the face data; the marked categories comprise eye opening, eye closing, mouth opening and mouth closing;

S4, training a fatigue distraction detection model through a target detection loss function and a head posture estimation loss function;

S5, deploying a fatigue distraction detection model on the vehicle-mounted terminal equipment, inputting a video stream shot by a camera of the terminal equipment into the fatigue distraction detection model, detecting the head gesture and the fatigue state through the trained fatigue distraction detection model, and outputting information, wherein the output information comprises the type of a target, a detection frame and whether the head rotation angle is larger than 45 degrees;

And S6, comparing the duration of a certain category with a set threshold value to judge whether the vehicle is in a fatigue state or a distraction state.

2. The method for detecting head gestures and fatigue states based on the multi-task joint model according to claim 1, wherein the method comprises the following steps: in the step S1, the enhanced feature extraction network based on the clustering and splitting mechanism includes using a low-level clustering and splitting mechanism to replace the upsampling fusion stage of the enhanced feature extraction network in YOLOv, and using a high-level clustering and splitting mechanism to replace the downsampling fusion stage of the enhanced feature extraction network in YOLOv.

3. The method for detecting the head posture and the fatigue state based on the multi-task joint model according to claim 2, wherein the method comprises the following steps: the aggregation and distribution mechanism comprises an information alignment module, an information fusion module and an information distribution module; the information alignment module collects multi-layer feature graphs from a backbone network and performs alignment in an up-sampling or down-sampling mode; the information fusion module fuses the aligned features to generate features in a global scope; the information splitting module splits global features to the feature layers using a self-attention mechanism.

4. A method for detecting head pose and fatigue state based on a multi-task joint model according to any of claims 1-3, wherein: in the step S2, the head pose estimation branch fused with the large-core attention mechanism is composed of a plurality of convolution layers, a large-core attention mechanism module and a full connection layer.

5. The method for detecting the head posture and the fatigue state based on the multi-task joint model according to claim 4, wherein the method comprises the following steps: the large-core attention mechanism module can capture long-distance relations; the large kernel attention mechanism module uses large kernel convolution layers to build global dependencies and produce attention results while reducing the number of parameters using depth separable convolutions.

6. The method for detecting head gestures and fatigue states based on the multi-task joint model according to claim 1, wherein the method comprises the following steps: in the step S4, the target detection loss function and the head pose estimation loss function are composed of two parts, namely a SIoU-based regression loss function and a classification loss function based on a classification and regression alignment method; the head pose estimation loss function is a cross entropy loss function of a model prediction result and a real label value, and the model is trained by balancing the two losses through weight parameters.

7. The method for detecting head gestures and fatigue states based on the multi-task joint model according to claim 1, wherein the method comprises the following steps: dividing the fatigue driving data set obtained in the step S3 into a training set, a verification set and a test set according to the proportion of 8:1:1, improving the data robustness by using masoic and mixup data enhancement methods when the data set is loaded in the training stage, and increasing the sample size of the type with smaller data size by using the data enhancement modes of horizontal and vertical overturn, random rotation, random cutting, deformation and scaling.

8. The method for detecting head gestures and fatigue states based on the multi-task joint model according to claim 1, wherein the method comprises the following steps: the fatigue distraction detection model is trained by a convolutional neural network, and the convolutional neural network comprises a backbone network, an aggregation and shunt reinforcement feature extraction network, a target detection head and a head gesture estimation branch of a large-core attention mechanism; the backbone network is used for extracting picture characteristics; the target detection head outputs a detection frame and a category; the target detection head comprises a classification regression branch, a boundary box regression branch and a depth information regression branch; the head gesture of the large-core attention mechanism estimates whether the branch outputs a turning result or not.