CN119399795A

CN119399795A - A finger key point detection method for factory operation safety system

Info

Publication number: CN119399795A
Application number: CN202411604718.1A
Authority: CN
Inventors: 徐辰楠; 王战; 何星慰; 贾晓燕; 俞荣栋; 孟瑜炜; 骆洲; 李泽易
Original assignee: Zhejiang Zheneng Digital Technology Co ltd
Current assignee: Zhejiang Zheneng Digital Technology Co ltd
Priority date: 2024-11-12
Filing date: 2024-11-12
Publication date: 2025-02-07

Abstract

The invention relates to a finger key point detection method for a factory operation safety system, which comprises the steps of acquiring input data, carrying out target detection and key point detection on the input data through a target detection network, wherein the target detection network comprises a network trunk structure, a neck structure and a detection structure, the detection structure comprises a target detection head and a key point detection head, and the key point detection head comprises a first convolution layer, a convolution and self-attention fusion module ACmix and a second convolution layer which are sequentially connected. The method has the beneficial effects that the target detection network is optimized through the convolution and self-attention fusion module ACmix, so that the advantages of the convolution neural network in the aspect of capturing local features of the image are reserved, and the perception capability of the model on the global image semantics is obviously enhanced.

Description

Finger key point detection method for factory operation safety system

Technical Field

The invention relates to the technical field of image processing, in particular to a finger key point detection method for a factory operation safety system.

Background

In the field of computer vision YOLO (You Only Look Once) is a fast and accurate target detection algorithm that has evolved to multiple versions, each of which is improved and optimized on the basis of the previous version. YOLO is excellent in both accuracy and real-time of target detection, while it is easy to deploy on a variety of different hardware platforms due to its relative simplicity of model structure and smaller number of parameters. These advantages have led to the widespread use of YOLO in various downstream tasks of computer vision, including mainly target classification, target detection, image segmentation, and keypoint detection. However, finger key point detection is an important module in certain specific industrial scenarios, such as intelligent systems designed to ensure property and personnel safety. In order to ensure high accuracy, the system requires further improvement of robustness of the YOLO key point detection model.

The original YOLO key point detection model has certain performance, but still has some limitations and improvement room. The following are the main reasons for their limited performance:

1. Model structure limitations YOLO models were originally designed primarily for target detection, with relatively weak key point detection capabilities, especially when complex spatial relationships exist between key points. YOLO keypoint detection is achieved by predicting the relative positions of keypoints within a detected target frame, which may not be sufficient to capture fine spatial layout and dynamic changes between fingers.

2. Limitation of data sets finger keypoint detection requires high quality, diversified annotation data to train the model. If the training data is insufficient or the labeling is inaccurate, the model is difficult to learn the effective feature representation, thereby causing the detection performance to be reduced.

3. Target gesture change-in practical application, the position and gesture of the finger may change significantly. YOLO models may perform poorly when dealing with such complex geometric changes, and thus require increased robustness of the model through a stronger feature extraction network or attention mechanism.

Disclosure of Invention

The invention aims at overcoming the defects of the prior art, and provides a finger key point detection method for a factory operation safety system.

In a first aspect, a finger keypoint detection method for a factory operation safety system is provided, comprising:

step 1, acquiring input data;

step 2, performing target detection and key point detection on the input data through a target detection network;

The target detection network comprises a network backbone structure, a neck structure and a detection structure, wherein the detection structure comprises a target detection head and a key point detection head, and the key point detection head comprises a first convolution layer, a convolution and self-attention fusion module ACmix and a second convolution layer which are sequentially connected.

Preferably, in step 2, the operation of the convolution and self-attention fusion module ACmix includes:

Acquiring an input feature map, and performing linear mapping on the input feature map by utilizing a convolution check to generate a weight matrix;

extracting local features of the input feature map by convolution operation according to the weight matrix, and obtaining a first feature map;

Capturing long-distance dependency relations among different positions in the feature map by using a self-attention mechanism, and acquiring a second feature map according to the long-distance dependency relations;

and carrying out weighted summation on the first characteristic diagram and the second characteristic diagram through the learnable super parameters alpha and beta to obtain an output characteristic diagram, wherein the dimension of the output characteristic diagram is consistent with the dimension of the input characteristic diagram.

Preferably, in step 2, the target detection network performs feature enhancement processing in a training stage, where the feature enhancement processing includes:

Training the RT-DETR model by utilizing a palm target detection data set to obtain deviations dx and dy of a target frame predicted by the RT-DETR and a real target frame in the directions of an x axis and a y axis of the image;

modeling dx and dy by adopting a Gaussian mixture model GMM, and sampling to obtain a supplementary characteristic value;

Adding the supplementary feature values into an original training set to obtain an extended training set, and training a target detection network according to the extended training set

Preferably, in the step 1, the method further comprises the step of carrying out data enhancement processing on the input data, wherein the data enhancement processing comprises the step of mixing the actual application scene picture with the finger key point data set.

In a second aspect, a finger keypoint detection device for a factory operation safety system is provided, for performing any one of the finger keypoint detection methods of the first aspect, comprising:

the acquisition module is used for acquiring input data;

The detection module is used for carrying out target detection and key point detection on the input data through a target detection network;

In a third aspect, a vision-based factory error prevention system is provided, comprising an image input unit, a detection unit, a calculation unit and a judgment unit;

The image input unit is used for acquiring an operation image;

The detection unit is used for executing the finger key point detection method according to any one of the first aspect, and detecting the positions of the target and the finger key points in the operation image;

The calculating unit calculates the distance between the target and the finger key point according to the positions of the target and the finger key point in the operation image, and determines the position and the state of the target closest to the finger key point;

And the judging unit judges whether the executed operation is accurate or not according to the target position closest to the finger key point and the state thereof, and alarms if the executed operation is not accurate.

In a fourth aspect, there is provided a computer storage medium having a computer program stored therein, which when run on a computer causes the computer to perform the method of any of the first aspects.

In a fifth aspect, there is provided an electronic device comprising:

A memory for storing a computer program;

a processor for executing the computer program to implement the method according to any of the first aspects.

The beneficial effects of the invention are as follows:

1. according to the invention, the target detection network is optimized through the convolution and self-attention fusion module ACmix, so that the advantages of the convolution neural network in the aspect of capturing local features of the image are maintained, and the perception capability of the model on the global image semantics is obviously enhanced. In the present invention, the convolution and self-attention fusion module ACmix can effectively characterize the precise location information of the finger keypoints and understand the interrelationship between them.

2. The invention carries out characteristic enhancement processing on the target detection network in a training stage, uses the RT-DETR target detector to carry out palm targeted training, and obtains the deviation of a prediction frame and a real frame of the RT-DETR. In the key point detection model training stage, according to deviation distribution in the directions of the x axis and the y axis, the GMM is used for generating additional offset of a real target frame through dense sampling, so that the target detection performance is improved, the condition of missing detection or false detection is reduced, and the influence on the subsequent key point detection is avoided.

3. The invention creates the finger key point data set in the industrial scene, combines the open-source MHP data set to carry out data enhancement, and improves the generalization capability of the target detection network.

Drawings

FIG. 1 is a schematic diagram of an overall framework of a key point detection model according to an embodiment of the present invention;

fig. 2 is a schematic diagram of specific implementation details of ACmix modules provided in an embodiment of the present invention;

FIG. 3 is a flow chart of a factory error prevention system provided by an embodiment of the invention.

Detailed Description

The invention is further described below with reference to examples. The following examples are presented only to aid in the understanding of the invention. It should be noted that it will be apparent to those skilled in the art that modifications can be made to the present invention without departing from the principles of the invention, and such modifications and adaptations are intended to be within the scope of the invention as defined in the following claims.

Example 1:

In order to solve the problems in the prior art, as shown in fig. 1, embodiment 1 of the present application provides a finger key point detection method for a factory operation security system, including:

and step 1, acquiring input data.

Specifically, the object detection network of the application uses the open-source YOLOv's backbone network as a network backbone structure, and the neck structure adopts a PAN (Path Aggregation Network) feature fusion module, which is an effective multi-scale feature fusion technology, and can transfer information between features of different levels, thereby enhancing the detection capability of the model on objects with different sizes, and the object detection network is widely used as a YOLO series universal neck module. The key point detection head uses an open source YOLOv frame detection head, which contains a convolution layer for returning to the target frame position.

Because the key point detection head in the original YOLO model has a simpler structure, the key point detection head only comprises three convolution layers, so that the accuracy of key point detection is improved. The present application employs ACmix modules instead of intermediate convolutional layers, as shown in fig. 1. The ACmix module skillfully combines a convolution mechanism and a multi-head self-attention mechanism, and the innovative design endows the model with the capability of efficiently coding the space position and the interrelationship of each finger joint. By the method, the model can accurately capture the fine change of the finger when the finger executes the complex motion, and further realize the deep capturing and fusion of the local detail and the global visual representation in the feature map. The fusion mechanism greatly enhances the accuracy and the robustness of the detection of the finger key points by the model.

Specifically, as shown in fig. 2, the operation of the convolution and self-attention fusion module ACmix is designed into multiple stages to fully exploit the advantages of the convolution and self-attention mechanism. The following is a detailed description of the stages:

in stage A, an input feature map is acquired, first using three The convolution check inputs the characteristic diagram to carry on the linear mapping, produce the weight matrix of query, key and value separately. These weight matrices provide the necessary input for the subsequent self-attention mechanism so that the model can focus on key regions in the feature map.

In subsequent phases B and C, the module further processes the feature map by applying the convolution and self-attention mechanisms in parallel. Stage B focuses on extracting local features using convolution operations, enhancing the model's ability to capture detailed information. And the phase C utilizes a self-attention mechanism to capture long-distance dependence between different positions in the characteristic diagram, so that the model can better understand the spatial position relationship between finger joints. The combination of these two phases enables the model to capture global context information while maintaining local detail.

Finally, by introducing the learnable super parameters alpha and beta, the feature images of the B and C stages are weighted and summed, so that the effective integration of the information of the two is realized, and the output feature image is obtained, and the dimension of the output feature image is consistent with the dimension of the input feature image.

Example 2:

on the basis of embodiment 1, embodiment 2 of the present application provides a more specific finger key point detection method for a factory operation safety system, comprising:

and step 1, acquiring input data.

In the step 1, the input data is subjected to data enhancement processing, wherein the data enhancement processing comprises model training by adopting pictures of actual application scenes, and mixing with a part of open-source finger key point data in a certain proportion for subsequent training and verification. Illustratively, data collected in an industrial scene is mixed with a portion of the screened open source Multiview Handpose dataset, enhancing the detection capabilities of the model from the data plane by removing pictures of similar scenes or similar gestures. The data set for model training and verification contains 2091 pictures in total, involving 2115 hand instances.

The object detection network in the embodiment of the application uses the backbone network of YOLOv with an open source as a network backbone structure, and the part of the YOLO network structure needs to be described, so that a proper scaling scale can be flexibly selected according to specific hardware resource conditions, particularly the video memory of a display card. The scale of the network can be changed through scaling, and the larger the network parameter is, the more the occupied display memory is, but the better the effect is.

Furthermore, YOLOv network architecture is not fixed, it has a high degree of flexibility, and can be easily replaced with other YOLO series versions including, but not limited to YOLOv, YOLOv7, YOLOv, etc. Other backbone networks can be replaced according to specific requirements, and through verification, model performance improvement can be obtained by applying the invention to three backbone networks, EFFICIENTNET, resnet and HRnet.

It should be noted that, the YOLO-based keypoint detection method follows a top-down strategy, i.e. first detects the target and determines the target frame, and then carries out accurate regression on the finger keypoints according to the position of the target frame. Therefore, the accuracy of target detection is critical, which is directly linked to the final effect of finger keypoint detection. In order to enhance the detection performance, the application performs feature enhancement processing on the target detection network in a training stage.

The characteristic enhancement processing comprises the steps of firstly, utilizing an open-source palm target detection dataset to conduct targeted training on a Real-time Detection Transformer (RT-DETR) model with high training difficulty and excellent performance. Through the training process, deviations dx and dy of the target frame predicted by the RT-DETR and the real target frame in the directions of the x axis and the y axis of the image are obtained, and the distribution of the deviations approximately accords with the characteristics of Gaussian distribution. To further optimize these bias data, the present design models dx and dy using a Gaussian Mixture Model (GMM) and implements a sampling operation. This process generates a series of supplemental feature values that are blended into the real data during the training process as an extension of the original training set. The feature enhancement means effectively improves the target detection capability of the model, so that more accurate detection of finger key points is realized.

The experimental results obtained using the innovative approach in this design for YOLOv, YOLOv8, YOLOv9 are shown in the following table. The evaluation index in table 1 is mAP (MEAN AVERAGE Precision), which is a common performance index in the tasks of target detection and keypoint detection, and is calculated by Precision (Precision), recall (Recall) and cross-over ratio (IoU).AndThe model's ability to detect hand bounding boxes is measured,AndFor measuring the detection capability of a model to finger key points, whereinRefers to the accuracy of the model detection result when IoU threshold is 0.5,Refers to the average accuracy of the model test accurate results when IoU threshold is 0.5 to 0.95. After the GMM-based feature enhancement method and ACmix modules in the design are used, the performance of the YOLO model is improved.

Table 1 performance of the application on YOLOv, YOLOv8, YOLOv9

In this embodiment, the same or similar parts as those in embodiment 1 may be referred to each other, and will not be described in detail in the present disclosure.

Example 3:

on the basis of the embodiments 1 and 2, the embodiment 3 of the application provides a vision-based factory error prevention system, and in the actual production environment of a factory, the correct execution of a workflow is a key link for ensuring the safe and stable operation of the factory, and the personnel safety and the production efficiency are directly related. With the rapid development of Artificial Intelligence (AI) and computer vision technology, vision-based anti-misoperation systems provide new solutions to improve operation accuracy and safety. The system effectively prevents the occurrence of human misoperation accidents by analyzing the behaviors of operators in real time.

The vision-based factory error prevention system comprises an image input unit, a detection unit, a calculation unit and a judgment unit;

Wherein the image input unit is used for acquiring an operation image. For example, as shown in fig. 3, the image input unit may be a wearable device (such as AR glasses), and may acquire an operation image in real time, such as acquiring any video frame from an operation video as the operation image.

The detection unit is used for executing a finger key point detection method and detecting positions of a target and a finger key point in an operation image. This function is critical to the accuracy of the judgment of the person's operation. In addition, the system integrates various algorithms, including detection algorithms for multi-target states of factory switches, personnel and the like, OCR algorithms for recognizing text or digital contents on an electrical cabinet, two-dimensional code recognition algorithms and the like.

The calculating unit calculates the distance between the target and the finger key point according to the positions of the target and the finger key point in the operation image, and determines the position and the state of the target closest to the finger key point.

The judging unit judges whether the executed operation is accurate or not according to the target position closest to the finger key point and the state thereof. Once the target pointed by the finger is found to be inconsistent with the target appointed in the workflow or the state of the target is inconsistent with the operation requirement, the system can immediately send out a warning, and misoperation is effectively prevented.

Example 4:

On the basis of embodiments 1 and 2, embodiment 4 of the present application provides a finger key point detection device for a factory operation safety system, including:

the acquisition module is used for acquiring input data;

Specifically, the system provided in this embodiment is a system corresponding to the method provided in embodiments 1 and 2, so that the portions in this embodiment that are the same as or similar to those in embodiments 1 and 2 may be referred to each other, and will not be described in detail in this disclosure.

Claims

1. A finger key point detection method for a factory operation safety system, comprising:

step 1, acquiring input data;

2. The method for finger keypoint detection for a plant operation safety system according to claim 1, wherein in step 2, the operation of the convolution and self-attention fusion module ACmix comprises:

3. The finger keypoint detection method for a plant operation safety system according to claim 2, wherein in step2, the object detection network performs a feature enhancement process in a training phase, the feature enhancement process comprising:

And adding the supplementary feature values into an original training set, obtaining an extended training set, and training a target detection network according to the extended training set.

4. The method for detecting finger keypoints for plant operation safety system as described in claim 3, wherein in step 1, further comprising performing data enhancement processing on the input data, the data enhancement processing comprising mixing an actual application scene picture with a finger keypoint data set.

5. The method for finger keypoint detection for a plant operation safety system according to claim 4, wherein in step 2, the neck structure is a feature fusion module PAN.

6. A finger key point detection device for a factory operation safety system, for performing the finger key point detection method according to any one of claims 1 to 5, comprising:

the acquisition module is used for acquiring input data;

7. The factory error prevention system based on vision is characterized by comprising an image input unit, a detection unit, a calculation unit and a judgment unit;

The image input unit is used for acquiring an operation image;

the detection unit is used for executing the finger key point detection method according to any one of claims 1 to 5, and detecting the positions of the target and the finger key points in the operation image;

8. A computer storage medium, wherein a computer program is stored in the computer storage medium, and when the computer program runs on a computer, the computer program causes the computer to execute the finger key point detection method as claimed in any one of claims 1 to 5.

9. An electronic device, comprising:

A memory for storing a computer program;

A processor for executing the computer program to implement the finger keypoint detection method as claimed in any one of claims 1 to 5.