CN111291593B

CN111291593B - Method for detecting human body posture

Info

Publication number: CN111291593B
Application number: CN201811492525.6A
Authority: CN
Inventors: 黄超; 徐滢
Original assignee: Chengdu Pinguo Technology Co Ltd
Current assignee: Chengdu Pinguo Technology Co Ltd
Priority date: 2018-12-06
Filing date: 2018-12-06
Publication date: 2023-04-18
Anticipated expiration: 2038-12-06
Also published as: CN111291593A

Abstract

The invention discloses a method for detecting human body posture, which comprises the following steps: inputting the preprocessed human body image to be detected into a pre-trained neural network model to obtain a predetermined number of hot spot maps, wherein each hot spot map comprises a human body joint point; the neural network model comprises a first 14 layers of a MobileNet V2 network, a dimension conversion layer, a first up-sampling layer, a first convolutional neural network layer, a BN regularization layer, a ReLU activation function layer, a second up-sampling layer and a second convolutional neural network layer which are connected in sequence; the convolution operations in the neural network model all adopt separable convolution operations; acquiring a preset number of human body joint point coordinates from a preset number of heat point maps; and zooming the coordinates of each human body joint point to an image coordinate system of the human body image to be detected, and acquiring the human body posture joint points of the human body image to be detected. The technical scheme provided by the invention can be used for detecting the human body posture in real time on the terminal with smaller memory and limited operation capacity of the CPU and the GPU.

Description

Method for detecting human body posture

Technical Field

The invention relates to the technical field of deep learning, in particular to a method for detecting human body postures.

Background

The detection of the human body posture can be applied to various fields at present, for example, the detection of the human body posture can be applied to the field of security protection and can be used for identifying the behavior of the human body; the method is applied to the field of game entertainment, and can increase the interest of games. And the detection of the human posture is finally concluded as the detection of the human posture joint point.

At present, there are two main methods for detecting human body posture joint points: one is a direct regression joint point method, namely, a network model is used for directly obtaining human posture joint points; the other method is a regression hotspot graph prediction method, namely, a network model is used for obtaining a plurality of hotspot graphs, and then the hotspot graphs are processed to obtain the final coordinates of the human body joint points, wherein one hotspot graph corresponds to one joint point. The direct regression joint point method usually has a poor effect of directly regressing the coordinates of the joint points due to large changes of postures, dresses and portrait backgrounds of human bodies, and is also very difficult to train a network model, so that a good usable model is difficult to obtain through convergence. Although the regression hot spot diagram prediction is better than the direct regression joint point in effect, due to the complex network structure and the huge network model, the training of the network model is also difficult, and the regression hot spot diagram prediction cannot be applied to terminals with small memories and limited CPU or GPU computing capacity, so that the application and popularization of human posture detection are greatly limited.

Disclosure of Invention

The invention aims to provide a method for detecting human body gestures, which can detect the human body gestures in real time on a terminal with small memory and limited CPU or GPU computing capacity.

In order to achieve the purpose, the technical scheme adopted by the invention is as follows:

a method of detecting human gestures, comprising: inputting the preprocessed human body image to be detected into a pre-trained neural network model, and acquiring a predetermined number of hot spot maps, wherein each hot spot map comprises a human body joint point; the neural network model comprises a first 14 layers of a MobileNetV2 network, a dimension conversion layer, a first up-sampling layer, a first convolution neural network layer, a BN regularization layer, a ReLU activation function layer, a second up-sampling layer and a second convolution neural network layer which are connected in sequence; the convolution operation in the neural network model adopts separable convolution operation; acquiring a preset number of human body joint point coordinates from the preset number of heat point maps; and zooming the coordinates of each human body joint point to an image coordinate system of the human body image to be detected, and acquiring the human body posture joint points of the human body image to be detected.

Preferably, training the neural network model comprises: marking a human body frame and joint points on a pre-acquired original training image; clipping the original training image according to the human body frame to obtain a clipped image; scaling the cut image according to a preset proportion and filling the cut image to a preset size to obtain a training input image; converting the coordinates of the joint points marked in the original training image into the coordinates in the training input image, and generating a ground truth value by adopting a two-dimensional Gaussian distribution function; and training the neural network model by adopting the training input image and the ground truth value.

Preferably, the size of the training input image is 240 × 192; the size of the group route value is 60 × 48.

Preferably, the loss function of the neural network model adopts a mean square loss function:

loss(x，y)＝(x-y) ²

wherein x is the predicted value of the neural network model, and y is the ground truth value.

Further, the method also comprises the following steps: and in the process of training the neural network model, optimizing the neural network model by adopting an Adam optimization function.

Preferably, the first upsampling layer and the second upsampling layer both adopt 2 times upsampling; the first convolutional neural network layer and the second convolutional neural network layer are both 3 x 3 convolutional neural networks.

Preferably, the pre-trained neural network model is run on a mobile terminal; and the human body image to be detected is acquired by the mobile terminal.

The method for detecting the human body posture provided by the embodiment of the invention abandons the existing complex human body posture detection network model, self-defines a simple and efficient neural network model, and simultaneously, the convolution operation in the neural network model adopts separable convolution operation. The simplification of the network model structure and the use of separable convolution operation greatly reduce the calculation amount of the neural network model, greatly reduce the model per se and make the training process easier. Compared with the prior art, the technical scheme provided by the invention can smoothly run on the mobile terminal with smaller memory and limited CPU and GPU computing capability, and realizes real-time detection of the human body posture.

Drawings

FIG. 1 is a flow chart of a method of an embodiment of the present invention;

FIG. 2 is a schematic structural diagram of a neural network model according to an embodiment of the present invention;

fig. 3 is a schematic diagram of the first 14-layer network structure of MobileNetV2 in the embodiment of the present invention;

FIG. 4 is a schematic structural diagram of a bottleeck network of MobileNet V2 according to an embodiment of the present invention;

FIG. 5 is a visual representation of a hotspot graph in accordance with an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail below with reference to the accompanying drawings.

The invention needs to define a simple and efficient deep neural network model, which needs to be capable of running on the mobile terminal efficiently, so in the embodiment of the invention, in order to be capable of performing forward reasoning efficiently, the input height and width of the image is 240 × 192, and the height and width of the output hotspot image is defined as 60 × 48.

At present, many experiments on deep Neural network research show that the deeper the network depth, the more the specific high-dimensional features which can be obtained by the deep Neural network research are, the better the network performance is, but the deeper the network training is, the more difficult the network training is, the problem of non-convergence of gradient vanishing training exists, and the extremely deep network is not beneficial to running at a mobile terminal, so that the invention can customize a simple and efficient CNN (Convolutional Neural network) residual error network for forward reasoning, because the CNN can extract different features at different network layers, the higher the network layer number is, the more serious the sampling is, and the adoption of a residual error network structure can achieve the effect of dimension reduction, namely, the low-dimensional and high-dimensional features are fused, so that the structural design of the multilayer CNN can repeatedly obtain the information contained in the input image at different scales, and a better feature extraction result is obtained.

Training the customized neural network model comprises: marking a human body frame and joint points on an original training image acquired in advance; clipping the original training image according to the human body frame to obtain a clipped image; scaling the cut image according to a preset proportion and filling the cut image to a preset size to obtain a training input image; converting the coordinates of the joint points marked in the original training image into coordinates in the training input image, and generating a ground truth value by adopting a two-dimensional Gaussian distribution function; and training the neural network model by adopting the training input image and the ground truth value. The trained neural network model can detect the human body posture joint points in the input image. In the above process, some common augmentation operations such as mirroring, rotation, scaling, color information interference of the image (such as enhancing or reducing contrast and saturation), etc. can be performed on the training data, and normalization and regularization operations are performed.

The loss function of the neural network model in the embodiment of the invention adopts a mean square loss function:

loss(x，y)＝(x-y) ²

wherein x is the predicted value of the neural network model, and y is the ground truth value. I.e. pixel by pixel comparing the squared difference between x and y to define how large the difference between the predicted value and the ground truth value is, the smaller this value the better.

In the process of training the neural network model, an Adam optimization function is adopted to optimize the neural network model, wherein Adam is a first-order optimization algorithm capable of replacing the traditional random gradient descent process and can update the weight of the neural network model iteratively based on training data. The convolution operation in the neural network model adopts separable convolution operation, so that the calculated amount is reduced, and the size of the model is reduced.

If the model needs to be used on the mobile terminal, the model needs to be converted into an Open Neural Network Exchange (ONNX) format, and then the model in the ONNX format is converted into a Network model format corresponding to a mobile terminal operation framework, such as the core ml of Apple, or the Caffe2, or a model format supported by other third-party Neural Network feed-forward reasoning networks. That is, ONNX is an intermediate model format, and as long as the forward inference framework has tools to support ONNX conversion, the ONNX conversion can be converted into the model format required by the forward inference framework through the conversion tools. The method comprises the steps of obtaining camera data on a mobile terminal of the mobile phone by using an Application Programming Interface (API) provided by the mobile terminal, zooming video frame data of the camera to a specified size, and eliminating detection of a human body frame under the condition that only one person exists in the default camera data, so that a large amount of time is saved for human body posture detection. The camera frame data is directly zoomed into the input size required by the network, namely the height and width are 240x192, the image content may be slightly stretched or compressed, but the influence on the neural network with strong robustness is not caused, and the zoomed camera frame data is input into the trained neural network model and converted to the mobile terminal to detect the human posture, so that the predicted hotspot graph is obtained. And processing the predicted hot spot diagram, namely traversing the whole hot spot diagram to obtain the maximum value of the hot spot diagram, namely obtaining the coordinate value of the joint point of the human body posture.

The specific structure of the neural network model defined by the present invention is described below:

as shown in fig. 2, the neural network model includes a first 14 layers, a dimension transformation layer, a first upsampling layer, a first convolutional neural network layer, a BN regularization layer, a ReLU activation function layer, a second upsampling layer, and a second convolutional neural network layer of the MobileNetV2 network, which are connected in sequence. Fig. 3 shows a schematic diagram of the first 14-layer network structure of mobilenet 2, where t represents a multiple of the channel expansion dimension, c is the output channel, and n is the number of times of repeating this bottommost structure. There are 5 groups of bottleeck structures, the size of the special pattern generated after each group of bottleeck becomes smaller, which embodies the idea that the network lower layer extracts the abstract feature and the network higher layer extracts the more specific feature, and s is the step length adopted by the Filter in the CNN.

Fig. 4 shows a schematic structural diagram of a bottleeck network of MobileNetV2, where the bottleeck is a bottleneck network structure, and the inside of the bottle is first subjected to dimension enhancement, then CNN convolution operation is performed, and finally dimension reduction is performed, so as to repeatedly extract feature data, and simultaneously determine whether to use shortcut connection according to whether the value of s is equal to that of an input channel and an output channel, where it should be noted that when n > 1, s of a first bottleeck layer of each group is a corresponding s value, s of other repetition layers are 1, and when s is 1, and an input dimension is equal to an output dimension, the network has shortcut connection, that is, a residual network concept.

After the input data passes through the first 14 layers of the MobileNetV2 network and the dimension conversion layer, the input data enters an attitude joint point feature extraction and up-sampling network layer, the input feature of the network is the output of the upper layer, and the network firstly performs 2 times up-sampling on the height and width of the feature. For example, the input to the network at this time is (r) ² C, H, W), up-sampling by 2 times, and outputting (C, rH, rW), where r is up-sampling by several times, e.g. 2 times up-sampling, and at this time, r =2, i.e. after the first up-sampling layer PixelShuffle, the number of channels is divided by r ² And H and W are enlarged by r times.

After passing through the first up-sampling layer, the data is subjected to feature extraction again through a 3 x 3Conv first convolutional neural network layer, and meanwhile, after passing through a Batch normalization regularization, after passing through a BN (Batch normalization), the network enables the data to have more expressive force through a ReLU activation function, and then is connected with sampling again (namely, a second up-sampling layer), the step has the effects of reducing the number of channels and improving and outputting the feature expression of the high-tolerance data more obviously, the second convolutional neural network layer 3 x 3Conv is the final prediction hot spot image output, the number of joint points required to be predicted is set as the data output channel, and the whole construction of the whole human posture joint point network is completed. The convolution operation of the whole network adopts separable convolution operation, namely, each channel is firstly subjected to respective convolution operation, and each channel has a filter, so that after a new channel characteristic diagram is obtained, the new channel characteristic diagram is subjected to standard 1 multiplied by 1 cross-channel convolution operation, and the two steps of operation can reduce the parameter quantity of the original traditional convolution operation to one ninth, thereby greatly reducing the size of the model, and simultaneously, the operation quantity is greatly reduced due to the reduction of the parameters.

To better illustrate the whole network flow, the data flow of the whole network is illustrated here as an example:

the input of the image data filled by cropping, scaling and data augmentation while normalizing the regularization is (3 × 240 × 192), 3 represents that the image data is 3 channels, 240 represents image height, 192 represents image width, and the characteristic output after the front 14 layer network structure of MobileNetV2 is output (96, 15, 12), and then a dimension transformation layer is used to expand (96, 15, 12) to the output (512, 15, 12) dimension, which is expanded to 512 by a 1 × 1 convolutional layer operation and then to Batch normalization and ReLU6. The dimension is expanded to increase the expression capability of the feature data and correspond to the input height and width of the subsequent attitude joint feature extraction network. At this point 512 represents 512 channel dimensions, 15 represents feature height, and 12 represents feature width.

The output (512, 15, 12) is input into the pose-joint feature extraction network, and after upsampling by a first upsampling layer PixelShuffle, the output is output as output (128, 30, 24), and it can be seen that the channel dimension is reduced in dimension and the height and width are simultaneously enlarged, and the reason that the upsampling by using the PixelShuffle is the process of amplifying the image from low resolution to high resolution, the interpolation parameters are implicitly contained in the previous convolutional layer, and the automatic learning can be realized, and the efficiency is very high because the PixelShuffle is only simply used for pixel shuffling.

The output (128, 30, 24) is input into the subsequent network, after the 3 × 3Conv of the first convolutional neural network layer is passed, the output dimension of this layer is set to 256, the BN regularization layer, the ReLU activation function layer, and the PixelShuffle upsampling is performed again, and then the network output is (64, 60, 48). And obtaining the final characteristic hotspot graph output after 3 × 3 convolution with stride of 1 and padding of 1, namely the result at this time is output (N, 60, 48), where N is the number of nodes, 60 is the hotspot graph output height defined before, and 48 is the hotspot graph output width defined before. If we define N as 17 at this time, 17 joint points are output, while the output of the second upsampling layer PixelShuffle in this document is 64-channel dimension, and the input of the last 3 × 3 convolutional network layer is 64 channels, the output channel dimension is N, which is defined as 17 joint points in this document, that is, 17 hot-spot map outputs of 60 × 48.

The training process of the neural network model comprises the following steps: and cutting out a corresponding single body frame by using the human body frame and joint point data which are marked in advance, scaling and filling the single body frame to the defined input size, and simultaneously performing data augmentation, normalization and regularization. And converting the marked coordinates of the human body joint points into a finally input coordinate system of an image with the size of 240x192, and generating a ground truth value by using a two-dimensional Gaussian distribution function. One joint generates a hotspot graph, and if 17 joints exist, 17 hotspot graphs are generated. The difference between the predicted result and the previous value is evaluated using the mselos mean square loss function. And simultaneously, gradient updating is carried out by using an Adam optimization algorithm, the weight data of the whole network is updated, the learning rate is set to be 0.001, the training times are 100 epochs, batch training can be carried out by using Batch during training, if one BatchSize is 100, the shape of the corresponding input data is input (100,3, 240 and 192). More than 80% accuracy can be obtained by using the method on a Coco data set, and meanwhile, the size of the model is only about 6M, which is enough for real-time posture detection of a human body on a mobile terminal. A visualization of a hotspot graph is shown in fig. 5, where the white point represents a corresponding joint point.

After the trained Neural Network model is obtained, the source model can be converted into a target model which can run on a mobile terminal through an Open Neural Network Exchange (ONNX) intermediate model, and the conversion process is the source model- > ONNX- > target model. Such as CoreML model converted to i0S, or Caffe2 model, or other third party neural network operational framework model. It is noted here that a custom implementation layer needs to be added if the feed forward inference execution framework has unsupported operator. In this embodiment, video frame image data acquired from a camera is directly scaled to 240 × 192, and then input to a network model for feed-forward inference prediction, after 17 60 × 48 hotspot graphs are obtained, the 17 hotspot graphs are processed to obtain corresponding human body joint point coordinates, and then each human body joint point coordinate is converted into an image coordinate system where a human body image to be detected is located when the human body joint point is not scaled, so that a human body posture joint point of the human body image to be detected can be acquired.

The method for detecting the human body posture provided by the embodiment of the invention abandons the existing complex human body posture detection network model, self-defines a simple and efficient neural network model, and simultaneously, the convolution operation in the neural network model adopts separable convolution operation. The simplification of the network model structure and the use of separable convolution operation greatly reduce the calculation amount of the neural network model, greatly reduce the model per se, make the training process easier and save time and cost. Compared with the prior art, the technical scheme provided by the invention can smoothly run on the mobile terminal with smaller memory and limited CPU and GPU computing capacity, realizes real-time detection of human body posture, and can be further applied to motion sensing games of the mobile terminal, body shaping and slimming, joint mapping decoration of the human body, or other interesting applications.

The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention, and all the changes or substitutions should be covered within the scope of the present invention.

Claims

1. A method of detecting a human pose, comprising:

inputting the preprocessed human body image to be detected into a pre-trained neural network model, and acquiring a predetermined number of hot spot maps, wherein each hot spot map comprises a human body joint point; the neural network model comprises a first 14 layers of a MobileNetV2 network, a dimension conversion layer, a first up-sampling layer, a first convolution neural network layer, a BN regularization layer, a ReLU activation function layer, a second up-sampling layer and a second convolution neural network layer which are connected in sequence; the convolution operations in the neural network model all adopt separable convolution operations;

acquiring a preset number of human body joint point coordinates from the preset number of heat point maps;

scaling each human body joint point coordinate to an image coordinate system of the human body image to be detected, and acquiring a human body posture joint point of the human body image to be detected;

pre-training the neural network model comprises:

marking a human body frame and joint points on an original training image acquired in advance;

clipping the original training image according to the human body frame to obtain a clipped image;

scaling the cut image according to a preset proportion and filling the cut image to a preset size to obtain a training input image;

converting the coordinates of the joint points marked in the original training image into the coordinates in the training input image, and generating a ground truth value by adopting a two-dimensional Gaussian distribution function;

and training the neural network model by adopting the training input image and the ground truth value.

2. The method of detecting human body gestures according to claim 1, characterized in that the training input images are 240x192 in size; the size of the group route value is 60 × 48.

3. The method for detecting human body posture as claimed in claim 1, wherein the loss function of the neural network model adopts a mean square loss function:

loss(x,y)＝(x-y) ²

4. The method of detecting human body gestures according to claim 1, further comprising: and in the process of training the neural network model, optimizing the neural network model by adopting an Adam optimization function.

5. The method for detecting the human body posture as claimed in claim 1, wherein the first upsampling layer and the second upsampling layer adopt 2 times upsampling; the first convolutional neural network layer and the second convolutional neural network layer are both 3 x 3 convolutional neural networks.

6. The method for detecting human body posture as claimed in claim 1, wherein the pre-trained neural network model is run on a mobile terminal; and the human body image to be detected is acquired by the mobile terminal.