CN109033978A

CN109033978A - A kind of CNN-SVM mixed model gesture identification method based on error correction strategies

Info

Publication number: CN109033978A
Application number: CN201810684333.9A
Authority: CN
Inventors: 冯志全; 李健
Original assignee: University of Jinan
Current assignee: University of Jinan
Priority date: 2018-06-28
Filing date: 2018-06-28
Publication date: 2018-12-18
Anticipated expiration: 2038-06-28
Also published as: CN109033978B

Abstract

The invention provides a CNN-SVM hybrid model gesture recognition method based on an error correction strategy, which belongs to the field of human-computer interaction. The CNN-SVM hybrid model gesture recognition method based on the error correction strategy first preprocesses the collected gesture data, then automatically extracts features and performs prediction classification to obtain classification results, and finally uses the error correction strategy to correct the classification results . By using the method of the invention, the misrecognition rate between easily confused gestures is reduced, and the recognition rate of static gestures is improved.

Description

A CNN-SVM Hybrid Model Gesture Recognition Method Based on Error Correction Strategy

技术领域technical field

本发明属于人机交互领域，具体涉及一种基于纠错策略的CNN-SVM混合模型手势识别方法。The invention belongs to the field of human-computer interaction, and in particular relates to a CNN-SVM hybrid model gesture recognition method based on an error correction strategy.

背景技术Background technique

随着计算机在当今社会越来越普及，一种便捷自然的人机交互(HCI)方式对使用者来说尤为重要。在众多的人机交互方式中，手势作为一种自然、简洁、直观的人机交互方式受到了越来越多人的关注，而且其在各种现实场景中都能发挥重要的作用，如体感游戏、手语识别、智能穿戴设备和智能教学等方面。手势识别的目的在于设计一种算法使计算机能够识别出图片或者人体的手势，理解手势的含义，从而实现人与计算机的交互。在手势识别过程中，手势通常是处于复杂的环境下，为了精准的进行人机交互，所设计的手势识别算法应该在各种光线、角度、背景以及其他复杂环境下都有良好的识别能力。As computers become more and more popular in today's society, a convenient and natural human-computer interaction (HCI) method is particularly important for users. Among the many human-computer interaction methods, gestures, as a natural, concise and intuitive human-computer interaction method, have attracted more and more people's attention, and they can play an important role in various real-world scenarios, such as somatosensory Games, sign language recognition, smart wearable devices and smart teaching. The purpose of gesture recognition is to design an algorithm to enable the computer to recognize the gestures of pictures or human bodies, understand the meaning of gestures, and realize the interaction between humans and computers. In the process of gesture recognition, gestures are usually in a complex environment. In order to accurately perform human-computer interaction, the designed gesture recognition algorithm should have good recognition ability in various light, angles, backgrounds and other complex environments.

传统的手势识别算法主要基于隐马尔可夫模型(HMM)和模板匹配。其中，基于隐马尔科夫模型的手势识别方法，该模型可以用于表达一个隐含未知参数的马尔科夫过程，而手势识别的过程可以看做是一个含有时间序列的马尔科夫链，因此该模型可以应用于手势识别。基于末班匹配的手势识别方法将手势的轮廓、边缘、空间分布等信息作为特征建立手势模板，应用模板匹配算法实现手势识别。这两种方法需要人工提取特征，而人工提取的手势特征需要大量的经验基础，并且人工提取的特征具有一定的主观性和局限性，使得其容易忽视一些显著性的特征，因此传统方法往往识别能力有限且效率不高。Traditional gesture recognition algorithms are mainly based on Hidden Markov Model (HMM) and template matching. Among them, the gesture recognition method based on the hidden Markov model can be used to express a Markov process with hidden unknown parameters, and the process of gesture recognition can be regarded as a Markov chain containing time series, so This model can be applied to gesture recognition. The gesture recognition method based on last-shift matching takes the contour, edge, and spatial distribution of the gesture as features to establish a gesture template, and applies the template matching algorithm to realize gesture recognition. These two methods require manual extraction of features, and manual extraction of gesture features requires a large amount of experience, and the manual extraction of features has certain subjectivity and limitations, making it easy to ignore some salient features, so traditional methods often identify Capacity is limited and not efficient.

卷积神经网络(Convolutional Neural Network,CNN)是目前机器视觉和图像处理领域应用最广泛的模型之一，卷积神经网络可以通过训练学习得到输入图像的局部和全局特征，解决了人工提取特征带来的特征提取不充分的问题。近几年来，卷积神经网络已经成功的应用于图像检索、人脸识别、表情识别和目标检测。已经有学者将CNN应用于手势识别领域，Jawad Nagi等人将最大池化层与卷积神经网络相结合(MPCNN)用于手势识别取得了不错的效果， Takayoushi等人提出了一种端到端的深度卷积网络实现了手势识别，同时提高了手势识别的准确率。在手势识别应用上，一般采用的都是比较浅的网络，在传统的静态手势识别方法中，基于人工特征提取的手势识别方法耗时长，识别率低。Convolutional Neural Network (CNN) is one of the most widely used models in the field of machine vision and image processing at present. The convolutional neural network can obtain the local and global features of the input image through training and learning, which solves the problem of manual extraction of feature bands. The problem of insufficient feature extraction. In recent years, convolutional neural networks have been successfully applied to image retrieval, face recognition, expression recognition and object detection. Some scholars have applied CNN to the field of gesture recognition. Jawad Nagi et al. combined the maximum pooling layer and convolutional neural network (MPCNN) for gesture recognition and achieved good results. Takayoushi et al. proposed an end-to-end The deep convolutional network realizes gesture recognition and improves the accuracy of gesture recognition at the same time. In the application of gesture recognition, relatively shallow networks are generally used. In the traditional static gesture recognition method, the gesture recognition method based on manual feature extraction takes a long time and the recognition rate is low.

发明内容Contents of the invention

本发明的目的在于解决上述现有技术中存在的难题，提供一种基于纠错策略的CNN-SVM混合模型手势识别方法，采用的网络更深，能够学习到更加深层次的特征，降低模型对易混淆手势的误识率，最终实现静态手势的识别。The purpose of the present invention is to solve the problems existing in the above-mentioned prior art, and provide a CNN-SVM hybrid model gesture recognition method based on an error correction strategy. Misrecognition rate of confusing gestures, and finally realize the recognition of static gestures.

本发明是通过以下技术方案实现的：The present invention is achieved through the following technical solutions:

一种基于纠错策略的CNN-SVM混合模型手势识别方法，首先对采集到的手势数据进行预处理，然后自动提取特征并进行预测分类得到分类结果，最后利用纠错策略对所述分类结果进行纠正。A CNN-SVM hybrid model gesture recognition method based on an error correction strategy. First, preprocess the collected gesture data, then automatically extract features and perform prediction classification to obtain classification results, and finally use an error correction strategy to perform classification on the classification results. correct.

所述方法包括：The methods include:

第一步：对采集到的数据进行预处理得到训练样本和测试样本；Step 1: Preprocess the collected data to obtain training samples and test samples;

第二步：获得CNN-SVM混合模型；The second step: obtain the CNN-SVM hybrid model;

第三步：将测试样本输入到第二步得到的CNN-SVM混合模型中，得到分类结果以及分类结果的概率估计以及混淆矩阵；The third step: input the test sample into the CNN-SVM hybrid model obtained in the second step, and obtain the classification result and the probability estimation of the classification result and the confusion matrix;

第四步：基于第三步得到的概率估计以及混淆矩阵得到纠错策略，然后利用纠错策略对分类结果进行纠正。Step 4: Get the error correction strategy based on the probability estimate obtained in the third step and the confusion matrix, and then use the error correction strategy to correct the classification results.

所述第一步的操作包括：The operation of the first step includes:

(11)采集静态手势，分别获取手部的深度图像和彩色图像；(11) Static gestures are collected, and depth images and color images of the hands are obtained respectively;

(12)对所述深度图像进行处理获得掩模图像；(12) Processing the depth image to obtain a mask image;

(13)对彩色图像和掩模图像进行与运算得到粗糙的手势区域图像；(13) performing an AND operation on the color image and the mask image to obtain a rough gesture area image;

(14)利用贝叶斯肤色模型对所述粗糙的手势区域图像进行肤色分割得到分割后的图像，将分割后的图像分为两部分，一部分作为训练样本，另一部分作为测试样本。(14) Using the Bayesian skin color model to perform skin color segmentation on the rough gesture region image to obtain a segmented image, the segmented image is divided into two parts, one part is used as a training sample, and the other part is used as a test sample.

所述步骤(11)中是采用Kinect采集静态手势。In described step (11), adopt Kinect to gather static gesture.

所述第二步是这样实现的：用SVM分类器代替CNN分类器的最后的输出层Described second step is realized like this: replace the last output layer of CNN classifier with SVM classifier

所述第二步的操作包括：The operations of the second step include:

(21)将所述训练样本输入到CNN分类器的输入层，经过CNN分类器的训练直到训练过程收敛或者达到最大的迭代次数，得到训练好的CNN模型；(21) the training sample is input to the input layer of the CNN classifier, until the training process converges or reaches the maximum number of iterations through the training of the CNN classifier, obtains the trained CNN model;

(22)：将所述训练样本输入到所述训练好的CNN模型中进行自动特征提取获得训练样本的特征向量；(22): inputting the training sample into the trained CNN model to perform automatic feature extraction to obtain the feature vector of the training sample;

(23)：将所述训练样本的特征向量输入到SVM分类器中进行二次训练，训练完成后得到CNN-SVM混合模型。(23): Input the feature vector of the training sample into the SVM classifier for secondary training, and obtain the CNN-SVM hybrid model after the training is completed.

所述纠错策略是指：规定一个阈值，根据该阈值将错误的分类结果筛选出来，然后依据实验得出的统计数据，对最终的分类结果进行纠正。The error correction strategy refers to specifying a threshold, filtering out wrong classification results according to the threshold, and then correcting the final classification results according to statistical data obtained from experiments.

所述第四步的操作包括：The operations of the fourth step include:

在N分类问题中，设M_i为对分类结果为i的所有测试样本进行纠错的一个阈值，对于M_i的描述如下：In the N classification problem, let M _i be a threshold for error correction of all test samples with classification result i, and the description of M _i is as follows:

其中，M_i,j表示预测结果为i,但真实值为j的样本所计算出来的均值，M_i是一个j维向量；S_i,j表示预测结果为i,但真实值为j的所有样本的数量，S_i表示预测为i类的所有测试样本的数量，P_n(i)代表在所有预测为i类的所有测试样本中第n个测试样本的概率估计的最大值，P_n(j)代表次大值；i表示分类估计中最大值属于的类，j表示分类估计中次大值属于的类；Among them, M _{i, j} represents the mean value calculated by the sample whose predicted result is i, but the actual value is j, and M _i is a j-dimensional vector; S _{i, j} represents all the samples whose predicted result is i, but the true value is j The number of samples, S _i represents the number of all test samples predicted to be class i, P _n (i) represents the maximum value of the probability estimate of the nth test sample among all test samples predicted to be class i, P _n ( j) represents the second largest value; i represents the class to which the maximum value belongs in the classification estimation, and j represents the class to which the second largest value belongs to in the classification estimation;

当概率估计满足以下条件时，将概率估计的最大值对应的类修改为次大值所对应的类：When the probability estimate satisfies the following conditions, modify the class corresponding to the maximum value of the probability estimate to the class corresponding to the second largest value:

其中w_n(i)表示预测结果为i类的概率估计最大值与概率估计次大值的距离，即在数值上等于P_n(i)-P_n(j)，p_ij表示在混淆矩阵中分类结果为i但真实值为j的概率。Among them, w _n (i) represents the distance between the maximum value of the probability estimate and the second maximum value of the probability estimate for the prediction result of class i, that is, it is equal to P _n (i)-P _n (j) in value, and p _ij is represented in the confusion matrix The probability that the classification result is i but the true value is j.

与现有技术相比，本发明的有益效果是：利用本发明方法降低了易混淆手势之间的误识率，提高了静态手势的识别率。Compared with the prior art, the beneficial effect of the present invention is that: using the method of the present invention reduces the misrecognition rate between confusing gestures and improves the recognition rate of static gestures.

附图说明Description of drawings

图1-1.九种不同的手势的照片Figure 1-1. Photographs of nine different gestures

图1-2 对应图1-1中九种不同的手势的深度图像Figure 1-2 Depth images corresponding to the nine different gestures in Figure 1-1

图2 本发明方法中的图像预处理不足框图Fig. 2 block diagram of insufficient image preprocessing in the method of the present invention

图3 预处理过程中的图片Figure 3 Pictures during preprocessing

图4 本发明方法所采用的CNN网络结构图The CNN network structural diagram that Fig. 4 method of the present invention adopts

图5 在不同数据集上测试准确率的曲线Figure 5 Curves of test accuracy on different data sets

图6 本发明方法的步骤框图。Fig. 6 is a block diagram of the steps of the method of the present invention.

具体实施方式Detailed ways

下面结合附图对本发明作进一步详细描述：Below in conjunction with accompanying drawing, the present invention is described in further detail:

本发明结合卷积神经网络和支持向量机的优点，提出了一种混合模型来自动提取特征并提高模型的泛化能力，并且用一种基于概率估计的纠错策略来降低易混淆手势的误识率。Combining the advantages of convolutional neural network and support vector machine, the present invention proposes a hybrid model to automatically extract features and improve the generalization ability of the model, and uses an error correction strategy based on probability estimation to reduce the error of confusing gestures. literacy rate.

如图6所示，本发明方法包括：首先，对Kinect采集的手势数据进行分段预处理，以减少复杂背景和人体其他部分的干扰。然后，混合模型自动提取特征并进行预测分类。最后，利用纠错策略对分类决策进行调整。在所建立的数据库上进行实验，最终得到没有使用纠错策略的识别率为95.81％，使用纠错策略后得到平均准确率为97.32％。As shown in FIG. 6 , the method of the present invention includes: firstly, segmental preprocessing is performed on the gesture data collected by Kinect to reduce interference from complex backgrounds and other parts of the human body. The hybrid model then automatically extracts features and performs predictive classification. Finally, the classification decision is adjusted using an error correction strategy. Experiments are carried out on the established database, and finally the recognition rate without using error correction strategy is 95.81%, and the average accuracy rate is 97.32% after using error correction strategy.

本发明方法中的数据采集如下：Data collection in the inventive method is as follows:

本发明系统采用Kinect2.0采集静态手势，分别获取手部的深度图像与彩色图像，然后建立相应的手势数据库。所建立的手势库共包含17类手势，由300 个在校大学生在不同光照背景下采集的静态图像组成。在本发明中挑选了人类常用的9种手势，每种手势包含3300张图片。图1-2、图1-2分别为操作者完成的9种手势的照片和深度图像。The system of the present invention adopts Kinect2.0 to collect static gestures, acquires depth images and color images of hands respectively, and then establishes corresponding gesture databases. The established gesture library contains a total of 17 types of gestures, which are composed of static images collected by 300 college students under different lighting backgrounds. In the present invention, 9 kinds of gestures commonly used by human beings are selected, and each gesture contains 3300 pictures. Figure 1-2 and Figure 1-2 are photos and depth images of 9 gestures completed by the operator, respectively.

数据预处理如下：Data preprocessing is as follows:

从采集到的手势图像不难看出，虽然彩色图像中人手势的图像清晰可辨，但要完成精确的识别还是十分困难，这是因为采集到的手势受视角、外观、形状、人体其他部位和复杂背景的影响。而在采集的深度图像中，一方面深度信息不受人手自身的颜色、纹理特征以及光照的影响，鲁棒性好，精度高；另一方面，深度图像中的深度信息反映的是人手到采集设备之间的距离，因此在手势区域的深度差别并不是很大。因为深度图像在采集过程中已经进行了分割，因此，利用该特点能够帮助分割出彩色图像感兴趣的手势区域，从而减少了彩色图像中人体其他部位以及复杂背景的干扰。分割预处理的步骤如图2所示。It is not difficult to see from the collected gesture images that although the image of human gestures in the color image is clearly identifiable, it is still very difficult to complete accurate recognition, because the collected gestures are affected by the angle of view, appearance, shape, other parts of the human body and The effect of complex backgrounds. In the collected depth image, on the one hand, the depth information is not affected by the color, texture characteristics and illumination of the human hand itself, so it has good robustness and high precision; on the other hand, the depth information in the depth image reflects the human hand to collect The distance between the devices, so the depth difference in the gesture area is not very big. Because the depth image has been segmented during the acquisition process, using this feature can help segment the gesture area of interest in the color image, thereby reducing the interference of other parts of the human body and complex backgrounds in the color image. The steps of segmentation preprocessing are shown in Fig. 2.

在预处理过程中，本发明将所采集的深度图像二值化，由于在采集过程中将深度图像已经转换为灰度深度图像，即将深度值到的取值范围调整到灰度值 [0-255]之间。由于在采集过程中已经对深度图进行了手势区域的分割，可以利用灰度值的大小来得到手势区域的二值图像。直接将掩模图像(对灰度图像设定一个阈值本发明中为128，大于128的像素点赋值为1，小于128的像素点赋值为0)与彩色图像进行逻辑与运算只能得到粗糙的手势区域图像，由于在 Kinect采集过程中，深度图像与彩色图像存在分辨率不同的问题，周围还会有非手势像素点的干扰。对得到的粗糙手势区域进行肤色分割，利用贝叶斯肤色模型(请参见文献“M.J.Jones,et al.Statistical color models with application to skindetection[J].International Journal of Computer Vision(IJCV),2002,46(1):81-96”)得到精确的手势区域图像。In the preprocessing process, the present invention binarizes the collected depth image. Since the depth image has been converted into a grayscale depth image during the collection process, the value range of the depth value is adjusted to the grayscale value [0- 255] between. Since the gesture region has been segmented on the depth map during the acquisition process, the binary image of the gesture region can be obtained by using the gray value. Directly performing logical AND operations on the mask image (set a threshold value of 128 for the grayscale image in the present invention, the pixel points greater than 128 are assigned a value of 1, and the pixel points smaller than 128 are assigned a value of 0) and the color image can only get rough Gesture area images, due to the problem of different resolutions between the depth image and the color image during the Kinect acquisition process, there will be interference from non-gesture pixels around. Perform skin color segmentation on the obtained rough gesture area, using the Bayesian skin color model (please refer to the literature "M.J. Jones, et al. Statistical color models with application to skin detection [J]. International Journal of Computer Vision (IJCV), 2002, 46 (1):81-96") to obtain accurate gesture region images.

在本发明随机的选择了一张图像以检验分割预处理的有效性，其中彩色图像、深度图像、掩模图像、粗糙手势区域以及分割后图像分别如图3-1到图3-5 所示。In the present invention, an image is randomly selected to test the effectiveness of segmentation preprocessing, in which the color image, depth image, mask image, rough gesture area, and segmented image are shown in Figure 3-1 to Figure 3-5 respectively .

可以明显的看到本发明方法中的分割预处理可以有效的去处复杂背景以及人体其他部分的影响，而且最后利用贝叶斯肤色模型也能够准确的保留手势区域的有效信息，为后期的训练工作提供了良好的数据保障。It can be clearly seen that the segmentation preprocessing in the method of the present invention can effectively remove the influence of complex backgrounds and other parts of the human body, and finally, the use of the Bayesian skin color model can also accurately retain the effective information of the gesture area for later training work Provides good data protection.

混合CNN-SVM模型如下：The hybrid CNN-SVM model is as follows:

SVM分类器：支持向量机通过选择不同的核函数将低维输入空间线性不可分的样本转换成高维特征空间使其线性可分，以机构风险最小化原则为理论基础在特征空间中构建最优超平面，得到了对数据分布的结构化描述，因此降低了对数据规模和数据分布的要求，有效降低了独立测试集误差，被认为是最常用、效果最好的分类器之一。SVM classifier: The support vector machine converts the linearly inseparable samples of the low-dimensional input space into the high-dimensional feature space by selecting different kernel functions to make them linearly separable, and constructs the optimal model in the feature space based on the principle of institutional risk minimization. Hyperplane obtains a structured description of data distribution, thus reducing the requirements for data size and data distribution, and effectively reducing the error of independent test sets. It is considered to be one of the most commonly used and best classifiers.

在实验中使用LIBSVM(请参见文献“Chih-Chung Chang,Chih-Jen Lin.LIBSVM:Alibrary for support vector machines[J].ACM Transactions on IntelligentSystems and Technology(TIST),2011,2(3):1-27”)来构建SVMs， LIBSVM是一种快速有效的用于分类和回归的软件包，使用一对一策略来解决多分类问题。LIBSVM不仅能够预测分类结果而且可以给每一个测试样本提供分类的概率信息。对于一个k分类问题，其目的就是去估计样本属于每一类的概率：Use LIBSVM in the experiment (please refer to the literature "Chih-Chung Chang, Chih-Jen Lin. LIBSVM: Library for support vector machines [J]. ACM Transactions on IntelligentSystems and Technology (TIST), 2011, 2(3): 1- 27") to build SVMs, LIBSVM is a fast and efficient package for classification and regression, using a one-vs-one strategy to solve multi-classification problems. LIBSVM can not only predict classification results but also provide classification probability information for each test sample. For a k-classification problem, the purpose is to estimate the sample Probability of belonging to each class:

对于一对一策略，p_i通过以下解决下列优化问题来得到：For the one-to-one strategy, _pi is obtained by solving the following optimization problem as follows:

其中r_ij是一个成对的概率定义为：where r _ij is a pairwise probability defined as:

在本发明的实验中，SVMs被训练用于预测带概率的分类结果，这些分类结果的概率值将被应用于对于易混淆手势的纠错，去决定分类结果是否直接应用，还是通过本发明所采用的一种策略来进行重新分类。In the experiments of the present invention, SVMs are trained to predict classification results with probabilities, and the probability values of these classification results will be applied to error correction for confusing gestures to determine whether the classification results are directly applied or passed through the present invention. A strategy employed for reclassification.

CNN分类器：卷积神经网络是一种深度前馈的神经网络，将图像直接作为网络的输入，不需要人工定义和特征选择，避免了在传统识别算法中的特征选择和特征提取的环节，同时还具有良好的容错能力、并行处理能力和自学习能力。CNN classifier: Convolutional neural network is a deep feedforward neural network, which uses images directly as the input of the network, does not require manual definition and feature selection, and avoids the link of feature selection and feature extraction in traditional recognition algorithms. At the same time, it also has good fault tolerance, parallel processing ability and self-learning ability.

代替使用在MPCNN，(请参见文献“Chih-Chung Chang,Chih-Jen Lin.LIBSVM:Alibrary for support vector machines[J].ACM Transactions on IntelligentSystems and Technology(TIST),2011,2(3):1-27”)本发明采用了一种在文献“A.Krizhevsky,S.Ilya,and G.E.Hinton.Imagent classification with deepconvolutional neural networks[C]//Advances in Neural Information ProcessingSystems 2(NIPS),2012:1106-1114”中所提到的更为复杂的CNN进行了训练，其中网络结构如图4所示。这个网络总共有8 层，包括5个卷积层和3个全连接层，最后一个全连接层输出一个9维的softmax 来表达对于9个类别的预测。第一层卷积层将224×224×3的输入图像与96个 11×11×3的卷积核做卷积运算，步长为4。第二层卷积层将第一层经过响应归一化和池化的输出与256个5×5×48的卷积核做卷积运算。第三层卷积层用384 个3×3×256的卷积核与进过归一化核池化的第二层输出做卷积运算。第四层卷积层的卷积核个数为384大小为3×3×192，第五层卷积层有256个3×3×192大小的卷积核。每一个全连接层都有4096个神经元。由于网络结构复杂，本发明采用放大数据集的方式来应对过拟合。通过从256×256的图片中随机抽取 224×224的区块以及水平镜像来实现这种方法，并在这些收取得到的区块上来训练神经网络。如果不采用这种方法，网络会出现严重的过拟合，迫使采用更小的网络，从而导致在SVM训练中无法使用深层次的特征。Instead of using MPCNN, (see the literature "Chih-Chung Chang, Chih-Jen Lin.LIBSVM: Library for support vector machines [J]. ACM Transactions on IntelligentSystems and Technology (TIST), 2011,2(3): 1- 27") The present invention adopts a method described in the document "A.Krizhevsky, S.Ilya, and G.E.Hinton. Imagent classification with deep convolutional neural networks [C]//Advances in Neural Information Processing Systems 2 (NIPS), 2012:1106-1114 The more complex CNN mentioned in " was trained, and the network structure is shown in Figure 4. This network has a total of 8 layers, including 5 convolutional layers and 3 fully connected layers, and the last fully connected layer outputs a 9-dimensional softmax to express predictions for 9 categories. The first convolutional layer performs convolution operation on the input image of 224×224×3 and 96 convolution kernels of 11×11×3 with a step size of 4. The second convolutional layer performs convolution operations on the output of the first layer after response normalization and pooling with 256 5×5×48 convolution kernels. The third convolutional layer uses 384 convolution kernels of 3×3×256 to perform convolution operations with the output of the second layer after normalized kernel pooling. The number of convolution kernels in the fourth convolutional layer is 384 with a size of 3×3×192, and the fifth convolutional layer has 256 convolution kernels with a size of 3×3×192. Each fully connected layer has 4096 neurons. Due to the complex network structure, the present invention adopts the method of enlarging the data set to deal with over-fitting. This method is implemented by randomly sampling 224×224 blocks from a 256×256 image and horizontally mirroring them, and training the neural network on these harvested blocks. Without this approach, the network would suffer severe overfitting, forcing a smaller network, resulting in the inability to use deep features in SVM training.

CNN-SVM混合模型：本发明所采用的混合CNN-SVM模型是将CNN中最后的输出层用SVM代替.首先,将经过处理的图像传入输入层，经过原始CNN多次的训练直到训练过程收敛或者达到最大的迭代次数。然后将训练样本输入训练好的 CNN模型，得到训练样本的特征向量，将其输入SVM分类器进行二次训练，训练完成后得到CNN-SVM模型，将测试样本输入模型得到分类结果。CNN-SVM hybrid model: The hybrid CNN-SVM model adopted in the present invention replaces the final output layer in CNN with SVM. First, the processed image is passed into the input layer, and the original CNN is trained many times until the training process Converge or reach the maximum number of iterations. Then input the training samples into the trained CNN model to obtain the feature vectors of the training samples, which are input into the SVM classifier for secondary training. After the training is completed, the CNN-SVM model is obtained, and the test samples are input into the model to obtain the classification results.

基于概率估计的纠错策略(HECS)：LIBSVM在最后的预测结果中给出了每一个样本被分到各类的一个概率估计，最终选择的分类结果是其中概率值最大的一个，表1列出了一些预测分类结果错误的测试样本最终的概率分布，表中的第一列表示测试样本的真实类别编号，表中的第二列代表的该测试样本的预测分类编号，剩余的其他列分别代表该样本属于某列的概率，从中可以观察到在预测错误的测试样本的概率估计中，估计概率的最大值为预测值而次大值就为其真实值。Error Correction Strategy Based on Probability Estimation (HECS): LIBSVM gives a probability estimate of each sample being classified into each category in the final prediction result, and the final selected classification result is the one with the largest probability value. Table 1 column The final probability distribution of some test samples with wrong predicted classification results is shown. The first column in the table represents the real category number of the test sample, the second column in the table represents the predicted classification number of the test sample, and the remaining columns are respectively Represents the probability that the sample belongs to a certain column, from which it can be observed that in the probability estimation of the wrongly predicted test sample, the maximum value of the estimated probability is the predicted value and the second maximum value is its true value.

表1Table 1

根据LIBSVM最终的决策特点以及从最后的实验结果可以得知，在预测分类结果错误的样本中，预测分类与真实分类之间的概率估计差距很小，而在预测结果正确的样本中，预测分类与其他各个分类结果之间的概率估计差距都比较大。本发明依据这种特点，提出了一种基于概率估计的纠错策略以达到减少此种情况下所产生的分类错误。在N分类问题中，本发明采用M_i作为预测结果为i的所有测试样本进行纠错的一个阈值，对于M_i的描述如下：According to the final decision-making characteristics of LIBSVM and the final experimental results, it can be known that in the samples with wrong predicted classification results, the probability estimation gap between the predicted classification and the real classification is very small, while in the samples with correct predicted results, the predicted classification The gap between the probability estimates and other classification results is relatively large. According to this feature, the present invention proposes an error correction strategy based on probability estimation to reduce classification errors in this case. In the N classification problem, the present invention uses _Mi as a threshold for error correction of all test samples whose prediction result is _i , and the description of Mi is as follows:

其中S_i表示预测为i类的所有测试样本的数量，P_n(i)代表在所有预测为i类的所有图片中第n个测试样本概率估计的最大值，P_n(j)代表次大值。i表示概率估计中最大值属于的那一类，j表示概率估计中次大值属于的那一类。Where S _i represents the number of all test samples predicted to be class i, P _n (i) represents the maximum probability estimate of the nth test sample in all pictures predicted to be class i, and P _n (j) represents the second largest value. i represents the class to which the largest value in the probability estimate belongs, and j represents the class to which the second largest value in the probability estimate belongs.

当概率估计满足以下条件时，将最大值的对应的类修改为次大值所对应的类。When the probability estimate satisfies the following conditions, modify the class corresponding to the largest value to the class corresponding to the second largest value.

其中w_n(i)表示预测结果为i类的概率估计最大值与次大值的距离，即在数值上等于P_n(i)-P_n(j)，p_ij表示在混淆矩阵中，预测结果为i但真实值j的概率。Among them, w _n (i) represents the distance between the predicted maximum value and the second maximum value of the probability estimate of class i, that is, it is equal to P _n (i)-P _n (j) in value, and p _ij represents in the confusion matrix, the prediction Probability of outcome i but true value j.

本发明模型的优点如下：The advantages of the model of the present invention are as follows:

本发明构建CNN-SVM模型以期达到弥补CNN和SVM分类器的限制，结合两种分类器的优点。卷积神经网络的理论学习方法与多层感知机(MLP)(可参考文献“E.A.Zanaty.Support Vector Machines(SVMs)versus Multilayer Perception(MLP)indata classification[J].Egyptian Informatics Journal,2012,13(3):177-183”)的学习方法是相同的，因此实质上是MLP的一个扩展。MLP理论基于经验风险最小化的，在训练过程中将训练的错误不断的最小化。当进行反向传播计算时发现一个极小值，无论其是否是全局的最小值都会使得训练结果收敛在这一点，而不再继续改进算法的解。SVM是在训练样本集分布固定的情况下，利用结构风险化最小原则去寻找一个最优的超平面，最小化数据上的泛化误差，因此SVM的泛化能力要优于MLP。The invention builds a CNN-SVM model in order to make up for the limitations of CNN and SVM classifiers and combine the advantages of the two classifiers. Theoretical learning method of convolutional neural network and multilayer perceptron (MLP) (refer to "E.A.Zanaty.Support Vector Machines (SVMs) versus Multilayer Perception (MLP) indata classification [J].Egyptian Informatics Journal, 2012,13( 3):177-183"), the learning method is the same, so it is essentially an extension of MLP. The MLP theory is based on empirical risk minimization, which continuously minimizes the training error during the training process. When a minimum value is found during backpropagation calculation, no matter whether it is a global minimum value or not, the training result will converge at this point, and the solution of the algorithm will not continue to be improved. SVM uses the principle of minimizing structural risk to find an optimal hyperplane and minimizes the generalization error on the data when the distribution of the training sample set is fixed. Therefore, the generalization ability of SVM is better than that of MLP.

CNN的优点在于能够自动地提取输入图像深层次的特征，而且输入图像在一定程度上移动和扭曲时特征仍然具有不变性。然而，人工进行提取特征需要进行精心的设计，在手势识别方法上传统的人工提取特征的方法(如文献“Jiang Y.An HMM based approachfor video action recognition using motion trajectories[C]//IEEE InternationalConference on Intelligent Control and Information Porcessing,2010:359-464.”、“Liu Jie,Huang Jin,Han Dongqi,Tian Feng,el at.Template Matching Algorithm for3D Gesture Recognition[J].Journal of Computer-Aided Design&Computer Graphics,2016,28(8):1365-1372”提供的方法)忽视了手的局部视觉特征，只注意到了手势的轮廓和颜色信息，诸如手指的弯曲，手指之间的距离这在手势识别方面都是非常重要的特征。人工设计的特征提取容易忽视和丢失一些特征。因此，利用CNN来进行特征的提取能够比传统方法收集到更多的具有代表性和相关性的信息。The advantage of CNN is that it can automatically extract deep-level features of the input image, and the features are still invariant when the input image is moved and distorted to a certain extent. However, manual extraction of features requires careful design. Traditional manual extraction of features in gesture recognition methods (such as the document "Jiang Y.An HMM based approach for video action recognition using motion trajectories[C]//IEEE InternationalConference on Intelligent Control and Information Porcessing, 2010:359-464.”, “Liu Jie, Huang Jin, Han Dongqi, Tian Feng, el at. Template Matching Algorithm for 3D Gesture Recognition[J]. Journal of Computer-Aided Design&Computer Graphics, 2016, 28 (8): 1365-1372 "provided method) ignores the local visual features of the hand, only notices the outline and color information of the gesture, such as the bending of the fingers, the distance between the fingers, which are very important in gesture recognition feature. Artificially designed feature extraction is easy to ignore and lose some features. Therefore, using CNN to extract features can collect more representative and relevant information than traditional methods.

纠错策略实际上就是规定了一个阈值，将有可能出错的预测分类结果筛选出来，然后依据实验得出的统计数据，以一定概率对最终的分类决策进行纠正。通过CNN-SVM模型对样本进行分类已经可以得到很好的效果，但对于一些由于遮挡或者采集图像质量问题而导致的两个样本难以分别的情况并不能进行准确的判断，本发明所提出的纠错策略能够在最终的决策中将易混淆样本的分类结果进行一定的纠正以提高最后整体的准确率。The error correction strategy actually specifies a threshold to screen out the predicted classification results that may be wrong, and then corrects the final classification decision with a certain probability based on the statistical data obtained from the experiment. Classification of samples through the CNN-SVM model has achieved good results, but for some cases where two samples are difficult to distinguish due to occlusion or image quality problems, accurate judgments cannot be made. The corrective method proposed by the present invention The wrong strategy can correct the classification results of confusing samples in the final decision to improve the final overall accuracy.

对本发明方法进行实验与分析如下：Carry out experiment and analysis to the inventive method as follows:

实验环境：在本实验中，手势识别模型运行在Windows操作系统上，硬件配置为：Intel(R)Core(TM)i5-6500处理器，NVIDIA GeForceGT730，内存为 8G，显存为2G。CNN网络是由Caffe搭建，本发明采用径向核(Gaussian RBF)，利用LIBSVM软件包实现SVM分类器。实验中所有算法在Matlab2014a平台运行。Experimental environment: In this experiment, the gesture recognition model runs on the Windows operating system, and the hardware configuration is: Intel(R) Core(TM) i5-6500 processor, NVIDIA GeForceGT730, 8G of memory, and 2G of video memory. CNN network is built by Caffe, and the present invention adopts radial kernel (Gaussian RBF), utilizes LIBSVM software package to realize SVM classifier. All algorithms in the experiment run on Matlab2014a platform.

实验结果与分析如下：The experimental results and analysis are as follows:

在本发明的实验中，首先将彩色图像与深度图像进行分割预处理，将得到的分割出的手势图像共计29700张作为本发明的数据集，其中27000张图片用于模型的训练，2700张图片用于测试。在CNN训练过程采用30000次为最大的迭代次数，可以从图5中看出，迭代约10000次时系统已经达到收敛，最后利用迭代30000次的模型进行测试，在测试集上的准确率为88.35％。然后再建立 CNN-SVM模型，将最后的全连接层代替为SVM分类器，将4096维的特征向量放入SVM中进行训练和测试。在本发明实验中，SVM采用RBF核函数，为了去寻找最优的乘法系数C和最优的核参数g在训练集上采用5折交叉验证法得到最优的结果。这两个参数寻找的范围分别是：g＝[2³，2¹，...，2^-15]和C＝[2¹⁵，2¹³，...，2^-5]。总共尝试了11×10＝110种不同的组合，最终确定C＝64，g＝0.00024414。然后用得到的这两个参数用于混合模型的训练，最终的到训练的准确率为99.94％，在2700张测试图片上的准确率达到95.81％。表2列出了在本发明所准备的数据集上，使用CNN和使用CNN-SVM的训练准确率与测试准确率。In the experiment of the present invention, firstly, the color image and the depth image were segmented and preprocessed, and a total of 29,700 gesture images were obtained as the data set of the present invention, of which 27,000 images were used for model training, and 2,700 images for testing. In the CNN training process, the maximum number of iterations is 30,000. It can be seen from Figure 5 that the system has reached convergence when the iteration is about 10,000 times. Finally, the model with 30,000 iterations is used for testing, and the accuracy rate on the test set is 88.35%. %. Then build the CNN-SVM model, replace the last fully connected layer with an SVM classifier, and put the 4096-dimensional feature vector into the SVM for training and testing. In the experiment of the present invention, the SVM adopts the RBF kernel function, in order to find the optimal multiplication coefficient C and the optimal kernel parameter g, the 5-fold cross-validation method is adopted on the training set to obtain the optimal result. The search ranges for these two parameters are: g=[2 ³ , 2 ¹ , . . . , 2 ⁻¹⁵ ] and C=[2 ¹⁵ , 2 ¹³ , . . . , 2 ⁻⁵ ]. A total of 11×10=110 different combinations were tried, and finally C=64, g=0.00024414 were determined. Then the two obtained parameters are used for the training of the hybrid model, and the final training accuracy rate is 99.94%, and the accuracy rate on 2700 test pictures reaches 95.81%. Table 2 lists the training accuracy and test accuracy using CNN and CNN-SVM on the data set prepared by the present invention.

由图5可以看出，在最大迭代次数同为30000次时，彩色图像的准确率最低，最多只能达到37.92％的准确率，深度图像与彩色图像相比有了明显的改善能够达到79.07％，经过预处理后的图像准确最高能够达到88.35％。这是由于直接用未处理的彩色图像进行训练时，训练样本本身存在大量的噪声信息(复杂的背景信息以及人体其他部分的信息)，用分割出的深度图像虽然没了背景和人体其他部分的干扰，但是因为所采集的深度图像是将深度信息投影到了[0,255] 的灰度信息中保存起来的，因此深度图像也会有一部分信息的缺失，而进过本发明预处理分割后的手势不仅能够有效的去处复杂背景和人体其他部分大干扰同时能够保留手势区域完整的彩色信息，使得在进行CNN网络训练的时候能够提取出更为丰富的特征用于分类。通过将测试样本放入混合模型进行分类预测，可以统计出一个混淆矩阵如表2所示：It can be seen from Figure 5 that when the maximum number of iterations is the same as 30,000, the accuracy of the color image is the lowest, at most it can only reach 37.92% accuracy, and the depth image has a significant improvement compared with the color image and can reach 79.07%. , the highest accuracy of the preprocessed image can reach 88.35%. This is because when the unprocessed color image is directly used for training, the training sample itself has a lot of noise information (complex background information and information of other parts of the human body), although the segmented depth image has no background and other parts of the human body. interference, but because the collected depth image is saved by projecting the depth information into the grayscale information of [0,255], so the depth image will also have some information missing, and the gestures after the preprocessing and segmentation of the present invention are not only It can effectively remove complex background and large interference from other parts of the human body while retaining the complete color information of the gesture area, so that more abundant features can be extracted for classification during CNN network training. By putting the test samples into the mixed model for classification prediction, a confusion matrix can be calculated as shown in Table 2:

表2Table 2

在100次试验中，纠错率主要集中在[3％,5％]之间，准确率在[97％,98％]最为集中，平均纠错率为4.12％，平均准确率为97.32％。In 100 trials, the error correction rate is mainly concentrated in [3%, 5%], the accuracy rate is most concentrated in [97%, 98%], the average error correction rate is 4.12%, and the average accuracy rate is 97.32%.

表3给出了本发明方法和其他方法在所提供的数据集下手势识别的准确率。与本发明方法不同，文献“Yamashita T,Watasue T.Hand posture recognition based onbottom-up structured deep conbolutional nerual network with curriculumlearning[C]//Image Processing(ICIP),2014 IEEE International Conferenceon.IEEE,2014:853-857”是用了一种比较简单的卷积神经网络,将最大池化层与卷积神经网络构成MPCNN，在测试集上得到了 68.89％的识别准确率。文献“Shao-Zi Li,Bin Yu,WeiWu,Song-Zhi Su,Rong-Rong Ji.Feature learning cased on SAE-PCA network forhuman gesture recognition in RGBD images[J].Neurocomputing,2015,151(2):565-573”使用的是一种端到端的卷积神经网络，得到手势识别准确率为85.43％。文献“Xiao-Xiao Niu,Ching Y.Suen.A novel hybrid CNN-SVM classifier for recognizinghandwritten digits[J].Pattern Recognition,2012,45(4):1318-1325”先利用深度信息与肤色信息进行手势分割，然后同过基于特征学习的SAE-PCA模型提取特征，最后采用SVM分类器进行分类，最终的到手势识别的准确率为93.32％，不同手势识别方法在本发明数据集上的准确率如表3所示：Table 3 shows the accuracy rate of gesture recognition in the provided data set by the method of the present invention and other methods. Different from the method of the present invention, the document "Yamashita T, Watasue T. Hand posture recognition based on bottom-up structured deep conbolutional neural network with curriculum learning [C] // Image Processing (ICIP), 2014 IEEE International Conference on. IEEE, 2014: 853- 857" uses a relatively simple convolutional neural network. The maximum pooling layer and convolutional neural network are used to form MPCNN, and the recognition accuracy rate is 68.89% on the test set. Literature "Shao-Zi Li, Bin Yu, WeiWu, Song-Zhi Su, Rong-Rong Ji. Feature learning cased on SAE-PCA network for human gesture recognition in RGBD images [J]. Neurocomputing, 2015, 151(2): 565 -573” uses an end-to-end convolutional neural network, and the gesture recognition accuracy rate is 85.43%. The literature "Xiao-Xiao Niu, Ching Y.Suen.A novel hybrid CNN-SVM classifier for recognizing handwritten digits[J].Pattern Recognition,2012,45(4):1318-1325" first uses depth information and skin color information for gesture segmentation , and then use the SAE-PCA model based on feature learning to extract features, and finally use the SVM classifier to classify. The final accuracy of gesture recognition is 93.32%. The accuracy of different gesture recognition methods on the data set of the present invention is shown in the table 3 shows:

表3table 3

可以看出，本发明方法在识别准确方面相比其他方法都有明显的提升。It can be seen that the method of the present invention has obvious improvement in recognition accuracy compared with other methods.

本发明方法首先对手势的深度数据与彩色数据进行了分割预处理，消除了彩色数据人体以及复杂背景的影响；然后利用卷积神经网络提取手势的特征，避免了根据手势的轮廓和几何特性人为设计特征的复杂过程；再通过支持向量机进行手势的概率估计；最后，基于所得到的概率估计结合实验得出的混淆矩阵提出一种纠错策略对模型的分类结果进行纠错。大量的实验结果表明本方法能够有效的识别静态手势，且能够在一定程度上优化CNN-SVM模型分类易混淆手势的能力，在整体上能够提高最终识别的准确率。The method of the present invention first performs segmentation preprocessing on the depth data and color data of the gesture, eliminating the influence of the human body and the complex background of the color data; The complex process of designing features; then estimate the probability of gestures through the support vector machine; finally, based on the obtained probability estimation and the confusion matrix obtained from the experiment, an error correction strategy is proposed to correct the classification results of the model. A large number of experimental results show that this method can effectively recognize static gestures, and can optimize the ability of CNN-SVM model to classify confusing gestures to a certain extent, and can improve the accuracy of final recognition as a whole.

上述技术方案只是本发明的一种实施方式，对于本领域内的技术人员而言，在本发明公开了应用方法和原理的基础上，很容易做出各种类型的改进或变形，而不仅限于本发明上述具体实施方式所描述的方法，因此前面描述的方式只是优选的，而并不具有限制性的意义。The above-mentioned technical solution is only an embodiment of the present invention. For those skilled in the art, on the basis of the application methods and principles disclosed in the present invention, it is easy to make various types of improvements or deformations, and is not limited to The methods described in the above specific embodiments of the present invention, therefore, the above-described methods are only preferred and not limiting.

Claims

1. a kind of CNN-SVM mixed model gesture identification method based on error correction strategies, it is characterised in that: the method is right first Collected gesture data is pre-processed, and is then automatically extracted feature and is carried out predicting that classification obtains classification results, last benefit The classification results are corrected with error correction strategies.

2. the CNN-SVM mixed model gesture identification method according to claim 1 based on error correction strategies, feature exist In: the described method includes:

Step 1: being pre-processed to obtain training sample and test sample to collected data；

Step 2: obtaining CNN-SVM mixed model；

It is trained step 3: test sample is input in the CNN-SVM mixed model that second step obtains, obtains classification results And the probability Estimation and confusion matrix of classification results；

Step 4: the probability Estimation and confusion matrix that obtain based on third step obtain error correction strategies, error correction strategies are then utilized Classification results are corrected.

3. the CNN-SVM mixed model gesture identification method according to claim 2 based on error correction strategies, feature exist In: the operation of the first step includes:

(11) static gesture is acquired, obtains the depth image and color image of hand respectively；

(12) processing is carried out to the depth image and obtains mask images；

(13) color image and mask images are carried out obtaining coarse gesture area image with operation；

(14) figure after skin color segmentation is divided is carried out to the coarse gesture area image using Bayes's complexion model Image after segmentation is divided into two parts by picture, and a part is used as training sample, and another part is as test sample.

4. the CNN-SVM mixed model gesture identification method according to claim 3 based on error correction strategies, feature exist In: it is that static gesture is acquired using Kinect in the step (11).

5. the CNN-SVM mixed model gesture identification method according to claim 3 based on error correction strategies, feature exist The last output layer that CNN classifier is replaced with SVM classifier is achieved in that in: the second step.

6. the CNN-SVM mixed model gesture identification method according to claim 5 based on error correction strategies, feature exist In: the operation of the second step includes:

(21) training sample is input to the input layer of CNN classifier, by the training of CNN classifier until training process Maximum the number of iterations is restrained or reached, trained CNN model is obtained；

(22): the training sample being input in the trained CNN model and carries out Automatic Feature Extraction acquisition training sample This feature vector；

(23): the feature vector of the training sample being input in SVM classifier and carries out second training, is obtained after the completion of training CNN-SVM mixed model.

7. the CNN-SVM mixed model gesture identification method according to claim 2 based on error correction strategies, feature exist In: the error correction strategies refer to: one threshold value of regulation, are screened the classification results of mistake according to the threshold value, then foundation The statistical data obtained is tested, final classification results are corrected.

8. the CNN-SVM mixed model gesture identification method according to claim 1 based on error correction strategies, feature exist In: the operation of the 4th step includes:

In N classification problem, if M_iFor be i to classification results all test samples carry out error correction a threshold value, for M_i's It is described as follows:

Wherein, M_i,jExpression prediction result is i, but the mean value that true value is calculated by the sample of j, M_iIt is a j dimensional vector； S_i,jExpression prediction result is i, but true value is the quantity of all samples of j, S_iIndicate all test samples for being predicted as i class Quantity, P_n(i) maximum of the probability Estimation of n-th of test sample in all test samples for being predicted as i class is represented Value, P_n(j) second largest value is represented；The class that belongs to of maximum value in the estimation of i presentation class, second largest value belongs in the estimation of j presentation class Class；

When probability Estimation meets the following conditions, the corresponding class of the maximum value of probability Estimation is revised as corresponding to second largest value Class:

Wherein w_n(i) expression prediction result is the probability Estimation maximum value of i class at a distance from probability Estimation second largest value, i.e., in numerical value It is upper to be equal to P_n(i)-P_n(j), p_ijIndicate the probability that classification results are i in confusion matrix but true value is j.