CN114782992A

CN114782992A - Super-joint and multi-mode network and behavior identification method thereof

Info

Publication number: CN114782992A
Application number: CN202210464698.7A
Authority: CN
Inventors: 侯振杰; 施海勇; 钟卓锟; 尤凯军
Original assignee: Changzhou University
Current assignee: Changzhou University
Priority date: 2022-04-29
Filing date: 2022-04-29
Publication date: 2022-07-22
Anticipated expiration: 2042-04-29
Also published as: CN114782992B

Abstract

The invention relates to the technical field of neural networks, in particular to a super-joint and multi-mode network and a behavior identification method thereof, which comprises the following steps: collecting a human body depth map, extracting features of the depth map by using a DMMS flow, and calculating depth data prediction scores; collecting a human body skeleton sequence, respectively extracting original joint and super joint data, combining super joints and common joints to construct skeleton information, sending the skeleton information into a structured space-time characteristic learning model to obtain static and dynamic joint data streams and static and dynamic super joint data streams, and performing self-adaptive weight fusion on the original joint and super joint data streams to respectively obtain joint data prediction scores and super joint prediction scores; and adding the classification prediction scores of the DMMS flow and the skeleton flow to generate a final prediction score. The invention learns the abundant texture information of human body parts in space from the depth map and learns the abundant space-time characteristics in the motion posture change from the skeleton sequence.

Description

A hyper-joint and multimodal network and its method in behavior recognition

技术领域technical field

本发明涉及神经网络技术领域，尤其涉及一种超关节与多模态网络及其在行为识别方法。The invention relates to the technical field of neural networks, in particular to a hyper-joint and multi-modal network and a method for recognizing it in action.

背景技术Background technique

人体行为识别因其具有广泛的应用前景，已经成为机器学习和计算机视觉领域的热点话题，具有十分重要的理论研究价值。与静态图像的识别不同，行为识别要综合考虑连续变化的图像，从一系列静态人体姿态图像组成的动作序列中构建人体行为的时空特征表示实现对动作的分类识别。目前，人体行为识别已经广泛应用于辅助人机交互、体育运动分析、智能监控和虚拟现实等领域。Human action recognition has become a hot topic in the field of machine learning and computer vision because of its broad application prospects, and has very important theoretical research value. Different from the recognition of static images, behavior recognition should comprehensively consider continuously changing images, and construct the spatiotemporal feature representation of human behavior from a series of motion sequences composed of static human pose images to realize the classification and recognition of actions. At present, human behavior recognition has been widely used in assisting human-computer interaction, sports analysis, intelligent monitoring and virtual reality and other fields.

目前，行为识别算法主要分为两类：At present, behavior recognition algorithms are mainly divided into two categories:

(1)基于传统方法的行为识别。在传统方法中，学者们提取手工特征作为有效的时空特征描述符。Donglu Li等将深度数据的特性和点云数据结合起来并利用深度信息重建三维点云空间提升深度数据的空间信息表达能力。然后他们将点云空间均匀划分成许多网格空间，计算网格空间中的点云数量的比例作为空间分布特征进行动作识别。Liu等在三维的骨架关节轨迹的二维投影平面上建立分布扇区，然后使用直方图统计特定投影的分布扇区作为基于骨架的动作描述符HDS-SP去表征空间和时间信息。(1) Behavior recognition based on traditional methods. In traditional methods, scholars extract handcrafted features as effective spatiotemporal feature descriptors. Donglu Li et al. combined the characteristics of depth data with point cloud data and used depth information to reconstruct 3D point cloud space to improve the spatial information expression ability of depth data. Then they evenly divided the point cloud space into many grid spaces, and calculated the proportion of the number of point clouds in the grid space as a spatial distribution feature for action recognition. Liu et al. established distribution sectors on the 2D projection plane of 3D skeleton joint trajectories, and then used histograms to count the distribution sectors of specific projections as skeleton-based motion descriptor HDS-SP to characterize spatial and temporal information.

(2)基于深度学习的行为识别。基于提取手工特征进行动作识别的方法在实现上有很高的复杂度。传统方法需要进行特征工程，对运动特征的分析和设计极大地影响到分类精度。在基于深度学习的动作识别方法中，首先使用特征学习网络学习动作序列的特征表示，然后将特征送入分类网络进行分类。Shengquan Wang等人提出了一种多级深度特征融合增强网络(MDFFEN),首先从双流网络的子模块中提取时空特征形成多深度特征，然后将这些时空特征融合后进行分类。Haoran Wang等提出骨架边缘运动网络(SEMN)，在骨架边缘运动上挖掘人体的运动信息，将骨架边缘的角度变化和相应的关节运动结合来确定骨架边缘的运动。(2) Behavior recognition based on deep learning. The method of action recognition based on extracting handcrafted features has high complexity in implementation. Traditional methods require feature engineering, and the analysis and design of motion features greatly affect the classification accuracy. In the action recognition method based on deep learning, the feature learning network is first used to learn the feature representation of the action sequence, and then the features are sent to the classification network for classification. Shengquan Wang et al. proposed a multi-level deep feature fusion enhancement network (MDFFEN), which firstly extracted spatiotemporal features from sub-modules of the two-stream network to form multi-depth features, and then fused these spatiotemporal features for classification. Haoran Wang et al. proposed the Skeleton Edge Motion Network (SEMN), which mines the motion information of the human body on the skeleton edge motion, and combines the angle change of the skeleton edge with the corresponding joint motion to determine the motion of the skeleton edge.

虽然深度图像携带密集的纹理信息，但是丰富的纹理信息也带来了较高的信息冗余使得它们对人体关节的局部姿态变化不够敏感。人体骨架数据因其对姿态变化的高度敏感性已经成为了人体动作识别领域的热门研究对象。与深度图像数据相比，骨架数据提供的三维空间结构信息可以大大减少数据采集对象的身高、体型轮廓和着装等因素的影响。虽然骨架关节是专门为捕获人体的空间结构信息而设计的，但是关节点携带的信息只能表示位置却不能表示与邻接点的依赖关系。以往的方法通常使用关节的位置作为关节的空间结构信息表示，而忽略了关节之间的依赖关系。Although depth images carry dense texture information, the rich texture information also brings high information redundancy, making them insensitive to local pose changes of human joints. Human skeleton data has become a hot research object in the field of human action recognition due to its high sensitivity to pose changes. Compared with depth image data, the three-dimensional spatial structure information provided by skeleton data can greatly reduce the influence of factors such as height, body contour, and clothing of the data collection object. Although the skeleton joints are specially designed to capture the spatial structure information of the human body, the information carried by the joint points can only represent the position but not the dependencies with the adjacent points. Previous methods usually use the position of the joint as the spatial structure information representation of the joint, while ignoring the dependencies between the joints.

发明内容SUMMARY OF THE INVENTION

针对现有算法的不足，本发明提出了一种基于骨架依赖关系和多模态时空特征表示的人体行为识别方法，从深度图中学习人体部位在空间中丰富的纹理信息，从骨架序列中学习运动姿态变化中丰富的时空特征。In view of the shortcomings of the existing algorithms, the present invention proposes a human behavior recognition method based on skeleton dependencies and multi-modal spatiotemporal feature representation, which learns the rich texture information of human body parts in space from the depth map, and learns from the skeleton sequence. Rich spatiotemporal features in motion pose changes.

本发明所采用的技术方案是：一种超关节与多模态网络及其在行为识别方法包括以下步骤：The technical scheme adopted by the present invention is: a hyper-joint and multi-modal network and its behavior recognition method include the following steps:

S1、采集人体深度数据，输入DMMS流对深度图进行特征提取，并计算DMMs流预测分数；S1. Collect human body depth data, input the DMMS stream to perform feature extraction on the depth map, and calculate the DMMs stream prediction score;

S2、采集人体骨架序列，并分别提取原始关节和超关节数据，结合超关节和普通关节构建骨架信息，分别计算超关节和普通关节的静态和运动数据并送入结构化时空特征学习模型得到静态和动态关节数据流、静态和动态超关节数据流，对来自普通关节和超关节的静态和运动数据分别进行特征级融合，并计算普通关节和超关节的预测分数。S2. Collect the human skeleton sequence, extract the original joint and hyper-joint data respectively, combine the hyper-joint and ordinary joint to construct skeleton information, calculate the static and motion data of the hyper-joint and ordinary joint respectively, and send it into the structured spatiotemporal feature learning model to obtain the static state and dynamic joint data streams, static and dynamic hyper-joint data streams, feature-level fusion of static and motion data from common joints and hyper-joints, respectively, and calculation of prediction scores for common and hyper-joints.

S3、首先对原始关节和超关节的分类分数进行自适应权重融合的得到骨架流的预测分数；S3. First, perform adaptive weight fusion on the classification scores of the original joint and the super-joint to obtain the prediction score of the skeleton flow;

S21、根据普通关节的依赖关系进行超关节构建，并结合超关节和普通关节构建骨架信息；S21 , constructing super-joints according to the dependencies of common joints, and constructing skeleton information by combining super-joints and common joints;

进一步的，超关节构建包括：Further, the hyperjoint construction includes:

首先，计算RWE和RWH，分别为RW指向RE和RH的方向向量，计算公式如下：First, calculate RWE and RWH, which are the direction vectors of RW pointing to RE and RH, respectively. The calculation formula is as follows:

其中,RW，RE和RH分别表示人体骨架中的右手腕，右手肘和右手；Among them, RW, RE and RH represent the right wrist, right elbow and right hand in the human skeleton, respectively;

然后，计算两个相交向量的笛卡尔积得到向量所在平面的法向量n，计算公式如下：Then, the Cartesian product of the two intersecting vectors is calculated to obtain the normal vector n of the plane where the vectors are located. The calculation formula is as follows:

然后，以RE为起点的两个骨骼的方向向量，计算两个方向向量的夹角，计算公式如下：Then, using the direction vectors of the two bones with RE as the starting point, calculate the angle between the two direction vectors. The calculation formula is as follows:

最后得到超关节数据向量HyperJoint，公式为：Finally, the hyperjoint data vector HyperJoint is obtained, and the formula is:

进一步的，结合超关节和普通关节构建骨架信息包括如下步骤：Further, constructing skeleton information by combining hyper-joints and common joints includes the following steps:

构成人体的骨架图用变量G(V,H)表示，公式为：The skeleton diagram that constitutes the human body is represented by the variable G(V,H), and the formula is:

其中，V是构成人体骨架图的空间节点集合，H是给人体骨架关节上建立的依赖关系；Among them, V is the set of spatial nodes that constitute the human skeleton graph, and H is the dependency established on the joints of the human skeleton;

骨架的动态信息被定义为骨架序列上相邻骨架上骨架元素之间的差，在时间域上给骨架姿态建立联系，公式如下：The dynamic information of the skeleton is defined as the difference between the skeleton elements on the adjacent skeletons on the skeleton sequence, and establishes a connection to the skeleton pose in the time domain. The formula is as follows:

S22、将骨架信息送入结构化时空特征学习模型，包括如下步骤：S22, sending the skeleton information into the structured spatiotemporal feature learning model, including the following steps:

利用卷积神经网络的时空特征学习模块用于在骨架信息上人体动作特征的提取；The spatiotemporal feature learning module of convolutional neural network is used to extract human action features from skeleton information;

进一步的，时空特征学习模块包括节点时序特征学习块和空间全局特征学习块，通过叠加时空块，构建特征学习模块，从骨架数据中学习原始骨架节点和节点间依赖关系的有效时空特征；Further, the spatiotemporal feature learning module includes a node timing feature learning block and a spatial global feature learning block. By stacking the spatiotemporal blocks, a feature learning module is constructed to learn the effective spatiotemporal features of the original skeleton nodes and the dependencies between the nodes from the skeleton data;

进一步的，节点时序特征学习块构建包括如下步骤：Further, the construction of the node timing feature learning block includes the following steps:

首先，使用卷积块注意力模块对输入网络的张量增加注意力，表示提升输入数据的特征表达能力；First, use the convolution block attention module to increase attention to the tensors of the input network, which means to improve the feature expression ability of the input data;

然后，使用卷积核大小为(1×1)的卷积层在网络中自动学习骨架关节的位置特征；公式化定义为：Then, the location features of the skeleton joints are automatically learned in the network using a convolutional layer with a kernel size of (1×1); the formulation is defined as:

f_T(X)＝σ(φ(Attn(X))) (9)f _T (X)=σ(φ(Attn(X))) (9)

其中，X表示一个三阶张量

φ表示由卷积层结构的函数，σ表示ReLU激活函数，而Attn表示一个注意力区块；where X represents a third-order tensor

φ represents the function structured by the convolutional layer, σ represents the ReLU activation function, and Attn represents an attention block;

最后，使用大小为(3×1)的卷积核将骨架关节位置上学习到的特征信息在时间上聚合起来，输出是一个三阶张量

通过如下公式进行定义：Finally, a convolution kernel of size (3×1) is used to aggregate the learned feature information on the skeleton joint positions in time, and the output is a third-order tensor

It is defined by the following formula:

A＝φ(f_T(X)) (10)。A=φ(f _T (X)) (10).

进一步的，空间全局特征学习块构建包括：Further, the construction of the spatial global feature learning block includes:

使用大小为(3×3)的卷积核在网络上自动学习骨架节点在空间域上协同运动的语义特征，卷积操作能够聚合构成人体的元素的空间结构信息，公式化定义为：A convolution kernel of size (3×3) is used to automatically learn the semantic features of the coordinated movement of skeleton nodes in the spatial domain on the network. The convolution operation can aggregate the spatial structure information of the elements constituting the human body, which is formulated as:

f_S(A)＝φ(A') (11)f _S (A)=φ(A') (11)

其中，A'表示矩阵的变换，将一个三阶张量

变换成

Among them, A' represents the transformation of the matrix, and a third-order tensor

transform into

S3、将DMMS流和骨架流的分类预测分数相加生成最终的预测分数；S3. Add the classification prediction scores of the DMMS flow and the skeleton flow to generate the final prediction score;

本发明的有益效果：Beneficial effects of the present invention:

1、提出骨架节点依赖关系来重新建模人体骨架序列的空间结构信息表示，将构成人体骨架的孤立节点与邻接点建立显示的依赖关系，并将人体骨架序列和骨架节点依赖关系序列作为网络的输入进行时空特征学习。1. The skeleton node dependency relationship is proposed to re-model the spatial structure information representation of the human skeleton sequence. The isolated nodes and adjacent points that constitute the human skeleton are established to display the dependency relationship, and the human skeleton sequence and the skeleton node dependency sequence are used as the network. Input for spatiotemporal feature learning.

2、设计了一种新的时空特征学习网络学习骨架序列和骨架节点依赖关系的时空特征；在网络中，构建了一个节点时序特征学习块(JTFLB)，探索人体骨架上的元素在时序上的关系；由于人体运动的整体性和协同性的特点，我们构建了了一个空间全局特征学习块(SGFLB)，将构成人体骨架的元素在时间域学习到的语义特征聚合起来。2. A new spatiotemporal feature learning network is designed to learn spatiotemporal features of skeleton sequences and skeleton node dependencies; in the network, a node timing feature learning block (JTFLB) is constructed to explore the temporal relationship of elements on the human skeleton. Due to the holistic and synergistic characteristics of human motion, we construct a Spatial Global Feature Learning Block (SGFLB) to aggregate the semantic features learned in the temporal domain from the elements constituting the human skeleton.

3、从人体的深度序列和骨架序列中协同提取丰富的时空特征，并通过融合不同模态的分数来提高动作识别性能；为了验证本发明的有效性，在不同规模的公共数据集上进行了评估，和单一模态相比，通过多模态融合后显著地提升了识别效果。3. Synergistically extract rich spatiotemporal features from the depth sequence and skeleton sequence of the human body, and improve the performance of action recognition by fusing the scores of different modalities; in order to verify the effectiveness of the present invention, experiments are carried out on public datasets of different scales. Evaluation, compared with single modality, the recognition effect is significantly improved after multimodal fusion.

附图说明Description of drawings

图1是本发明的超关节与多模态网络及其在行为识别方法结构框图；Fig. 1 is the super-joint and multimodal network of the present invention and its structural block diagram in the behavior recognition method;

图2是本发明的计算关节姿态方向示意图；Fig. 2 is the schematic diagram of calculating joint posture direction of the present invention;

图3是本发明的计算关节夹角示意图；3 is a schematic diagram of calculating a joint angle of the present invention;

图4是本发明的建立超关节示意图；Fig. 4 is the establishment super-joint schematic diagram of the present invention;

图5是本发明的关节时序特征学习块图；Fig. 5 is the joint timing feature learning block diagram of the present invention;

图6是本发明的空间全局特征学习块图；Fig. 6 is the spatial global feature learning block diagram of the present invention;

图7是本发明的UTD-MHAD和NTU RGB+D数据集的帧数统计；Fig. 7 is the frame number statistics of UTD-MHAD of the present invention and NTU RGB+D data set;

图8是本发明的基础模型和融合模型在UTD-MHAD上的差异；Fig. 8 is the difference between the basic model of the present invention and the fusion model on UTD-MHAD;

图9是本发明的基础模型和融合模型在NTU RGB+D CS上的差异；Fig. 9 is the difference between the basic model of the present invention and the fusion model on NTU RGB+D CS;

图10是本发明的基础模型和融合模型在NTU RGB+D CV上的差异。FIG. 10 is the difference between the basic model of the present invention and the fusion model on NTU RGB+D CV.

具体实施方式Detailed ways

下面结合附图和实施例对本发明作进一步说明，此图为简化的示意图，仅以示意方式说明本发明的基本结构，因此其仅显示与本发明有关的构成。The present invention will be further described below with reference to the accompanying drawings and embodiments. This figure is a simplified schematic diagram, and only illustrates the basic structure of the present invention in a schematic manner, so it only shows the structure related to the present invention.

为了评估本发明方法的有效性，基于深度图和骨架信息的公共数据集上进行实验；大型公共数据集能够为模型提供更广泛的训练数据，使模型更强壮；为了验证本发明方法的鲁棒性，在数据集的选择上采用了经典的小型数据集，因此，选取几个具有截然不同规模的数据集上进行了实验：UTD-MHAD和NTU-RGB+D。In order to evaluate the effectiveness of the method of the present invention, experiments are carried out on a public dataset based on the depth map and skeleton information; a large public dataset can provide a wider range of training data for the model, making the model stronger; in order to verify the robustness of the method of the present invention Due to the nature of the dataset, a classic small dataset is used in the selection of the dataset. Therefore, experiments are conducted on several datasets with distinct scales: UTD-MHAD and NTU-RGB+D.

本发明是基于PyTorch框架建立的，其中Python版本为3.7.0，pytorch版本为1.10.1；本实验的硬件平台为台式机，其中，主板是微星B460M MORTAR，CPU是Intel i710700，主频为2.9GHz，内存是16GB，操作系统是Windows 10专业版，GPU资源NVIDIA TeslaV100，显存32GB；实验用到的软件工具为PyCharm，Anaconda3。The invention is established based on the PyTorch framework, wherein the Python version is 3.7.0, and the pytorch version is 1.10.1; the hardware platform of this experiment is a desktop computer, wherein the main board is MSI B460M MORTAR, the CPU is Intel i710700, and the main frequency is 2.9 GHz, the memory is 16GB, the operating system is Windows 10 Professional Edition, the GPU resource is NVIDIA TeslaV100, and the video memory is 32GB; the software tools used in the experiment are PyCharm and Anaconda3.

UTD-MHAD数据集一共包括861个动作样本，并且使用固定位置的采集设备为每个动作样本提供从一个视角捕获的RGB、深度图、骨架序列和惯性传感器序列四种数据；包括27个动作类别，每个动作类被由8名受试人分别重复3-4次；为了便于与现有技术做对比，将受试人编号为1,3,5,7的动作序列被用来做训练集，其余的用来做测试集。The UTD-MHAD dataset includes a total of 861 action samples, and each action sample is provided with four kinds of data, RGB, depth map, skeleton sequence, and inertial sensor sequence captured from a viewpoint using a fixed-position acquisition device; including 27 action categories , each action class was repeated 3-4 times by 8 subjects; in order to facilitate the comparison with the prior art, the action sequences of subjects numbered 1, 3, 5, and 7 were used as the training set , and the rest are used as the test set.

NTU RGB+D属于大型数据集，并且提供更具有挑战性的动作样本和更多的模态信息；NTU RGB+D数据集一共包括56880个动作样本，并且使用固定不同位置的采集设备为每个样本提供从三个视角捕获的RGB、深度图、骨架序列和红外辐射视频四种数据；包括60个动作类别，每个动作类别由40名受试人完成1-2次；与UTD-MHAD采集方式不同，NTU RGB+D提供采集设备在不同水平高度和距离采集的17种多视角、多模态的数据；在数据集上使用了两种标准的实验策略，分为跨受试人和跨视角；在跨受试人策略中，首先将所有的受试人分成两组，其中受试人编号为1，2，4，5，8，9，13，14，15，16，17，18，19，25，27，28，31，34，35和38的动作序列被用来做训练集，其余的用来做测试集。在跨视角的策略中，首先将采集设备中不同视角的相机分成两组，其中相机编号为2和3采集的37920个动作序列作为训练集，相机编号为1采集的18960个动作序列作为测试集。NTU RGB+D belongs to a large dataset, and provides more challenging action samples and more modal information; NTU RGB+D dataset includes a total of 56880 action samples, and uses fixed different location acquisition devices for each The sample provides four kinds of data, RGB, depth map, skeleton sequence and infrared radiation video captured from three viewpoints; includes 60 action categories, each action category is completed 1-2 times by 40 subjects; collected with UTD-MHAD In different ways, NTU RGB+D provides 17 kinds of multi-view and multi-modal data collected by acquisition equipment at different levels and distances; two standard experimental strategies are used in the data set, which are divided into cross-subject and cross- Perspective; in a cross-subject strategy, all subjects are first divided into two groups, where subjects are numbered 1, 2, 4, 5, 8, 9, 13, 14, 15, 16, 17, 18 , 19, 25, 27, 28, 31, 34, 35 and 38 action sequences are used as training set and the rest are used as test set. In the cross-perspective strategy, the cameras with different perspectives in the acquisition device are firstly divided into two groups, in which the 37920 action sequences collected by the cameras numbered 2 and 3 are used as the training set, and the 18960 action sequences collected by the camera number 1 are used as the test set. .

如图1所示，一种超关节与多模态网络及其在行为识别方法，包括以下步骤：As shown in Figure 1, a hyper-joint and multimodal network and its behavior recognition method, including the following steps:

图1中两个流，即图1上半图：DMMs流和图1下半图：骨架流，图1(a)分别计算原始关节和超关节的静态和运动数据；图1(b)通过堆叠JTFLB和SGFLB建立结构化时空特征学习模块去学习来自骨架序列的时空特征表示；图1(c)对来自关节和超关节的静态和运动数据分别进行特征级融合；图1(d)首先对原始关节和超关节的预测分数进行自适应权重融合，然后将两个流的预测分数相加生成最终的预测分数；There are two flows in Fig. 1, namely the upper half of Fig. 1: DMMs flow and the lower half of Fig. 1: skeleton flow, Fig. 1(a) calculates the static and motion data of original joints and super-joints respectively; Fig. 1(b) passes Stacking JTFLB and SGFLB builds a structured spatiotemporal feature learning module to learn spatiotemporal feature representations from skeleton sequences; Figure 1(c) performs feature-level fusion of static and motion data from joints and hyper-joints, respectively; The prediction scores of the original joint and super-joint are fused with adaptive weights, and then the prediction scores of the two streams are added to generate the final prediction score;

S1采集人体深度图，利用DMMS流对深度图进行特征提取，并计算深度数据预测分数；S1 collects the depth map of the human body, uses the DMMS stream to perform feature extraction on the depth map, and calculates the depth data prediction score;

S2、采集人体骨架序列，并分别提取原始关节和超关节数据，送入结构化时空特征学习模型，分别得到静态和动态关节数据流、静态和动态超关节数据流，并将原始关节和超关节的数据流进行自适应权重融合，分别得到关节数据预测分数和超关节预测分数；S2. Collect the human skeleton sequence, extract the original joint and hyper-joint data respectively, and send it to the structured spatiotemporal feature learning model to obtain static and dynamic joint data streams, static and dynamic hyper-joint data streams, respectively, and combine the original joint and hyper-joint data. Adaptive weight fusion is performed on the data stream of the data stream, and the prediction score of joint data and the prediction score of hyper-joint are obtained respectively;

进一步的，给定一个表示运动过程的人体骨架序列{S₁,S₂,...,S_T}，计算一个动作描述符，能够更好地描述动作的变化；人体骨架序列看作函数

它表示人体骨架的空间结构在时间维度t＝1,2,...,T上的变化；然而，构成人体骨架的关节仅携带位置信息，因此，在人体中自然连接的关节间建立依赖关系去获得关节位置以外的高级信息。Further, given a human skeleton sequence {S ₁ , S ₂ ,..., S _T } representing the motion process, an action descriptor is calculated, which can better describe the change of the action; the human skeleton sequence is regarded as a function

It represents the change of the spatial structure of the human skeleton in the time dimension t=1,2,...,T; however, the joints that make up the human skeleton only carry position information, so a dependency relationship is established between the joints that are naturally connected in the human body to get advanced information beyond joint position.

在骨架序列表示的人体行为中，关节是主要的运动部分，如图2所示，RW，RE和RH分别表示人体骨架中的右手腕，右手肘和右手，他们在人体上是自然连接，以右手腕关节的运动为例，RW，RE和RH的坐标在空间中确定一个平面，平面的法向量确定方向，计算该平面的法向量作为以右手腕关节为中心的局部姿态的方向描述符；In the human behavior represented by the skeleton sequence, joints are the main moving parts. As shown in Figure 2, RW, RE and RH represent the right wrist, right elbow and right hand in the human skeleton, respectively. Taking the movement of the right wrist joint as an example, the coordinates of RW, RE and RH determine a plane in space, the normal vector of the plane determines the direction, and the normal vector of the plane is calculated as the direction descriptor of the local posture centered on the right wrist joint;

对于不同的动作，骨骼运动的幅度是不同的，如图3所示，当右手腕关节运动时，关节连接的两个骨骼的夹角也会发生变化，因此，计算关节连接的两个骨骼的夹角作为运动幅度的描述符；For different actions, the range of bone movement is different. As shown in Figure 3, when the right wrist joint moves, the angle between the two bones connected by the joint will also change. The included angle serves as a descriptor of the magnitude of motion;

首先我们需要计算以RE为起点的两个骨骼的方向向量，如公式(2)所示；然后，计算两个向量的夹角，计算公式如下：First, we need to calculate the direction vector of the two bones with RE as the starting point, as shown in formula (2); then, calculate the angle between the two vectors, and the calculation formula is as follows:

由于人体局部姿态描述符和人体局部肢体运动描述符是汇聚相邻关节信息得到的，将他们组合到一起构成关节依赖关系；关节依赖关系对应于具体的关节，为了和关节位置区分，将表示关节依赖关系的向量叫做HyperJoint，HyperJoint的计算如下所示：Since the human body local pose descriptor and the human body local limb motion descriptor are obtained by aggregating the information of adjacent joints, they are combined together to form a joint dependency relationship; the joint dependency relationship corresponds to a specific joint, in order to distinguish it from the joint position, it will represent the joint The vector of dependencies is called HyperJoint, and the calculation of HyperJoint is as follows:

构建的HyperJoint与骨架数据中的关节有显著不同，HyperJoint构成的骨架序列看作函数

原始骨架序列仅仅表示了人体骨架关节位置在时间维度t＝1,2,...,T上的变化，而HyperJoint中聚合了来自邻接点的信息，表示局部姿态方向和局部肢体运动幅度。The constructed HyperJoint is significantly different from the joints in the skeleton data, and the skeleton sequence formed by the HyperJoint is regarded as a function

The original skeleton sequence only represents the changes of the joint positions of the human skeleton in the time dimension t=1, 2, ..., T, while the information from the adjacent points is aggregated in HyperJoint, which represents the local pose direction and the local limb motion range.

如图4所示，普通的关节点携带的信息用于描述关节在三维空间中的位置，即关节信息不包括关节点之间存在的多元关联关系；现实中的对象之间的关系往往是复杂的多元关联关系；因此，如果将普通关节点携带的一元关系转换为多元关系，那么将会产生很多有用的信息；HyperJoint和普通关节点的区别不仅仅是关节点的维数不同，相对于普通关节点而言，HyperJoint更加准确地描述存在关联的多元对象之间的关系。As shown in Figure 4, the information carried by common joint points is used to describe the position of joints in three-dimensional space, that is, joint information does not include the multivariate relationship between joint points; the relationship between objects in reality is often complex The multivariate relationship; therefore, if the univariate relationship carried by the common joint points is converted into a multivariate relationship, then a lot of useful information will be generated; the difference between HyperJoint and common joint points is not only the dimension of the joint points. In terms of joints, HyperJoint more accurately describes the relationship between multiple objects that are related.

S21、根据普通关节的依赖关系进行超关节构建，并根据超关节和普通关节构建骨架信息；S21, constructing super-joints according to the dependencies of common joints, and constructing skeleton information according to super-joints and common joints;

构成人体的骨架图用变量G(V,H)表示，其中，V是构成人体骨架图的空间节点集合(普通关节)，H是给人体骨架关节上建立的依赖关系(超关节)；节点信息V和依赖关系H通过下面公式表示：The skeleton graph that constitutes the human body is represented by the variable G(V, H), where V is the set of spatial nodes (common joints) that constitute the skeleton graph of the human body, and H is the dependency (hyper-joint) established on the human skeleton joints; node information V and dependency H are expressed by the following formula:

S22、骨架数据的结构化时空特征学习模型；S22. A structured spatiotemporal feature learning model for skeleton data;

在自然环境中，人的行为是动态连续的；然而，在采集设备中人的完整行为被分成了在时间轴上的一系列人体姿势，而且人体姿势还能被进一步分解成在三维空间中的人体骨骼点，由于一个连续动作在形成的过程中展现了从局部到整体的层次结构，因此，将卷积神经网络的时空特征学习模块用于在骨架数据上人体动作特征的提取；In the natural environment, human behavior is dynamic and continuous; however, in the acquisition device, the complete human behavior is divided into a series of human poses on the time axis, and human poses can be further decomposed into three-dimensional space. For human skeleton points, since a continuous action shows a hierarchical structure from local to whole in the process of forming, the spatiotemporal feature learning module of convolutional neural network is used to extract human action features from skeleton data;

特征学习模块由两个主要的功能块构成，分为是节点时序特征学习块和空间全局特征学习块；其中，节点时序特征学习块处理骨架节点的时序特征，空间全局特征学习块处理骨架节点的空间信息分布特征；通过叠加时空块，构建特征学习模块，从骨架数据中学习原始骨架节点和超关节节点间依赖关系的有效时空特征；The feature learning module consists of two main functional blocks, which are divided into node timing feature learning block and spatial global feature learning block; among them, the node timing feature learning block deals with the timing features of the skeleton nodes, and the spatial global feature learning block deals with the Spatial information distribution features; by stacking spatiotemporal blocks, a feature learning module is constructed to learn effective spatiotemporal features of dependencies between original skeleton nodes and hyperjoint nodes from skeleton data;

进一步的，通过节点时序特征学习块(JTFLB)学习人体骨架上的元素在时序上的关系；节点时序特征学习块如图5所示，节点时序特性学习块是将人体的骨架分解为一系列基本组成元素，每个组成元素在运动时有各自的运动轨迹，这些轨迹具有时间域上的语义特征，对于节点时序特征学习块，输入数据是一个三阶张量

其中，N表示构成人体骨架的元素个数，C表示由元素维度构成的通道数量，T表示输入动作的帧数。Further, the relationship between elements on the human skeleton in terms of time sequence is learned through the node timing feature learning block (JTFLB). The node timing feature learning block is shown in Figure 5. The node timing feature learning block is to decompose the human body skeleton into a series of basic Component elements, each component element has its own trajectory when moving, and these trajectories have semantic features in the time domain. For the node timing feature learning block, the input data is a third-order tensor

Among them, N represents the number of elements constituting the human skeleton, C represents the number of channels composed of element dimensions, and T represents the number of frames of input actions.

首先，使用卷积块注意力模块对输入网络的张量增加注意力表示提升输入数据的特征表达能力；其中，卷积注意力模块由通道注意力模块和空间注意力模块构成；然后使用卷积核大小为(1×1)的卷积层在网络中自动学习骨架关节的位置特征；公式化定义为：First, the convolution block attention module is used to add attention to the tensors of the input network to improve the feature expression ability of the input data; among them, the convolution attention module is composed of a channel attention module and a spatial attention module; then the convolutional attention module is used. A convolutional layer with a kernel size of (1×1) automatically learns the positional features of the skeleton joints in the network; it is formulated as:

f_T(X)＝σ(φ(Attn(X))) (9)f _T (X)=σ(φ(Attn(X))) (9)

其中，X表示一个三阶张量

φ表示由卷积层结构的函数，σ表示ReLU激活函数，而Attn表示一个注意力区块。where X represents a third-order tensor

φ represents the function structured by convolutional layers, σ represents the ReLU activation function, and Attn represents an attention block.

最后，使用大小为(3×1)的卷积核将骨架关节位置上学习到特征信息在时间上聚合起来，输出是一个三阶张量

通过如下公式进行定义：Finally, a convolution kernel of size (3×1) is used to aggregate the feature information learned from the skeleton joint positions in time, and the output is a third-order tensor

It is defined by the following formula:

A＝φ(fT(X)) (10)A=φ(fT(X)) (10)

进一步的，通过空间全局特征学习块(SGFLB)学习人体骨架上的元素在运动上的关系，如图6所示，空间全局特征学习块是将人体运动的不同部位一起参与空间学习，因为人体运动的整体性和协同性具有空间上的语义特征；对于空间全局特征学习块，输入数据是一个三阶张量

其中，N是构成人体骨架的元素个数，C表示通道数量，T表示输入动作的帧数；Further, the spatial global feature learning block (SGFLB) is used to learn the relationship between the elements on the human skeleton in motion, as shown in Figure 6, the spatial global feature learning block is to participate in the spatial learning of different parts of the human body movement, because the human body movement The integrity and cooperativity of , have spatial semantic features; for the spatial global feature learning block, the input data is a third-order tensor

Among them, N is the number of elements constituting the human skeleton, C is the number of channels, and T is the number of frames of input actions;

f_S(A)＝φ(A') (11)f _S (A)=φ(A') (11)

其中，A'表示矩阵的变换，将一个三阶张量

变换成

transform into

研究骨架序列的长度对识别率的影响，在图7中为使用两种规模数据集的帧数统计直方图；图7(a)中为UTD-MHAD数据集的帧数统计直方图，从图7(a)中可以看出UTD-MHAD数据集的帧数分布区间为[40,125]；图7(b)中为NTU RGB+D数据集主要的帧数统计直方图，从图7(b)中看出NTU RGB+D数据集主要的帧数分布区间为[25,200]；为了有效探索帧率对实验结果的影响，将帧数分成了为32，64和128三个级别，由于UTD-MHAD是小规模数据集，所以额外加入帧数为16时的实验；表1中显示了实验结果，在这个表格中我们可以清晰地看到选取不同长度的骨架序列对应的识别率的变化。To study the effect of the length of the skeleton sequence on the recognition rate, in Figure 7 is the statistical histogram of the number of frames using the two scale datasets; Figure 7(a) is the statistical histogram of the number of frames in the UTD-MHAD dataset, from Figure 7 It can be seen in 7(a) that the frame number distribution interval of the UTD-MHAD data set is [40,125]; Figure 7(b) is the main frame number statistical histogram of the NTU RGB+D data set, from Figure 7(b) It can be seen that the main frame number distribution interval of the NTU RGB+D data set is [25, 200]; in order to effectively explore the impact of the frame rate on the experimental results, the number of frames is divided into three levels: 32, 64 and 128. Because UTD-MHAD It is a small-scale data set, so an additional experiment when the number of frames is 16 is added; the experimental results are shown in Table 1. In this table, we can clearly see the change of the recognition rate corresponding to the selection of skeleton sequences of different lengths.

表1帧数对分类性能的影响Table 1 Influence of the number of frames on the classification performance

随着帧数的增加，两个数据集上的识别率发生了明显的变化，在UTD-MHAD数据集中，当帧数为16时CS上的识别率仅为43.35％，随着帧数增加到32，识别率增加到77.67％；当帧数小于32时，在中心帧数为16的这个帧数区间内给模型提供的信息不足以实现对整个动作的表达；当帧数从32增加到64时，识别率下降到76.74％；在这段区间内增加的动作帧没能给模型提供有判别力的信息，相反由于动作帧增加造成冗余导致了精度的下降；随着帧数从64增加到128，识别率上升到了91.16％，因此，在UTD-MHAD数据集中，由于动作序列中间部分的冗余较大，当取得足够多的帧数时会使模型更具有判别力；As the number of frames increases, the recognition rate on the two datasets changes significantly. In the UTD-MHAD dataset, the recognition rate on CS is only 43.35% when the number of frames is 16. As the number of frames increases to 32, the recognition rate is increased to 77.67%; when the number of frames is less than 32, the information provided to the model in the frame number interval with the center frame number of 16 is not enough to express the entire action; when the number of frames increases from 32 to 64 , the recognition rate dropped to 76.74%; the action frames added in this interval failed to provide discriminative information to the model, on the contrary, the redundancy caused by the increase of action frames resulted in a decrease in accuracy; as the number of frames increased from 64 By 128, the recognition rate has risen to 91.16%. Therefore, in the UTD-MHAD dataset, due to the large redundancy in the middle part of the action sequence, when enough frames are obtained, the model will be more discriminative;

在NTU RGB+D数据集中，两个实验策略的识别率随着帧数的增加也在上升，因此更高的帧数确实给模型带来了更多有判别力的信息；为保证模型参数的一致性，将从两个数据集采样到的帧数都取128帧，以验证本发明方法在大规模数据上的鲁棒性。In the NTU RGB+D dataset, the recognition rates of the two experimental strategies also increase with the increase of the number of frames, so the higher number of frames does bring more discriminative information to the model; in order to ensure the accuracy of the model parameters For consistency, 128 frames are sampled from both datasets to verify the robustness of the method of the present invention on large-scale data.

单一模态在分类中的效果有限，因此将深度序列和骨架序列进行融合提高识别精度；表2显示了融合前后的识别精度，本发明方法在两个数据集的识别精度都有明显的提高；为了便于分析融合深度特征的改进效果，图8-10中列出本发明方法在两个数据集的具体动作上精度的变化。The effect of a single modality in classification is limited, so the depth sequence and the skeleton sequence are fused to improve the recognition accuracy; Table 2 shows the recognition accuracy before and after fusion, and the method of the present invention has significantly improved recognition accuracy in both datasets; In order to facilitate the analysis of the improvement effect of the fusion depth feature, Figures 8-10 list the variation of the accuracy of the method of the present invention on the specific actions of the two data sets.

图8中，本发明方法对UTD-MHAD数据集中的各个动作类别的识别精度都有所提高；其中，编号为6的动作在融合深度特征后改善最大，其识别精度提高了近30％。在提高识别精度的动作中，大部分动作识别精度提高了5％左右；图8中发现，本发明方法对编号为7、12、13、17、18、22、24、25和26的动作具有非常好的识别精度。In Figure 8, the method of the present invention has improved the recognition accuracy of each action category in the UTD-MHAD data set; among them, the action numbered 6 has the greatest improvement after fusion of depth features, and its recognition accuracy is increased by nearly 30%. In the actions to improve the recognition accuracy, the recognition accuracy of most actions is improved by about 5%; it is found in FIG. Very good recognition accuracy.

本发明方法在NTU RGB+D数据集上也有很好的效果，图9中为实验策略CS上的结果，本发明模型对编号为1，10，11，12，13，16，17，和20的动作的识别精度有显著的提升；图10中为实验策略CV上的结果，融合后的模型对编号为4，11，12，28，29，和30的动作的识别精度有显著的提升；综合来看，融合后的模型在CS上的精度提升了6.76％，在CV上的精度提升了3.28％。The method of the present invention also has a good effect on the NTU RGB+D data set. Figure 9 shows the results of the experimental strategy CS. The model pairs of the present invention are numbered 1, 10, 11, 12, 13, 16, 17, and 20 The recognition accuracy of the actions has been significantly improved; Figure 10 shows the results of the experimental strategy CV, and the fusion model has significantly improved the recognition accuracy of actions numbered 4, 11, 12, 28, 29, and 30; Taken together, the fused model improves the accuracy by 6.76% on CS and 3.28% on CV.

表2多模态性能Table 2 Multimodal performance

将本发明方法与现有技术的方法在UTD-MHAD数据集上进行对比，结果如表3所示：The method of the present invention is compared with the method of the prior art on the UTD-MHAD data set, and the results are as shown in Table 3:

表3与现有技术方法在UTD-MHAD数据集上对比Table 3 is compared with the prior art method on the UTD-MHAD dataset

本发明方法和其他方法在UTD-MHAD上的结果如表3所示，本发明方法的识别率最高；表3中的其他方法被分为基于手工特征的方法和基于深度学习的方法；和BayesianGC-LSTM方法比较，精度提升了4.64％；基于骨架边缘的网络比较，精度提升了1.15％。The results of the method of the present invention and other methods on UTD-MHAD are shown in Table 3, the method of the present invention has the highest recognition rate; other methods in Table 3 are divided into methods based on handcrafted features and methods based on deep learning; and BayesianGC Compared with the LSTM method, the accuracy is improved by 4.64%; compared with the network based on the skeleton edge, the accuracy is improved by 1.15%.

本发明在独立的关节和邻接点之间建立显示依赖关系，这一特征可以用来描述人体行为过程中局部的姿态和运动幅度；为了更好地学习骨架序列在空间和时间上的语义特征，我们设计了两个功能模块，JTFLB和SGFLB，JTFLB学习骨架节点的时序特征，SGFLB在空间域上将从骨架节点上学习到的时序特征聚合起来。The present invention establishes a display dependency between independent joints and adjacent points, and this feature can be used to describe the local posture and motion range in the process of human behavior; in order to better learn the semantic features of the skeleton sequence in space and time, We design two functional modules, JTFLB and SGFLB. JTFLB learns the temporal features of skeleton nodes, and SGFLB aggregates the temporal features learned from skeleton nodes in the spatial domain.

以上述依据本发明的理想实施例为启示，通过上述的说明内容，相关工作人员完全可以在不偏离本项发明技术思想的范围内，进行多样的变更以及修改。本项发明的技术性范围并不局限于说明书上的内容，必须要根据权利要求范围来确定其技术性范围。Taking the above ideal embodiments according to the present invention as inspiration, and through the above description, relevant personnel can make various changes and modifications without departing from the technical idea of the present invention. The technical scope of the present invention is not limited to the contents in the specification, and the technical scope must be determined according to the scope of the claims.

Claims

1. a super-joint and a multimodal network and a method for recognizing behavior thereof, are characterized in that, comprise the following steps:

S1 collects the depth map of the human body, uses the DMMS stream to perform feature extraction on the depth map, and calculates the depth data prediction score;

S2. Collect the human skeleton sequence, extract the original joint and hyper-joint data respectively, combine the hyper-joint and ordinary joints to construct the skeleton information, send the skeleton information into the structured spatiotemporal feature learning model, and obtain the static and dynamic joint data flow, static and Dynamic super-joint data flow, and fuse the data flow of original joint and super-joint with adaptive weights to obtain joint data prediction score and super-joint prediction score;

S3. Add the classification prediction scores of the DMMS flow and the skeleton flow to generate a final prediction score.

2. super-joint and multimodal network according to claim 1 and its method for recognizing in behavior, it is characterized in that, described step S2 comprises:

S21, constructing super-joints according to the dependencies of common joints, and constructing skeleton information according to super-joints and common joints;

S22, sending the skeleton information into the structured spatiotemporal feature learning model.

3. super-joint and multimodal network according to claim 2 and its method for recognizing in behavior, it is characterized in that, described super-joint construction comprises:

First, calculate RWE and RWH, which are the direction vectors of RW pointing to RE and RH, respectively. The calculation formula is as follows:

Among them, RW, RE and RH represent the right wrist, right elbow and right hand in the human skeleton, respectively;

Then, the Cartesian product of the two intersecting vectors is calculated to obtain the normal vector n of the plane where the vectors are located. The calculation formula is as follows:

Then, using the direction vectors of the two bones with RE as the starting point, calculate the angle between the two direction vectors. The calculation formula is as follows:

Finally, the hyperjoint data vector HyperJoint is obtained, and the calculation formula is:

4. The super-joint and multimodal network according to claim 3 and the method for recognizing the behavior thereof, wherein the construction of the skeleton information in combination with the super-joint and the common joint comprises:

The skeleton diagram that constitutes the human body is represented by the variable G(V,H), and the formula is:

Among them, V is the set of spatial nodes that constitute the human skeleton graph, and H is the dependency established on the joints of the human skeleton;

The dynamic information of the skeleton is defined as the difference between the skeleton elements on the adjacent skeletons on the skeleton sequence, and establishes a connection to the skeleton pose in the time domain. The formula is as follows:

5. The hyper-joint and multimodal network according to claim 2 and the method for recognizing the behavior thereof, wherein the sending the skeleton information into the structured spatiotemporal feature learning model comprises: utilizing the spatiotemporal feature of the convolutional neural network The learning module is used to extract human action features on the skeleton information.

6. The super-joint and multimodal network according to claim 5 and the method for recognizing it in action, wherein the spatiotemporal feature learning module comprises a node timing feature learning block and a spatial global feature learning block, and the spatiotemporal blocks are stacked by stacking the space-time feature learning block. , build a feature learning module to learn effective spatiotemporal features of dependencies between original skeleton nodes from skeleton information.

7. The super-joint and multimodal network according to claim 6 and the method for recognizing the behavior thereof, wherein: the construction of the node timing feature learning block comprises the steps:

First, use the convolution block attention module to increase attention to the tensors of the input network, which means to improve the feature expression ability of the input data;

Then, the location features of the skeleton joints are automatically learned in the network using a convolutional layer with a kernel size of (1×1); the formulation is defined as:

f _T (X)=σ(φ(Attn(X))) (9)

where X represents a third-order tensor

Finally, a convolution kernel of size (3×1) is used to aggregate the learned feature information on the skeleton joint positions in time, and the output is a third-order tensor

It is defined by the following formula:

A=φ(f _T (X)) (10)

8 . The hyper-joint and multimodal network and the method for recognizing the behavior thereof according to claim 6 , wherein the construction of the spatial global feature learning block comprises: using a convolution kernel with a size of (3×3) The semantic features of the coordinated movement of skeleton nodes in the spatial domain are automatically learned on the network, and the convolution operation can aggregate the spatial structure information of the elements that constitute the human body. The formula is defined as:

f _S (A)=φ(A') (11)

transform into