Semantics-Guided Neural Networks for Efficient Skeleton-Based Human Action Recognition
2020 CVPR | code
Abstract
由于人体骨骼数据易于获取,基于骨骼的人体动作识别引起了极大的兴趣。最近,有一种趋势是使用非常深的前馈神经网络对关节的 3D
坐标进行建模,而不考虑计算效率。在这项工作中,我们提出了一个简单而有效的语义引导神经网络(SGN
)。我们显式地将关节的高级语义(关节类型
和帧索引
)引入网络以增强特征表示能力。此外,我们通过两个模块来层次化地利用关节之间的关系,即关节级别模块和框架级别模块,前者(joint-level module
)用于建模同一帧中关节的相关性,后者(frame-level module
)用于将同一帧中的关节作为一个整体来建模帧之间的依赖关系。提出了一个强有力的基线,以促进这一领域的研究。与大多数以前的工作相比,SGN
的模型尺寸小了一个数量级,在NTU60
、NTU120
和SYSU
数据集上实现了最先进的性能。Skeleton-based human action recognition has attracted great interest thanks to the easy accessibility of the human skeleton data. Recently, there is a trend of using very deep feedforward neural networks to model the 3D coordinates of joints without considering the computational efficiency. In this paper, we propose a simple yet effective semantics-guided neural network (SGN) for skeleton-based action recognition. We explicitly introduce the high level semantics of joints (joint type and frame index) into the network to enhance the feature representation capability. In addition, we exploit the relationship of joints hierarchically through two modules, i.e., a joint-level module for modeling the correlations of joints in the same frame and a frame-level module for modeling the dependencies of frames by taking the joints in the same frame as a whole. A strong baseline is proposed to facilitate the study of this field. With an order of magnitude smaller model size than most previous works, SGN achieves the state-of-the-art performance on the NTU60, NTU120, and SYSU datasets.
1. Introduction
人体行为识别有着广泛的应用场景,如人机交互和视频检索。近年来,基于骨骼的动作识别受到越来越多的关注。骨架是一种结构良好的数据类型,人体的每个关节都由关节类型、帧索引和3D
位置标识。使用骨架进行动作识别有几个优点。首先,骨架是人体的高级表示,抽象了人体姿势和运动。在生物学上,即使没有外观信息,人类也能够通过仅观察关节的运动来识别动作类别。其次,具有成本效益的深度相机和姿态估计技术的进步使骨骼的访问变得更加容易。第三,与 RGB
视频相比,骨架表示对视点和外观的变化具有鲁棒性。第四,由于低维表示,它在计算上也是有效的。此外,基于骨架的动作识别也是基于 RGB
的动作识别的补充。在这项工作中,我们专注于基于骨架的动作识别。Human action recognition has a wide range of application scenarios, such as human-computer interaction and video retrieval. In recent years, skeleton-based action recognition is attracting increasing interests. Skeleton is a type of well structured data with each joint of the human body identified by a joint type, a frame index, and a 3D position. There are several advantages of using the skeleton for action recognition. First, skeleton is a high level representation of the human body with the human pose and motion abstracted. Biologically, human is able to recognize the action category by observing only the motion of joints even without appearance information. Second, the advance of cost effective depth cameras and pose estimation technology make the access of skeleton much easier. Third, compared with RGB video, the skeleton representation is robust to variation of viewpoint and appearance. Fourth, it is also computationally efficient because of low dimensional representation. Besides, skeleton-based action recognition is also complementary to the RGB-based action recognition. In this work, we focus on skeleton-based action recognition.
对于基于骨架的动作识别,深度学习被广泛用于对骨架序列的时空演化进行建模,目前已经有了各种网络结构,例如循环神经网络(RNN
)、卷积神经网络(CNN
)和图卷积网络( GCN
)。在早期,RNN/LSTM
是用于开发短期和长期时间动态的首选网络。最近,有一种趋势是使用前馈(即非循环)卷积神经网络对语音、语言和骨架中的序列进行建模,因为它们具有卓越的性能。大多数基于骨架的方法将关节的坐标组织到 2D
映射中,并将映射的大小调整为适合 CNN
输入(例如 ResNet50
)的大小(例如 224×224
)。它的行/列对应于不同类型的关节/框架索引。在这些方法中,长期依赖和语义信息有望被深度网络的大感受野捕获。但这些方法通常会导致模型复杂度高。For skeleton-based action recognition, deep learning is widely used to model the spatio-temporal evolution of the skeleton sequenceVarious network structures have been exploited, such as Recurrent Neural Networks (RNN), Convolutional Neural Networks (CNN), and Graph Convolutional Networks (GCN). In the early years, RNN/LSTM was the favored network to be used to exploit the short and long term temporal dynamics. Recently, there is a trend of using feedforward (i.e., non-recurrent) convolutional neural networks for modeling sequences in speech, language, and skeleton due to their superior performance. Most skeleton-based approaches organize the coordinates of joints to a 2D map and resize the map to a size (e.g. 224×224) suitable for the input of a CNN (e.g. ResNet50). Its rows/columns correspond to the different types of joints/frames indexes. In these methods, long-term dependencies and semantic information are expected to be captured by the large receptive fields of deep networks. This appears to be brutal and typically results in high model complexity.
直观地说,语义信息,即关节类型和帧索引,对于动作识别非常重要。语义与动力学(如 3D
坐标)一起揭示了人体关节的空间和时间配置/结构。众所周知,具有相同坐标但语义不同的两个关节会传递非常不同的信息。比如头顶的一个关节,如果这个关节是手关节,那么动作很可能是举手;如果是脚关节,动作可能是踢腿。此外,时间信息对于动作识别也很重要。以坐下和起身这两个动作为例,它们的区别仅在于帧的出现顺序。然而,大多数方法都忽略了语义信息的重要性并没有充分探索它。Intuitively, semantic information, i.e., the joint type and the frame index, is very important for action recognition. Semantics together with dynamics (i.e., 3D coordinates) reveal the spatial and temporal configuration/structure of human body joints. As we know, two joints of the same coordinates but different semantics would deliver very different information. For example, for a joint above the head, if this joint is a hand joint, the action is likely to be raising hand; if it is a foot joint, the action may be kicking a leg. Besides, the temporal information is also important for action recognition. Taking the two actions of sitting down and standing up as examples, they are different only in occurrence order of the frames. However, most approaches overlook the importance of the semantic information and underexplore it.
为了解决当前方法的上述局限性,我们提出了一种语义引导的神经网络(SGN
),它明确地利用语义和动力学来进行高效的基于骨架的动作识别。图 2
显示了整体框架。我们通过顺序探索骨架序列的关节级和帧级依赖关系来构建分层网络。为了更好的关节级相关建模,除了动力学之外,我们将关节类型(例如,“头部”和“臀部”)的语义合并到 GCN
层中,从而实现内容自适应图的构建和每个关节之间的有效消息传递框架。为了更好的帧级相关建模,我们将时间帧索引的语义结合到网络中。特别是,我们对同一帧内关节的所有特征执行 Spatial MaxPooling
(SMP
) 操作,以获得帧级特征表示。结合嵌入的帧索引信息,使用两个时间卷积神经网络层来学习特征表示以进行分类。此外,我们开发了一个高性能和高效率的强大基线。由于语义信息的有效探索、分层建模和强大的基线,我们提出的 SGN
以更少的参数实现了最先进的性能。To address the above mentioned limitations of current approaches, we propose a semantics-guided neural network (SGN) which explicitly exploits the semantics and dynamics for high efficient skeleton-based action recognition. Fig. 2 shows the overall framework. We build a hierarchical network by sequentially exploring the joint-level and frame-level dependencies of the skeleton sequence. For better joint-level correlation modeling, besides the dynamics, we incorporate the semantics of joint type (e.g., ‘head’, and ‘hip’) to the GCN layers which enables the content adaptive graph construction and effective message passing among joints within each frame. For better frame-level correlation modeling, we incorporate the semantics of temporal frame index to the network. Particularly, we perform a Spatial MaxPooling (SMP) operation over all the features of the joints within the same frame to obtain framelevel feature representation. Combined with the embedded frame index information, two temporal convolutional neural network layers are used to learn feature representations for classification. In addition, we develop a strong baseline which is of high performance and efficiency. Thanks to the efficient exploration of semantic information, the hierarchical modeling, and the strong baseline, our proposed SGN achieves the state-of-the-art performance with a much smaller number of parameters.
本文主要三个贡献如下:We summarize our three main contributions as follows:
- 显式探索联合语义(帧索引和联合类型),以实现有效的基于骨架的动作识别。以前的工作忽略了语义的重要性,并依赖于高度复杂的深度网络进行动作识别。We propose to explicitly explore the joint semantics (frame index and joint type) for efficient skeleton-based action recognition. Previous works overlook the importance of semantics and rely on deep networks with high complexity for action recognition.
- 提出了一个语义引导的神经网络(
SGN
)来分层利用联合级别和帧级别的空间和时间相关性。We present a semantics-guided neural network (SGN) to exploit the spatial and temporal correlations at joint-level and frame-level hierarchically. - 开发了一个轻量级的强基线方法,它比大多数以前的方法更强大。我们希望强基线方法对基于骨架的动作识别的研究有所帮助。We develop a lightweight strong baseline, which is more powerful than most previous methods. We hope the strong baseline will be helpful for the study of skeletonbased action recognition.
有了上述技术贡献,我们得到了一个高性能、计算效率高的基于骨架的动作识别模型。广泛的消融研究证明了所提出的模型设计的有效性。在基于骨架的动作识别的三个最大的基准数据集上,所提出的模型始终比许多竞争算法实现了卓越的性能,同时模型大小比许多算法小一个数量级(见图 1)。With the above technical contributions, we have obtained a high performance skeleton-based action recognition model with high computational efficiency. Extensive ablation studies demonstrate the effectiveness of the proposed model design. On the three largest benchmark datasets for skeleton-based action recognition, the proposed model consistently achieves superior performances over many competing algorithms while having an order of magnitude smaller model size than many algorithms (see Fig. 1).
2. Related Work
基于骨架的动作识别最近引起了越来越多的关注。最近使用神经网络的作品明显优于使用手工制作特征的传统方法。Skeleton-based action recognition has attracted increasing attentions recently. Recent works using neural networks have significantly outperformed traditional approaches that use hand-crafted features.
Recurrent Neural Network based
RNN
,例如 LSTM
和 GRU
,通常用于对骨架序列的时间动态进行建模。帧中所有关节的 3D
坐标以某种顺序连接起来作为时隙的输入向量。他们没有明确告诉网络哪些维度属于哪个关节。其他一些基于 RNN
的工作倾向于在 RNN
中设计特殊的结构,以使其了解空间结构信息。Shahroudy
等人将 LSTM
的单元分为五个子单元,分别对应于五个身体部位,即躯干、两条手臂和两条腿。Liu
等人提出了一个时空 LSTM
模型,以利用时域和空间域中关节的上下文依赖性,它们在每个步骤中为不同类型的关节提供数据。在某种程度上,它们区分了不同的关节。Recurrent neural networks, such as LSTM and GRU, are often used to model the temporal dynamics of skeleton sequence. The 3D coordinates of all joints in a frame are concatenated in some order to be the input vector of a time slot. They do not explicitly tell the networks which dimensions belong to which joint. Some other RNN-based works tend to design special structures in RNN to make it aware of the spatial structural information. Shahroudy et al. divide the cell of LSTM into five sub cells corresponding to five body parts, i.e., torso, two arms, and two legs, respectively. Liu et al. propose a spatial-temporal LSTM model to exploit the contextual dependency of joints in both the temporal and spatial domain, where they feed different types of joints at each step. To some extent, they distinguish the different joints.
Convolutional Neural Network based
近年来,在语音、语言序列建模领域,卷积神经网络在准确率和并行度上都表现出优越性。基于骨架的动作识别也是如此。这些基于 CNN
的工作将骨架序列转换为具有一定目标大小的骨架图,然后使用流行的网络(例如 ResNet
)来探索空间和时间动态。一些作品通过将关节坐标 (x,y,z)
视为像素的 R
、G
和 B
通道,将骨架序列转换为图像。Ke
等人将骨架序列转换为四个二维数组,由四个选定的参考关节(即左/右肩
、左/右髋
)与其他关节之间的相对位置表示。骨架是结构良好的数据,具有明确的高级语义,即帧索引和关节类型。然而,CNN
的内核/过滤器是平移不变的,因此不能直接从这些输入骨架图中感知语义。CNN
期望通过深度网络的大感受野了解此类语义,但效率不高。In recent years, in the field of speech, language sequence modeling, convolutional neural networks demonstrate their superiority in both accuracy and parallelism. The same is true for skeleton-based action recognition. These CNN-based works transform the skeleton sequence to skeleton map of some target size and then use a popular network, such as ResNet, to explore the spatial and temporal dynamics. Some works transform a skeleton sequence to an image by treating the joint coordinate (x,y,z) as the R, G, and B channels of a pixel. Ke et al. transform the skeleton sequence to four 2D arrays, which are represented by the relative position between four selected reference joints (i.e., the left/right shoulder, the left/right hip) and other joints. Skeleton is well structured data with explicit high level semantics, i.e., frame index and joint type. However, the kernels/filters of CNNs are translation invariant and thus cannot directly perceive the semantics from such input skeleton maps. The CNNs are expected to be aware of such semantics through large receptive fields of deep networks, which is not very efficient.
Graph Convolutional Network based
图卷积网络已被证明可有效处理结构化数据,也已用于对结构化骨架数据进行建模。Yan
等人提出了一种时空图卷积网络。他们将每个关节视为图的一个节点。表示联合关系的边的存在是由人类基于先验知识预先定义的。为了增强预定义图,Tang
等人定义物理断开和连接的关节对的边,以便更好地构建图形。提出了一种 SR-TSL
模型,以使用数据驱动方法而不是利用人类定义来学习每帧内五个人体部位的图形边缘。双流 GCN
模型基于非本地块学习内容自适应图,并使用它在 GCN
层中传递消息。然而,信息语义并未用于学习 GCN
的图边和消息传递,这使得网络效率较低。Graph convolutional networks, which have been proven to be effective for processing structured data, have also been used to model the structured skeleton data. Yan et al. propose a spatial and temporal graph convolutional network. They treat each joint as a node of the graph. The presence of edge denoting the joint relationship is pre-defined by human based on prior knowledge. To enhance the predefined graph, Tang et al. define the edges for both physically disconnected and connected joint pairs for better constructing the graph. A SR-TSL model is proposed to learn the graph edge of five human body parts within each frame using a datadriven method instead of leveraging human definition. A two-stream GCN model learns a content adaptive graph based on the non-local block and uses it to pass messages in GCN layers. However, the informative semantics is not utilized for learning the graph edge and message passing of GCN, which makes the network less efficient.
Explicit Exploration of Semantics Information
语义的显式探索已在其他领域得到利用,例如机器翻译和图像识别。Ashish
等人显式编码序列中标记的位置,以利用机器翻译任务中的序列顺序。Zheng
等人将组索引编码为卷积通道表示以保留组顺序信息。然而,对于基于骨架的动作识别,即使这些信息非常重要,也忽略了关节类型和帧索引语义。在我们的工作中,我们建议对关节类型和帧索引进行显式编码,以保留空间和时间身体结构的重要信息。作为探索这种语义的初步尝试,我们希望它能激发这个领域更多的调查和探索。The explicit exploration of semantics has been exploited in other fields, e.g., machine translation and image recognition. Ashish et al. explicitly encode the position of the tokens in the sequence to make use of the order of the sequence in machine translation tasks. Zheng et al. encode the group index into convolutional channel representation to preserve the information of group order. For skeleton-based action recognition, however, the joint type and frame index semantics are overlooked even though such information is very important. In our work, we propose to explicitly encode the joint type and frame index to preserve the important information of the spatial and temporal body structure. As an initial attempt to explore such semantics, we hope it will inspire more investigation and exploration in the community.
3. Semantics-Guided Neural Networks
对于骨架序列,我们通过其语义(关节类型和帧索引)识别关节,并将其与动力学(位置/3D
坐标和速度)一起表示。没有语义,骨架数据将失去重要的时空结构。然而,以前基于 CNN
的工作通常通过将语义隐藏在 2D
骨架图中(例如,对应于不同类型关节的行和对应于帧索引的列)来忽略语义。For a skeleton sequence, we identify a joint by its semantics (joint type and frame index) and represent it together with its dynamics (position/3D coordinates and velocity). Without semantics, the skeleton data will lose the important spatial and temporal structure. Previous CNN-based works, however, typically overlook the semantics by implicitly hiding them in the 2D skeleton map (e.g. with rows corresponding to the different types of joints and columns corresponding to the frame indexes).
我们提出了一种用于基于骨架的动作识别的语义引导神经网络(SGN
),并在图 2 中展示了整体的端到端框架。它由一个关节级模块和一个帧级模块组成。我们在以下小节中描述了框架的细节。We propose a semantics-guided neural network (SGN) for skeleton-based action recognition and show the overall end-to-end framework in Fig. 2. It consists of a joint-level module and a frame-level module. We describe the details of the framework in the following subsections.
具体来说,对于骨架序列,我们将所有关节表示为一个集合,其中 表示在时间 的类型 的关节。 表示骨架序列的帧数,表示一帧中人体关节的总数。对于在时间 的给定类型 的关节 ,它可以通过其动力学和语义来识别。动力学与关节的 3D
位置有关。语义意味着帧索引 和关节类型 。Specifically, for a skeleton sequence, we denote all the joints as a set , where denotes the joint of type at time . denotes the number of frames of the skeleton sequence and denotes the total number of joints of a human body in a frame. For a given joint of type at time , it can be identified by its dynamics and semantics. Dynamics are related to the 3D
position of a joint. Semantics means the frame index and joint type .
3.1. Dynamics Representation
Dynamics Representation
模块包含两个全连接层,使用ReLU
作为激活函数,获得位置的编码。同样的,可以获得速度的编码。融合位置编码和速度编码可得,为了方便,称其为位置速度融合编码。
For a given joint , we define its dynamics by the position in the 3D
coordinate system, and the velocity . We encode/embed the position and velocity into the same high dimensional space, i.e., and , respectively, and fuse them together by summation as
where is the dimension of the joint representation. Take the embedding of position as an example, we encode the position using two fully connected (FC
) layers as
where and are weight matrices, and are the bias vectors, denotes the ReLU
activation function. Similarly, we obtain the embedding for velocity as .
3.2. Joint-level Module
我们设计了一个关节级模块来利用同一帧中关节的相关性。我们采用图卷积网络(GCN
)来探索结构骨架数据的相关性。以前一些基于 GCN
的方法将关节作为节点,它们根据先验知识预先定义图连接(边)或学习内容自适应图。我们同样也学习了内容自适应图,但不同的是,我们将联合类型的语义合并到 GCN
层中,以实现更有效的学习。We design a joint-level module to exploit the correlations of joints in the same frame. We adopt graph convolutional networks (GCN) to explore the correlations for the structural skeleton data. Some previous GCN-based approaches take the joints as nodes and they pre-define the graph connections (edges) based on prior knowledge or learn a content adaptive graph. We also learn a content adaptive graph, but differently we incorporate the semantics of joint type to the GCN layers for more effective learning.
我们通过从两个方面充分利用语义来增强 GCN
层的功能。首先,我们使用关节类型的语义和动态特性来学习帧内节点(不同关节)之间的图连接。关节类型信息有助于学习合适的相邻矩阵(即关节之间的连接权重关系)。以脚和手两个源关节和一个目标关节头为例,直观地说,即使在脚和手的动态特性相同的情况下,脚到头的连接权重值也应该不同于手到头的值。其次,作为关节信息的一部分,关节类型的语义参与了 GCN
层中的消息传递过程。We enhance the power of GCN layers by making full use of the semantics from two aspects. First, we use the semantics of joint type and the dynamics to learn the graph connections among the nodes (different joints) within a frame. The joint type information is helpful for learning suitable adjacent matrix (i.e., relations between joints in terms of connecting weights). Take two source joints, foot and hand, and a target joint head as an example, intuitively, the connection weight value from foot to head should be different from the value from hand to head even when the dynamics of foot and hand are the same. Second, as part of the information of a joint, the semantics of joint types takes part in the message passing process in GCN layers.
使用独热编码表示关节 ,并且获取的编码。We denote the type of the joint (also referred to as type ) by a one-hot vector , where the dimension is one and the others are all zeros. Similar to the encoding of position as in Equ.(2)
, we obtain the embedding of the joint type as .
位置速度融合编码和节点编码是通过concatenate
实现的,即。Given joints of a skeleton frame, we build a graph of nodes. We denote the joint representation of joint type at frame with both the dynamics and the semantics of joint type as . All the joints of frame are then represented by .
节点与节点之间的相似度是通过网络学习出来的,对的每一行做SoftMax
可以得到图的邻接矩阵。Similar to , the edge weight from the joint to the joint in the same frame is modeled by their similarity/affinity in the embeded space as
where and denote two transformation functions, each implemented by an layer, i.e., and .
By computing the affinities of all the joint pairs in the same frame based on (3), we obtain the adjacency matrix . Normalization using SoftMax
as is performed on each row of so that the sum of all the edge values connected to a target node is . We denote the normalized adjacency matrix by .
图2连接方式实现了残差图卷积。GCN
的原论文实现方式为,使用renormalization trick
: 得到。如renormalization trick
所示,。
SGN
对GCN
公式做出了变形,,若,既有,就可以和GCN
原公式对应上了。输出,,此时融合了各节点的位置信息、速度信息、关节信息。
A residual graph convolution layer is used to realize the massage passing among nodes as
where and are transformation matrices. The weight matrices are shared for different temporal frames. is the output. Note that one can stack multiple residual graph convolution layers to enable further message passing among nodes with the same adjacency matrix
.
3.3. Frame-level Module
我们设计了一个帧级模块来利用跨帧的相关性。为了让网络知道帧的顺序,我们结合了帧索引的语义来增强帧的表示能力。We design a frame-level module to exploit the correlations across frames. To make the network know the order of frames, we incorporate the semantics of frame index to enhance the representation capability of a frame.
帧索引也采用了独热编码,同样的,获得帧的编码。与Joint-level Module
模块融合节点编码采用concatenate
不同,Frame-level Module
融合帧编码采用了add
操作,即。We denote the frame index by a one-hot vector . Similar to the encoding of position as in Equ.(2)
, we obtain the embedding of the frame index as . We denote the joint representation corresponding to joint type at frame with both the semantics of frame index and the learned feature as , where .
使用SMP
下采样合并同一帧中的所有关节信息,得到输出。To merge the information of all joints in a frame, we apply one spatial MaxPooling
layer to aggregate them across the joints. The dimension of feature of the sequence is thus .
接下来跟随着两个CNN层,第一个CNN层表示为时间卷积,用来建模帧之间的依赖;第二个CNN层是使用卷积核进行升维来增强对特征的表示能力。Two layers are applied. The first layer is a temporal convolution layer to model the dependencies of frames. The second CNN layer is used to enhance the representation capability of learned features by mapping it to a high dimension space with kernel size of 1.
之后使用TMP
层合并所有帧的信息,输出。接下来,通过全连接层,Softmax
分类得到网络最终输出。After the two CNN layers, we apply a temporal MaxPooling
layer to aggregate the information of all frames and obtain the sequence level feature representation of dimensions. This is then followed by a fully connected layer with Softmax
to perform the classification.
文章出处登录后可见!