摘要

Objective: Skeleton-based action recognition has been concerned in recent years, as the dynamics of human skeletons has significant information for the task of action recognition. The action of human skeletons can be seen as time series of human poses, or the combination of human joint trajectories. The trajectory of important joints indicating the action class has conveyed the most significant information among all the human joints. The trajectories of these joints have been subjected to some distortions when performing the same action under different attempts. In this case, two similar trajectories of corresponding joints should share a basic shape. However, these two trajectories have appeared in diverse kinds of distortions due to individual factors. These distortions have been caused by spatial and temporal factors. Spatial factors have included the change of viewpoints, different skeleton sizes and action amplitudes, while temporal factors indicate time scaling along the time series, denoting the order and speed of performing specific action. All the spatial factors can be modeled by the affine transformation in 3D space, whereas the uniform time scaling has been commonly discussed case, which can be seen as affine transformation in 1D space. These two kinds of distortions as the spatio-temporal dual affine transformation have been combined. A novel invariant feature under these distortions has been proposed and utilized for facilitating skeleton-based action recognition. A kind of feature invariant based on the spatio-temporal affine transformation has aided the identification of similar trajectories to be beneficial for action recognition. Method A general method for constructing spatio-temporal dual affine differential invariant (STDADI) has been proposed. The rational polynomial of derivatives of joint trajectories to obtain the invariants has been utilized in detail via eliminating the transformation parameters effectively. Robust, coordinate-system-independent feature has calculated directly from the 3D coordinates. Bounding the degree of polynomial and the order of derivatives, we generate 8 independent STDADIs and combine them as an invariant vector at each moment for each human joint. Moreover, an intuitive and effective method called channel augmentation has been proposed to extend input data with STDADI along the channel dimension for training and evaluation. Specifically, the coordinate vector and the STDADI vector at each joint for each frame have been concatenated. Channel augmentation has introduced invariant information into input data without changing the inner structure of neural networks. The spatio-temporal graph convolutional networks (ST-GCN) as the basic network have been used. The skeleton data modeling as a graph structure has envolved spatial and temporal connections between human joints simultaneously. Particularly, it has exploited local pattern and correlation from human skeletons. In other words, the importance of joints along the action sequence has been expressed as the weights of human joints in the spatio-temporal graph. This is in line with our STDADI, because both of them focus on describing joint dynamics, and our features further provide an invariant expression which is not affected by the distortions. Result The synthetic data has been examined to verify the effectiveness of STDADI as well as the large-scale action recognition dataset. First, 3D spiral line and selected joint trajectory based on NTU-RGB+D applied with random transformation parameters has shown that STDADI is invariant under the spatio-temporal affine transformations. Next, the effectiveness of the proposed feature and method has been validated on the large-scale action recognition dataset NTU(Nanyang Technological University)RGB+D (NTU 60) and its extended version NTU-RGB+D 120 (NTU 120), which is currently the largest dataset with 3D joint annotations captured in a constrained indoor environment, and perform some detailed study to examine the contributions of STDADI. A data augmentation technique as well as the original ST-GCN have been as the baseline methods. The data augmentation technique has involved rotation, scaling and shear transformations of 3D skeletons. The same training strategy and hyper-parameters as the original ST-GCN have been used. ST-GCN + channel augmentation has performed well. Compared with the ST-GCN using raw data, in NTU 60, the cross-subject and cross-view recognition accuracy has been increased by 1.9% and 3.0%, respectively; in NTU 120, the cross-subject and cross-setup recognition accuracy has increased by 5.6% and 4.5% respectively. As it is mainly consisted of 3D geometric transformations, the accuracy in cross-view recognition has been much improved but contributes little to the cross-subject setting for data augmentation. The spatio-temporal dual affine transformation assumption has been validated on both evaluation criteria. Conclusion: A general method for constructing spatio-temporal dual affine differential invariant (STDADI) has been proposed. The effectiveness of this invariant feature using a channel augmentation technique has been proved on the large-scale action recognition dataset NTU-RGB+D and NTU-RGB+D 120. The combination of hand-crafted features and data-driven methods has improved the accuracy and generalization well.