This book provides a series of gesture and behavior recognition methods based on multimodal data representation. The data modalities include image data and skeleton data, and the modeling methods include traditional codebook, topological graph, and LSTM architectures. The tasks include single gesture recognition classification, single action recognition classification, continuous gesture classification, complex behavior classification of human interaction and other tasks of different complexity. This book focuses on the data processing methods of each modality, and the modeling methods for different tasks. We hope the reader can leam basic gesture and action recognition methods from this book, and develop a model system that suits their needs on this basis.This book can be used as a textbook for graduate, postgraduate and PhD students majoring in computer science, automation, etc. It can also be used as a reference for the reader who is interested in gesture recognition, human action interaction, sequence data processing, and deep neural network design, and who hopes to contribute to the fields.
Chapter 1 Human Action Recognition Using Multi-layer Codebooks of Key Poses and Atomic Motions 1.1 Introduction 1.2 Related Work 1.2.1 Feature Representation 1.2.2 Classification Model 1.3 Construction of Multi-layer Codebook 1.3.1 Feature Representation 1.3.2 Feature Sequence Segmentation 1.3.3 Pose-layer Codebook 1.3.4 Motion-layer Codebook 1.3.5 Multi-layer Codebook Construction 1.4 Classification Methods 1.4.1 Naive Bayes Nearest Neighbor 1.4.2 Support Vector Machine 1.4.3 Random Forest 1.5 Experimental Results 1.5.1 Experiments on the CAD-60 dataset 1.5.2 Experiments on the MSRC-12 dataset 1.5.3 Discussion 1.6 Conclusion and Future Work Acknowledgements References
Chapter 2 Topology-learnable Graph Convolution for Skeleton-based Action Recognition 2.1 Introduction 2.2 Related Work 2.2.1 Graph Convolutional Network for Action Recognition 2.2.2 Adaptive Graph Convolution 2.3 Topology-learnable Graph Convolution 2.3.1 Graph Convolution 2.3.2 Graph Topology Analysis 2.3.3 Topology-learnable Graph Convolution 2.3.4 Topology-learnable GCNs 2.4 Experiments 2.4.1 Datasets 2.4.2 Ablation Study 2.4.3 Comparison with the State-of-the-art Methods 2.4.4 Discussion 2.5 Conclusion Acknowledgements References
Chapter 3 Recurrent Graph Convolutional Networks for Skeleton-based Action Recognition 3.1 Introduction 3.2 Related Work 3.2.1 Graph Convolution forAction Recognition 3.2.2 LSTM on Graphs 3.3 Recurrent Graph Convolutional Network 3.3.1 Graph Convolution 3.3.2 Adaptive Graph Convolution. 3.3.3 Recurrent Graph Convolution 3.3.4 Recurrent Graph Convolutional Network 3.4.1 Datasets 3.4.2 Training Details 3.4.3 Ablation Study 3.4.4 Comparison with the State-of-the-art Methods 3.4.5 Visualization of the Evolved Graph Topologies 3.5 Conclusion Acknowledgements References
Chapter 4 Graph-temporal LSTM Networks for Skeleton-based Action Recognition 4.1 Introduction 4.2 Related Work 4.3 GT-LSTM Networks 4.3.1 Pipeline Overview 4.3.2 Topology-learnable ST-GCN 4.3.3 GT-LSTM 4.3.4 GT-LSTM Networks 4.4 Experiments 4.4.1 Datasets 4.4.2 Training Details 4.4.3 Ablation Study 4.4.4 Comparison with the State-of-the-art Methods 4.5 Conclusion References
Chapter 5 Spatio-temporal Interaction Graph Parsing Networks for Human-object Interaction Recognition 5.1 Introduction 5.2 Related Work 5.3 Overview 5.4 Proposed Approach 5.4.1 Video Feature Extraction 5.4.2 Spatio-temporal Interaction Graph Parsing 5.4.3 Inference 5.4.4 Implementation Details 5.5 Experiments 5.5.1 Dataset 5.5.2 Ablation Study 5.5.3 Comparison with the State-of-the-arts Methods 5.5.4 Visualization of Parsed Graphs 5.6 Conclusion Acknowledgements References Chapter 6 Learning Spatio-temporal Features Using 3DCNN and Convolutional LSTM For Gesture Recognition 6.1 Introduction 6.2 Related Work 6.3 Method 6.3.1 2D Spatio-temporal Feature MapLearning 6.3.2 Classification Based on the 2D Feature Maps 6.3.3 Network Training 6.4 Experiments 6.4.1 Datasets 6.4.2 Implementation 6.4.3 Architecture Analysis 6.4.4 Comparison with the State-of-the-art Methods 6.5 Conclusion Acknowledgements References …… Chapter 7 Multimodal Gesture Recognition Using 3D Convoluhon and Convolutional LSTM Chapter 8 Continuous Gesture Segmentation and Recognition Using 3DCNN and Convolutional LSTM Chapter 9 Redundancy and Attention in Convolutional LSTM for Gesture Recognition
With the development of artificial intelligence, big data, sensors and other technologies, metaverse, robotics and other industries have had higher and higher requirements for multi-modal natural human-computer interaction technology. In particular, in the actual business applications of service robots, virtual reality, and augmented reality, the intelligent agent needs to have an accurate understanding of human behavior and gesture expression. In this way, from the perception span to the cognitive level, the intelligent agent can better understand human beings, and make a natural and anthropomorphic response, thus realizing efficient human-computer interaction, and promoting the application of artificial intelligence products in the field of real life. The accurate recognition of gestures and behavioral actions is very important for human-computer interaction and can be applied to many different fields. For example, accurate gesture recognition methods can be applied to sign language recognition so that we can better interact with deaf people; gesture recognition can also be applied to other scenarios such as command and control. The recognition of behavior and actions by an agent such as a robot can enable it to accurately understand the intention of human beings, and then serve human beings better. For example, under the premise of accurately recognizing behaviors and actions of the elderly, the service robot can provide timely and necessary assistance according to the actions the elderly expect to perform. However, there are many difficulties in how to accurately model the gesture and action information according to the data collected by the sensor, which are mainly summarized as follows: (1)The data are incomplete due to the different perspective views of the collected data, which makes it difficult for the purely data-driven modeling method to accurately represent the gestures and action features. (2)Gesture actions focus mainly on the hand, but the changes in hand joints are small and quick and difficult to capture. (3)Action behaviors often involve the manipulation of state changes of different objects, and how to unify the modeling is also a difficult point. To solve these problems, based on video and skeletal data, this book analyzes gesture and action behaviors from multiple perspectives, such as traditional codebook methods, topological map-based data expression methods, and deep network-based feature extraction methods. At the same time, personalized methods are proposed for different tasks, such as single gesture/action behavior recognition,continuous gesture recognition, human object interaction, etc. We hope that the methods presented in this book will be helpful to those working in this field. With the development of deep learning, artificial intelligence and other technologies, how to make intelligent agents understand our society from a cognitive perspective, understand people, and serve us healthily is the biggest goal of intelligent system research. We also believe that with the continuous advancement of technology in various fields, humans and intelligent entities will be able to live in harmony, humans know machines, and machines understand people.