Video Representation for Fine-grained Action Recognition

Video Representation for Fine-grained Action Recognition PDF Author: Yang Zhou
Publisher:
ISBN: 9781369057997
Category : High definition video recording
Languages : en
Pages : 108

Get Book Here

Book Description
Recently, fine-grained action analysis has raised a lot of research interests due to its potential applications in smart home, medical surveillance, daily living assist and child/elderly care, where action videos are captured indoor with fixed camera. Although background motion (i.e. one of main challenges for general action recognition) is more controlled compared to general action recognition, it is widely acknowledged that fine-grained action recognition is very challenging due to large intra-class variability, small inter-class variability, large variety of action categories, complex motions and complicated interactions. Fine-Grained actions, especially the manipulation sequences involve a large amount of interactions between hands and objects, therefore how to model the interactions between human hands and objects (i.e., context) plays an important role in action representation and recognition. We propose to discover the manipulated objects by human by modeling which objects are being manipulated and how they are being operated. Firstly, we propose a representation and classification pipeline which seamlessly incorporates localized semantic information into every processing step for fine-grained action recognition. In the feature extraction stage, we explore the geometric information between local motion features and the surrounding objects. In the feature encoding stage, we develop a semantic-grouped locality-constrained linear coding (SG-LLC) method that captures the joint distributions between motion and object-in-use information. Finally, we propose a semantic-aware multiple kernel learning framework (SA-MKL) by utilizing the empirical joint distribution between action and object type for more discriminative action classification. This approach can discover and model the inter- actions between human and objects. However, discovering the detailed knowledge of pre-detected objects (e.g. drawer and refrigerator). Thus, the performance of action recognition is constrained by object recognition, not to mention detection of objects requires tedious human labor for object annotation. Secondly, we propose a mid-level video representation to be suitable for fine-grained action classification. Given an input video sequence, we densely sample a large amount of spatio-temporal motion parts by temporal segmentation with spatial segmentation, and represent them with local motion features. The dense mid-level candidate parts are rich in localized motion information, which is crucial to fine-grained action recognition. From the candidate spatio-temporal parts, we perform an unsupervised approach to discover and learn the representative part detectors for final video representation. By utilizing the dense spatio-temporal motion parts, we highlight the human-object interactions and localized delicate motion in the local spatio-temporal sub-volume of the video. Thirdly, we propose a novel fine-grained action recognition pipeline by interaction part proposal and discriminative mid-level part mining. Firstly, we generate a large number of candidate object regions using off-the-shelf object proposal tool, e.g., BING. Secondly, these object regions are matched and tracked across frames to form a large spatio-temporal graph based on the appearance matching and the dense motion trajectories through them. We then propose an efficient approximate graph segmentation algorithm to partition and filter the graph into consistent local dense sub-graphs. These sub-graphs, which are spatio-temporal sub-volumes, represent our candidate interaction parts. Finally, we mine discriminative mid-level part detectors from the features computed over the candidate interaction parts. Bag-of-detection scores based on a novel Max-N pooling scheme are computed as the action representation for a video sample. Finally, we also focus on the first-view (egocentric) action recognition problem, which contains lots of hand-object interactions. On one hand, we propose a novel end-to-end trainable semantic parsing network for hand segmentation. On the other hand, we propose a second end-to-end deep convolutional network to maximally utilize the contextual information among hand, foreground object, and motion for interactional foreground object detection.

Video Representation for Fine-grained Action Recognition

Video Representation for Fine-grained Action Recognition PDF Author: Yang Zhou
Publisher:
ISBN: 9781369057997
Category : High definition video recording
Languages : en
Pages : 108

Get Book Here

Book Description
Recently, fine-grained action analysis has raised a lot of research interests due to its potential applications in smart home, medical surveillance, daily living assist and child/elderly care, where action videos are captured indoor with fixed camera. Although background motion (i.e. one of main challenges for general action recognition) is more controlled compared to general action recognition, it is widely acknowledged that fine-grained action recognition is very challenging due to large intra-class variability, small inter-class variability, large variety of action categories, complex motions and complicated interactions. Fine-Grained actions, especially the manipulation sequences involve a large amount of interactions between hands and objects, therefore how to model the interactions between human hands and objects (i.e., context) plays an important role in action representation and recognition. We propose to discover the manipulated objects by human by modeling which objects are being manipulated and how they are being operated. Firstly, we propose a representation and classification pipeline which seamlessly incorporates localized semantic information into every processing step for fine-grained action recognition. In the feature extraction stage, we explore the geometric information between local motion features and the surrounding objects. In the feature encoding stage, we develop a semantic-grouped locality-constrained linear coding (SG-LLC) method that captures the joint distributions between motion and object-in-use information. Finally, we propose a semantic-aware multiple kernel learning framework (SA-MKL) by utilizing the empirical joint distribution between action and object type for more discriminative action classification. This approach can discover and model the inter- actions between human and objects. However, discovering the detailed knowledge of pre-detected objects (e.g. drawer and refrigerator). Thus, the performance of action recognition is constrained by object recognition, not to mention detection of objects requires tedious human labor for object annotation. Secondly, we propose a mid-level video representation to be suitable for fine-grained action classification. Given an input video sequence, we densely sample a large amount of spatio-temporal motion parts by temporal segmentation with spatial segmentation, and represent them with local motion features. The dense mid-level candidate parts are rich in localized motion information, which is crucial to fine-grained action recognition. From the candidate spatio-temporal parts, we perform an unsupervised approach to discover and learn the representative part detectors for final video representation. By utilizing the dense spatio-temporal motion parts, we highlight the human-object interactions and localized delicate motion in the local spatio-temporal sub-volume of the video. Thirdly, we propose a novel fine-grained action recognition pipeline by interaction part proposal and discriminative mid-level part mining. Firstly, we generate a large number of candidate object regions using off-the-shelf object proposal tool, e.g., BING. Secondly, these object regions are matched and tracked across frames to form a large spatio-temporal graph based on the appearance matching and the dense motion trajectories through them. We then propose an efficient approximate graph segmentation algorithm to partition and filter the graph into consistent local dense sub-graphs. These sub-graphs, which are spatio-temporal sub-volumes, represent our candidate interaction parts. Finally, we mine discriminative mid-level part detectors from the features computed over the candidate interaction parts. Bag-of-detection scores based on a novel Max-N pooling scheme are computed as the action representation for a video sample. Finally, we also focus on the first-view (egocentric) action recognition problem, which contains lots of hand-object interactions. On one hand, we propose a novel end-to-end trainable semantic parsing network for hand segmentation. On the other hand, we propose a second end-to-end deep convolutional network to maximally utilize the contextual information among hand, foreground object, and motion for interactional foreground object detection.

Human Activity Recognition and Prediction

Human Activity Recognition and Prediction PDF Author: Yun Fu
Publisher: Springer
ISBN: 3319270044
Category : Technology & Engineering
Languages : en
Pages : 179

Get Book Here

Book Description
This book provides a unique view of human activity recognition, especially fine-grained human activity structure learning, human-interaction recognition, RGB-D data based action recognition, temporal decomposition, and causality learning in unconstrained human activity videos. The techniques discussed give readers tools that provide a significant improvement over existing methodologies of video content understanding by taking advantage of activity recognition. It links multiple popular research fields in computer vision, machine learning, human-centered computing, human-computer interaction, image classification, and pattern recognition. In addition, the book includes several key chapters covering multiple emerging topics in the field. Contributed by top experts and practitioners, the chapters present key topics from different angles and blend both methodology and application, composing a solid overview of the human activity recognition techniques.

Towards Action Recognition and Localization in Videos with Weakly Supervised Learning

Towards Action Recognition and Localization in Videos with Weakly Supervised Learning PDF Author: Nataliya Shapovalova
Publisher:
ISBN:
Category :
Languages : en
Pages : 102

Get Book Here

Book Description
Human behavior understanding is a fundamental problem of computer vision. It is an important component of numerous real-life applications, such as human-computer interaction, sports analysis, video search, and many others. In this thesis we work on the problem of action recognition and localization, which is a crucial part of human behavior understanding. Action recognition explains what a human is doing in the video, while action localization indicates where and when in the video the action is happening. We focus on two important aspects of the problem: (1) capturing intra-class variation of action categories and (2) inference of action location. Manual annotation of videos with fine-grained action labels and spatio-temporal action locations is a nontrivial task, thus employing weakly supervised learning approaches is of interest. Real-life actions are complex, and the same action can look different in different scenarios. A single template is not capable of capturing such data variability. Therefore, for each action category we automatically discover small clusters of examples that are visually similar to each other. A separate classifier is learnt for each cluster, so that more class variability is captured. In addition, we establish a direct association between a novel test example and examples from training data and demonstrate how metadata (e.g., attributes) can be transferred to test examples. Weakly supervised learning for action recognition and localization is another challenging task. It requires automatic inference of action location for all the training videos during learning. Initially, we simplify this problem and try to find discriminative regions in videos that lead to a better recognition performance. The regions are inferred in a manner such that they are visually similar across all the videos of the same category. Ideally, the regions should correspond to the action location; however, there is a gap between inferred discriminative regions and semantically meaningful regions representing action location. To fill the gap, we incorporate human eye gaze data to drive the inference of regions during learning. This allows inferring regions that are both discriminative and semantically meaningful. Furthermore, we use the inferred regions and learnt action model to assist top-down eye gaze prediction.

Computer Vision – ECCV 2022

Computer Vision – ECCV 2022 PDF Author: Shai Avidan
Publisher: Springer Nature
ISBN: 3031200802
Category : Computers
Languages : en
Pages : 815

Get Book Here

Book Description
The 39-volume set, comprising the LNCS books 13661 until 13699, constitutes the refereed proceedings of the 17th European Conference on Computer Vision, ECCV 2022, held in Tel Aviv, Israel, during October 23–27, 2022. The 1645 papers presented in these proceedings were carefully reviewed and selected from a total of 5804 submissions. The papers deal with topics such as computer vision; machine learning; deep neural networks; reinforcement learning; object recognition; image classification; image processing; object detection; semantic segmentation; human pose estimation; 3d reconstruction; stereo vision; computational photography; neural networks; image coding; image reconstruction; object recognition; motion estimation.

Gesture Recognition

Gesture Recognition PDF Author: Sergio Escalera
Publisher: Springer
ISBN: 3319570218
Category : Computers
Languages : en
Pages : 583

Get Book Here

Book Description
This book presents a selection of chapters, written by leading international researchers, related to the automatic analysis of gestures from still images and multi-modal RGB-Depth image sequences. It offers a comprehensive review of vision-based approaches for supervised gesture recognition methods that have been validated by various challenges. Several aspects of gesture recognition are reviewed, including data acquisition from different sources, feature extraction, learning, and recognition of gestures.

Advances in Computational Intelligence

Advances in Computational Intelligence PDF Author: Ignacio Rojas
Publisher: Springer
ISBN: 3030205185
Category : Computers
Languages : en
Pages : 926

Get Book Here

Book Description
This two-volume set LNCS 10305 and LNCS 10306 constitutes the refereed proceedings of the 15th International Work-Conference on Artificial Neural Networks, IWANN 2019, held at Gran Canaria, Spain, in June 2019. The 150 revised full papers presented in this two-volume set were carefully reviewed and selected from 210 submissions. The papers are organized in topical sections on machine learning in weather observation and forecasting; computational intelligence methods for time series; human activity recognition; new and future tendencies in brain-computer interface systems; random-weights neural networks; pattern recognition; deep learning and natural language processing; software testing and intelligent systems; data-driven intelligent transportation systems; deep learning models in healthcare and biomedicine; deep learning beyond convolution; artificial neural network for biomedical image processing; machine learning in vision and robotics; system identification, process control, and manufacturing; image and signal processing; soft computing; mathematics for neural networks; internet modeling, communication and networking; expert systems; evolutionary and genetic algorithms; advances in computational intelligence; computational biology and bioinformatics.

Computer Vision and Action Recognition

Computer Vision and Action Recognition PDF Author: Md. Atiqur Rahman Ahad
Publisher: Springer Science & Business Media
ISBN: 9491216201
Category : Computers
Languages : en
Pages : 228

Get Book Here

Book Description
Human action analyses and recognition are challenging problems due to large variations in human motion and appearance, camera viewpoint and environment settings. The field of action and activity representation and recognition is relatively old, yet not well-understood by the students and research community. Some important but common motion recognition problems are even now unsolved properly by the computer vision community. However, in the last decade, a number of good approaches are proposed and evaluated subsequently by many researchers. Among those methods, some methods get significant attention from many researchers in the computer vision field due to their better robustness and performance. This book will cover gap of information and materials on comprehensive outlook – through various strategies from the scratch to the state-of-the-art on computer vision regarding action recognition approaches. This book will target the students and researchers who have knowledge on image processing at a basic level and would like to explore more on this area and do research. The step by step methodologies will encourage one to move forward for a comprehensive knowledge on computer vision for recognizing various human actions.

Computer Vision – ECCV 2016

Computer Vision – ECCV 2016 PDF Author: Bastian Leibe
Publisher: Springer
ISBN: 3319464876
Category : Computers
Languages : en
Pages : 909

Get Book Here

Book Description
The eight-volume set comprising LNCS volumes 9905-9912 constitutes the refereed proceedings of the 14th European Conference on Computer Vision, ECCV 2016, held in Amsterdam, The Netherlands, in October 2016. The 415 revised papers presented were carefully reviewed and selected from 1480 submissions. The papers cover all aspects of computer vision and pattern recognition such as 3D computer vision; computational photography, sensing and display; face and gesture; low-level vision and image processing; motion and tracking; optimization methods; physics-based vision, photometry and shape-from-X; recognition: detection, categorization, indexing, matching; segmentation, grouping and shape representation; statistical methods and learning; video: events, activities and surveillance; applications. They are organized in topical sections on detection, recognition and retrieval; scene understanding; optimization; image and video processing; learning; action activity and tracking; 3D; and 9 poster sessions.

Contextual Visual Recognition from Images and Videos

Contextual Visual Recognition from Images and Videos PDF Author: Georgia Gkioxari
Publisher:
ISBN:
Category :
Languages : en
Pages : 77

Get Book Here

Book Description
Object recognition from images and videos has been a topic of great interest in the computer vision community. Its success directly impacts a wide variety of real-world applications; from surveillance and health care to self-driving cars and online shopping. Objects exhibit organizational structure in their real-world setting (Biederman et al., 1982). Contextual reasoning is part of human's visual understanding and has been modeled by various efforts in computer vision in the past (Torralba, 2001). Recently, object recognition has reached a new peak with the help of deep learning. State-of-the-art object recognition systems use convolutional neural networks (CNNs) to classify regions of interest in an image. The visual cues extracted for each region are limited to the content of the region and ignore the contextual information from the scene. So the question remains, how can we enhance convolutional neural networks with contextual reasoning to improve recognition? Work presented in this manuscript shows how contextual cues conditioned on the scene and the object can improve CNNs' ability to recognize difficult, highly contextual objects from images. Turning to the most interesting object of all, people, contextual reasoning is a key for the fine-grained tasks of action and attribute recognition. Here, we demonstrate the importance of extracting cues in an instance-specific and category-specific manner tied to the task in question. Finally, we study motion which captures the change in shape and appearance in time and is a way to extract dynamic contextual cues. We show that coupling motion with the complementary signal of static visual appearance leads to a very effective representation for action recognition from videos.

Motion History Images for Action Recognition and Understanding

Motion History Images for Action Recognition and Understanding PDF Author: Md. Atiqur Rahman Ahad
Publisher: Springer Science & Business Media
ISBN: 1447147308
Category : Computers
Languages : en
Pages : 132

Get Book Here

Book Description
Human action analysis and recognition is a relatively mature field, yet one which is often not well understood by students and researchers. The large number of possible variations in human motion and appearance, camera viewpoint, and environment, present considerable challenges. Some important and common problems remain unsolved by the computer vision community. However, many valuable approaches have been proposed over the past decade, including the motion history image (MHI) method. This method has received significant attention, as it offers greater robustness and performance than other techniques. This work presents a comprehensive review of these state-of-the-art approaches and their applications, with a particular focus on the MHI method and its variants.