Spatio-temporal Human Action Detection and Instance Segmentation in Videos

Spatio-temporal Human Action Detection and Instance Segmentation in Videos PDF Author: Suman Saha
Publisher:
ISBN:
Category :
Languages : en
Pages : 194

Get Book Here

Book Description

Spatio-temporal Human Action Detection and Instance Segmentation in Videos

Spatio-temporal Human Action Detection and Instance Segmentation in Videos PDF Author: Suman Saha
Publisher:
ISBN:
Category :
Languages : en
Pages : 194

Get Book Here

Book Description


Human Action Detection, Tracking and Segmentation in Videos

Human Action Detection, Tracking and Segmentation in Videos PDF Author: Yicong Tian
Publisher:
ISBN:
Category :
Languages : en
Pages : 94

Get Book Here

Book Description
This dissertation addresses the problem of human action detection, human tracking and segmentation in videos. They are fundamental tasks in computer vision and are extremely challenging to solve in realistic videos. We first propose a novel approach for action detection by exploring the generalization of deformable part models from 2D images to 3D spatiotemporal volumes. By focusing on the most distinctive parts of each action, our models adapt to intra-class variation and show robustness to clutter. This approach deals with detecting action performed by a single person. When there are multiple humans in the scene, humans need to be segmented and tracked from frame to frame before action recognition can be performed. Next, we propose a novel approach for multiple object tracking (MOT) by formulating detection and data association in one framework. Our method allows us to overcome the confinements of data association based MOT approaches, where the performance is dependent on the object detection results provided at input level. We show that automatically detecting and tracking targets in a single framework can help resolve the ambiguities due to frequent occlusion and heavy articulation of targets. In this tracker, targets are represented by bounding boxes, which is a coarse representation. However, pixel-wise object segmentation provides fine level information, which is desirable for later tasks. Finally, we propose a tracker that simultaneously solves three main problems: detection, data association and segmentation. This is especially important because the output of each of those three problems are highly correlated and the solution of one can greatly help improve the others. The proposed approach achieves more accurate segmentation results and also helps better resolve typical difficulties in multiple target tracking, such as occlusion, ID-switch and track drifting.

Video Representation for Fine-grained Action Recognition

Video Representation for Fine-grained Action Recognition PDF Author: Yang Zhou
Publisher:
ISBN: 9781369057997
Category : High definition video recording
Languages : en
Pages : 108

Get Book Here

Book Description
Recently, fine-grained action analysis has raised a lot of research interests due to its potential applications in smart home, medical surveillance, daily living assist and child/elderly care, where action videos are captured indoor with fixed camera. Although background motion (i.e. one of main challenges for general action recognition) is more controlled compared to general action recognition, it is widely acknowledged that fine-grained action recognition is very challenging due to large intra-class variability, small inter-class variability, large variety of action categories, complex motions and complicated interactions. Fine-Grained actions, especially the manipulation sequences involve a large amount of interactions between hands and objects, therefore how to model the interactions between human hands and objects (i.e., context) plays an important role in action representation and recognition. We propose to discover the manipulated objects by human by modeling which objects are being manipulated and how they are being operated. Firstly, we propose a representation and classification pipeline which seamlessly incorporates localized semantic information into every processing step for fine-grained action recognition. In the feature extraction stage, we explore the geometric information between local motion features and the surrounding objects. In the feature encoding stage, we develop a semantic-grouped locality-constrained linear coding (SG-LLC) method that captures the joint distributions between motion and object-in-use information. Finally, we propose a semantic-aware multiple kernel learning framework (SA-MKL) by utilizing the empirical joint distribution between action and object type for more discriminative action classification. This approach can discover and model the inter- actions between human and objects. However, discovering the detailed knowledge of pre-detected objects (e.g. drawer and refrigerator). Thus, the performance of action recognition is constrained by object recognition, not to mention detection of objects requires tedious human labor for object annotation. Secondly, we propose a mid-level video representation to be suitable for fine-grained action classification. Given an input video sequence, we densely sample a large amount of spatio-temporal motion parts by temporal segmentation with spatial segmentation, and represent them with local motion features. The dense mid-level candidate parts are rich in localized motion information, which is crucial to fine-grained action recognition. From the candidate spatio-temporal parts, we perform an unsupervised approach to discover and learn the representative part detectors for final video representation. By utilizing the dense spatio-temporal motion parts, we highlight the human-object interactions and localized delicate motion in the local spatio-temporal sub-volume of the video. Thirdly, we propose a novel fine-grained action recognition pipeline by interaction part proposal and discriminative mid-level part mining. Firstly, we generate a large number of candidate object regions using off-the-shelf object proposal tool, e.g., BING. Secondly, these object regions are matched and tracked across frames to form a large spatio-temporal graph based on the appearance matching and the dense motion trajectories through them. We then propose an efficient approximate graph segmentation algorithm to partition and filter the graph into consistent local dense sub-graphs. These sub-graphs, which are spatio-temporal sub-volumes, represent our candidate interaction parts. Finally, we mine discriminative mid-level part detectors from the features computed over the candidate interaction parts. Bag-of-detection scores based on a novel Max-N pooling scheme are computed as the action representation for a video sample. Finally, we also focus on the first-view (egocentric) action recognition problem, which contains lots of hand-object interactions. On one hand, we propose a novel end-to-end trainable semantic parsing network for hand segmentation. On the other hand, we propose a second end-to-end deep convolutional network to maximally utilize the contextual information among hand, foreground object, and motion for interactional foreground object detection.

Spatio-temporal Modeling for Action Recognition in Videos

Spatio-temporal Modeling for Action Recognition in Videos PDF Author: Guoxi Huang
Publisher:
ISBN:
Category :
Languages : en
Pages : 0

Get Book Here

Book Description


Action Recognition, Temporal Localization and Detection in Trimmed and Untrimmed Videos

Action Recognition, Temporal Localization and Detection in Trimmed and Untrimmed Videos PDF Author: Rui Hou
Publisher:
ISBN:
Category :
Languages : en
Pages : 107

Get Book Here

Book Description
Automatic understanding of videos is one of the most active areas of computer vision research. It has applications in video surveillance, human computer interaction, video sports analysis, virtual and augmented reality, video retrieval etc. In this dissertation, we address four important tasks in video understanding, namely action recognition, temporal action localization, spatial-temporal action detection and video object/action segmentation. This dissertation makes contributions to above tasks by proposing. First, for video action recognition, we propose a category level feature learning method. Our proposed method automatically identifies such pairs of categories using a criterion of mutual pairwise proximity in the (kernelized) feature space, and a category-level similarity matrix where each entry corresponds to the one-vs-one SVM margin for pairs of categories. Second, for temporal action localization, we propose to exploit the temporal structure of actions by modeling an action as a sequence of sub-actions and present a computationally efficient approach. Third, we propose 3D Tube Convolutional Neural Network (TCNN) based pipeline for action detection. The proposed architecture is a unified deep network that is able to recognize and localize action based on 3D convolution features. It generalizes the popular faster R-CNN framework from images to videos. Last, an end-to-end encoder-decoder based 3D convolutional neural network pipeline is proposed, which is able to segment out the foreground objects from the background. Moreover, the action label can be obtained as well by passing the foreground object into an action classifier. Extensive experiments on several video datasets demonstrate the superior performance of the proposed approach for video understanding compared to the state-of-the-art.

Segmentation of Video Into Spatio-temporal Objects

Segmentation of Video Into Spatio-temporal Objects PDF Author: Gerrit Polder
Publisher:
ISBN: 9789057760631
Category :
Languages : en
Pages : 133

Get Book Here

Book Description


Spatio-temporal Segmentation of Video Data

Spatio-temporal Segmentation of Video Data PDF Author: John Yu An Wang
Publisher:
ISBN:
Category :
Languages : en
Pages : 12

Get Book Here

Book Description


Comparative Evaluation of Local Spatio-temporal Features for Human Action Recognition

Comparative Evaluation of Local Spatio-temporal Features for Human Action Recognition PDF Author: Divye Kumar
Publisher:
ISBN:
Category : Human behavior
Languages : en
Pages : 150

Get Book Here

Book Description


Analysis of Human-centric Activities in Video Via Qualitative Spatio-temporal Reasoning

Analysis of Human-centric Activities in Video Via Qualitative Spatio-temporal Reasoning PDF Author: Hajar Sadeghi Sokeh
Publisher:
ISBN:
Category :
Languages : en
Pages : 0

Get Book Here

Book Description
Applying qualitative spatio-temporal reasoning in video analysis is now a very active research topic in computer vision and artificial intelligence. Among all video analysis applications, monitoring and understanding human activities is of great interest. Many human activities can be understood by analysing the interaction between objects in space and time. Qualitative spatio-temporal reasoning encapsulates information that is useful for analysing huma-centric videos. This information can be represented in a very compact form involving interactions between objects of interest in the form of qualitative spatio-temporal relationships. This thesis focuses on three different aspects of interpreting human-centric videos; first introducing a representation of interactions between objects of interest, second determining which objects in the scene are relevant to the activity, and third recognising of human actions by applying the proposed representation model between human body joints and body parts. As a first contribution, we present an accurate and comprehensive model for representing several aspects of space over time from videos called "AngledCORE-9", a modified version of CORE-9 (proposed by Cohn et al. [2012]). This model is as efficient as CORE-9 and allows us to extract spatial information with much higher accuracy than previously possible. We evaluate our new knowledge representation method on a real video dataset to perform action clustering. Our next contribution is proposing a model for differentiating relevant from irrelevant objects to the human actions in the videos. The chief issue of recognising different human actions in videos using spatio-temporal features is that there are usually many moving objects in the scene. No existing method can successfully find the involved objects in the activity. The output of our system is a list of tracks for all possible objects in the video with their probabilities for being involved in the activity. The track with the highest probability is most likely to be the object with which the person is interacting. Knowing the involved object(s) in the activities is very advantageous. Since it can be used to improve the human action recognition rate. Finally, instead of looking at human-object interactions, we consider skeleton joints as the points of interest. Working on joints provides more information about how a person is moving to perform the activity. In this part of the thesis, we use videos with human skeletons in 3D captured by Kinect, MSR3D-action dataset. We use our proposed model "AngledCORE-9" to extract features and describe the temporal variation of these features frame by frame. We compare our results against some of the recent works on the same dataset.

Spatiotemporal Graphs for Object Segmentation and Human Pose Estimation in Videos

Spatiotemporal Graphs for Object Segmentation and Human Pose Estimation in Videos PDF Author: Dong Zhang
Publisher:
ISBN:
Category :
Languages : en
Pages : 128

Get Book Here

Book Description
Images and videos can be naturally represented by graphs, with spatial graphs for images and spatiotemporal graphs for videos. However, for different applications, there are usually different formulations of the graphs, and algorithms for each formulation have different complexities. Therefore, wisely formulating the problem to ensure an accurate and efficient solution is one of the core issues in Computer Vision research. We explore three problems in this domain to demonstrate how to formulate all of these problems in terms of spatiotemporal graphs and obtain good and efficient solutions. The first problem we explore is video object segmentation. The goal is to segment the primary moving objects in the videos. This problem is important for many applications, such as content based video retrieval, video summarization, activity understanding and targeted content replacement. In our framework, we use object proposals, which are object-like regions obtained by lowlevel visual cues. Each object proposal has an object-ness score associated with it, which indicates how likely this object proposal corresponds to an object. The problem is formulated as a directed acyclic graph, for which nodes represent the object proposals and edges represent the spatiotemporal relationship between nodes. A dynamic programming solution is employed to select one object proposal from each video frame, while ensuring their consistency throughout the video frames. Gaussian mixture models (GMMs) are used for modeling the background and foreground, and Markov Random Fields (MRFs) are employed to smooth the pixel-level segmentation.