Towards Action Recognition and Localization in Videos with Weakly Supervised Learning

Towards Action Recognition and Localization in Videos with Weakly Supervised Learning PDF Author: Nataliya Shapovalova
Publisher:
ISBN:
Category :
Languages : en
Pages : 102

Get Book Here

Book Description
Human behavior understanding is a fundamental problem of computer vision. It is an important component of numerous real-life applications, such as human-computer interaction, sports analysis, video search, and many others. In this thesis we work on the problem of action recognition and localization, which is a crucial part of human behavior understanding. Action recognition explains what a human is doing in the video, while action localization indicates where and when in the video the action is happening. We focus on two important aspects of the problem: (1) capturing intra-class variation of action categories and (2) inference of action location. Manual annotation of videos with fine-grained action labels and spatio-temporal action locations is a nontrivial task, thus employing weakly supervised learning approaches is of interest. Real-life actions are complex, and the same action can look different in different scenarios. A single template is not capable of capturing such data variability. Therefore, for each action category we automatically discover small clusters of examples that are visually similar to each other. A separate classifier is learnt for each cluster, so that more class variability is captured. In addition, we establish a direct association between a novel test example and examples from training data and demonstrate how metadata (e.g., attributes) can be transferred to test examples. Weakly supervised learning for action recognition and localization is another challenging task. It requires automatic inference of action location for all the training videos during learning. Initially, we simplify this problem and try to find discriminative regions in videos that lead to a better recognition performance. The regions are inferred in a manner such that they are visually similar across all the videos of the same category. Ideally, the regions should correspond to the action location; however, there is a gap between inferred discriminative regions and semantically meaningful regions representing action location. To fill the gap, we incorporate human eye gaze data to drive the inference of regions during learning. This allows inferring regions that are both discriminative and semantically meaningful. Furthermore, we use the inferred regions and learnt action model to assist top-down eye gaze prediction.

Towards Action Recognition and Localization in Videos with Weakly Supervised Learning

Towards Action Recognition and Localization in Videos with Weakly Supervised Learning PDF Author: Nataliya Shapovalova
Publisher:
ISBN:
Category :
Languages : en
Pages : 102

Get Book Here

Book Description
Human behavior understanding is a fundamental problem of computer vision. It is an important component of numerous real-life applications, such as human-computer interaction, sports analysis, video search, and many others. In this thesis we work on the problem of action recognition and localization, which is a crucial part of human behavior understanding. Action recognition explains what a human is doing in the video, while action localization indicates where and when in the video the action is happening. We focus on two important aspects of the problem: (1) capturing intra-class variation of action categories and (2) inference of action location. Manual annotation of videos with fine-grained action labels and spatio-temporal action locations is a nontrivial task, thus employing weakly supervised learning approaches is of interest. Real-life actions are complex, and the same action can look different in different scenarios. A single template is not capable of capturing such data variability. Therefore, for each action category we automatically discover small clusters of examples that are visually similar to each other. A separate classifier is learnt for each cluster, so that more class variability is captured. In addition, we establish a direct association between a novel test example and examples from training data and demonstrate how metadata (e.g., attributes) can be transferred to test examples. Weakly supervised learning for action recognition and localization is another challenging task. It requires automatic inference of action location for all the training videos during learning. Initially, we simplify this problem and try to find discriminative regions in videos that lead to a better recognition performance. The regions are inferred in a manner such that they are visually similar across all the videos of the same category. Ideally, the regions should correspond to the action location; however, there is a gap between inferred discriminative regions and semantically meaningful regions representing action location. To fill the gap, we incorporate human eye gaze data to drive the inference of regions during learning. This allows inferring regions that are both discriminative and semantically meaningful. Furthermore, we use the inferred regions and learnt action model to assist top-down eye gaze prediction.

Spatiotemporal Representation Learning For Human Action Recognition And Localization

Spatiotemporal Representation Learning For Human Action Recognition And Localization PDF Author: Alaaeldin Ali
Publisher:
ISBN:
Category :
Languages : en
Pages :

Get Book Here

Book Description
Human action understanding from videos is one of the foremost challenges in computer vision. It is the cornerstone of many applications like human-computer interaction and automatic surveillance. The current state of the art methods for action recognition and localization mostly rely on Deep Learning. In spite of their strong performance, Deep Learning approaches require a huge amount of labeled training data. Furthermore, standard action recognition pipelines rely on independent optical flow estimators which increase their computational cost. We propose two approaches to improve these aspects. First, we develop a novel method for efficient, real-time action localization in videos that achieves performance on par or better than other more computationally expensive methods. Second, we present a self-supervised learning approach for spatiotemporal feature learning that does not require any annotations. We demonstrate that features learned by our method provide a very strong prior for the downstream task of action recognition.

Computer Vision – ACCV 2020

Computer Vision – ACCV 2020 PDF Author: Hiroshi Ishikawa
Publisher: Springer Nature
ISBN: 3030695417
Category : Computers
Languages : en
Pages : 718

Get Book Here

Book Description
The six volume set of LNCS 12622-12627 constitutes the proceedings of the 15th Asian Conference on Computer Vision, ACCV 2020, held in Kyoto, Japan, in November/ December 2020.* The total of 254 contributions was carefully reviewed and selected from 768 submissions during two rounds of reviewing and improvement. The papers focus on the following topics: Part I: 3D computer vision; segmentation and grouping Part II: low-level vision, image processing; motion and tracking Part III: recognition and detection; optimization, statistical methods, and learning; robot vision Part IV: deep learning for computer vision, generative models for computer vision Part V: face, pose, action, and gesture; video analysis and event recognition; biomedical image analysis Part VI: applications of computer vision; vision for X; datasets and performance analysis *The conference was held virtually.

Towards Semi-supervised Video Action Recognition

Towards Semi-supervised Video Action Recognition PDF Author: Zexi Chen
Publisher:
ISBN:
Category :
Languages : en
Pages : 66

Get Book Here

Book Description


A Study of Localization and Latency Reduction for Action Recognition

A Study of Localization and Latency Reduction for Action Recognition PDF Author: Syed Zain Masood
Publisher:
ISBN:
Category :
Languages : en
Pages : 99

Get Book Here

Book Description
High latency causes the system's feedback to lag behind and thus significantly degrade the interactivity of the user experience. With slight modification to the weakly supervised probablistic model we proposed for action localization, we show how it can be used for reducing latency when recognizing actions in Human Computer Interaction (HCI) environments. This latency-aware learning formulation trains a logistic regression-based classifier that automatically determines distinctive canonical poses from the data and uses these to robustly recognize actions in the presence of ambiguous poses. We introduce a novel (publicly released) dataset for the purpose of our experiments. Comparisons of our method against both a Bag of Words and a Conditional Random Field (CRF) classifier show improved recognition performance for both pre-segmented and online classification tasks.

Computer Vision – ECCV 2020

Computer Vision – ECCV 2020 PDF Author: Andrea Vedaldi
Publisher: Springer Nature
ISBN: 3030585263
Category : Computers
Languages : en
Pages : 844

Get Book Here

Book Description
The 30-volume set, comprising the LNCS books 12346 until 12375, constitutes the refereed proceedings of the 16th European Conference on Computer Vision, ECCV 2020, which was planned to be held in Glasgow, UK, during August 23-28, 2020. The conference was held virtually due to the COVID-19 pandemic. The 1360 revised papers presented in these proceedings were carefully reviewed and selected from a total of 5025 submissions. The papers deal with topics such as computer vision; machine learning; deep neural networks; reinforcement learning; object recognition; image classification; image processing; object detection; semantic segmentation; human pose estimation; 3d reconstruction; stereo vision; computational photography; neural networks; image coding; image reconstruction; object recognition; motion estimation.

Computer Vision – ECCV 2018

Computer Vision – ECCV 2018 PDF Author: Vittorio Ferrari
Publisher: Springer
ISBN: 3030012700
Category : Computers
Languages : en
Pages : 871

Get Book Here

Book Description
The sixteen-volume set comprising the LNCS volumes 11205-11220 constitutes the refereed proceedings of the 15th European Conference on Computer Vision, ECCV 2018, held in Munich, Germany, in September 2018.The 776 revised papers presented were carefully reviewed and selected from 2439 submissions. The papers are organized in topical sections on learning for vision; computational photography; human analysis; human sensing; stereo and reconstruction; optimization; matching and recognition; video attention; and poster sessions.

Computer Vision – ECCV 2022

Computer Vision – ECCV 2022 PDF Author: Shai Avidan
Publisher: Springer Nature
ISBN: 3031197720
Category : Computers
Languages : en
Pages : 801

Get Book Here

Book Description
The 39-volume set, comprising the LNCS books 13661 until 13699, constitutes the refereed proceedings of the 17th European Conference on Computer Vision, ECCV 2022, held in Tel Aviv, Israel, during October 23–27, 2022. The 1645 papers presented in these proceedings were carefully reviewed and selected from a total of 5804 submissions. The papers deal with topics such as computer vision; machine learning; deep neural networks; reinforcement learning; object recognition; image classification; image processing; object detection; semantic segmentation; human pose estimation; 3d reconstruction; stereo vision; computational photography; neural networks; image coding; image reconstruction; object recognition; motion estimation.

Action Recognition, Temporal Localization and Detection in Trimmed and Untrimmed Videos

Action Recognition, Temporal Localization and Detection in Trimmed and Untrimmed Videos PDF Author: Rui Hou
Publisher:
ISBN:
Category :
Languages : en
Pages : 107

Get Book Here

Book Description
Automatic understanding of videos is one of the most active areas of computer vision research. It has applications in video surveillance, human computer interaction, video sports analysis, virtual and augmented reality, video retrieval etc. In this dissertation, we address four important tasks in video understanding, namely action recognition, temporal action localization, spatial-temporal action detection and video object/action segmentation. This dissertation makes contributions to above tasks by proposing. First, for video action recognition, we propose a category level feature learning method. Our proposed method automatically identifies such pairs of categories using a criterion of mutual pairwise proximity in the (kernelized) feature space, and a category-level similarity matrix where each entry corresponds to the one-vs-one SVM margin for pairs of categories. Second, for temporal action localization, we propose to exploit the temporal structure of actions by modeling an action as a sequence of sub-actions and present a computationally efficient approach. Third, we propose 3D Tube Convolutional Neural Network (TCNN) based pipeline for action detection. The proposed architecture is a unified deep network that is able to recognize and localize action based on 3D convolution features. It generalizes the popular faster R-CNN framework from images to videos. Last, an end-to-end encoder-decoder based 3D convolutional neural network pipeline is proposed, which is able to segment out the foreground objects from the background. Moreover, the action label can be obtained as well by passing the foreground object into an action classifier. Extensive experiments on several video datasets demonstrate the superior performance of the proposed approach for video understanding compared to the state-of-the-art.

Learning to Recognize Actions with Weak Supervision

Learning to Recognize Actions with Weak Supervision PDF Author: Nicolas Chesneau
Publisher:
ISBN:
Category :
Languages : en
Pages : 0

Get Book Here

Book Description
With the rapid growth of digital video content, automaticvideo understanding has become an increasingly important task. Video understanding spansseveral applications such as web-video content analysis, autonomous vehicles, human-machine interfaces (eg, Kinect). This thesismakes contributions addressing two major problems in video understanding:webly-supervised action detection and human action localization.Webly-supervised action recognition aims to learn actions from video content on the internet, with no additional supervision. We propose a novel approach in this context, which leverages thesynergy between visual video data and the associated textual metadata, to learnevent classifiers with no manual annotations. Specifically, we first collect avideo dataset with queries constructed automatically from textual descriptionof events, prune irrelevant videos with text and video data, and then learn thecorresponding event classifiers. We show the importance of both the main steps of our method, ie,query generation and data pruning, with quantitative results. We evaluate this approach in the challengingsetting where no manually annotated training set is available, i.e., EK0 in theTrecVid challenge, and show state-of-the-art results on MED 2011 and 2013datasets.In the second part of the thesis, we focus on human action localization, which involves recognizing actions that occur in a video, such as ``drinking'' or ``phoning'', as well as their spatial andtemporal extent. We propose a new person-centric framework for action localization that trackspeople in videos and extracts full-body human tubes, i.e., spatio-temporalregions localizing actions, even in the case of occlusions or truncations.The motivation is two-fold. First, it allows us to handle occlusions and camera viewpoint changes when localizing people, as it infers full-body localization. Second, it provides a better reference grid for extracting action information than standard human tubes, ie, tubes which frame visible parts only.This is achieved by training a novel human part detector that scores visibleparts while regressing full-body bounding boxes, even when they lie outside the frame. The core of our method is aconvolutional neural network which learns part proposals specific to certainbody parts. These are then combined to detect people robustly in each frame.Our tracking algorithm connects the image detections temporally to extractfull-body human tubes. We evaluate our new tube extraction method on a recentchallenging dataset, DALY, showing state-of-the-art results.