Video imprint (computer vision)

Last updated

Proposed as an extension of image epitomes in the field of video content analysis, video imprint is obtained by recasting video contents into a fixed-sized tensor representation [1] [2] regardless of video resolution or duration. Specifically, statistical characteristics are retained to some degrees so that common video recognition tasks can be carried out directly on such imprints, e.g., event retrieval, temporal action localization. [2] It is claimed that both spatio-temporal interdependences are accounted for and redundancies are mitigated during the computation of video imprints.

The option of computing video imprints exploiting the epitome model [3] has the advantage of more flexible input feature formats and more efficient training stage for video content analysis.

See also

Related Research Articles

<span class="mw-page-title-main">Automatic image annotation</span>

Automatic image annotation is the process by which a computer system automatically assigns metadata in the form of captioning or keywords to a digital image. This application of computer vision techniques is used in image retrieval systems to organize and locate images of interest from a database.

Vasant G. Honavar is an Indian born American computer scientist, and artificial intelligence, machine learning, big data, data science, causal inference, knowledge representation, bioinformatics and health informatics researcher and professor.

<span class="mw-page-title-main">Long short-term memory</span> Artificial recurrent neural network architecture used in deep learning

Long short-term memory (LSTM) is an artificial neural network used in the fields of artificial intelligence and deep learning. Unlike standard feedforward neural networks, LSTM has feedback connections. Such a recurrent neural network (RNN) can process not only single data points, but also entire sequences of data. For example, LSTM is applicable to tasks such as unsegmented, connected handwriting recognition, speech recognition, machine translation, robot control, video games, and healthcare. LSTM has become the most cited neural network of the 20th century.

Activity recognition aims to recognize the actions and goals of one or more agents from a series of observations on the agents' actions and the environmental conditions. Since the 1980s, this research field has captured the attention of several computer science communities due to its strength in providing personalized support for many different applications and its connection to many different fields of study such as medicine, human-computer interaction, or sociology.

In machine learning the random subspace method, also called attribute bagging or feature bagging, is an ensemble learning method that attempts to reduce the correlation between estimators in an ensemble by training them on random samples of features instead of the entire feature set.

<span class="mw-page-title-main">Convolutional neural network</span> Artificial neural network

In deep learning, a convolutional neural network is a class of artificial neural network (ANN), most commonly applied to analyze visual imagery. CNNs are also known as Shift Invariant or Space Invariant Artificial Neural Networks (SIANN), based on the shared-weight architecture of the convolution kernels or filters that slide along input features and provide translation-equivariant responses known as feature maps. Counter-intuitively, most convolutional neural networks are not invariant to translation, due to the downsampling operation they apply to the input. They have applications in image and video recognition, recommender systems, image classification, image segmentation, medical image analysis, natural language processing, brain–computer interfaces, and financial time series.

Social media mining is the process of obtaining big data from user-generated content on social media sites and mobile apps in order to extract actionable patterns, form conclusions about users, and act upon the information, often for the purpose of advertising to users or conducting research. The term is an analogy to the resource extraction process of mining for rare minerals. Resource extraction mining requires mining companies to shift through vast quantities of raw ore to find the precious minerals; likewise, social media mining requires human data analysts and automated software programs to shift through massive amounts of raw social media data in order to discern patterns and trends relating to social media usage, online behaviours, sharing of content, connections between individuals, online buying behaviour, and more. These patterns and trends are of interest to companies, governments and not-for-profit organizations, as these organizations can use these patterns and trends to design their strategies or introduce new programs, new products, processes or services.

<span class="mw-page-title-main">Object co-segmentation</span>

In computer vision, object co-segmentation is a special case of image segmentation, which is defined as jointly segmenting semantically similar objects in multiple images or video frames.

<span class="mw-page-title-main">Visual temporal attention</span>

Visual temporal attention is a special case of visual attention that involves directing attention to specific instant of time. Similar to its spatial counterpart visual spatial attention, these attention modules have been widely implemented in video analytics in computer vision to provide enhanced performance and human interpretable explanation of deep learning models.

Triplet loss is a loss function for machine learning algorithms where a reference input is compared to a matching input and a non-matching input. The distance from the anchor to the positive is minimized, and the distance from the anchor to the negative input is maximized. An early formulation equivalent to triplet loss was introduced for metric learning from relative comparisons by M. Schultze and T. Joachims in 2003.

Dynamic texture is the texture with motion which can be found in videos of sea-waves, fire, smoke, wavy trees, etc. Dynamic texture has a spatially repetitive pattern with time-varying visual pattern. Modeling and analyzing dynamic texture is a topic of images processing and pattern recognition in computer vision.

An event camera, also known as a neuromorphic camera, silicon retina or dynamic vision sensor, is an imaging sensor that responds to local changes in brightness. Event cameras do not capture images using a shutter as conventional (frame) cameras do. Instead, each pixel inside an event camera operates independently and asynchronously, reporting changes in brightness as they occur, and staying silent otherwise.

Energy-based generative neural networks is a class of generative models, which aim to learn explicit probability distributions of data in the form of energy-based models whose energy functions are parameterized by modern deep neural networks. Its name is due to the fact that this model can be derived from the discriminative neural networks. The parameter of the neural network in this model is trained in a generative manner by Markov chain Monte Carlo(MCMC)-based maximum likelihood estimation. The learning process follows an ''analysis by synthesis'' scheme, where within each learning iteration, the algorithm samples the synthesized examples from the current model by a gradient-based MCMC method, e.g., Langevin dynamics, and then updates the model parameters based on the difference between the training examples and the synthesized ones. This process can be interpreted as an alternating mode seeking and mode shifting process, and also has an adversarial interpretation. The first energy-based generative neural network is the generative ConvNet proposed in 2016 for image patterns, where the neural network is a convolutional neural network. The model has been generalized to various domains to learn distributions of videos, and 3D voxels. They are made more effective in their variants. They have proven useful for data generation, data recovery, data reconstruction.

In the domain of physics and probability, the filters, random fields, and maximum entropy (FRAME) model is a Markov random field model of stationary spatial processes, in which the energy function is the sum of translation-invariant potential functions that are one-dimensional non-linear transformations of linear filter responses. The FRAME model was originally developed by Song-Chun Zhu, Ying Nian Wu, and David Mumford for modeling stochastic texture patterns, such as grasses, tree leaves, brick walls, water waves, etc. This model is the maximum entropy distribution that reproduces the observed marginal histograms of responses from a bank of filters, where for each filter tuned to a specific scale and orientation, the marginal histogram is pooled over all the pixels in the image domain. The FRAME model is also proved to be equivalent to the micro-canonical ensemble, which was named the Julesz ensemble. Gibbs sampler is adopted to synthesize texture images by drawing samples from the FRAME model.

<span class="mw-page-title-main">Spatial embedding</span>

Spatial embedding is one of feature learning techniques used in spatial analysis where points, lines, polygons or other spatial data types. representing geographic locations are mapped to vectors of real numbers. Conceptually it involves a mathematical embedding from a space with many dimensions per geographic object to a continuous vector space with a much lower dimension.

<span class="mw-page-title-main">Video super-resolution</span> Generating high-resolution video frames from given low-resolution ones

Video Super-Resolution (VSR) is the process of generating high-resolution video frames from the given low-resolution ones. Unlike single image super-resolution (SISR), the main goal is not only to restore more fine details while saving coarse ones, but also to preserve motion consistency.

Video matting is a technique for separating the video into two or more layers, usually foreground and background, and generating alpha mattes which determine blending of the layers. The technique is very popular in video editing because it allows to substitute the background, or process the layers individually.

Self-supervised learning (SSL) is a method of machine learning. It learns from unlabeled sample data. It can be regarded as an intermediate form between supervised and unsupervised learning. It is based on an artificial neural network or other model such as a decision list. The model learns in two steps. First, the task is solved based on pseudo-labels which help to initialize the model parameters. Second, the actual task is performed with supervised or unsupervised learning. Self-supervised learning has produced promising results in recent years and has found practical application in audio processing and is being used by Facebook and others for speech recognition. The primary appeal of SSL is that training can occur with data of lower quality, rather than improving ultimate outcomes. Self-supervised learning more closely imitates the way humans learn to classify objects.

Jiaya Jia is a tenured professor of the Department of Computer Science and Engineering at The Chinese University of Hong Kong (CUHK). He is an IEEE Fellow, the associate editor-in-chief of one of IEEE’s flagship and premier journals- Transactions on Pattern Analysis and Machine Intelligence (TPAMI), as well as on the editorial board of International Journal of Computer Vision (IJCV).

References

  1. Gao, Zhanning; Wang, Le; Jojic, Nebojsa; Niu, Zhenxing; Zheng, Nanning; Hua, Gang (2019-12-01). "Video Imprint". IEEE Transactions on Pattern Analysis and Machine Intelligence. Institute of Electrical and Electronics Engineers (IEEE). 41 (12): 3086–3099. arXiv: 2106.03283 . doi:10.1109/tpami.2018.2866114. ISSN   0162-8828. PMID   30130178. S2CID   52059105.
  2. 1 2 Gao, Zhanning; Wang, Le; Zhang, Qilin; Niu, Zhenxing; Zheng, Nanning; Hua, Gang (2019-07-17). "Video Imprint Segmentation for Temporal Action Detection in Untrimmed Videos" (PDF). Proceedings of the AAAI Conference on Artificial Intelligence. 33 (1): 8328–8335. doi: 10.1609/aaai.v33i01.33018328 . ISSN   2374-3468.
  3. Jojic, N.; Frey, B.J.; Kannan, A. (2003). "Epitomic analysis of appearance and shape". Proceedings Ninth IEEE International Conference on Computer Vision. IEEE ICCV. pp. 34-41 vol.1. doi:10.1109/iccv.2003.1238311. ISBN   0-7695-1950-4.