Video matting is a technique for separating the video into two or more layers, usually foreground and background, and generating alpha mattes which determine blending of the layers. The technique is very popular in video editing because it allows to substitute the background, or process the layers individually.
When combining two images the alpha matte is utilized, also known as the transparency map. In the case of digital video, the alpha matte is a sequence of images. The matte can serve as a binary mask, defining which of the image parts are visible. In a more complicated case it enables smooth blending of the images, the alpha matte is used as the transparency map of the top image. Film production has known alpha matting since the very creation of filmmaking. The mattes were drawn by hand. Nowadays, the process can be automatized with computer algorithms.
The basic matting problem is defined as following: given an image , compute the foreground , background and alpha matte , such that the equation holds true. This equation has trivial solution , , is any image. Thus, usually an additional trimap must be provided as input. The trimap specifies background, foreground, and uncertain pixels, which will be decomposed into foreground and background by the matting method.
The main criteria for video matting methods from a user perspective are following:
The first known video matting method [1] was developed in 2001. The method utilizes optical flow for trimap propagation and a Bayesian image matting technique which is applied to each image separately.
Video SnapCut, [2] which later was incorporated in Adobe After Effects as Roto Brush tool, was developed in 2009. The method makes use of local classifiers for binary image segmentation near the target object's boundary. The results of the segmentation are propagated to the next frame using optical flow, and an image matting algorithm [3] is applied.
A method [4] from 2011 was also included in Adobe After Effects as Refine Edge tool. The propagation of trimap with optical flow was enhanced with control points along the object edge. The method uses per-image matting, but temporal coherence was improved with a temporal filter.
Finally, a deep learning method [5] was developed for image matting in 2017. It overcomes most traditional methods. [6]
Video matting is a rapidly-evolving field with many practical applications. However, in order to compare the quality of the methods, they must be tested on a benchmark. The benchmark consists of a dataset with test sequences and a result comparison methodology. Currently there exists one major video matting online benchmark, [6] which uses chroma keying and stop motion for ground truth estimation. After method submission, the rating for each method is derived from objective metrics. As objective metrics do not represent human perception of quality, a subjective survey is necessary to provide adequate comparison.
Method | Year of development | Ranking place |
---|---|---|
Deep Image Matting [1] | 2016 | 1 |
Self-Adaptive [7] | 2016 | 2 |
Learning Based [8] | 2009 | 3 |
Sparse Sampling [9] | 2016 | 4 |
Closed Form [3] | 2008 | 5 |
Video matting methods are required in video editing software. The most common application is cutting out and transferring an object into another scene. The tool allows users to cut out a moving object by interactively painting areas that must or must not belong to the object, or specifying complete trimaps as input. There are several software implementations:
To enhance the speed and quality of matting, some methods use additional data. For example, time-of-flight cameras had been explored in real-time matting systems. [12]
Another application of video matting is background matting, which is very popular in online video calls. A Zoom plugin had been developed, [13] and Skype announced Background Replace in June 2020. [14] Video matting methods also allow to apply video effects only to background or foreground.
Video matting is crucial in 2D to 3D conversion, where the alpha matte is used to correctly process transparent objects. It is also employed in stereo to multiview conversion.
Closely related to matting is video completion [15] after removal of an object in the video. While matting is used to separate the video into several layers, completion allows to fill gaps with plausible contents from the video after removing one of the layers.
In computer graphics, alpha compositing or alpha blending is the process of combining one image with a background to create the appearance of partial or full transparency. It is often useful to render picture elements (pixels) in separate passes or layers and then combine the resulting 2D images into a single, final image called the composite. Compositing is used extensively in film when combining computer-rendered image elements with live footage. Alpha blending is also used in 2D computer graphics to put rasterized foreground elements over a background.
Digital compositing is the process of digitally assembling multiple images to make a final image, typically for print, motion pictures or screen display. It is the digital analogue of optical film compositing.
Chroma key compositing, or chroma keying, is a visual-effects and post-production technique for compositing (layering) two or more images or video streams together based on colour hues. The technique has been used in many fields to remove a background from the subject of a photo or video – particularly the newscasting, motion picture, and video game industries. A colour range in the foreground footage is made transparent, allowing separately filmed background footage or a static image to be inserted into the scene. The chroma keying technique is commonly used in video production and post-production. This technique is also referred to as colour keying, colour-separation overlay, or by various terms for specific colour-related variants such as green screen or blue screen; chroma keying can be done with backgrounds of any colour that are uniform and distinct, but green and blue backgrounds are more commonly used because they differ most distinctly in hue from any human skin colour. No part of the subject being filmed or photographed may duplicate the colour used as the backing, or the part may be erroneously identified as part of the backing.
Compositing is the process or technique of combining visual elements from separate sources into single images, often to create the illusion that all those elements are parts of the same scene. Live-action shooting for compositing is variously called "chroma key", "blue screen", "green screen" and other names. Today, most, though not all, compositing is achieved through digital image manipulation. Pre-digital compositing techniques, however, go back as far as the trick films of Georges Méliès in the late 19th century, and some are still in use.
Mattes are used in photography and special effects filmmaking to combine two or more image elements into a single, final image. Usually, mattes are used to combine a foreground image with a background image. In this case, the matte is the background painting. In film and stage, mattes can be physically huge sections of painted canvas, portraying large scenic expanses of landscapes.
In digital image processing, sub-pixel resolution can be obtained in images constructed from sources with information exceeding the nominal pixel resolution of said images.
Articulated body pose estimation in computer vision is the study of algorithms and systems that recover the pose of an articulated body, which consists of joints and rigid parts using image-based observations. It is one of the longest-lasting problems in computer vision because of the complexity of the models that relate observation with pose, and because of the variety of situations in which it would be useful.
In computer vision, the bag-of-words model sometimes called bag-of-visual-words model can be applied to image classification or retrieval, by treating image features as words. In document classification, a bag of words is a sparse vector of occurrence counts of words; that is, a sparse histogram over the vocabulary. In computer vision, a bag of visual words is a vector of occurrence counts of a vocabulary of local image features.
2D to 3D video conversion is the process of transforming 2D ("flat") film to 3D form, which in almost all cases is stereo, so it is the process of creating imagery for each eye from one 2D image.
In computer vision, the term cuboid is used to describe a small spatiotemporal volume extracted for purposes of behavior recognition. The cuboid is regarded as a basic geometric primitive type and is used to depict three-dimensional objects within a three dimensional representation of a flat, two dimensional image.
Foreground detection is one of the major tasks in the field of computer vision and image processing whose aim is to detect changes in image sequences. Background subtraction is any technique which allows an image's foreground to be extracted for further processing.
Robust Principal Component Analysis (RPCA) is a modification of the widely used statistical procedure of principal component analysis (PCA) which works well with respect to grossly corrupted observations. A number of different approaches exist for Robust PCA, including an idealized version of Robust PCA, which aims to recover a low-rank matrix L0 from highly corrupted measurements M = L0 +S0. This decomposition in low-rank and sparse matrices can be achieved by techniques such as Principal Component Pursuit method (PCP), Stable PCP, Quantized PCP, Block based PCP, and Local PCP. Then, optimization methods are used such as the Augmented Lagrange Multiplier Method (ALM), Alternating Direction Method (ADM), Fast Alternating Minimization (FAM), Iteratively Reweighted Least Squares (IRLS ) or alternating projections (AP).
In computer vision, rigid motion segmentation is the process of separating regions, features, or trajectories from a video sequence into coherent subsets of space and time. These subsets correspond to independent rigidly moving objects in the scene. The goal of this segmentation is to differentiate and extract the meaningful rigid motion from the background and analyze it. Image segmentation techniques labels the pixels to be a part of pixels with certain characteristics at a particular time. Here, the pixels are segmented depending on its relative movement over a period of time i.e. the time of the video sequence.
Moving object detection is a technique used in computer vision and image processing. Multiple consecutive frames from a video are compared by various methods to determine if any moving object is detected.
In computer vision, object co-segmentation is a special case of image segmentation, which is defined as jointly segmenting semantically similar objects in multiple images or video frames.
Michael J. Black is an American-born computer scientist working in Tübingen, Germany. He is a founding director at the Max Planck Institute for Intelligent Systems where he leads the Perceiving Systems Department in research focused on computer vision, machine learning, and computer graphics. He is also an Honorary Professor at the University of Tübingen.
Video super-resolution (VSR) is the process of generating high-resolution video frames from the given low-resolution video frames. Unlike single-image super-resolution (SISR), the main goal is not only to restore more fine details while saving coarse ones, but also to preserve motion consistency.
Self-supervised learning (SSL) refers to a machine learning paradigm, and corresponding methods, for processing unlabeled data to obtain useful representations that can help with downstream learning tasks. The most salient thing about SSL methods is that they do not need human-annotated labels, which means they are designed to take in datasets consisting entirely of unlabeled data samples. Then the typical SSL pipeline consists of learning supervisory signals in a first stage, which are then used for some supervised learning task in the second and later stages. For this reason, SSL can be described as an intermediate form of unsupervised and supervised learning.
Jiaya Jia is a tenured professor of the Department of Computer Science and Engineering at The Chinese University of Hong Kong (CUHK). He is an IEEE Fellow, the associate editor-in-chief of one of IEEE’s flagship and premier journals- Transactions on Pattern Analysis and Machine Intelligence (TPAMI), as well as on the editorial board of International Journal of Computer Vision (IJCV).
Wolfgang Heidrich is a German-Canadian computer scientist and Professor at the King Abdullah University of Science and Technology (KAUST), for which he served as the director of Visual Computing Center from 2014 to 2021. He was previously a professor at the University of British Columbia (UBC), where he was a Dolby Research Chair (2008-2013). His research has combined methods from computer graphics, optics, machine vision, imaging, inverse methods, and perception to develop new Computational Imaging and Display technologies. His more recent interest focuses on hardware-software co-design of the next generation of imaging systems, with applications such as high dynamic range (HDR) imaging, compact computational cameras, hyper-spectral cameras, wavefront sensors, to name just a few.