Video matting

Last updated

Video matting is a technique for separating the video into two or more layers, usually foreground and background, and generating alpha mattes which determine blending of the layers. The technique is very popular in video editing because it allows to substitute the background, or process the layers individually.

Contents

Video matting methods

Problem definition

When combining two images the alpha matte is utilized, also known as the transparency map. In the case of digital video, the alpha matte is a sequence of images. The matte can serve as a binary mask, defining which of the image parts are visible. In a more complicated case it enables smooth blending of the images, the alpha matte is used as the transparency map of the top image. Film production has known alpha matting since the very creation of filmmaking. The mattes were drawn by hand. Nowadays, the process can be automatized with computer algorithms.

Left to right: input image, background, foreground, and alpha matte. Alpha matte.png
Left to right: input image, background, foreground, and alpha matte.

The basic matting problem is defined as following: given an image , compute the foreground , background and alpha matte , such that the equation holds true. This equation has trivial solution , , is any image. Thus, usually an additional trimap must be provided as input. The trimap specifies background, foreground, and uncertain pixels, which will be decomposed into foreground and background by the matting method.

The main criteria for video matting methods from a user perspective are following:

The trimap (bottom) is used as a guide for estimating the alpha matte. White pixels are foreground, black pixels are background and grey pixels are yet to be estimated. Matting algorithms take the complete frame (top) and the trimap as input to produce the alpha matte (middle) Example trimap with toy.jpg
The trimap (bottom) is used as a guide for estimating the alpha matte. White pixels are foreground, black pixels are background and grey pixels are yet to be estimated. Matting algorithms take the complete frame (top) and the trimap as input to produce the alpha matte (middle)

Methods description

The first known video matting method [1] was developed in 2001. The method utilizes optical flow for trimap propagation and a Bayesian image matting technique which is applied to each image separately.

Video SnapCut, [2] which later was incorporated in Adobe After Effects as Roto Brush tool, was developed in 2009. The method makes use of local classifiers for binary image segmentation near the target object's boundary. The results of the segmentation are propagated to the next frame using optical flow, and an image matting algorithm [3] is applied.

A method [4] from 2011 was also included in Adobe After Effects as Refine Edge tool. The propagation of trimap with optical flow was enhanced with control points along the object edge. The method uses per-image matting, but temporal coherence was improved with a temporal filter.

Finally, a deep learning method [5] was developed for image matting in 2017. It overcomes most traditional methods. [6]

Benchmarking

Video matting is a rapidly-evolving field with many practical applications. However, in order to compare the quality of the methods, they must be tested on a benchmark. The benchmark consists of a dataset with test sequences and a result comparison methodology. Currently there exists one major video matting online benchmark, [6] which uses chroma keying and stop motion for ground truth estimation. After method submission, the rating for each method is derived from objective metrics. As objective metrics do not represent human perception of quality, a subjective survey is necessary to provide adequate comparison.

Top 5 video matting methods [6]
MethodYear of developmentRanking place
Deep Image Matting [1] 20161
Self-Adaptive [7] 20162
Learning Based [8] 20093
Sparse Sampling [9] 20164
Closed Form [3] 20085

Practical use

Object cutout

Video matting methods are required in video editing software. The most common application is cutting out and transferring an object into another scene. The tool allows users to cut out a moving object by interactively painting areas that must or must not belong to the object, or specifying complete trimaps as input. There are several software implementations:

To enhance the speed and quality of matting, some methods use additional data. For example, time-of-flight cameras had been explored in real-time matting systems. [12]

Background replacement

Another application of video matting is background matting, which is very popular in online video calls. A Zoom plugin had been developed, [13] and Skype announced Background Replace in June 2020. [14] Video matting methods also allow to apply video effects only to background or foreground.

3D video editing

Video matting is crucial in 2D to 3D conversion, where the alpha matte is used to correctly process transparent objects. It is also employed in stereo to multiview conversion.

Video completion

Closely related to matting is video completion [15] after removal of an object in the video. While matting is used to separate the video into several layers, completion allows to fill gaps with plausible contents from the video after removing one of the layers.

See also

Related Research Articles

<span class="mw-page-title-main">Alpha compositing</span> Operation in computer graphics

In computer graphics, alpha compositing or alpha blending is the process of combining one image with a background to create the appearance of partial or full transparency. It is often useful to render picture elements (pixels) in separate passes or layers and then combine the resulting 2D images into a single, final image called the composite. Compositing is used extensively in film when combining computer-rendered image elements with live footage. Alpha blending is also used in 2D computer graphics to put rasterized foreground elements over a background.

<span class="mw-page-title-main">Digital compositing</span>

Digital compositing is the process of digitally assembling multiple images to make a final image, typically for print, motion pictures or screen display. It is the digital analogue of optical film compositing.

<span class="mw-page-title-main">Chroma key</span> Compositing technique, also known as green screen

Chroma key compositing, or chroma keying, is a visual-effects and post-production technique for compositing (layering) two or more images or video streams together based on colour hues. The technique has been used in many fields to remove a background from the subject of a photo or video – particularly the newscasting, motion picture, and video game industries. A colour range in the foreground footage is made transparent, allowing separately filmed background footage or a static image to be inserted into the scene. The chroma keying technique is commonly used in video production and post-production. This technique is also referred to as colour keying, colour-separation overlay, or by various terms for specific colour-related variants such as green screen or blue screen; chroma keying can be done with backgrounds of any colour that are uniform and distinct, but green and blue backgrounds are more commonly used because they differ most distinctly in hue from any human skin colour. No part of the subject being filmed or photographed may duplicate the colour used as the backing, or the part may be erroneously identified as part of the backing.

<span class="mw-page-title-main">Compositing</span> Combining of visual elements from separate sources into single images

Compositing is the process or technique of combining visual elements from separate sources into single images, often to create the illusion that all those elements are parts of the same scene. Live-action shooting for compositing is variously called "chroma key", "blue screen", "green screen" and other names. Today, most, though not all, compositing is achieved through digital image manipulation. Pre-digital compositing techniques, however, go back as far as the trick films of Georges Méliès in the late 19th century, and some are still in use.

Mattes are used in photography and special effects filmmaking to combine two or more image elements into a single, final image. Usually, mattes are used to combine a foreground image with a background image. In this case, the matte is the background painting. In film and stage, mattes can be physically huge sections of painted canvas, portraying large scenic expanses of landscapes.

<span class="mw-page-title-main">Sub-pixel resolution</span>

In digital image processing, sub-pixel resolution can be obtained in images constructed from sources with information exceeding the nominal pixel resolution of said images.

Articulated body pose estimation in computer vision is the study of algorithms and systems that recover the pose of an articulated body, which consists of joints and rigid parts using image-based observations. It is one of the longest-lasting problems in computer vision because of the complexity of the models that relate observation with pose, and because of the variety of situations in which it would be useful.

In computer vision, the bag-of-words model sometimes called bag-of-visual-words model can be applied to image classification or retrieval, by treating image features as words. In document classification, a bag of words is a sparse vector of occurrence counts of words; that is, a sparse histogram over the vocabulary. In computer vision, a bag of visual words is a vector of occurrence counts of a vocabulary of local image features.

2D to 3D video conversion is the process of transforming 2D ("flat") film to 3D form, which in almost all cases is stereo, so it is the process of creating imagery for each eye from one 2D image.

In computer vision, the term cuboid is used to describe a small spatiotemporal volume extracted for purposes of behavior recognition. The cuboid is regarded as a basic geometric primitive type and is used to depict three-dimensional objects within a three dimensional representation of a flat, two dimensional image.

<span class="mw-page-title-main">Foreground detection</span>

Foreground detection is one of the major tasks in the field of computer vision and image processing whose aim is to detect changes in image sequences. Background subtraction is any technique which allows an image's foreground to be extracted for further processing.

Robust Principal Component Analysis (RPCA) is a modification of the widely used statistical procedure of principal component analysis (PCA) which works well with respect to grossly corrupted observations. A number of different approaches exist for Robust PCA, including an idealized version of Robust PCA, which aims to recover a low-rank matrix L0 from highly corrupted measurements M = L0 +S0. This decomposition in low-rank and sparse matrices can be achieved by techniques such as Principal Component Pursuit method (PCP), Stable PCP, Quantized PCP, Block based PCP, and Local PCP. Then, optimization methods are used such as the Augmented Lagrange Multiplier Method (ALM), Alternating Direction Method (ADM), Fast Alternating Minimization (FAM), Iteratively Reweighted Least Squares (IRLS ) or alternating projections (AP).

In computer vision, rigid motion segmentation is the process of separating regions, features, or trajectories from a video sequence into coherent subsets of space and time. These subsets correspond to independent rigidly moving objects in the scene. The goal of this segmentation is to differentiate and extract the meaningful rigid motion from the background and analyze it. Image segmentation techniques labels the pixels to be a part of pixels with certain characteristics at a particular time. Here, the pixels are segmented depending on its relative movement over a period of time i.e. the time of the video sequence.

Moving object detection is a technique used in computer vision and image processing. Multiple consecutive frames from a video are compared by various methods to determine if any moving object is detected.

<span class="mw-page-title-main">Object co-segmentation</span>

In computer vision, object co-segmentation is a special case of image segmentation, which is defined as jointly segmenting semantically similar objects in multiple images or video frames.

<span class="mw-page-title-main">Michael J. Black</span> American-born computer scientist

Michael J. Black is an American-born computer scientist working in Tübingen, Germany. He is a founding director at the Max Planck Institute for Intelligent Systems where he leads the Perceiving Systems Department in research focused on computer vision, machine learning, and computer graphics. He is also an Honorary Professor at the University of Tübingen.

<span class="mw-page-title-main">Video super-resolution</span> Generating high-resolution video frames from given low-resolution ones

Video super-resolution (VSR) is the process of generating high-resolution video frames from the given low-resolution video frames. Unlike single-image super-resolution (SISR), the main goal is not only to restore more fine details while saving coarse ones, but also to preserve motion consistency.

<span class="mw-page-title-main">Self-supervised learning</span> Machine learning paradigm

Self-supervised learning (SSL) refers to a machine learning paradigm, and corresponding methods, for processing unlabeled data to obtain useful representations that can help with downstream learning tasks. The most salient thing about SSL methods is that they do not need human-annotated labels, which means they are designed to take in datasets consisting entirely of unlabeled data samples. Then the typical SSL pipeline consists of learning supervisory signals in a first stage, which are then used for some supervised learning task in the second and later stages. For this reason, SSL can be described as an intermediate form of unsupervised and supervised learning.

Jiaya Jia is a tenured professor of the Department of Computer Science and Engineering at The Chinese University of Hong Kong (CUHK). He is an IEEE Fellow, the associate editor-in-chief of one of IEEE’s flagship and premier journals- Transactions on Pattern Analysis and Machine Intelligence (TPAMI), as well as on the editorial board of International Journal of Computer Vision (IJCV).

Wolfgang Heidrich is a German-Canadian computer scientist and Professor at the King Abdullah University of Science and Technology (KAUST), for which he served as the director of Visual Computing Center from 2014 to 2021. He was previously a professor at the University of British Columbia (UBC), where he was a Dolby Research Chair (2008-2013). His research has combined methods from computer graphics, optics, machine vision, imaging, inverse methods, and perception to develop new Computational Imaging and Display technologies. His more recent interest focuses on hardware-software co-design of the next generation of imaging systems, with applications such as high dynamic range (HDR) imaging, compact computational cameras, hyper-spectral cameras, wavefront sensors, to name just a few.

References

  1. 1 2 Chuang, Yung-Yu; Agarwala, Aseem; Curless, Brian; Salesin, David H.; Szeliski, Richard (2002). "Video matting of complex scenes". ACM Transactions on Graphics. 21 (3): 243–248. doi:10.1145/566654.566572. ISSN   0730-0301.
  2. 1 2 Bai, Xue; Wang, Jue; Simons, David; Sapiro, Guillermo (2009). "Video SnapCut". ACM Transactions on Graphics. 28 (3): 1–11. doi:10.1145/1531326.1531376. ISSN   0730-0301.
  3. 1 2 Levin, A.; Lischinski, D.; Weiss, Y. (2008). "A Closed-Form Solution to Natural Image Matting". IEEE Transactions on Pattern Analysis and Machine Intelligence. 30 (2): 228–242. doi:10.1109/TPAMI.2007.1177. ISSN   0162-8828. PMID   18084055.
  4. 1 2 Bai, Xue; Wang, Jue; Simons, David (2011). "Towards Temporally-Coherent Video Matting". Computer Vision/Computer Graphics Collaboration Techniques. Lecture Notes in Computer Science. Vol. 6930. pp. 63–74. doi:10.1007/978-3-642-24136-9_6. ISBN   978-3-642-24135-2. ISSN   0302-9743.
  5. Xu, Ning; Price, Brian; Cohen, Scott; Huang, Thomas (2017). "Deep Image Matting". 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 311–320. doi:10.1109/CVPR.2017.41. ISBN   978-1-5386-0457-1. S2CID   14061786.
  6. 1 2 3 Erofeev, Mikhail; Gitman, Yury; Vatolin, Dmitriy; Fedorov, Alexey; Wang, Jue (2015). "Perceptually Motivated Benchmark for Video Matting". Proceedings of the British Machine Vision Conference 2015. pp. 99.1–99.12. doi:10.5244/C.29.99. ISBN   978-1-901725-53-7.
  7. Cao, Guangying; Li, Jianwei; Chen, Xiaowu; He, Zhiqiang (2017). "Patch-based self-adaptive matting for high-resolution image and video". The Visual Computer. 35 (1): 133–147. doi:10.1007/s00371-017-1424-3. ISSN   0178-2789. S2CID   24625947.
  8. Kambhamettu, Chandra (2009). "Learning based digital matting". 2009 IEEE 12th International Conference on Computer Vision. IEEE. pp. 889–896. doi:10.1109/iccv.2009.5459326. ISBN   978-1-4244-4420-5.
  9. Karacan, Levent; Erdem, Aykut; Erdem, Erkut (2015). "Image Matting with KL-Divergence Based Sparse Sampling". 2015 IEEE International Conference on Computer Vision (ICCV). pp. 424–432. doi:10.1109/ICCV.2015.56. ISBN   978-1-4673-8391-2. S2CID   2174306.
  10. Wang, Jue; Bhat, Pravin; Colburn, R. Alex; Agrawala, Maneesh; Cohen, Michael F. (2005). "Interactive video cutout". ACM Transactions on Graphics. 24 (3): 585–594. doi:10.1145/1073204.1073233. ISSN   0730-0301.
  11. "Matting plugin for Adobe After Effects" . Retrieved 2021-03-02.
  12. Wang, Liang; Gong, Minglun; Zhang, Chenxi; Yang, Ruigang; Zhang, Cha; Yang, Yee-Hong (2011-06-15). "Automatic Real-Time Video Matting Using Time-of-Flight Camera and Multichannel Poisson Equations". International Journal of Computer Vision. Springer Science and Business Media LLC. 97 (1): 104–121. doi:10.1007/s11263-011-0471-x. ISSN   0920-5691. S2CID   255108880.
  13. "Real-Time High Resolution Background Matting" . Retrieved 2021-03-02.
  14. "Introducing Background Replace in Skype" . Retrieved 2021-03-02.
  15. "Video Completion Benchmark" . Retrieved 2021-03-10.