MOVIE Index

Last updated

The MOtion-tuned Video Integrity Evaluation (MOVIE) index is a model and set of algorithms for predicting the perceived quality of digital television and cinematic pictures, as well as other kinds of digital images and videos.

Contents

It was developed by Kalpana Seshadrinathan and Alan Bovik in the Laboratory for Image and Video Engineering (LIVE) at The University of Texas at Austin. It was described in print in the 2010 technical paper "Motion Tuned Spatio-Temporal Quality Assessment of Natural Videos". [1] The original MOVIE paper was accorded an IEEE Signal Processing Society Best Journal Paper Award in 2013.

Model overview

The MOVIE index is a neuroscience-based model for predicting the perceptual quality of a (possibly compressed or otherwise distorted) motion picture or video against a pristine reference video. Thus, the MOVIE index is a full-reference metric. The MOVIE model is quite different from many other models since it uses neuroscience-based models of how the human brain processes visual signals at various stages along the visual pathway, including the lateral geniculate nucleus, primary visual cortex, and in the motion-sensitive extrastriate cortex visual area MT.

Spatial MOVIE operates by processing spatial and temporal motion picture information in an approximately separable manner. A prediction of the spatial (frame) quality of a video is found by calculating a space-time frequency decomposition of both reference and test (distorted) videos using a Gabor filter bank. Following a process of divisive normalization based on a model of cortical (area V1) processing in the brain, the processed reference and test videos are combined in a weighted difference to produce a prediction of spatial picture quality.

At the same time, a prediction of the temporal (time-varying or inter-frame) motion picture quality is calculated by using the responses of the same Gabor space-time frequency decomposition of reference and test videos, but in a different manner. Temporal MOVIE weights these responses using an excitatory-inhibitory weighting of the Gabor responses to motion-tune them in accordance with a local measurement of video motion. The motion measurements are also made using the space-time filter bank using a perceptually relevant measurement of phase-based optical flow. These measurements on the reference and test videos are then differentially combined and divisively normalized to produce a prediction of temporal picture quality.

The overall MOVIE index is then defined as the simple product of the Spatial and Temporal MOVIE indices, pooled over time (frames).

Performance

According to the original paper, MOVIE index delivers better perceptual motion picture quality predictions than do traditional methods such as the peak signal-to-noise ratio (PSNR) and mean squared error (MSE), which are inconsistent with human visual perception. [1] In the same paper, the authors also show that it performs better than other video quality models such as the ANSI/ISO standard VQM, and the popular Structural Similarity (SSIM) model in terms of motion picture quality prediction performance.

In another comparison, the MOVIE Index topped other models in terms of correlation with human judgments of motion picture quality on the LIVE Video Quality Database, which is a tool for assessing the accuracy of picture quality models. [2]

Usage

The MOVIE index is commercially marketed as part of the Video Clarity line of video quality measurements tools that are used throughout the Television and motion picture industries.

Related Research Articles

Computer vision tasks include methods for acquiring, processing, analyzing and understanding digital images, and extraction of high-dimensional data from the real world in order to produce numerical or symbolic information, e.g. in the forms of decisions. Understanding in this context means the transformation of visual images into descriptions of the world that make sense to thought processes and can elicit appropriate action. This image understanding can be seen as the disentangling of symbolic information from image data using models constructed with the aid of geometry, physics, statistics, and learning theory.

In information theory, data compression, source coding, or bit-rate reduction is the process of encoding information using fewer bits than the original representation. Any particular compression is either lossy or lossless. Lossless compression reduces bits by identifying and eliminating statistical redundancy. No information is lost in lossless compression. Lossy compression reduces bits by removing unnecessary or less important information. Typically, a device that performs data compression is referred to as an encoder, and one that performs the reversal of the process (decompression) as a decoder.

<span class="mw-page-title-main">Motion compensation</span> Video compression technique, used to efficiently predict and generate video frames

Motion compensation in computing, is an algorithmic technique used to predict a frame in a video, given the previous and/or future frames by accounting for motion of the camera and/or objects in the video. It is employed in the encoding of video data for video compression, for example in the generation of MPEG-2 files. Motion compensation describes a picture in terms of the transformation of a reference picture to the current picture. The reference picture may be previous in time or even from the future. When images can be accurately synthesized from previously transmitted/stored images, the compression efficiency can be improved.

H.261 is an ITU-T video compression standard, first ratified in November 1988. It is the first member of the H.26x family of video coding standards in the domain of the ITU-T Study Group 16 Video Coding Experts Group. It was the first video coding standard that was useful in practical terms.

<span class="mw-page-title-main">Motion estimation</span> Process used in video coding/compression

Motion estimation is the process of determining motion vectors that describe the transformation from one 2D image to another; usually from adjacent frames in a video sequence. It is an ill-posed problem as the motion is in three dimensions but the images are a projection of the 3D scene onto a 2D plane. The motion vectors may relate to the whole image or specific parts, such as rectangular blocks, arbitrary shaped patches or even per pixel. The motion vectors may be represented by a translational model or many other models that can approximate the motion of a real video camera, such as rotation and translation in all three dimensions and zoom.

Video quality is a characteristic of a video passed through a video transmission or processing system that describes perceived video degradation. Video processing systems may introduce some amount of distortion or artifacts in the video signal that negatively impacts the user's perception of a system. For many stakeholders in video production and distribution, assurance of video quality is an important task.

The structural similarityindex measure (SSIM) is a method for predicting the perceived quality of digital television and cinematic pictures, as well as other kinds of digital images and videos. SSIM is used for measuring the similarity between two images. The SSIM index is a full reference metric; in other words, the measurement or prediction of image quality is based on an initial uncompressed or distortion-free image as reference.

<span class="mw-page-title-main">Simple cell</span> Beaker with Dilute Sulphuric Acid, Zinc and Copper Sheet is known as A Simple Cell

A simple cell in the primary visual cortex is a cell that responds primarily to oriented edges and gratings. These cells were discovered by Torsten Wiesel and David Hubel in the late 1950s.

Image quality can refer to the level of accuracy with which different imaging systems capture, process, store, compress, transmit and display the signals that form an image. Another definition refers to image quality as "the weighted combination of all of the visually significant attributes of an image". The difference between the two definitions is that one focuses on the characteristics of signal processing in different imaging systems and the latter on the perceptual assessments that make an image pleasant for human viewers.

Perceptual Evaluation of Video Quality(PEVQ) is an end-to-end (E2E) measurement algorithm to score the picture quality of a video presentation by means of a 5-point mean opinion score (MOS). It is, therefore, a video quality model. PEVQ was benchmarked by the Video Quality Experts Group (VQEG) in the course of the Multimedia Test Phase 2007–2008. Based on the performance results, in which the accuracy of PEVQ was tested against ratings obtained by human viewers, PEVQ became part of the new International Standard.

<span class="mw-page-title-main">Alan Bovik</span>

Alan Conrad Bovik is an American engineer, vision scientist, and educator. He is a professor at the University of Texas at Austin (UT-Austin), where he holds the Cockrell Family Regents Endowed Chair in the Cockrell School of Engineering and is Director of the Laboratory for Image and Video Engineering (LIVE). He is a faculty member in the UT-Austin Department of Electrical and Computer Engineering, the Machine Learning Laboratory, the Institute for Neuroscience, and the Wireless Networking and Communications Group.

Scene statistics is a discipline within the field of perception. It is concerned with the statistical regularities related to scenes. It is based on the premise that a perceptual system is designed to interpret scenes.

VQuad-HD(Objective perceptual multimedia video quality measurement of HDTV) is a video quality testing technology for high definition video signals. It is a full-reference model, meaning that it requires access to the original and the degraded signal to estimate the quality.

Perceptual Objective Listening Quality Analysis (POLQA) was the working title of an ITU-T standard that covers a model to predict speech quality by means of analyzing digital speech signals. The model was standardized as Recommendation ITU-T P.863 in 2011. The second edition of the standard appeared in 2014, and the third, currently in-force edition was adopted in 2018 under the title Perceptual objective listening quality prediction.

ZPEG is a motion video technology that applies a human visual acuity model to a decorrelated transform-domain space, thereby optimally reducing the redundancies in motion video by removing the subjectively imperceptible. This technology is applicable to a wide range of video processing problems such as video optimization, real-time motion video compression, subjective quality monitoring, and format conversion.

Visual information fidelity (VIF) is a full reference image quality assessment index based on natural scene statistics and the notion of image information extracted by the human visual system. It was developed by Hamid R Sheikh and Alan Bovik at the Laboratory for Image and Video Engineering (LIVE) at the University of Texas at Austin in 2006. It is deployed in the core of the Netflix VMAF video quality monitoring system, which controls the picture quality of all encoded videos streamed by Netflix.

Video Multimethod Assessment Fusion (VMAF) is an objective full-reference video quality metric developed by Netflix in cooperation with the University of Southern California, The IPI/LS2N lab Nantes Université, and the Laboratory for Image and Video Engineering (LIVE) at The University of Texas at Austin. It predicts subjective video quality based on a reference and distorted video sequence. The metric can be used to evaluate the quality of different video codecs, encoders, encoding settings, or transmission variants.

Dynamic texture is the texture with motion which can be found in videos of sea-waves, fire, smoke, wavy trees, etc. Dynamic texture has a spatially repetitive pattern with time-varying visual pattern. Modeling and analyzing dynamic texture is a topic of images processing and pattern recognition in computer vision.

Jan P. Allebach is an American engineer, educator and researcher known for contributions to imaging science including halftoning, digital image processing, color management, visual perception, and image quality. He is Hewlett-Packard Distinguished Professor of Electrical and Computer Engineering at Purdue University.

<span class="mw-page-title-main">Video super-resolution</span> Generating high-resolution video frames from given low-resolution ones

Video super-resolution (VSR) is the process of generating high-resolution video frames from the given low-resolution video frames. Unlike single-image super-resolution (SISR), the main goal is not only to restore more fine details while saving coarse ones, but also to preserve motion consistency.

References

  1. 1 2 Seshadrinathan, K.; Bovik, A.C. (2010-02-01). "Motion Tuned Spatio-Temporal Quality Assessment of Natural Videos". IEEE Transactions on Image Processing. 19 (2): 335–350. CiteSeerX   10.1.1.153.9018 . doi:10.1109/TIP.2009.2034992. ISSN   1057-7149. PMID   19846374.
  2. Seshadrinathan, K.; Soundararajan, R.; Bovik, A. C.; Cormack, L. K. (June 2010). "Study of Subjective and Objective Quality Assessment of Video". IEEE Transactions on Image Processing. 19 (6): 1427–1441. doi:10.1109/tip.2010.2042111. ISSN   1057-7149. PMID   20129861.