This article may be too technical for most readers to understand.(February 2020) |
A major contributor to this article appears to have a close connection with its subject.(January 2021) |
Developed by | ZPEG, Inc |
---|---|
Initial release | 2017 |
Website | zpeg |
ZPEG is a motion video technology that applies a human visual acuity model to a decorrelated transform-domain space, thereby optimally reducing the redundancies in motion video by removing the subjectively imperceptible. This technology is applicable to a wide range of video processing problems such as video optimization, real-time motion video compression, subjective quality monitoring, and format conversion.
The ZPEG company produces modified versions of x264, x265, AV1, and FFmpeg under the name ZPEG Engine (see § Video optimization).
Pixel distributions are well-modeled as stochastic process, and a transformation to their ideal decorrelated representation is accomplished by the Karhunen–Loève transform (KLT) defined by the Karhunen–Loève theorem. The discrete cosine transform (DCT) is often used as a computationally efficient transform that closely approximates the Karhunen–Loève transform for video data due to the strong correlation in pixel space typical of video frames. [1] As the correlation in the temporal direction is just as high as that of the spatial directions, a three-dimensional DCT may be used to decorrelate motion video. [2]
A Human Visual Model may be formulated based on the contrast sensitivity of the visual perception system. [3] A time-varying Contrast Sensitivity model may be specified, and is applicable to the three-dimensional discrete cosine transform (DCT). [4] A three-dimensional Contrast Sensitivity model is used to generate quantizers for each of the three-dimensional basis vectors, resulting in a near-optimal visually lossless removal of imperceptible motion video artifacts. [5]
The perceptual strength of the Human Visual Model quantizer generation process is calibrated in visiBels (vB), a logarithmic scale roughly corresponding to perceptibility as measured in screen heights. As the eye moves further from the screen, it becomes less able to perceive details in the image. The ZPEG model also includes a temporal component, and thus is not fully described by viewing distance. In terms of viewing distance, the visiBel strength increases by six as the screen distance halves. The standard viewing distance for Standard Definition television (about 7 screen heights) is defined as 0vB. The normal viewing distance for high-definition video (HD video), about 4 screen heights, would be defined as about −6 vB (3.5 screen heights).
The ZPEG pre-processor optimizes motion video sequences for compression by existing motion estimation-based video compressors, such as Advanced Video Coding (AVC) (H.264) and High Efficiency Video Coding (HEVC) (H.265). The human visual acuity model is converted into quantizers for direct application to a three-dimensional transformed block of the motion video sequence, followed by an inverse quantization (signal processing) step by the same quantizers. The motion video sequence returned from this process is then used as input to the existing compressor.
The application of Human Visual System-generated quantizers to a block-based Discrete Cosine Transform results in increased compressibility of a motion video stream by removing imperceptible content from the stream. The result is a curated stream that has removed detailed spatial and temporal details that the compressor would otherwise be required to reproduce. The stream also produces better matches for motion estimation algorithms. The quantizers are generated to be imperceptible at a specified viewing distance, specified in visiBels. Typical pre-processing viewing conditions in common use are:
Average compression savings for 6 Mbps HD video using the x.264 codec when processed at −12 vB is 21.88%. Average compression savings for 16 Mbps Netflix 4K test suite video using the x.264 codec processed at −12 vB is 29.81%. The same Netflix test suite when compressed for immersive viewing (−18 vB) generates a 25.72% savings. These results are reproducible through use of a publicly-accessible test bed. [6]
While the effects of ZPEG pre-processing are imperceptible to the average viewer at the specified viewing distance, edge effects introduced by block-based transform processing still affect the performance advantage of the video optimization process. While existing deblocking filters may be applied to improve this performance, optimal results are obtained through use of a multi-plane deblocking algorithm. Each plane is offset by one-half the block size in each of four directions, such that the offset of the plane is one of (0,0), (0,4), (4, 0), and (4,4) in the case of 8x8 blocks [7] and four planes. Pixels values are then chosen according to their distance to the block edge, with interior pixel values being preferred to boundary pixel values. The resulting deblocked video generates substantially better optimization over a wide range of pre-processing strengths.
Conventional motion compression solutions are based on motion estimation technology. [8] While some transform-domain video codec technologies exist, ZPEG is based on the three-dimensional Discrete Cosine Transform (DCT), [9] where the three dimensions are pixel within line, line within frame, and temporal sequence of frames. The extraction of redundant visual data is performed by the computationally-efficient process of quantization of the transform-domain representation of the video, rather than the far more computationally expensive process of searching for object matches between blocks. Quantizer values are derived by applying a Human Visual Model to the basis set of DCT coefficients at a pre-determined perceptual processing strength. All perceptually redundant information is thereby removed from the transform domain representation of the video. Compression is then performed by an entropy removal process. [10]
Once the viewing conditions has been chosen under which the compressed content is to be viewed, a Human Visual Model generates quantizers for application to the three-dimensional Discrete Cosine Transform (DCT). [11] These quantizers are tuned to remove all imperceptible content from the motion video stream, greatly reducing the entropy of the representation. The viewing conditions expressed in visiBels and the correlation of pixels before transformation are generated for reference by the entropy encoding.
While quantized DCT coefficients have traditionally be modeled as Laplace distributions, [12] more recent work has suggested the Cauchy distribution better models the quantized coefficient distributions. [13] The ZPEG entropy encoder encodes quantized three-dimensional DCT values according to a distribution that is completely characterized by the quantization matrix and the pixel correlations. This side-band information carried in the compressed stream enables the decoder to synchronize its internal state to the encoder. [14]
Each DCT band is separately entropy coded to all other bands. These coefficients are transmitted in band-wise order, starting with the DC component, followed by the successive bands in order of low resolution to high, similar to wavelet packet decomposition. [15] Following this convention assures that the receiver will always receive the maximum possible resolution for any bandpass pipe, enabling a no-buffering transmission protocol.
The gold measure of perceived quality difference between a reference video and its degraded representation is defined in ITU-R recommendation BT-500. [16] The double-stimulus continuous quality-scale (DSCQS) method rates the perceived difference between the reference and distorted videos to create an overall difference score derived from individual scores ranging from −3 to 3:
In an analogy to the single-stimulus continuous quality-scale (SSCQS) normalized metric Mean Opinion Score (MOS), [17] the overall DSCQS score is normalized to the range (−100, 100) and is termed the Differential Mean Opinion Score (DMOS), a measure of subjective video quality. An ideal objective measure will correlate strongly to the DMOS score when applied to a reference/impaired video pair. A survey of existing techniques and their overall merits may be found on the Netflix blog. [18] ZPEG extends the list of available techniques by providing a subjective quality metric generated by comparing the Mean Squared Error metric of the difference between the reference and impaired videos after pre-processing at various perceptual strengths (in visiBels). The effective viewing distance at which the impairment difference is no longer perceivable is reported as the impairment metric.
Statistically ideal format conversion is done by interpolation of video content in Discrete Cosine Transform space. [19] The conversion process, particularly in the case of up-sampling, must consider the ringing artifacts that occur when abrupt continuities take place in a sequence of pixels being re-sampled. The resulting algorithm can down-sample or up-sample video formats by changing the frame dimensions, pixel aspect ratio, and frame rate.
In information theory, data compression, source coding, or bit-rate reduction is the process of encoding information using fewer bits than the original representation. Any particular compression is either lossy or lossless. Lossless compression reduces bits by identifying and eliminating statistical redundancy. No information is lost in lossless compression. Lossy compression reduces bits by removing unnecessary or less important information. Typically, a device that performs data compression is referred to as an encoder, and one that performs the reversal of the process (decompression) as a decoder.
JPEG is a commonly used method of lossy compression for digital images, particularly for those images produced by digital photography. The degree of compression can be adjusted, allowing a selectable tradeoff between storage size and image quality. JPEG typically achieves 10:1 compression with little perceptible loss in image quality. Since its introduction in 1992, JPEG has been the most widely used image compression standard in the world, and the most widely used digital image format, with several billion JPEG images produced every day as of 2015.
In information technology, lossy compression or irreversible compression is the class of data compression methods that uses inexact approximations and partial data discarding to represent the content. These techniques are used to reduce data size for storing, handling, and transmitting content. The different versions of the photo of the cat on this page show how higher degrees of approximation create coarser images as more details are removed. This is opposed to lossless data compression which does not degrade the data. The amount of data reduction possible using lossy compression is much higher than using lossless techniques.
MPEG-1 is a standard for lossy compression of video and audio. It is designed to compress VHS-quality raw digital video and CD audio down to about 1.5 Mbit/s without excessive quality loss, making video CDs, digital cable/satellite TV and digital audio broadcasting (DAB) practical.
Image compression is a type of data compression applied to digital images, to reduce their cost for storage or transmission. Algorithms may take advantage of visual perception and the statistical properties of image data to provide superior results compared with generic data compression methods which are used for other digital data.
Transform coding is a type of data compression for "natural" data like audio signals or photographic images. The transformation is typically lossless on its own but is used to enable better quantization, which then results in a lower quality copy of the original input.
Motion compensation in computing is an algorithmic technique used to predict a frame in a video given the previous and/or future frames by accounting for motion of the camera and/or objects in the video. It is employed in the encoding of video data for video compression, for example in the generation of MPEG-2 files. Motion compensation describes a picture in terms of the transformation of a reference picture to the current picture. The reference picture may be previous in time or even from the future. When images can be accurately synthesized from previously transmitted/stored images, the compression efficiency can be improved.
A video codec is software or hardware that compresses and decompresses digital video. In the context of video compression, codec is a portmanteau of encoder and decoder, while a device that only compresses is typically called an encoder, and one that only decompresses is a decoder.
A discrete cosine transform (DCT) expresses a finite sequence of data points in terms of a sum of cosine functions oscillating at different frequencies. The DCT, first proposed by Nasir Ahmed in 1972, is a widely used transformation technique in signal processing and data compression. It is used in most digital media, including digital images, digital video, digital audio, digital television, digital radio, and speech coding. DCTs are also important to numerous other applications in science and engineering, such as digital signal processing, telecommunication devices, reducing network bandwidth usage, and spectral methods for the numerical solution of partial differential equations.
A compression artifact is a noticeable distortion of media caused by the application of lossy compression. Lossy data compression involves discarding some of the media's data so that it becomes small enough to be stored within the desired disk space or transmitted (streamed) within the available bandwidth. If the compressor cannot store enough data in the compressed version, the result is a loss of quality, or introduction of artifacts. The compression algorithm may not be intelligent enough to discriminate between distortions of little subjective importance and those objectionable to the user.
H.261 is an ITU-T video compression standard, first ratified in November 1988. It is the first member of the H.26x family of video coding standards in the domain of the ITU-T Study Group 16 Video Coding Experts Group. It was the first video coding standard that was useful in practical terms.
H.262 or MPEG-2 Part 2 is a video coding format standardised and jointly maintained by ITU-T Study Group 16 Video Coding Experts Group (VCEG) and ISO/IEC Moving Picture Experts Group (MPEG), and developed with the involvement of many companies. It is the second part of the ISO/IEC MPEG-2 standard. The ITU-T Recommendation H.262 and ISO/IEC 13818-2 documents are identical.
In computer vision and image processing, motion estimation is the process of determining motion vectors that describe the transformation from one 2D image to another; usually from adjacent frames in a video sequence. It is an ill-posed problem as the motion happens in three dimensions (3D) but the images are a projection of the 3D scene onto a 2D plane. The motion vectors may relate to the whole image or specific parts, such as rectangular blocks, arbitrary shaped patches or even per pixel. The motion vectors may be represented by a translational model or many other models that can approximate the motion of a real video camera, such as rotation and translation in all three dimensions and zoom.
JPEG XR is an image compression standard for continuous tone photographic images, based on the HD Photo specifications that Microsoft originally developed and patented. It supports both lossy and lossless compression, and is the preferred image format for Ecma-388 Open XML Paper Specification documents.
The macroblock is a processing unit in image and video compression formats based on linear block transforms, typically the discrete cosine transform (DCT). A macroblock typically consists of 16×16 samples, and is further subdivided into transform blocks, and may be further subdivided into prediction blocks. Formats which are based on macroblocks include JPEG, where they are called MCU blocks, H.261, MPEG-1 Part 2, H.262/MPEG-2 Part 2, H.263, MPEG-4 Part 2, and H.264/MPEG-4 AVC. In H.265/HEVC, the macroblock as a basic processing unit has been replaced by the coding tree unit.
PGF is a wavelet-based bitmapped image format that employs lossless and lossy data compression. PGF was created to improve upon and replace the JPEG format. It was developed at the same time as JPEG 2000 but with a focus on speed over compression ratio.
H.120 was the first digital video compression standard. It was developed by the COST 211 European research project and published by the CCITT in 1984, with a revision in 1988 that included contributions proposed by other organizations. The video turned out not to be of adequate quality, there were few implementations, and there are no existing codecs for the format, but it provided important knowledge leading directly to its practical successors, such as H.261. The latest revision was published in March 1993.
In digital image and video processing, a color layout descriptor (CLD) is designed to capture the spatial distribution of color in an image. The feature extraction process consists of two parts: grid based representative color selection and discrete cosine transform with quantization.
A video coding format is a content representation format of digital video content, such as in a data file or bitstream. It typically uses a standardized video compression algorithm, most commonly based on discrete cosine transform (DCT) coding and motion compensation. A specific software, firmware, or hardware implementation capable of compression or decompression in a specific video coding format is called a video codec.
Nasir Ahmed is an Indian-American electrical engineer and computer scientist. He is Professor Emeritus of Electrical and Computer Engineering at University of New Mexico (UNM). He is best known for inventing the discrete cosine transform (DCT) in the early 1970s. The DCT is the most widely used data compression transformation, the basis for most digital media standards and commonly used in digital signal processing. He also described the discrete sine transform (DST), which is related to the DCT.