The structural similarityindex measure (SSIM) is a method for predicting the perceived quality of digital television and cinematic pictures, as well as other kinds of digital images and videos. It is also used for measuring the similarity between two images. The SSIM index is a full reference metric; in other words, the measurement or prediction of image quality is based on an initial uncompressed or distortion-free image as reference.
SSIM is a perception-based model that considers image degradation as perceived change in structural information, while also incorporating important perceptual phenomena, including both luminance masking and contrast masking terms. This distinguishes SSIM from other techniques such as mean squared error (MSE) or Peak signal-to-noise ratio (PSNR) that instead estimate absolute errors. Structural information is the idea that the pixels have strong inter-dependencies especially when they are spatially close. These dependencies carry important information about the structure of the objects in the visual scene. Luminance masking is a phenomenon whereby image distortions (in this context) tend to be less visible in bright regions, while contrast masking is a phenomenon whereby distortions become less visible where there is significant activity or "texture" in the image.
The predecessor of SSIM was called Universal Quality Index (UQI), or Wang–Bovik Index, which was developed by Zhou Wang and Alan Bovik in 2001. This evolved, through their collaboration with Hamid Sheikh and Eero Simoncelli, into the current version of SSIM, which was published in April 2004 in the IEEE Transactions on Image Processing . [1] In addition to defining the SSIM quality index, the paper provides a general context for developing and evaluating perceptual quality measures, including connections to human visual neurobiology and perception, and direct validation of the index against human subject ratings.
The basic model was developed in the Laboratory for Image and Video Engineering (LIVE) at The University of Texas at Austin and further developed jointly with the Laboratory for Computational Vision (LCV) at New York University. Further variants of the model have been developed in the Image and Visual Computing Laboratory at University of Waterloo and have been commercially marketed.
SSIM subsequently found strong adoption in the image processing community and in the television and social media industries. The 2004 SSIM paper has been cited over 50,000 times according to Google Scholar, [2] making it one of the highest cited papers in the image processing and video engineering fields. It was recognized with the IEEE Signal Processing Society Best Paper Award for 2009. [3] It also received the IEEE Signal Processing Society Sustained Impact Award for 2016, indicative of a paper having an unusually high impact for at least 10 years following its publication. Because of its high adoption by the television industry, the authors of the original SSIM paper were each accorded a Primetime Engineering Emmy Award in 2015 by the Television Academy.
The SSIM index is calculated on various windows of an image. The measure between two windows and of common size is: [4]
with:
The SSIM formula is based on three comparison measurements between the samples of and : luminance (), contrast () and structure (). The individual comparison functions are: [4]
with, in addition to above definitions:
SSIM is then a weighted combination of those comparative measures:
Setting the weights to 1, the formula can be reduced to the form shown above.
SSIM satisfies the identity of indiscernibles, and symmetry properties, but not the triangle inequality or non-negativity, and thus is not a distance function. However, under certain conditions, SSIM may be converted to a normalized root MSE measure, which is a distance function. [5] The square of such a function is not convex, but is locally convex and quasiconvex, [5] making SSIM a feasible target for optimization.
In order to evaluate the image quality, this formula is usually applied only on luma, although it may also be applied on color (e.g., RGB) values or chromatic (e.g. YCbCr) values. The resultant SSIM index is a decimal value between -1 and 1, where 1 indicates perfect similarity, 0 indicates no similarity, and -1 indicates perfect anti-correlation. For an image, it is typically calculated using a sliding Gaussian window of size 11x11 or a block window of size 8×8. The window can be displaced pixel-by-pixel on the image to create an SSIM quality map of the image. In the case of video quality assessment, [6] the authors propose to use only a subgroup of the possible windows to reduce the complexity of the calculation.
A more advanced form of SSIM, called Multiscale SSIM (MS-SSIM) [4] is conducted over multiple scales through a process of multiple stages of sub-sampling, reminiscent of multiscale processing in the early vision system. It has been shown to perform equally well or better than SSIM on different subjective image and video databases. [4] [7] [8]
Three-component SSIM (3-SSIM) is a form of SSIM that takes into account the fact that the human eye can see differences more precisely on textured or edge regions than on smooth regions. [9] The resulting metric is calculated as a weighted average of SSIM for three categories of regions: edges, textures, and smooth regions. The proposed weighting is 0.5 for edges, 0.25 for the textured and smooth regions. The authors mention that a 1/0/0 weighting (ignoring anything but edge distortions) leads to results that are closer to subjective ratings. This suggests that edge regions play a dominant role in image quality perception.
The authors of 3-SSIM have also extended the model into four-component SSIM (4-SSIM). The edge types are further subdivided into preserved and changed edges by their distortion status. The proposed weighting is 0.25 for all four components. [10]
Structural dissimilarity (DSSIM) may be derived from SSIM, though it does not constitute a distance function as the triangle inequality is not necessarily satisfied.
It is worth noting that the original version SSIM was designed to measure the quality of still images. It does not contain any parameters directly related to temporal effects of human perception and human judgment. [7] A common practice is to calculate the average SSIM value over all frames in the video sequence. However, several temporal variants of SSIM have been developed. [11] [6] [12]
The complex wavelet transform variant of the SSIM (CW-SSIM) is designed to deal with issues of image scaling, translation and rotation. Instead of giving low scores to images with such conditions, the CW-SSIM takes advantage of the complex wavelet transform and therefore yields higher scores to said images. The CW-SSIM is defined as follows:
Where is the complex wavelet transform of the signal and is the complex wavelet transform for the signal . Additionally, is a small positive number used for the purposes of function stability. Ideally, it should be zero. Like the SSIM, the CW-SSIM has a maximum value of 1. The maximum value of 1 indicates that the two signals are perfectly structurally similar while a value of 0 indicates no structural similarity. [13]
The SSIMPLUS index is based on SSIM and is a commercially available tool. [14] It extends SSIM's capabilities, mainly to target video applications. It provides scores in the range of 0–100, linearly matched to human subjective ratings. It also allows adapting the scores to the intended viewing device, comparing video across different resolutions and contents.
According to its authors, SSIMPLUS achieves higher accuracy and higher speed than other image and video quality metrics. However, no independent evaluation of SSIMPLUS has been performed, as the algorithm itself is not publicly available.
In order to further investigate the standard discrete SSIM from a theoretical perspective, the continuous SSIM (cSSIM) [15] has been introduced and studied in the context of Radial basis function interpolation.
SSIMULACRA and SSIMULACRA2 are variants of SSIM developed by Cloudinary with the goal of fitted to subjective opinion data. The variants operate in XYB color space and combine MS-SSIM with two types of asymmetric error maps for blockiness/ringing and smoothing/blur, common compression artifacts. SSIMULACRA2 is part of libjxl, the reference implementation of JPEG XL. [16] [17]
The r* cross-correlation metric is based on the variance metrics of SSIM. It's defined as r*(x, y) = σxy/σxσy when σxσy ≠ 0, 1 when both standard deviations are zero, and 0 when only one is zero. It has found use in analyzing human response to contrast-detail phantoms. [18]
SSIM has also been used on the gradient of images, making it "G-SSIM". G-SSIM is especially useful on blurred images. [19]
The modifications above can be combined. For example, 4-G-r* is a combination of 4-SSIM, G-SSIM, and r*. It is able to reflect radiologist preference for images much better than other SSIM variants tested. [20]
SSIM has applications in a variety of different problems. Some examples are:
Due to its popularity, SSIM is often compared to other metrics, including more simple metrics such as MSE and PSNR, and other perceptual image and video quality metrics. SSIM has been repeatedly shown to significantly outperform MSE and its derivates in accuracy, including research by its own authors and others. [7] [22] [23] [24] [25] [26]
A paper by Dosselmann and Yang claims that the performance of SSIM is "much closer to that of the MSE" than usually assumed. While they do not dispute the advantage of SSIM over MSE, they state an analytical and functional dependency between the two metrics. [8] According to their research, SSIM has been found to correlate as well as MSE-based methods on subjective databases other than the databases from SSIM's creators. As an example, they cite Reibman and Poole, who found that MSE outperformed SSIM on a database containing packet-loss–impaired video. [27] In another paper, an analytical link between PSNR and SSIM was identified. [28]
In signal processing, cross-correlation is a measure of similarity of two series as a function of the displacement of one relative to the other. This is also known as a sliding dot product or sliding inner-product. It is commonly used for searching a long signal for a shorter, known feature. It has applications in pattern recognition, single particle analysis, electron tomography, averaging, cryptanalysis, and neurophysiology. The cross-correlation is similar in nature to the convolution of two functions. In an autocorrelation, which is the cross-correlation of a signal with itself, there will always be a peak at a lag of zero, and its size will be the signal energy.
In statistics, the Bhattacharyya distance is a quantity which represents a notion of similarity between two probability distributions. It is closely related to the Bhattacharyya coefficient, which is a measure of the amount of overlap between two statistical samples or populations.
Peak signal-to-noise ratio (PSNR) is an engineering term for the ratio between the maximum possible power of a signal and the power of corrupting noise that affects the fidelity of its representation. Because many signals have a very wide dynamic range, PSNR is usually expressed as a logarithmic quantity using the decibel scale.
Sensor fusion is a process of combining sensor data or data derived from disparate sources so that the resulting information has less uncertainty than would be possible if these sources were used individually. For instance, one could potentially obtain a more accurate location estimate of an indoor object by combining multiple data sources such as video cameras and WiFi localization signals. The term uncertainty reduction in this case can mean more accurate, more complete, or more dependable, or refer to the result of an emerging view, such as stereoscopic vision.
Linear discriminant analysis (LDA), normal discriminant analysis (NDA), canonical variates analysis (CVA), or discriminant function analysis is a generalization of Fisher's linear discriminant, a method used in statistics and other fields, to find a linear combination of features that characterizes or separates two or more classes of objects or events. The resulting combination may be used as a linear classifier, or, more commonly, for dimensionality reduction before later classification.
In probability theory, the Rice distribution or Rician distribution is the probability distribution of the magnitude of a circularly-symmetric bivariate normal random variable, possibly with non-zero mean (noncentral). It was named after Stephen O. Rice (1907–1986).
Video quality is a characteristic of a video passed through a video transmission or processing system that describes perceived video degradation. Video processing systems may introduce some amount of distortion or artifacts in the video signal that negatively impact the user's perception of the system. For many stakeholders in video production and distribution, ensuring video quality is an important task.
Affine shape adaptation is a methodology for iteratively adapting the shape of the smoothing kernels in an affine group of smoothing kernels to the local image structure in neighbourhood region of a specific image point. Equivalently, affine shape adaptation can be accomplished by iteratively warping a local image patch with affine transformations while applying a rotationally symmetric filter to the warped image patches. Provided that this iterative process converges, the resulting fixed point will be affine invariant. In the area of computer vision, this idea has been used for defining affine invariant interest point operators as well as affine invariant texture analysis methods.
In objective video quality assessment, the outliers ratio (OR) is a measure of the performance of an objective video quality metric. It is the ratio of "false" scores given by the objective metric to the total number of scores. The "false" scores are the scores that lie outside the interval
Compressed sensing is a signal processing technique for efficiently acquiring and reconstructing a signal, by finding solutions to underdetermined linear systems. This is based on the principle that, through optimization, the sparsity of a signal can be exploited to recover it from far fewer samples than required by the Nyquist–Shannon sampling theorem. There are two conditions under which recovery is possible. The first one is sparsity, which requires the signal to be sparse in some domain. The second one is incoherence, which is applied through the isometric property, which is sufficient for sparse signals. Compressed sensing has applications in, for example, magnetic resonance imaging (MRI) where the incoherence condition is typically satisfied.
Image quality can refer to the level of accuracy with which different imaging systems capture, process, store, compress, transmit and display the signals that form an image. Another definition refers to image quality as "the weighted combination of all of the visually significant attributes of an image". The difference between the two definitions is that one focuses on the characteristics of signal processing in different imaging systems and the latter on the perceptual assessments that make an image pleasant for human viewers.
Alan Conrad Bovik is an American engineer, vision scientist, and educator. He is a professor at the University of Texas at Austin (UT-Austin), where he holds the Cockrell Family Regents Endowed Chair in the Cockrell School of Engineering and is Director of the Laboratory for Image and Video Engineering (LIVE). He is a faculty member in the UT-Austin Department of Electrical and Computer Engineering, the Machine Learning Laboratory, the Institute for Neuroscience, and the Wireless Networking and Communications Group.
Foreground detection is one of the major tasks in the field of computer vision and image processing whose aim is to detect changes in image sequences. Background subtraction is any technique which allows an image's foreground to be extracted for further processing.
In target tracking, the multi-fractional order estimator (MFOE) is an alternative to the Kalman filter. The MFOE is focused strictly on simple and pragmatic fundamentals along with the integrity of mathematical modeling. Like the KF, the MFOE is based on the least squares method (LSM) invented by Gauss and the orthogonality principle at the center of Kalman's derivation. Optimized, the MFOE yields better accuracy than the KF and subsequent algorithms such as the extended KF and the interacting multiple model (IMM). The MFOE is an expanded form of the LSM, which effectively includes the KF and ordinary least squares (OLS) as subsets. OLS is revolutionized in for application in econometrics. The MFOE also intersects with signal processing, estimation theory, economics, finance, statistics, and the method of moments. The MFOE offers two major advances: (1) minimizing the mean squared error (MSE) with fractions of estimated coefficients and (2) describing the effect of deterministic OLS processing of statistical inputs
The MOtion-tuned Video Integrity Evaluation (MOVIE) index is a model and set of algorithms for predicting the perceived quality of digital television and cinematic pictures, as well as other kinds of digital images and videos.
Diagnostically acceptable irreversible compression (DAIC) is the amount of lossy compression which can be used on a medical image to produce a result that does not prevent the reader from using the image to make a medical diagnosis.
Visual information fidelity (VIF) is a full reference image quality assessment index based on natural scene statistics and the notion of image information extracted by the human visual system. It was developed by Hamid R Sheikh and Alan Bovik at the Laboratory for Image and Video Engineering (LIVE) at the University of Texas at Austin in 2006. It is deployed in the core of the Netflix VMAF video quality monitoring system, which controls the picture quality of all encoded videos streamed by Netflix.
The Fréchet inception distance (FID) is a metric used to assess the quality of images created by a generative model, like a generative adversarial network (GAN) or a diffusion model.
Video super-resolution (VSR) is the process of generating high-resolution video frames from the given low-resolution video frames. Unlike single-image super-resolution (SISR), the main goal is not only to restore more fine details while saving coarse ones, but also to preserve motion consistency.
Similarity between two different signals is important in the field of signal processing. Below are some common methods for calculating similarity.