Subjective video quality

Last updated

Subjective video quality is video quality as experienced by humans. It is concerned with how video is perceived by a viewer (also called "observer" or "subject") and designates their opinion on a particular video sequence. It is related to the field of Quality of Experience. Measuring subjective video quality is necessary because objective quality assessment algorithms such as PSNR have been shown to correlate poorly with subjective ratings. Subjective ratings may also be used as ground truth to develop new algorithms.

Contents

Subjective video quality tests are psychophysical experiments in which a number of viewers rate a given set of stimuli. These tests are quite expensive in terms of time (preparation and running) and human resources and must therefore be carefully designed.

In subjective video quality tests, typically, SRCs ("Sources", i.e. original video sequences) are treated with various conditions (HRCs for "Hypothetical Reference Circuits") to generate PVSs ("Processed Video Sequences"). [1]

Measurement

The main idea of measuring subjective video quality is similar to the mean opinion score (MOS) evaluation for audio. To evaluate the subjective video quality of a video processing system, the following steps are typically taken:

Many parameters of the viewing conditions may influence the results, such as room illumination, display type, brightness, contrast, resolution, viewing distance, and the age and educational level of viewers. It is therefore advised to report this information along with the obtained ratings.

Source selection

Typically, a system should be tested with a representative number of different contents and content characteristics. For example, one may select excerpts from contents of different genres, such as action movies, news shows, and cartoons. The length of the source video depends on the purpose of the test, but typically, sequences of no less than 10 seconds are used.

The amount of motion and spatial detail should also cover a broad range. This ensures that the test contains sequences which are of different complexity.

Sources should be of pristine quality. There should be no visible coding artifacts or other properties that would lower the quality of the original sequence.

Settings

The design of the HRCs depends on the system under study. Typically, multiple independent variables are introduced at this stage, and they are varied with a number of levels. For example, to test the quality of a video codec, independent variables may be the video encoding software, a target bitrate, and the target resolution of the processed sequence.

It is advised to select settings that result in ratings which cover the full quality range. In other words, assuming an Absolute Category Rating scale, the test should show sequences that viewers would rate from bad to excellent.

Viewers

Number of viewers

Viewers are also called "observers" or "subjects". A certain minimum number of viewers should be invited to a study, since a larger number of subjects increases the reliability of the experiment outcome, for example by reducing the standard deviation of averaged ratings. Furthermore, there is a risk of having to exclude subjects for unreliable behavior during rating.

The minimum number of subjects that are required for a subjective video quality study is not strictly defined. According to ITU-T, any number between 4 and 40 is possible, where 4 is the absolute minimum for statistical reasons, and inviting more than 40 subjects has no added value. In general, at least 15 observers should participate in the experiment. They should not be directly involved in picture quality evaluation as part of their work and should not be experienced assessors. [2] In other documents, it is also claimed that at minimum 10 subjects are needed to obtain meaningful averaged ratings. [3]

However, most recommendations for the number of subjects have been designed for measuring video quality encountered by a home television or PC user, where the range and diversity of distortions tend to be limited (e.g., to encoding artifacts only). Given the large ranges and diversity of impairments that may occur on videos captured with mobile devices and/or transmitted over wireless networks, generally, a larger number of human subjects may be required.

Brunnström and Barkowsky have provided calculations for estimating the minimum number of subjects necessary based on existing subjective tests. [4] They claim that in order to ensure statistically significant differences when comparing ratings, a larger number of subjects than usually recommended may be needed.

Viewer selection

Viewers should be non-experts in the sense of not being professionals in the field of video coding or related domains. This requirement is introduced to avoid potential subject bias. [2]

Typically, viewers are screened for normal vision or corrected-to-normal vision using Snellen charts. Color blindness is often tested with Ishihara plates. [2]

There is an ongoing discussion in the QoE community as to whether a viewer's cultural, social, or economic background has a significant impact on the obtained subjective video quality results. A systematic study involving six laboratories in four countries found no statistically significant impact of subject's language and culture / country of origin on video quality ratings. [5]

Test environment

Subjective quality tests can be done in any environment. However, due to possible influence factors from heterogenous contexts, it is typically advised to perform tests in a neutral environment, such as a dedicated laboratory room. Such a room may be sound-proofed, with walls painted in neutral grey, and using properly calibrated light sources. Several recommendations specify these conditions. [6] [7] Controlled environments have been shown to result in lower variability in the obtained scores. [5]

Crowdsourcing

Crowdsourcing has recently been used for subjective video quality evaluation, and more generally, in the context of Quality of Experience. [8] Here, viewers give ratings using their own computer, at home, rather than taking part in a subjective quality test in laboratory rooms. While this method allows for obtaining more results than in traditional subjective tests at lower costs, the validity and reliability of the gathered responses must be carefully checked. [9]

Analysis of results

Opinions of viewers are typically averaged into the mean opinion score (MOS). To this aim, the labels of categorical scales may be translated into numbers. For example, the responses "bad" to "excellent" can be mapped to the values 1 to 5, and then averaged. MOS values should always be reported with their statistical confidence intervals so that the general agreement between observers can be evaluated.

Subject screening

Often, additional measures are taken before evaluating the results. Subject screening is a process in which viewers whose ratings are considered invalid or unreliable are rejected from further analysis. Invalid ratings are hard to detect, as subjects may have rated without looking at a video, or cheat during the test. The overall reliability of a subject can be determined by various procedures, some of which are outlined in ITU-R and ITU-T recommendations. [2] [7] For example, the correlation between a person's individual scores and the overall MOS, evaluated for all sequences, is a good indicator of their reliability in comparison with the remaining test participants.

Advanced models

While rating stimuli, humans are subject to biases. These may lead to different and inaccurate scoring behavior and consequently result in MOS values that are not representative of the “true quality” of a stimulus. In the recent years, advanced models have been proposed that aim at formally describing the rating process and subsequently recovering noisiness in subjective ratings. According to Janowski et al., subjects may have an opinion bias that generally shifts their scores, as well as a scoring imprecision that is dependent on the subject and stimulus to be rated. [10] Li et al. have proposed to differentiate between subject inconsistency and content ambiguity. [11]

Standardized testing methods

There are many ways to select proper sequences, system settings, and test methodologies. A few of them have been standardized. They are thoroughly described in several ITU-R and ITU-T recommendations, among those ITU-R BT.500 [7] and ITU-T P.910. [2] While there is an overlap in certain aspects, the BT.500 recommendation has its roots in broadcasting, whereas P.910 focuses on multimedia content.

A standardized testing method usually describes the following aspects:

Another recommendation, ITU-T P.913, [6] gives researchers more freedom to conduct subjective quality tests in environments different from a typical testing laboratory, while still requiring them to report all details necessary to make such tests reproducible.

Examples

Below, some examples of standardized testing procedures are explained.

Single-Stimulus

  • ACR (Absolute Category Rating): [2] each sequence is rated individually on the ACR scale. The labels on the scale are "bad", "poor", "fair", "good", and "excellent", and they are translated to the values 1, 2, 3, 4 and 5 when calculating the MOS.
  • ACR-HR (Absolute Category Rating with Hidden Reference): a variation of ACR, in which an original unimpaired source sequence is shown in addition to the impaired sequences, without informing the subjects of its presence (hence, "hidden"). The ratings are calculated as differential scores between the reference and the impaired versions. The differential score is defined as the score of the PVS minus the score given to the hidden reference, plus the number of points on the scale. For example, if a PVS is rated as “poor", and its corresponding hidden reference as “good", then the rating is . When these ratings are averaged, the result is not a MOS, but a differential MOS ("DMOS").
  • SSCQE (Single Stimulus Continuous Quality Rating): [7] a longer sequence is rated continuously over time using a slider device (a variation of a fader), on which subjects rate the current quality. Samples are taken in regular intervals, resulting in a quality curve over time rather than a single quality rating.

Double-stimulus or multiple stimulus

  • DSCQS (Double Stimulus Continuous Quality Scale): [7] the viewer sees an unimpaired reference and the impaired sequence in a random order. They are allowed to re-view the sequences, and then rate the quality for both on a continuous scale labeled with the ACR categories.
  • DSIS (Double Stimulus Impairment Scale) [7] and DCR (Degradation Category Rating): [2] both refer to the same method. The viewer sees an unimpaired reference video, then the same video impaired, and after that they are asked to vote on the second video using a so-called impairment scale (from "impairments are imperceptible" to "impairments are very annoying").
  • PC (Pair Comparison): [2] instead of comparing an unimpaired and impaired sequence, different impairment types (HRCs) are compared. All possible combinations of HRCs should be evaluated.

Choice of methodology

Which method to choose largely depends on the purpose of the test and possible constraints in time and other resources. Some methods may have fewer context effects (i.e. where the order of stimuli influences the results), which are unwanted test biases. [12] In ITU-T P.910, it is noted that methods such as DCR should be used for testing the fidelity of transmission, especially in high quality systems. ACR and ACR-HR are better suited for qualification tests and – due to giving absolute results – comparison of systems. The PC method has a high discriminatory power, but it requires longer test sessions.

Databases

The results of subjective quality tests, including the used stimuli, are called databases. A number of subjective picture and video quality databases based on such studies have been made publicly available by research institutes. These databases – some of which have become de facto standards – are used globally by television, cinematic, and video engineers around the world to design and test objective quality models, since the developed models can be trained against the obtained subjective data. An overview of publicly available databases has been compiled by the Video Quality Experts Group, and video assets have been made available in the Consumer Digital Video Library.

Related Research Articles

Quality of service (QoS) is the description or measurement of the overall performance of a service, such as a telephony or computer network, or a cloud computing service, particularly the performance seen by the users of the network. To quantitatively measure quality of service, several related aspects of the network service are often considered, such as packet loss, bit rate, throughput, transmission delay, availability, jitter, etc.

Psychophysics quantitatively investigates the relationship between physical stimuli and the sensations and perceptions they produce. Psychophysics has been described as "the scientific study of the relation between stimulus and sensation" or, more completely, as "the analysis of perceptual processes by studying the effect on a subject's experience or behaviour of systematically varying the properties of a stimulus along one or more physical dimensions".

<span class="mw-page-title-main">Loudness</span> Subjective perception of sound pressure

In acoustics, loudness is the subjective perception of sound pressure. More formally, it is defined as the "attribute of auditory sensation in terms of which sounds can be ordered on a scale extending from quiet to loud". The relation of physical attributes of sound to perceived loudness consists of physical, physiological and psychological components. The study of apparent loudness is included in the topic of psychoacoustics and employs methods of psychophysics.

Mean opinion score (MOS) is a measure used in the domain of Quality of Experience and telecommunications engineering, representing overall quality of a stimulus or system. It is the arithmetic mean over all individual "values on a predefined scale that a subject assigns to his opinion of the performance of a system quality". Such ratings are usually gathered in a subjective quality evaluation test, but they can also be algorithmically estimated.

Perceptual Speech Quality Measure (PSQM) is a computational and modeling algorithm defined in Recommendation ITU-T P.861 that objectively evaluates and quantifies voice quality of voice-band speech codecs. It may be used to rank the performance of these speech codecs with differing speech input levels, talkers, bit rates and transcodings. P.861 was withdrawn and replaced by Recommendation ITU-T P.862 (PESQ), which contains an improved speech assessment algorithm.

Video quality is a characteristic of a video passed through a video transmission or processing system that describes perceived video degradation. Video processing systems may introduce some amount of distortion or artifacts in the video signal that negatively impact the user's perception of the system. For many stakeholders in video production and distribution, ensuring video quality is an important task.

Α video codec is software or a device that provides encoding and decoding for digital video, and which may or may not include the use of video compression and/or decompression. Most codecs are typically implementations of video coding formats.

Quality of experience (QoE) is a measure of the delight or annoyance of a customer's experiences with a service. QoE focuses on the entire service experience; it is a holistic concept, similar to the field of user experience, but with its roots in telecommunication. QoE is an emerging multidisciplinary field based on social psychology, cognitive science, economics, and engineering science, focused on understanding overall human quality requirements.

MUSHRA stands for Multiple Stimuli with Hidden Reference and Anchor and is a methodology for conducting a codec listening test to evaluate the perceived quality of the output from lossy audio compression algorithms. It is defined by ITU-R recommendation BS.1534-3. The MUSHRA methodology is recommended for assessing "intermediate audio quality". For very small audio impairments, Recommendation ITU-R BS.1116-3 (ABC/HR) is recommended instead.

An ABX test is a method of comparing two choices of sensory stimuli to identify detectable differences between them. A subject is presented with two known samples followed by one unknown sample X that is randomly selected from either A or B. The subject is then required to identify X as either A or B. If X cannot be identified reliably with a low p-value in a predetermined number of trials, then the null hypothesis cannot be rejected and it cannot be proven that there is a perceptible difference between A and B.

Perceptual Evaluation of Audio Quality (PEAQ) is a standardized algorithm for objectively measuring perceived audio quality, developed in 1994–1998 by a joint venture of experts within Task Group 6Q of the International Telecommunication Union's Radiocommunication Sector (ITU-R). It was originally released as ITU-R Recommendation BS.1387 in 1998 and last updated in 2023. It utilizes software to simulate perceptual properties of the human ear and then integrates multiple model output variables into a single metric.

Genista Corporation was a company that used computational models of human visual and auditory systems to measure what human viewers see and hear. The company offered quality measurement technology that estimated the experienced quality that would be measured by a mean opinion score (MOS) resulting from subjective tests using actual human test subjects.

Perceptual Evaluation of Speech Quality (PESQ) is a family of standards comprising a test methodology for automated assessment of the speech quality as experienced by a user of a telephony system. It was standardized as Recommendation ITU-T P.862 in 2001. PESQ is used for objective voice quality testing by phone manufacturers, network equipment vendors and telecom operators. Its usage requires a license. The first edition of PESQ's successor POLQA entered into force in 2011.

Image quality can refer to the level of accuracy with which different imaging systems capture, process, store, compress, transmit and display the signals that form an image. Another definition refers to image quality as "the weighted combination of all of the visually significant attributes of an image". The difference between the two definitions is that one focuses on the characteristics of signal processing in different imaging systems and the latter on the perceptual assessments that make an image pleasant for human viewers.

Perceptual Evaluation of Video Quality(PEVQ) is an end-to-end (E2E) measurement algorithm to score the picture quality of a video presentation by means of a 5-point mean opinion score (MOS). It is, therefore, a video quality model. PEVQ was benchmarked by the Video Quality Experts Group (VQEG) in the course of the Multimedia Test Phase 2007–2008. Based on the performance results, in which the accuracy of PEVQ was tested against ratings obtained by human viewers, PEVQ became part of the new International Standard.

Absolute Category Rating (ACR) is a test method used in quality tests.

VQuad-HD(Objective perceptual multimedia video quality measurement of HDTV) is a video quality testing technology for high definition video signals. It is a full-reference model, meaning that it requires access to the original and the degraded signal to estimate the quality.

Perceptual Objective Listening Quality Analysis (POLQA) was the working title of an ITU-T standard that covers a model to predict speech quality by means of analyzing digital speech signals. The model was standardized as Recommendation ITU-T P.863 in 2011. The second edition of the standard appeared in 2014, and the third, currently in-force edition was adopted in 2018 under the title Perceptual objective listening quality prediction.

ZPEG is a motion video technology that applies a human visual acuity model to a decorrelated transform-domain space, thereby optimally reducing the redundancies in motion video by removing the subjectively imperceptible. This technology is applicable to a wide range of video processing problems such as video optimization, real-time motion video compression, subjective quality monitoring, and format conversion.

Video Multimethod Assessment Fusion (VMAF) is an objective full-reference video quality metric developed by Netflix in cooperation with the University of Southern California, The IPI/LS2N lab Nantes Université, and the Laboratory for Image and Video Engineering (LIVE) at The University of Texas at Austin. It predicts subjective video quality based on a reference and distorted video sequence. The metric can be used to evaluate the quality of different video codecs, encoders, encoding settings, or transmission variants.

References

  1. ITU-T Tutorial: Objective perceptual assessment of video quality: Full reference television, 2004.
  2. 1 2 3 4 5 6 7 8 ITU-T Rec. P.910 : Subjective video quality assessment methods for multimedia applications, 2008.
  3. Winkler, Stefan. "On the properties of subjectiveratings in video quality experiments". Proc.Quality of Multimedia Experience, 2009.
  4. Brunnström, Kjell; Barkowsky, Marcus (2018-09-25). "Statistical quality of experience analysis: on planning the sample size and statistical significance testing". Journal of Electronic Imaging. 27 (5): 053013. Bibcode:2018JEI....27e3013B. doi:10.1117/1.jei.27.5.053013. ISSN   1017-9909. S2CID   53058660.
  5. 1 2 Pinson, M. H.; Janowski, L.; Pepion, R.; Huynh-Thu, Q.; Schmidmer, C.; Corriveau, P.; Younkin, A.; Callet, P. Le; Barkowsky, M. (October 2012). "The Influence of Subjects and Environment on Audiovisual Subjective Tests: An International Study" (PDF). IEEE Journal of Selected Topics in Signal Processing. 6 (6): 640–651. Bibcode:2012ISTSP...6..640P. doi:10.1109/jstsp.2012.2215306. ISSN   1932-4553. S2CID   10667847.
  6. 1 2 ITU-T P.913: Methods for the subjective assessment of video quality, audio quality and audiovisual quality of Internet video and distribution quality television in any environment, 2014.
  7. 1 2 3 4 5 6 ITU-R BT.500: Methodology for the subjective assessment of the quality of television pictures, 2012.
  8. Hossfeld, Tobias (2014-01-15). "Best Practices for QoE Crowdtesting: QoE Assessment With Crowdsourcing". IEEE Transactions on Multimedia. 16 (2): 541–558. doi:10.1109/TMM.2013.2291663. S2CID   16862362.
  9. Hossfeld, Tobias; Hirth, Matthias; Redi, Judith; Mazza, Filippo; Korshunov, Pavel; Naderi, Babak; Seufert, Michael; Gardlo, Bruno; Egger, Sebastian (October 2014). "Best Practices and Recommendations for Crowdsourced QoE - Lessons learned from the Qualinet Task Force "Crowdsourcing"". hal-01078761.{{cite journal}}: Cite journal requires |journal= (help)
  10. Janowski, Lucjan; Pinson, Margaret (2015). "The Accuracy of Subjects in a Quality Experiment: A Theoretical Subject Model". IEEE Transactions on Multimedia. 17 (12): 2210–2224. doi: 10.1109/tmm.2015.2484963 . ISSN   1520-9210. S2CID   22343847.
  11. Li, Zhi; Bampis, Christos G. (2017). "Recover Subjective Quality Scores from Noisy Measurements". 2017 Data Compression Conference (DCC). IEEE. pp. 52–61. arXiv: 1611.01715 . doi:10.1109/dcc.2017.26. ISBN   9781509067213. S2CID   14251604.
  12. Pinson, Margaret and Wolf, Stephen. "Comparing Subjective Video Quality Testing Methodologies". SPIE Video Communications and Image Processing Conference, Lugano, Switzerland, July 2003.