Mean opinion score

Last updated

Mean opinion score (MOS) is a measure used in the domain of Quality of Experience and telecommunications engineering, representing overall quality of a stimulus or system. It is the arithmetic mean over all individual "values on a predefined scale that a subject assigns to his opinion of the performance of a system quality". [1] Such ratings are usually gathered in a subjective quality evaluation test, but they can also be algorithmically estimated.

Contents

MOS is a commonly used measure for video, audio, and audiovisual quality evaluation, but not restricted to those modalities. ITU-T has defined several ways of referring to a MOS in Recommendation ITU-T P.800.1, depending on whether the score was obtained from audiovisual, conversational, listening, talking, or video quality tests.

Rating scales and mathematical definition

The MOS is expressed as a single rational number, typically in the range 1–5, where 1 is lowest perceived quality, and 5 is the highest perceived quality. Other MOS ranges are also possible, depending on the rating scale that has been used in the underlying test. The Absolute Category Rating scale is very commonly used, which maps ratings between Bad and Excellent to numbers between 1 and 5, as seen in below table.

RatingLabel
5Excellent
4Good
3Fair
2Poor
1Bad

Other standardized quality rating scales exist in ITU-T Recommendations (such as ITU-T P.800 or ITU-T P.910). For example, one could use a continuous scale ranging between 1–100. Which scale is used depends on the purpose of the test. In certain contexts there are no statistically significant differences between ratings for the same stimuli when they are obtained using different scales. [2]

The MOS is calculated as the arithmetic mean over single ratings performed by human subjects for a given stimulus in a subjective quality evaluation test. Thus:

Where are the individual ratings for a given stimulus by subjects.

Properties of the MOS

The MOS is subject to certain mathematical properties and biases. In general, there is an ongoing debate on the usefulness of the MOS to quantify Quality of Experience in a single scalar value. [3]

When the MOS is acquired using a categorical rating scales, it is based on – similar to Likert scales – an ordinal scale. In this case, the ranking of the scale items is known, but their interval is not. Therefore, it is mathematically incorrect to calculate a mean over individual ratings in order to obtain the central tendency; the median should be used instead. [4] However, in practice and in the definition of MOS, it is considered acceptable to calculate the arithmetic mean.

It has been shown that for categorical rating scales (such as ACR), the individual items are not perceived equidistant by subjects. For example, there may be a larger "gap" between Good and Fair than there is between Good and Excellent. The perceived distance may also depend on the language into which the scale is translated. [5] However, there exist studies that could not prove a significant impact of scale translation on the obtained results. [6]

Several other biases are present in the way MOS ratings are typically acquired. [7] In addition to the above-mentioned issues with scales that are perceived non-linearly, there is a so-called "range-equalization bias": subjects, over the course of a subjective experiment, tend to give scores that span the entire rating scale. This makes it impossible to compare two different subjective tests if the range of presented quality differs. In other words, the MOS is never an absolute measure of quality, but only relative to the test in which it has been acquired.

For the above reasons – and due to several other contextual factors influencing the perceived quality in a subjective test – a MOS value should only be reported if the context in which the values have been collected in is known and reported as well. MOS values gathered from different contexts and test designs therefore should not be directly compared. Recommendation ITU-T P.800.2 prescribes how MOS values should be reported. Specifically, P.800.2 says:

it is not meaningful to directly compare MOS values produced from separate experiments, unless those experiments were explicitly designed to be compared, and even then the data should be statistically analysed to ensure that such a comparison is valid.

MOS for speech and audio quality estimation

MOS historically originates from subjective measurements where listeners would sit in a "quiet room" and score a telephone call quality as they perceived it. This kind of test methodology had been in use in the telephony industry for decades and was standardized in Recommendation ITU-T P.800. It specifies that "the talker should be seated in a quiet room with volume between 30 and 120 m³ and a reverberation time less than 500 ms (preferably in the range 200–300 ms). The room noise level must be below 30 dBA with no dominant peaks in the spectrum." Requirements for other modalities were similarly specified in later ITU-T Recommendations.

MOS estimation using quality models

Obtaining MOS ratings may be time-consuming and expensive as it requires the recruitment of human assessors. For various use cases such as codec development or service quality monitoring purposes – where quality should be estimated repeatedly and automatically – MOS scores can also be predicted by objective quality models, which typically have been developed and trained using human MOS ratings. A question that arises from using such models is whether the MOS differences produced are noticeable to the users. For example, when rating images on a five point MOS scale, an image with a MOS equal to 5 is expected to be noticeably better in quality than one with a MOS equal to 1. Contrary to that, it is not evident whether an image with a MOS equal to 3.8 is noticeably better in quality than one with a MOS equal to 3.6. Research conducted on determining the smallest MOS difference that is perceptible to users for digital photographs showed that a MOS difference of approximately 0.46 is required in order for 75% of the users to be able to detect the higher quality image. [8] Nevertheless, image quality expectation, and hence MOS, changes over time with the change of user expectations. As a result, minimum noticeable MOS differences determined using analytical methods such as in [8] may change over time.

See also

Related Research Articles

Psychophysics Branch of knowledge relating physical stimuli and psychological perception

Psychophysics quantitatively investigates the relationship between physical stimuli and the sensations and perceptions they produce. Psychophysics has been described as "the scientific study of the relation between stimulus and sensation" or, more completely, as "the analysis of perceptual processes by studying the effect on a subject's experience or behaviour of systematically varying the properties of a stimulus along one or more physical dimensions".

<span class="mw-page-title-main">Likert scale</span> Psychometric measurement scale

A Likert scale is a psychometric scale commonly involved in research that employs questionnaires. It is the most widely used approach to scaling responses in survey research, such that the term is often used interchangeably with rating scale, although there are other types of rating scales.

Perceptual Speech Quality Measure (PSQM) is a computational and modeling algorithm defined in Recommendation ITU-T P.861 that objectively evaluates and quantifies voice quality of voice-band speech codecs. It may be used to rank the performance of these speech codecs with differing speech input levels, talkers, bit rates and transcodings. P.861 was withdrawn and replaced by Recommendation ITU-T P.862 (PESQ), which contains an improved speech assessment algorithm.

Video quality is a characteristic of a video passed through a video transmission or processing system that describes perceived video degradation. Video processing systems may introduce some amount of distortion or artifacts in the video signal that negatively impacts the user's perception of a system. For many stakeholders in video production and distribution, assurance of video quality is an important task.

Subjective video quality is video quality as experienced by humans. It is concerned with how video is perceived by a viewer and designates their opinion on a particular video sequence. It is related to the field of Quality of Experience. Measuring subjective video quality is necessary because objective quality assessment algorithms such as PSNR have been shown to correlate poorly with subjective ratings. Subjective ratings may also be used as ground truth to develop new algorithms.

Quality of experience (QoE) is a measure of the delight or annoyance of a customer's experiences with a service. QoE focuses on the entire service experience; it is a holistic concept, similar to the field of user experience, but with its roots in telecommunication. QoE is an emerging multidisciplinary field based on social psychology, cognitive science, economics, and engineering science, focused on understanding overall human quality requirements.

A rating scale is a set of categories designed to elicit information about a quantitative or a qualitative attribute. In the social sciences, particularly psychology, common examples are the Likert response scale and 1-10 rating scales in which a person selects the number which is considered to reflect the perceived quality of a product.

MUSHRA stands for MUltiple Stimuli with Hidden Reference and Anchor and is a methodology for conducting a codec listening test to evaluate the perceived quality of the output from lossy audio compression algorithms. It is defined by ITU-R recommendation BS.1534-3. The MUSHRA methodology is recommended for assessing "intermediate audio quality". For very small audio impairments, Recommendation ITU-R BS.1116-3 (ABC/HR) is recommended instead.

In statistics, inter-rater reliability is the degree of agreement among independent observers who rate, code, or assess the same phenomenon.

Consensus-based assessment expands on the common practice of consensus decision-making and the theoretical observation that expertise can be closely approximated by large numbers of novices or journeymen. It creates a method for determining measurement standards for very ambiguous domains of knowledge, such as emotional intelligence, politics, religion, values and culture in general. From this perspective, the shared knowledge that forms cultural consensus can be assessed in much the same way as expertise or general intelligence.

An ABX test is a method of comparing two choices of sensory stimuli to identify detectable differences between them. A subject is presented with two known samples followed by one unknown sample X that is randomly selected from either A or B. The subject is then required to identify X as either A or B. If X cannot be identified reliably with a low p-value in a predetermined number of trials, then the null hypothesis cannot be rejected and it cannot be proven that there is a perceptible difference between A and B.

Perceptual Evaluation of Audio Quality (PEAQ) is a standardized algorithm for objectively measuring perceived audio quality, developed in 1994-1998 by a joint venture of experts within Task Group 6Q of the International Telecommunication Union's Radiocommunication Sector (ITU-R). It was originally released as ITU-R Recommendation BS.1387 in 1998 and last updated in 2001. It utilizes software to simulate perceptual properties of the human ear and then integrates multiple model output variables into a single metric. PEAQ characterizes the perceived audio quality as subjects would do in a listening test according to ITU-R BS.1116. PEAQ results principally model mean opinion scores that cover a scale from 1 (bad) to 5 (excellent).

Genista Corporation was a company that used computational models of human visual and auditory systems to measure what human viewers see and hear. The company offered quality measurement technology that estimated the experienced quality that would be measured by a mean opinion score (MOS) resulting from subjective tests using actual human test subjects.

Perceptual Evaluation of Speech Quality (PESQ) is a family of standards comprising a test methodology for automated assessment of the speech quality as experienced by a user of a telephony system. It was standardized as Recommendation ITU-T P.862 in 2001. PESQ is used for objective voice quality testing by phone manufacturers, network equipment vendors and telecom operators. Its usage requires a license. The first edition of PESQ's successor POLQA entered into force in 2011.

Perceptual Evaluation of Video Quality(PEVQ) is an end-to-end (E2E) measurement algorithm to score the picture quality of a video presentation by means of a 5-point mean opinion score (MOS). It is, therefore, a video quality model. PEVQ was benchmarked by the Video Quality Experts Group (VQEG) in the course of the Multimedia Test Phase 2007–2008. Based on the performance results, in which the accuracy of PEVQ was tested against ratings obtained by human viewers, PEVQ became part of the new International Standard.

Absolute Category Rating (ACR) is a test method used in quality tests.

VQuad-HD(Objective perceptual multimedia video quality measurement of HDTV) is a video quality testing technology for high definition video signals. It is a full-reference model, meaning that it requires access to the original and the degraded signal to estimate the quality.

Perceptual Objective Listening Quality Analysis (POLQA) was the working title of an ITU-T standard that covers a model to predict speech quality by means of analyzing digital speech signals. The model was standardized as Recommendation ITU-T P.863 in 2011. The second edition of the standard appeared in 2014, and the third, currently in-force edition was adopted in 2018 under the title Perceptual objective listening quality prediction.

The Circuit Merit system is a measurement process designed to assess the voice-to-noise ratio in wired and wireless telephone circuits, especially the AMPS system, and although its reporting scale is sometimes used as input for calculating mean opinion score, the rating system is officially defined relative to given ranges of voice-to-noise ratios.

ZPEG is a motion video technology that applies a human visual acuity model to a decorrelated transform-domain space, thereby optimally reducing the redundancies in motion video by removing the subjectively imperceptible. This technology is applicable to a wide range of video processing problems such as video optimization, real-time motion video compression, subjective quality monitoring, and format conversion.

References

  1. ITU-T Rec. P.10/G.100 (2017) Vocabulary for performance, quality of service and quality of experience.
  2. Huynh-Thu, Q.; Garcia, M. N.; Speranza, F.; Corriveau, P.; Raake, A. (2011-03-01). "Study of Rating Scales for Subjective Quality Assessment of High-Definition Video". IEEE Transactions on Broadcasting. 57 (1): 1–14. doi:10.1109/TBC.2010.2086750. ISSN   0018-9316.
  3. Hoßfeld, Tobias; Heegaard, Poul E.; Varela, Martín; Möller, Sebastian (2016-12-01). "QoE beyond the MOS: an in-depth look at QoE via better metrics and their relation to MOS". Quality and User Experience. 1 (1): 2. arXiv: 1607.00321 . doi:10.1007/s41233-016-0002-1. ISSN   2366-0139.
  4. Jamieson, Susan. "Likert scales: how to (ab) use them." Medical education 38.12 (2004): 1217-1218.
  5. Streijl, Robert C., Stefan Winkler, and David S. Hands. "Mean opinion score (MOS) revisited: methods and applications, limitations and alternatives." Multimedia Systems 22.2 (2016): 213-227.
  6. Pinson, M. H.; Janowski, L.; Pepion, R.; Huynh-Thu, Q.; Schmidmer, C.; Corriveau, P.; Younkin, A.; Callet, P. Le; Barkowsky, M. (October 2012). "The Influence of Subjects and Environment on Audiovisual Subjective Tests: An International Study" (PDF). IEEE Journal of Selected Topics in Signal Processing. 6 (6): 640–651. doi:10.1109/jstsp.2012.2215306. ISSN   1932-4553.
  7. Zielinski, Slawomir, Francis Rumsey, and Søren Bech. "On some biases encountered in modern audio quality listening tests-a review." Journal of the Audio Engineering Society 56.6 (2008): 427-451.
  8. 1 2 Katsigiannis, S.; Scovell, J. N.; Ramzan, N.; Janowski, L.; Corriveau, P.; Saad, M.; Van Wallendael, G. (2018-05-02). "Interpreting MOS scores, when can users see a difference? Understanding user experience differences for photo quality". Quality and User Experience. 3 (1): 6. doi:10.1007/s41233-018-0019-8. hdl: 1854/LU-8581457 . ISSN   2366-0139.