MUSHRA

Last updated

MUSHRA stands for Multiple Stimuli with Hidden Reference and Anchor and is a methodology for conducting a codec listening test to evaluate the perceived quality of the output from lossy audio compression algorithms. It is defined by ITU-R recommendation BS.1534-3. [1] The MUSHRA methodology is recommended for assessing "intermediate audio quality". For very small audio impairments, Recommendation ITU-R BS.1116-3 (ABC/HR) is recommended instead.

Contents

The main advantage over the mean opinion score (MOS) methodology (which serves a similar purpose) is that MUSHRA requires fewer participants to obtain statistically significant results.[ citation needed ] This is because all codecs are presented at the same time, on the same samples, so that a paired t-test or a repeated measures analysis of variance can be used for statistical analysis. Also, the 0–100 scale used by MUSHRA makes it possible to rate very small differences.

In MUSHRA, the listener is presented with the reference (labeled as such), a certain number of test samples, a hidden version of the reference and one or more anchors. The recommendation specifies that a low-range and a mid-range anchor should be included in the test signals. These are typically a 7 kHz and a 3.5 kHz low-pass version of the reference. The purpose of the anchors is to calibrate the scale so that minor artifacts are not unduly penalized. This is particularly important when comparing or pooling results from different labs.

Listener behavior

Both, MUSHRA and ITU BS.1116 tests [2] call for trained expert listeners who know what typical artifacts sound like and where they are likely to occur. Expert listeners also have a better internalization of the rating scale which leads to more repeatable results than with untrained listeners. Thus, with trained listeners, fewer listeners are needed to achieve statistically significant results.

It is assumed that preferences are similar for expert listeners and naive listeners and thus results of expert listeners are also predictive for consumers. In agreement with this assumption Schinkel-Bielefeld et al. [3] found no differences in the rank order between expert listeners and untrained listeners when using test signals containing only timbre and no spatial artifacts. However, Rumsey et al. [4] showed that for signals containing spatial artifacts, expert listeners weigh spatial artifacts slightly stronger than untrained listeners, who primarily focus on timbre artifacts.

In addition to this, it has been shown that expert listeners make more use of the option to listen to smaller sections of the signals under test repeatedly and perform more comparisons between the signals under test and the reference. [3] In contrast to the naive listener who produces a preference rating, expert listeners therefore produce an audio quality rating, rating the differences between the signal under test and the uncompressed original, which is the actual goal of a MUSHRA-test.

Pre- or post-screening

The MUSHRA guideline mentions several possibilities to assess the reliability of a listener.

The easiest and most common is to disqualify listeners who rate the hidden reference below 90 MUSHRA points for more than 15 percent of all test items. The hidden reference should be rated with 100 MUSHRA points so this is obviously a mistake. While it can happen that the hidden reference and a high-quality signal are confused, a rating of lower than 90 should only be given when the listener is certain that the rated signal is different from the original reference.

The other possibility to assess a listener's performance is eGauge, [5] a framework based on the analysis of variance. It computes agreement, repeatability and discriminability, though only the latter two are recommended for pre or post screening. Agreement analyses how well a listener agrees with the rest of the listeners. Repeatability looks at the variance when rating the same test signal again in comparison to the variance of the other test signals and discriminability analyses if listeners can distinguish between test signals of different conditions. As eGauge requires listening to every test signal twice, it is more effort to apply this than to post screen listeners based on the ratings of the hidden reference. However, if a listener has proven a reliable listener using eGauge, he or she can also be considered a reliable listener for future listening tests, provided the character of the test does not change; A reliable listener for stereo listening test is not necessarily equally good in perceiving artifacts in 5.1 or 22.2 format test items.

Test items

It is important to choose critical test items; items that are difficult to encode and are likely to produce artifacts. At the same time, the test items should be ecologically valid; they should be representative of broadcast material and not some synthetic signals especially designed to be difficult to encode. A method to choose critical material is presented by Ekeroot et al. who propose a ranking by elimination procedure. [6] While this is a good way to choose the most critical test items, it does not ensure inclusion a variety of test items prone to different artifacts.

Ideally the character of a MUSHRA test item should have similar characteristics for the duration of that item.[ example needed ] Otherwise, it can be difficult for the listener to decide on a rating if different parts of the items display different or stronger artifacts than others. [7] Often shorter items lead to less variability than longer ones, as they are more stationary. [8] However, even when trying to choose stationary items, ecologically valid stimuli[ further explanation needed ] will very often have sections that are slightly more critical than the rest of the signal. Thus, listeners who focus on different sections of the signal may evaluate it differently. In this case more critical listeners seem to be better at identifying the most critical regions of a stimulus than less critical listeners. [9]

Language of test items

While in ITU-T P.800 tests [10] which are commonly used to evaluate telephone quality codecs, the tested speech items should always be in the native language of the listeners, this is not necessary in MUSHRA tests. A study with Mandarin Chinese and German listeners found no significant difference between rating foreign language and native language test items. However, listeners needed more time and comparison opportunities when evaluating the foreign language items. [11] Such compensation is not possible in ITU-T P.800 ACR tests where items are heard only once and no comparison to the reference is possible. There, foreign language items are rated as being of lower quality when listeners' language proficiency is low. [12]

Related Research Articles

<span class="mw-page-title-main">Audio system measurements</span> Means of quantifying system performance

Audio system measurements are a means of quantifying system performance. These measurements are made for several purposes. Designers take measurements so that they can specify the performance of a piece of equipment. Maintenance engineers make them to ensure equipment is still working to specification, or to ensure that the cumulative defects of an audio path are within limits considered acceptable. Audio system measurements often accommodate psychoacoustic principles to measure the system in a way that relates to human hearing.

<span class="mw-page-title-main">Loudness</span> Subjective perception of sound pressure

In acoustics, loudness is the subjective perception of sound pressure. More formally, it is defined as the "attribute of auditory sensation in terms of which sounds can be ordered on a scale extending from quiet to loud". The relation of physical attributes of sound to perceived loudness consists of physical, physiological and psychological components. The study of apparent loudness is included in the topic of psychoacoustics and employs methods of psychophysics.

<span class="mw-page-title-main">G.722</span> ITU-T recommendation

G.722 is an ITU-T standard 7 kHz wideband audio codec operating at 48, 56 and 64 kbit/s. It was approved by ITU-T in November 1988. Technology of the codec is based on sub-band ADPCM (SB-ADPCM). The corresponding narrow-band codec based on the same technology is G.726.

In data compression and psychoacoustics, transparency is the result of lossy data compression accurate enough that the compressed result is perceptually indistinguishable from the uncompressed input, i.e. perceptually lossless.

Mean opinion score (MOS) is a measure used in the domain of Quality of Experience and telecommunications engineering, representing overall quality of a stimulus or system. It is the arithmetic mean over all individual "values on a predefined scale that a subject assigns to his opinion of the performance of a system quality". Such ratings are usually gathered in a subjective quality evaluation test, but they can also be algorithmically estimated.

Perceptual Speech Quality Measure (PSQM) is a computational and modeling algorithm defined in Recommendation ITU-T P.861 that objectively evaluates and quantifies voice quality of voice-band speech codecs. It may be used to rank the performance of these speech codecs with differing speech input levels, talkers, bit rates and transcodings. P.861 was withdrawn and replaced by Recommendation ITU-T P.862 (PESQ), which contains an improved speech assessment algorithm.

Video quality is a characteristic of a video passed through a video transmission or processing system that describes perceived video degradation. Video processing systems may introduce some amount of distortion or artifacts in the video signal that negatively impacts the user's perception of a system. For many stakeholders in video production and distribution, assurance of video quality is an important task.

SINPO, an acronym for Signal, Interference, Noise, Propagation, and Overall, is a Signal Reporting Code used to describe the quality of broadcast and radiotelegraph transmissions. SINPFEMO, an acronym for Signal, Interference, Noise, Propagation, frequency of Fading, dEpth, Modulation, and Overall is used to describe the quality of radiotelephony transmissions. SINPFEMO code consists of the SINPO code plus the addition of three letters to describe additional features of radiotelephony transmissions. These codes are defined by Recommendation ITU-R Sm.1135, SINPO and SINPFEMO codes.

Subjective video quality is video quality as experienced by humans. It is concerned with how video is perceived by a viewer and designates their opinion on a particular video sequence. It is related to the field of Quality of Experience. Measuring subjective video quality is necessary because objective quality assessment algorithms such as PSNR have been shown to correlate poorly with subjective ratings. Subjective ratings may also be used as ground truth to develop new algorithms.

A codec listening test is a scientific study designed to compare two or more lossy audio codecs, usually with respect to perceived fidelity or compression efficiency.

An ABX test is a method of comparing two choices of sensory stimuli to identify detectable differences between them. A subject is presented with two known samples followed by one unknown sample X that is randomly selected from either A or B. The subject is then required to identify X as either A or B. If X cannot be identified reliably with a low p-value in a predetermined number of trials, then the null hypothesis cannot be rejected and it cannot be proven that there is a perceptible difference between A and B.

Perceptual Evaluation of Audio Quality (PEAQ) is a standardized algorithm for objectively measuring perceived audio quality, developed in 1994-1998 by a joint venture of experts within Task Group 6Q of the International Telecommunication Union's Radiocommunication Sector (ITU-R). It was originally released as ITU-R Recommendation BS.1387 in 1998 and last updated in 2023. It utilizes software to simulate perceptual properties of the human ear and then integrates multiple model output variables into a single metric.

Latency refers to a short period of delay between when an audio signal enters a system and when it emerges. Potential contributors to latency in an audio system include analog-to-digital conversion, buffering, digital signal processing, transmission time, digital-to-analog conversion and the speed of sound in the transmission medium.

Perceptual Evaluation of Speech Quality (PESQ) is a family of standards comprising a test methodology for automated assessment of the speech quality as experienced by a user of a telephony system. It was standardized as Recommendation ITU-T P.862 in 2001. PESQ is used for objective voice quality testing by phone manufacturers, network equipment vendors and telecom operators. Its usage requires a license. The first edition of PESQ's successor POLQA entered into force in 2011.

Audio equipment testing is the measurement of audio quality through objective and/or subjective means. The results of such tests are published in journals, magazines, whitepapers, websites, and in other media.

Wideband audio, also known as wideband voice or HD voice, is high definition voice quality for telephony audio, contrasted with standard digital telephony "toll quality". It extends the frequency range of audio signals transmitted over telephone lines, resulting in higher quality speech. The range of the human voice extends from 100 Hz to 17 kHz but traditional, voiceband or narrowband telephone calls limit audio frequencies to the range of 300 Hz to 3.4 kHz. Wideband audio relaxes the bandwidth limitation and transmits in the audio frequency range of 50 Hz to 7 kHz. In addition, some wideband codecs may use a higher audio bit depth of 16 bits to encode samples, also resulting in much better voice quality.

Perceptual Objective Listening Quality Analysis (POLQA) was the working title of an ITU-T standard that covers a model to predict speech quality by means of analyzing digital speech signals. The model was standardized as Recommendation ITU-T P.863 in 2011. The second edition of the standard appeared in 2014, and the third, currently in-force edition was adopted in 2018 under the title Perceptual objective listening quality prediction.

High-resolution audio is a term for audio files with greater than 44.1 kHz sample rate or higher than 16-bit audio bit depth. It commonly refers to 96 or 192 kHz sample rates. However, 44.1 kHz/24-bit, 48 kHz/24-bit and 88.2 kHz/24-bit recordings also exist that are labeled HD Audio.

Hearing-Aid Speech Quality Index (HASQI) is a measure of audio quality originally designed for the evaluation of speech quality for those with a hearing aid,. It has also been shown to be able to gauge audio quality for non-speech sounds and for listeners without a hearing loss.

Dolby AC-4 is an audio compression technology developed by Dolby Laboratories. Dolby AC-4 bitstreams can contain audio channels and/or audio objects. Dolby AC-4 has been adopted by the DVB project and standardized by the ETSI.

References

  1. ITU-R recommendation BS.1534
  2. ITU-R BS.1116 (February 2015). "Methods for the subjective assessment of small impairments in audio systems".{{cite journal}}: Cite journal requires |journal= (help)
  3. 1 2 Schinkel-Bielefeld, N., Lotze, N. and Nagel, F. (May 2013). "Audio quality evaluation by experienced and inexperienced listeners". The Journal of the Acoustical Society of America. 133 (5): 3246. Bibcode:2013ASAJ..133.3246S. doi:10.1121/1.4805210.{{cite journal}}: CS1 maint: multiple names: authors list (link)
  4. Rumsey, Francis; Zielinski, Slawomir; Kassier, Rafael; Bech, Søren (2005-05-31). "Relationships between experienced listener ratings of multichannel audio quality and naïve listener preferences". The Journal of the Acoustical Society of America. 117 (6): 3832–3840. Bibcode:2005ASAJ..117.3832R. doi:10.1121/1.1904305. ISSN   0001-4966. PMID   16018485.
  5. Gaëtan, Lorho; Guillaume, Le Ray; Nick, Zacharov (2010-06-13). "eGauge—A Measure of Assessor Expertise in Audio Quality Evaluations". Proceedings of the Audio Engineering Society. 38th International Conference on Sound Quality Evaluation.
  6. Ekeroot, Jonas; Berg, Jan; Nykänen, Arne (2014-04-25). "Criticality of Audio Stimuli for Listening Tests – Listening Durations During a Ranking Task". 136th Convention of the Audio Engineering Society.
  7. Max, Neuendorf; Frederik, Nagel (2011-10-19). "Exploratory Studies on Perceptual Stationarity in Listening Test - Part I: Real World Signals from Custom Listening Tests".{{cite journal}}: Cite journal requires |journal= (help)
  8. Frederik, Nagel; Max, Neuendorf (2011-10-19). "Exploratory Studies on Perceptual Stationarity in Listening Test - Part II: Synthetic Signals with Time Varying Artifacts".{{cite journal}}: Cite journal requires |journal= (help)
  9. Nadja, Schinkel-Bielefeld (2017-05-11). "Audio Quality Evaluation in MUSHRA Tests–Influences between Loop Setting and a Listeners' Ratings". 142nd Convention of the Audio Engineering Society.
  10. ITU-T P.800 (August 1996). "P.800 : Methods for subjective determination of transmission quality".{{cite journal}}: Cite journal requires |journal= (help)
  11. Nadja, Schinkel-Bielefeld; Zhang, Jiandong; Qin, Yili; Katharina, Leschanowsky, Anna; Fu, Shanshan (2017-05-11). "Is it Harder to Perceive Coding Artifact in Foreign Language Items? – A Study with Mandarin Chinese and German Speaking Listeners".{{cite journal}}: Cite journal requires |journal= (help)CS1 maint: multiple names: authors list (link)
  12. Blašková, Lubica; Holub, Jan (2008). "How do Non-native Listeners Perceive Quality of Transmitted Voice?" (PDF). Communications. 10 (4): 11–15. doi:10.26552/com.C.2008.4.11-14. S2CID   196699038.