Perceptual Objective Listening Quality Analysis

Last updated

Perceptual Objective Listening Quality Analysis (POLQA) was the working title of an ITU-T standard that covers a model to predict speech quality by means of analyzing digital speech signals. [1] The model was standardized as Recommendation ITU-T P.863 (Perceptual objective listening quality assessment) in 2011. The second edition of the standard appeared in 2014, and the third, currently in-force edition was adopted in 2018 under the title Perceptual objective listening quality prediction. [2]

Contents

Measurement scope

POLQA covers a model to predict speech quality, [3] [4] by means of digital speech signal analysis. The predictions of those objective measures should come as close as possible to subjective quality scores as obtained in subjective listening tests. Usually, a Mean Opinion Score (MOS) is predicted. POLQA uses real speech as a test stimulus for assessing telephony networks.

Technology capabilities

POLQA is the successor of PESQ (Recommendation ITU-T P.862). [5] POLQA avoids weaknesses of the current P.862 model and is extended towards handling of higher bandwidth audio signals. Further improvements target the handling of time called signals and signals with many delay variations. Similarly to P.862, POLQA supports measurements in the common telephony band (300–3400 Hz), but in addition it has a second operational mode for assessing HD-Voice in wideband and super-wideband speech signals (50–14000 Hz). POLQA also targets the assessment of speech signals recorded acoustically by an artificial head with mouth and ear simulators.

Development history

The POLQA activities started in ITU-T in early 2006 under the working title P.OLQA. In mid-2009, a competition was started to evaluate several candidate models. In May 2010, ITU-T selected candidate models from three companies (OPTICOM, SwissQual / Rohde & Schwarz and TNO (Netherlands Organisation for Applied Scientific Research)). The three companies merged their approaches to one single model, which was adopted as Recommendation ITU-T P.863. [2]

ITU-T’s family of full reference objective voice quality measurements started in 1997 with Recommendation ITU-T P.861 (PSQM), which was superseded by ITU-T P.862 (PESQ) [5] in 2001. P.862 was later complemented with Recommendations ITU-T P.862.1 [6] (mapping of PESQ scores to a MOS scale), ITU-T P.862.2 [7] (wideband measurements) and ITU-T P.862.3 [8] (application guide). The first edition of ITU-T P.863 (POLQA) [2] entered into force in 2011. An Application guide for Recommendation ITU-T P.863 was approved in 2019 and published as ITU-T P.863.1. [9]

In addition to the above listed full reference methods, the list of ITU-T’s objective voice quality measurement standards also includes ITU-T P.563 [10] (no-reference algorithm).

Testing typology

POLQA, similar to P.862 PESQ, is a Full Reference (FR) algorithm that rates a degraded or processed speech signal in relation to the original signal. It compares each sample of the reference signal (talker side) to each corresponding sample of the degraded signal (listener side). Perceptual differences between both signals are scored as differences. The perceptual psycho-acoustic model is based on similar models of human perception as MP3 or AAC. Basically, the signals are analysed in the frequency domain (in critical bands) after applying masking functions. Unmasked differences between the two signal representations will be counted as distortions. Finally, the accumulated distortions in the speech file are mapped into a 1 to 5 quality scale as usual for MOS tests. FR measurements deliver the highest accuracy and repeatability but can only be applied for dedicated tests in live networks (e.g. drive test tools for mobile network benchmarks).

POLQA is a full-reference algorithm and analyzes the speech signal sample-by-sample after a temporal alignment of corresponding excerpts of reference and test signal. POLQA can be applied to provide an end-to-end (E2E) quality assessment for a network, or characterize individual network components.

POLQA results principally model mean opinion scores (MOS) that cover a scale from 1 (bad) to 5 (excellent).

Description of the POLQA algorithm

The inputs to the algorithm are two waveforms represented by two data vectors containing 16 bit PCM samples. The first vector contains the samples of the (undistorted) reference signal, whereas the second vector contains the samples of the degraded signal. The POLQA algorithm consists of a temporal alignment block, a sample rate estimator of a sample rate converter, which is used to compensate for differences in the sample rate of the input signals, and the actual core model, which performs the MOS calculation. In a first step, the delay between the two input signals is determined and the sample rate of the two signals relative to each other is estimated. The sample rate estimation is based on the delay information calculated by the temporal alignment. If the sample rate differs by more than approximately 1%, the signal with the higher sample rate is down sampled. After each step, the results are stored together with an average delay reliability indicator, which is a measure for the quality of the delay estimation. The result from the re-sampling step, which yielded the highest overall reliability, is finally chosen. Once the correct delay is determined and the sample rate differences have been compensated, the signals and the delay information are passed on to the core model, which calculates the perceptibility as well as the annoyance of the distortions and maps them to a MOS scale. A much more detailed and comprehensive description of the algorithm can be found in. [2] The next few sections are only intended to give an overview on the basics of POLQA’s internal structure.

The core model

The main element of the core model is the perceptual model which is calculated four times using different parameters in order to cope with different major distortion types. Those distortion types can be split into additive distortions and subtracted distortions. For both types a further distinction is made between very strong and weaker effects. The inputs to the perceptual models are waveforms and the delay information. The output is the Disturbance Density, which is a measure for the perceptibility of distortions in the signals. The perceptual model for the main branch also produces indicators for Frequency distortions, Noise and Reverberation distortions. A subsequent switch which is triggered by a detector for very strong distortions reduces the four Disturbance Density values down to two, one for added and one for subtracted distortions. So far the Disturbance Density is an indicator for the perceptibility of distortions only and cognitive effects are not yet taken into account. Cognitive aspects are however important when human beings are asked to score the quality of what they can perceive. Essentially they convert the perceptibility measure Disturbance Density into an annoyance measure. This conversion is performed by correcting the Disturbance Density values for situations with:

Two further indicators, one for spectral flatness and one for level variations are also calculated in this step.

So far all operations were performed on frames with a duration of approximately 32 and 43ms duration (depending on the sample rate and using an overlap of 50%) and for each Bark band separately. In a final step all indicators are integrated over time and frequency in order to compute the final MOS LQO value.

The perceptual model

The key concept inside the perceptual model is Idealisation. The idea behind this is, that POLQA is supposed to simulate Absolute Category Rating (ACR) tests. In an ACR test however, subjects have no comparison to the actual reference signal when they score a speech signal. Instead, it is assumed that subjects have an understanding of what an ideal signal sounds like and they use this as their own reference. Consequently, if they are asked to score a reference signal which is not absolutely perfect (e.g. it has the wrong volume or contains too much timbre, noise or reverberation), it will be scored worse than perfect. In its idealization step POLQA therefore corrects small imperfections of the reference signals in order to derive the same ideal reference for the comparison to the degraded signal as human subjects would use in their minds. Similar to the idealization of the reference signal, some distortions present in the degraded signal which are hardly perceptible in an ACR test will be partially compensated (e.g. small pitch shifts, linear frequency distortions). The perceptual model starts with scaling the reference signal to an ideal average active speech level of approximately -26dBov. No such scaling is performed on the degraded signal. It is assumed that any deviation of the level of the degraded signal from the ideal -26dBov is to be scored as a degradation of the signal. Next, the spectra of both signals are computed using an FFT with 50% overlapping frames with a duration of between 32ms and 43ms duration (depending on the sample rate). Subsequently small pitch shifts of the degraded signal will be eliminated (Frequency Dewarping). Now, the spectra will be transformed to a psychoacoustically motivated pitch scale, by combining individual spectral lines (FFT bins) to so-called critical bands. The pitch scale used is similar to the Bark scale with an average resolution of 0.3 Bark per band. The result is the Pitch Power Density. At this stage the first three distortion indicators for frequency response distortions, additive noise and room reverberations are calculated. After this, the excitation of each band is derived. This includes the modeling of masking effects in the frequency as well as in the temporal domain. The result is for each frame of each signal a head-internal representation which indicates roughly how loud each frequency component would be perceived. Now, a further idealization step of the reference signal takes place by filtering out excessive timbre and low level stationary noise. At the same time, linear frequency distortions and stationary noise are partially removed from the degraded signal. A subtraction of the idealized excitations finally leads to the Distortion Density, which is measure for the audibility of distortions.

POLQA in research

A paper which uses POLQA to investigate the impact of tone language and non-native listening on speech quality measurement can be found in. [11]

See also

Related Research Articles

Speech coding is an application of data compression to digital audio signals containing speech. Speech coding uses speech-specific parameter estimation using audio signal processing techniques to model the speech signal, combined with generic data compression algorithms to represent the resulting modeled parameters in a compact bitstream.

<span class="mw-page-title-main">G.711</span> ITU-T recommendation

G.711 is a narrowband audio codec originally designed for use in telephony that provides toll-quality audio at 64 kbit/s. It is an ITU-T standard (Recommendation) for audio encoding, titled Pulse code modulation (PCM) of voice frequencies released for use in 1972.

<span class="mw-page-title-main">Sound quality</span> Assessment of the audio output from an electronic device

Sound quality is typically an assessment of the accuracy, fidelity, or intelligibility of audio output from an electronic device. Quality can be measured objectively, such as when tools are used to gauge the accuracy with which the device reproduces an original sound; or it can be measured subjectively, such as when human listeners respond to the sound or gauge its perceived similarity to another sound.

<span class="mw-page-title-main">Audio system measurements</span> Means of quantifying system performance

Audio system measurements are a means of quantifying system performance. These measurements are made for several purposes. Designers take measurements so that they can specify the performance of a piece of equipment. Maintenance engineers make them to ensure equipment is still working to specification, or to ensure that the cumulative defects of an audio path are within limits considered acceptable. Audio system measurements often accommodate psychoacoustic principles to measure the system in a way that relates to human hearing.

A signal strength and readability report is a standardized format for reporting the strength of the radio signal and the readability (quality) of the radiotelephone (voice) or radiotelegraph signal transmitted by another station as received at the reporting station's location and by their radio station equipment. These report formats are usually designed for only one communications mode or the other, although a few are used for both telegraph and voice communications. All but one of these signal report formats involve the transmission of numbers.

<span class="mw-page-title-main">G.722</span> ITU-T recommendation

G.722 is an ITU-T standard 7 kHz wideband audio codec operating at 48, 56 and 64 kbit/s. It was approved by ITU-T in November 1988. Technology of the codec is based on sub-band ADPCM (SB-ADPCM). The corresponding narrow-band codec based on the same technology is G.726.

Perceptual Speech Quality Measure (PSQM) is a computational and modeling algorithm defined in Recommendation ITU-T P.861 that objectively evaluates and quantifies voice quality of voice-band speech codecs. It may be used to rank the performance of these speech codecs with differing speech input levels, talkers, bit rates and transcodings. P.861 was withdrawn and replaced by Recommendation ITU-T P.862 (PESQ), which contains an improved speech assessment algorithm.

Video quality is a characteristic of a video passed through a video transmission or processing system that describes perceived video degradation. Video processing systems may introduce some amount of distortion or artifacts in the video signal that negatively impact the user's perception of the system. For many stakeholders in video production and distribution, ensuring video quality is an important task.

Subjective video quality is video quality as experienced by humans. It is concerned with how video is perceived by a viewer and designates their opinion on a particular video sequence. It is related to the field of Quality of Experience. Measuring subjective video quality is necessary because objective quality assessment algorithms such as PSNR have been shown to correlate poorly with subjective ratings. Subjective ratings may also be used as ground truth to develop new algorithms.

Quality of experience (QoE) is a measure of the delight or annoyance of a customer's experiences with a service. QoE focuses on the entire service experience; it is a holistic concept, similar to the field of user experience, but with its roots in telecommunication. QoE is an emerging multidisciplinary field based on social psychology, cognitive science, economics, and engineering science, focused on understanding overall human quality requirements.

Bandwidth extension of signal is defined as the deliberate process of expanding the frequency range (bandwidth) of a signal in which it contains an appreciable and useful content, and/or the frequency range in which its effects are such. Its significant advancement in recent years has led to the technology being adopted commercially in several areas including psychacoustic bass enhancement of small loudspeakers and the high frequency enhancement of coded speech and audio.

<span class="mw-page-title-main">G.729.1</span> ITU-T Recommendation

G.729.1 is an 8-32 kbit/s embedded speech and audio codec providing bitstream interoperability with G.729, G.729 Annex A and G.729 Annex B. Its official name is G.729-based embedded variable bit rate codec: An 8-32 kbit/s scalable wideband coder bitstream interoperable with G.729. It was introduced in 2006.

Perceptual Evaluation of Audio Quality (PEAQ) is a standardized algorithm for objectively measuring perceived audio quality, developed in 1994–1998 by a joint venture of experts within Task Group 6Q of the International Telecommunication Union's Radiocommunication Sector (ITU-R). It was originally released as ITU-R Recommendation BS.1387 in 1998 and last updated in 2023. It utilizes software to simulate perceptual properties of the human ear and then integrates multiple model output variables into a single metric.

Latency refers to a short period of delay between when an audio signal enters a system and when it emerges. Potential contributors to latency in an audio system include analog-to-digital conversion, buffering, digital signal processing, transmission time, digital-to-analog conversion and the speed of sound in the transmission medium.

Perceptual Evaluation of Speech Quality (PESQ) is a family of standards comprising a test methodology for automated assessment of the speech quality as experienced by a user of a telephony system. It was standardized as Recommendation ITU-T P.862 in 2001. PESQ is used for objective voice quality testing by phone manufacturers, network equipment vendors and telecom operators. Its usage requires a license. The first edition of PESQ's successor POLQA entered into force in 2011.

Perceptual Evaluation of Video Quality(PEVQ) is an end-to-end (E2E) measurement algorithm to score the picture quality of a video presentation by means of a 5-point mean opinion score (MOS). It is, therefore, a video quality model. PEVQ was benchmarked by the Video Quality Experts Group (VQEG) in the course of the Multimedia Test Phase 2007–2008. Based on the performance results, in which the accuracy of PEVQ was tested against ratings obtained by human viewers, PEVQ became part of the new International Standard.

<span class="mw-page-title-main">G.718</span> ITU-T Recommendation

G.718 is an ITU-T Recommendation embedded scalable speech and audio codec providing high quality narrowband speech over the lower bit rates and high quality wideband speech over the complete range of bit rates. In addition, G.718 is designed to be highly robust to frame erasures, thereby enhancing the speech quality when used in Internet Protocol (IP) transport applications on fixed, wireless and mobile networks. Despite its embedded nature, the codec also performs well with both narrowband and wideband generic audio signals. The codec has an embedded scalable structure, enabling maximum flexibility in the transport of voice packets through IP networks of today and in future media-aware networks. In addition, the embedded structure of G.718 will easily allow the codec to be extended to provide a superwideband and stereo capability through additional layers which are currently under development in ITU-T Study Group 16. The bitstream may be truncated at the decoder side or by any component of the communication system to instantaneously adjust the bit rate to the desired value without the need for out-of-band signalling. The encoder produces an embedded bitstream structured in five layers corresponding to the five available bit rates: 8, 12, 16, 24 & 32 kbit/s.

VQuad-HD(Objective perceptual multimedia video quality measurement of HDTV) is a video quality testing technology for high definition video signals. It is a full-reference model, meaning that it requires access to the original and the degraded signal to estimate the quality.

The Circuit Merit system is a measurement process designed to assess the voice-to-noise ratio in wired and wireless telephone circuits, especially the AMPS system, and although its reporting scale is sometimes used as input for calculating mean opinion score, the rating system is officially defined relative to given ranges of voice-to-noise ratios.

Hearing-Aid Speech Quality Index (HASQI) is a measure of audio quality originally designed for the evaluation of speech quality for those with a hearing aid,. It has also been shown to be able to gauge audio quality for non-speech sounds and for listeners without a hearing loss.

References

  1. "POLQA - The Next-Generation Mobile Voice Quality Testing Standard". www.polqa.info. Retrieved 2021-04-11.
  2. 1 2 3 4 "P.863 : Perceptual objective listening quality prediction". www.itu.int. Retrieved 2021-04-11.
  3. Beerends, John G.; Schmidmer, Christian; Berger, Jens; Obermann, Matthias; Ullmann, Raphael; Pomy, Joachim; Keyhl, Michael (2013-07-08). "Perceptual Objective Listening Quality Assessment (POLQA), The Third Generation ITU-T Standard for End-to-End Speech Quality Measurement Part I—Temporal Alignment". Journal of the Audio Engineering Society. 61 (6): 366–384.
  4. Beerends, John G.; Schmidmer, Christian; Berger, Jens; Obermann, Matthias; Ullmann, Raphael; Pomy, Joachim; Keyhl, Michael (2013-07-08). "Perceptual Objective Listening Quality Assessment (POLQA), The Third Generation ITU-T Standard for End-to-End Speech Quality Measurement Part II—Perceptual Model". Journal of the Audio Engineering Society. 61 (6): 385–402.
  5. 1 2 "P.862 : Perceptual evaluation of speech quality (PESQ): An objective method for end-to-end speech quality assessment of narrow-band telephone networks and speech codecs". www.itu.int. Retrieved 2021-04-11.
  6. "P.862.1 : Mapping function for transforming P.862 raw result scores to MOS-LQO". www.itu.int. Retrieved 2021-04-11.
  7. "P.862.2 : Wideband extension to Recommendation P.862 for the assessment of wideband telephone networks and speech codecs". www.itu.int. Retrieved 2021-04-11.
  8. "P.862.3 : Application guide for objective quality measurement based on Recommendations P.862, P.862.1 and P.862.2". www.itu.int. Retrieved 2021-04-11.
  9. "P.863.1 : Application guide for Recommendation ITU-T P.863". www.itu.int. Retrieved 2021-04-11.
  10. "P.563 : Single-ended method for objective speech quality assessment in narrow-band telephony applications". www.itu.int. Retrieved 2021-04-11.
  11. D. Ebem (University of Nigeria); et al. (2011). "The Impact of Tone language and Non-Native Language Listening on Measuring Speech Quality" (PDF). Journal of the Audio Engineering Society. 59 (9, 2011 September): 9.