ABX test

Last updated December 12, 2023

An ABX test is a method of comparing two choices of sensory stimuli to identify detectable differences between them. A subject is presented with two known samples (sample A, the first reference, and sample B, the second reference) followed by one unknown sample X that is randomly selected from either A or B. The subject is then required to identify X as either A or B. If X cannot be identified reliably with a low p-value in a predetermined number of trials, then the null hypothesis cannot be rejected and it cannot be proven that there is a perceptible difference between A and B.

ABX tests can easily be performed as double-blind trials, eliminating any possible unconscious influence from the researcher or the test supervisor. Because samples A and B are provided just prior to sample X, the difference does not have to be discerned using long-term memory or past experience. Thus, the ABX test answers whether or not, under the test circumstances, a perceptual difference can be found.

ABX tests are commonly used in evaluations of digital audio data compression methods; sample A is typically an uncompressed sample, and sample B is a compressed version of A. Audible compression artifacts that indicate a shortcoming in the compression algorithm can be identified with subsequent testing. ABX tests can also be used to compare the different degrees of fidelity loss between two different audio formats at a given bitrate.

ABX tests can be used to audition input, processing, and output components as well as cabling: virtually any audio product or prototype design.

History

The history of ABX testing and naming dates back to 1950 in a paper published by two Bell Labs researchers, W. A. Munson and Mark B. Gardner, titled Standardizing Auditory Tests.^[1]

The purpose of the present paper is to describe a test procedure which has shown promise in this direction and to give descriptions of equipment which have been found helpful in minimizing the variability of the test results. The procedure, which we have called the "ABX" test, is a modification of the method of paired comparisons. An observer is presented with a time sequence of three signals for each judgment he is asked to make. During the first time interval he hears signal A, during the second, signal B, and finally signal X. His task is to indicate whether the sound heard during the X interval was more like that during the A interval or more like that during the B interval. For a threshold test, the A interval is quiet, the B interval is signal, and the X interval is either quiet or signal.

The test has evolved to other variations such as subject control over duration and sequence of testing. One such example was the hardware ABX comparator in 1977, built by the ABX company in Troy, Michigan, and documented by one of its founders, David Clark.^[2]

Refinements to the A/B test
The author's first experience with double-blind audibility testing was as a member of the SMWTMS Audio Club in early 1977. A button was provided which would select at random component A or B. Identifying one of these, the X component was greatly hampered by not having the known A and B available for reference.
This was corrected by using three interlocked pushbuttons, A, B, and X. Once an X was selected, it would remain that particular A or B until it was decided to move on to another random selection.
However, another problem quickly became obvious. There was always an audible relay transition time delay when switching from A to B. When switching from A to X, however, the time delay would be missing if X was really A and present if X was really B. This extraneous cue was removed by inserting a fixed length dropout time when any change was made. The dropout time was selected to be 50 ms which produces a slight consistent click while allowing subjectively instant comparison.

The ABX company is now defunct and hardware comparators in general as commercial offerings extinct. Myriad of software tools exist such as Foobar ABX plug-in for performing file comparisons. But hardware equipment testing requires building custom implementations.

Hardware tests

ABX test equipment utilizing relays to switch between two different hardware paths can help determine if there are perceptual differences in cables and components. Video, audio and digital transmission paths can be compared. If the switching is microprocessor controlled, double-blind tests are possible.

Loudspeaker level and line level audio comparisons could be performed on an ABX test device offered for sale as the ABX Comparator by QSC Audio Products from 1998 to 2004. Other hardware solutions have been fabricated privately by individuals or organizations for internal testing.

Confidence

If only one ABX trial were performed, random guessing would incur a 50% chance of choosing the correct answer, the same as flipping a coin. In order to make a statement having some degree of confidence, many trials must be performed. By increasing the number of trials, the likelihood of statistically asserting a person's ability to distinguish A and B is enhanced for a given confidence level. A 95% confidence level is commonly considered statistically significant.^[2] The company QSC, in the ABX Comparator user manual, recommended a minimum of ten listening trials in each round of tests.^[3]

Results required for a 95% confidence level^[4] (see: P-value)
Number of trials	10	11	12	13	14	15	16	17	18	19	20	21	22	23	24	25
Minimum number correct	9	9	10	10	11	12	12	13	13	14	15	15	16	16	17	18

QSC recommended that no more than 25 trials be performed, as subject fatigue can set in, making the test less sensitive (less likely to reveal one's actual ability to discern the difference between A and B).^[3] However, a more sensitive test can be obtained by pooling the results from a number of such tests using separate individuals or tests from the same subject conducted in between rest breaks. For a large number of total trials N, a significant result (one with 95% confidence) can be claimed if the number of correct responses exceeds $N/2+{\sqrt {N}}$ . Important decisions are normally based on a higher level of confidence, since an erroneous significant result would be claimed in one of 20 such tests simply by chance.

Software tests

The foobar2000 and the Amarok audio players support software-based ABX testing, the latter using a third-party script. Lacinato ABX is a cross-platform audio testing tool for Linux, Windows, and 64-bit Mac. Lacinato WebABX is a web-based cross-browser audio ABX tool. Open source aveX was mainly developed for Linux which also provides test-monitoring from a remote computer. ABX patcher is an ABX implementation for Max/MSP. More ABX software can be found at the archived PCABX website.

Codec listening tests

A codec listening test is a scientific study designed to compare two or more lossy audio codecs, usually with respect to perceived fidelity or compression efficiency.

Potential flaws

ABX is a type of forced choice testing. A subject's choices can be on merit, i.e. the subject indeed honestly tried to identify whether X seemed closer to A or B. But uninterested or tired subjects might choose randomly without even trying. If not caught, this may dilute the results of other subjects who intently took the test and subject the outcome to Simpson's paradox, resulting in false summary results. Simply looking at the outcome totals of the test (m out of n answers correct) cannot reveal occurrences of this problem.

This problem becomes more acute if the differences are small. The user may get frustrated and simply aim to finish the test by voting randomly. In this regard, forced-choice tests such as ABX tend to favor negative outcomes when differences are small if proper protocols are not used to guard against this problem.

Best practices call for both the inclusion of controls and the screening of subjects:^[5]

A major consideration is the inclusion of appropriate control conditions. Typically, control conditions include the presentation of unimpaired audio materials, introduced in ways that are unpredictable to the subjects. It is the differences between judgement of these control stimuli and the potentially impaired ones that allows one to conclude that the grades are actual assessments of the impairments.

3.2.2 Post-screening of subjects
Post-screening methods can be roughly separated into at least two classes; one is based on inconsistencies compared with the mean result and another relies on the ability of the subject to make correct identifications. The first class is never justifiable. Whenever a subjective listening test is performed with the test method recommended here, the required information for the second class of post-screening is automatically available. A suggested statistical method for doing this is described in Attachment 1.'
The methods are primarily used to eliminate subjects who cannot make the appropriate discriminations. The application of a post-screening method may clarify the tendencies in a test result. However, bearing in mind the variability of subjects’ sensitivities to different artefacts, caution should be exercised.

Other flaws include lack of subject training and familiarization with the test and content selected:

4.1 Familiarization or training phase
Prior to formal grading, subjects must be allowed to become thoroughly familiar with the test facilities, the test environment, the grading process, the grading scales and the methods of their use. Subjects should also become thoroughly familiar with the artefacts under study. For the most sensitive tests they should be exposed to all the material they will be grading later in the formal grading sessions. During familiarization or training, subjects should be preferably together in groups (say, consisting of three subjects), so that they can interact freely and discuss the artefacts they detect with each other.

Other problems might arise from the ABX equipment itself, as outlined by Clark,^[2] where the equipment provides a tell, allowing the subject to identify the source. Lack of transparency of the ABX fixture creates similar problems.

Since auditory tests and many other sensory tests rely on short-term memory, which only lasts a few seconds, it is critical that the test fixture allows the subject to identify short segments that can be compared quickly. Pops and glitches in switching apparatus likewise must be eliminated, as they may dominate or otherwise interfere with the stimuli being tested in what is stored in the subject's short-term memory.

Alternatives

Algorithmic Audio Compression Evaluation

Since ABX testing requires human beings for evaluation of lossy audio codecs, it is time-consuming and costly. Therefore, cheaper approaches have been developed, e.g. PEAQ, which is an implementation of the ODG.

MUSHRA

In MUSHRA, the subject is presented with the reference (labeled as such), a certain number of test samples, a hidden version of the reference and one or more anchors. A 0–100 rating scale makes it possible to rate very small differences, and the hidden version still provides discrimination checks.

Discrimination testing

Alternative general methods are used in discrimination testing, such as paired comparison, duo–trio, and triangle testing. Of these, duo–trio and triangle testing are particularly close to ABX testing. Schematically:

Duo–trio: AXY – one known, two unknown (one equals A, other equals B), test is which unknown is the known: X = A (and Y = B), or Y = A (and X = B).
Triangle: XXY – three unknowns (two are A and one is B or one is A and two are B), test which is the odd one out: Y = 1, Y = 2, or Y = 3.

In this context, ABX testing is also known as "duo–trio" in "balanced reference" mode – both knowns are presented as references, rather than one alone.^[6]

Related Research Articles

In information theory, data compression, source coding, or bit-rate reduction is the process of encoding information using fewer bits than the original representation. Any particular compression is either lossy or lossless. Lossless compression reduces bits by identifying and eliminating statistical redundancy. No information is lost in lossless compression. Lossy compression reduces bits by removing unnecessary or less important information. Typically, a device that performs data compression is referred to as an encoder, and one that performs the reversal of the process (decompression) as a decoder.

<span class="mw-page-title-main">Lossy compression</span> Data compression approach that reduces data size while discarding or changing some of it

In information technology, lossy compression or irreversible compression is the class of data compression methods that uses inexact approximations and partial data discarding to represent the content. These techniques are used to reduce data size for storing, handling, and transmitting content. The different versions of the photo of the cat on this page show how higher degrees of approximation create coarser images as more details are removed. This is opposed to lossless data compression which does not degrade the data. The amount of data reduction possible using lossy compression is much higher than using lossless techniques.

Windows Media Audio (WMA) is a series of audio codecs and their corresponding audio coding formats developed by Microsoft. It is a proprietary technology that forms part of the Windows Media framework. WMA consists of four distinct codecs. The original WMA codec, known simply as WMA, was conceived as a competitor to the popular MP3 and RealAudio codecs. WMA Pro, a newer and more advanced codec, supports multichannel and high resolution audio. A lossless codec, WMA Lossless, compresses audio data without loss of audio fidelity. WMA Voice, targeted at voice content, applies compression using a range of low bit rates. Microsoft has also developed a digital container format called Advanced Systems Format to store audio encoded by WMA.

In electronics, an analog-to-digital converter is a system that converts an analog signal, such as a sound picked up by a microphone or light entering a digital camera, into a digital signal. An ADC may also provide an isolated measurement such as an electronic device that converts an analog input voltage or current to a digital number representing the magnitude of the voltage or current. Typically the digital output is a two's complement binary number that is proportional to the input, but there are other possibilities.

Sound can be recorded and stored and played using either digital or analog techniques. Both techniques introduce errors and distortions in the sound, and these methods can be systematically compared. Musicians and listeners have argued over the superiority of digital versus analog sound recordings. Arguments for analog systems include the absence of fundamental error mechanisms which are present in digital audio systems, including aliasing and associated anti-aliasing filter implementation, jitter and quantization noise. Advocates of digital point to the high levels of performance possible with digital audio, including excellent linearity in the audible band and low levels of noise and distortion.

Motion JPEG is a video compression format in which each video frame or interlaced field of a digital video sequence is compressed separately as a JPEG image.

Sound quality is typically an assessment of the accuracy, fidelity, or intelligibility of audio output from an electronic device. Quality can be measured objectively, such as when tools are used to gauge the accuracy with which the device reproduces an original sound; or it can be measured subjectively, such as when human listeners respond to the sound or gauge its perceived similarity to another sound.

WavPack is a free and open-source lossless audio compression format and application implementing the format. It is unique in the way that it supports hybrid audio compression alongside normal compression which is similar to how FLAC works. It also supports compressing a wide variety of lossless formats, including various variants of PCM and also DSD as used in SACDs, together with its support for surround audio.

In data compression and psychoacoustics, transparency is the result of lossy data compression accurate enough that the compressed result is perceptually indistinguishable from the uncompressed input, i.e. perceptually lossless.

Sample size determination or estimation is the act of choosing the number of observations or replicates to include in a statistical sample. The sample size is an important feature of any empirical study in which the goal is to make inferences about a population from a sample. In practice, the sample size used in a study is usually determined based on the cost, time, or convenience of collecting the data, and the need for it to offer sufficient statistical power. In complex studies, different sample sizes may be allocated, such as in stratified surveys or experimental designs with multiple treatment groups. In a census, data is sought for an entire population, hence the intended sample size is equal to the population. In experimental design, where a study may be divided into different treatment groups, there may be different sample sizes for each group.

This glossary of statistics and probability is a list of definitions of terms and concepts used in the mathematical sciences of statistics and probability, their sub-disciplines, and related fields. For additional related terms, see Glossary of mathematics and Glossary of experimental design.

A permutation test is an exact statistical hypothesis test making use of the proof by contradiction. A permutation test involves two or more samples. The null hypothesis is that all samples come from the same distribution $. Under the null hypothesis, the distribution of the test statistic is obtained by calculating all possible values of the test statistic under possible rearrangements of the observed data. Permutation tests are, therefore, a form of resampling.$

Α video codec is software or a device that provides encoding and decoding for digital video, and which may or may not include the use of video compression and/or decompression. Most codecs are typically implementations of video coding formats.

The sign test is a statistical method to test for consistent differences between pairs of observations, such as the weight of subjects before and after treatment. Given pairs of observations for each subject, the sign test determines if one member of the pair tends to be greater than the other member of the pair.

A codec listening test is a scientific study designed to compare two or more lossy audio codecs, usually with respect to perceived fidelity or compression efficiency.

Discrimination testing is a technique employed in sensory analysis to determine whether there is a detectable difference among two or more products. The test uses a group of assessors (panellists) with a degree of training appropriate to the complexity of the test to discriminate from one product to another through one of a variety of experimental designs. Though useful, these tests typically do not quantify or describe any differences, requiring a more specifically trained panel under different study design to describe differences and assess significance of the difference.

In engineering, science, and statistics, replication is the process of repeating a study or experiment under the same or similar conditions to support the original claim, which crucial to confirm the accuracy of results as well as for identifying and correcting the flaws in the original experiment. ASTM, in standard E1847, defines replication as "... the repetition of the set of all the treatment combinations to be compared in an experiment. Each of the repetitions is called a replicate."

aptX is a family of proprietary audio codec compression algorithms owned by Qualcomm, with a heavy emphasis on wireless audio applications.

<span class="mw-page-title-main">Sub-band coding</span>

In signal processing, sub-band coding (SBC) is any form of transform coding that breaks a signal into a number of different frequency bands, typically by using a fast Fourier transform, and encodes each one independently. This decomposition is often the first step in data compression for audio and video signals.

Pulse-code modulation (PCM) is a method used to digitally represent sampled analog signals. It is the standard form of digital audio in computers, compact discs, digital telephony and other digital audio applications. In a PCM stream, the amplitude of the analog signal is sampled at uniform intervals, and each sample is quantized to the nearest value within a range of digital steps.

References

↑ Munson, W. A.; Gardner, Mark B. (1950). "Standardizing Auditory Tests". The Journal of the Acoustical Society of America. Acoustical Society of America (ASA). 22 (5): 675. Bibcode:1950ASAJ...22Q.675M. doi: 10.1121/1.1917190 . ISSN 0001-4966.
1 2 3 Clark, David (May 1, 1982). "High-Resolution Subjective Testing Using a Double-Blind Comparator". Journal of the Audio Engineering Society. 30 (5): 330–338. Retrieved 8 October 2016.
1 2 QSC ABX Comparator user manual. (1998) p. 10
↑ David Carlstrom. "Probability of Experimental Result Being the Same as Random Guesses". ABX Web Page. Retrieved 2011-12-14.] at
↑ "Recommendation ITU-R BS.1116-2" (PDF). Retrieved 8 October 2016.
↑ Meilgaard, Morten; Gail Vance Civille; B. Thomas Carr (1999). Sensory evaluation techniques (3 ed.). CRC Press. pp. 68–70. ISBN 0-8493-0276-5.

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[1] Munson, W. A.; Gardner, Mark B. (1950). "Standardizing Auditory Tests". The Journal of the Acoustical Society of America. Acoustical Society of America (ASA). 22 (5): 675. Bibcode:1950ASAJ...22Q.675M. doi: 10.1121/1.1917190 . ISSN 0001-4966.

[Clark-2] 1 2 3 Clark, David (May 1, 1982). "High-Resolution Subjective Testing Using a Double-Blind Comparator". Journal of the Audio Engineering Society. 30 (5): 330–338. Retrieved 8 October 2016.

[QSCABX-3] 1 2 QSC ABX Comparator user manual. (1998) p. 10

[ABX_Web_Page-4] David Carlstrom. "Probability of Experimental Result Being the Same as Random Guesses". ABX Web Page. Retrieved 2011-12-14.] at

[Methods_for_the_subjective_assessment_of_small_impairments_in_audio_systems-5] "Recommendation ITU-R BS.1116-2" (PDF). Retrieved 8 October 2016.

[6] Meilgaard, Morten; Gail Vance Civille; B. Thomas Carr (1999). Sensory evaluation techniques (3 ed.). CRC Press. pp. 68–70. ISBN 0-8493-0276-5.

[1]

[2]

[3]

[4]

[5]

[6]