Joint encoding

Last updated October 21, 2024

In audio engineering, joint encoding is the joining of several channels of similar information during encoding in order to obtain higher quality, a smaller file size, or both.

Joint stereo

The term joint stereo has become prominent as the Internet has allowed for the transfer of relatively low bit rate, acceptable-quality audio with modest Internet access speeds. Joint stereo refers to any number of encoding techniques used for this purpose. Two forms are described here, both of which are implemented in various ways with different codecs, such as MP3, AAC and Ogg Vorbis.

Intensity stereo coding

This form of joint stereo uses a technique known as joint frequency encoding, which functions on the principle of sound localization. Human hearing is predominantly less acute at perceiving the direction of certain audio frequencies. By exploiting this characteristic, intensity stereo coding can reduce the data rate of an audio stream with little or no perceived change in apparent quality.

More specifically, the dominance of inter-aural time differences (ITD) for sound localization by humans is only present for lower frequencies. That leaves inter-aural amplitude differences (IAD) as the dominant location indicator for higher frequencies (the cutoff being ~2 kHz). The idea of intensity stereo coding is to merge the lower spectrum into just one channel (thus reducing overall differences between channels) and to transmit a little side information about how to pan certain frequency regions to recover the IAD cues. ITD is not lost completely in this scheme, however: the shape of the ear makes it such that the ITD can be recovered from IAD if the sound comes from free space, e.g. played through loudspeakers.^[1]

This type of coding does not perfectly reconstruct the original audio because of the loss of information which results in the simplification of the stereo image and can produce perceptible compression artifacts. However, for very low bit rates this type of coding usually yields a gain in perceived quality of the audio. It is supported by many audio compression formats (including MP3, AAC, Vorbis and Opus) but not always by every encoder.

M/S stereo coding

M/S stereo coding transforms the left and right channels into a mid channel and a side channel. The mid channel is the sum of the left and right channels, or $M=L+R$ . The side channel is the difference of the left and right channels, or $S=L-R$ . Unlike intensity stereo coding, M/S coding is a special case of transform coding, and retains the audio perfectly without introducing artifacts. Lossless codecs such as FLAC or Monkey's Audio use M/S stereo coding because of this characteristic.

To reconstruct the original signal, the channels are either added ${\textstyle L={\frac {M+S}{2}}}$ or subtracted ${\textstyle R={\frac {M-S}{2}}}$ .

This form of coding is also sometimes known as matrix stereo^{[lower-alpha 1]} and is used in many different forms of audio processing and recording equipment. It is not limited to digital systems and can even be created with passive audio transformers or analog amplifiers. One example of the use of M/S stereo is in FM stereo broadcasting, where $L+R$ modulates the carrier wave and $L-R$ modulates a subcarrier. This enables backwards compatibility with mono equipment, which will only require the mid channel.^[2] Another example of M/S stereo is the stereophonic microgroove record. Lateral motions of a stylus represent the sum of two channels and the vertical motion represents the difference between the channels; two perpendicular coils mechanically decode the channels.^[3]

M/S is also a common technique for production of stereo recordings. See Microphone practice § M/S technique.

M/S encoding does not strictly require that the left and right channels use the same weight. In Opus CELT, M/S encoding is combined with an angle parameter, so that different weights can be used to maximize de-correlation.^[4]^: 4.5.1

A similar form of joining multiple channels is seen in the ambisonics implementation of Opus 1.3. A matrix may be used to mix the spherical harmonic channels together, reducing redundancy.^[5]

Parametric stereo

Parametric stereo is similar to intensity stereo, except that parameters beyond the intensity difference is used. In the MPEG-4 (HE-AAC) version, the intensity difference and time delay difference are used, allowing all bands to be used without hurting localization. HE-AAC also adds "correlation" information, which replicates ambience by synthesizing some difference between channels.^[6]

Binaural cue coding (BCC) is the HE-AAC PS technique extended for many input channels, all downmixing to one. The very same ILD, ITD, and IC parameters were used. MPEG Surround is similar to BCC, but allows downmixing to multiple channels, and does not seem to use ITD.^[7]

Joint frequency encoding

Joint frequency encoding is an encoding technique used in audio data compression to reduce the data rate.

The idea is to merge a given frequency range of multiple sound channels together so that the resulting encoding will preserve the sound information of that range not as a bundle of separate channels but as one homogeneous data stream. This will destroy the original channel separation permanently, as the information cannot be accurately reconstructed, but will greatly lessen the amount of required storage space. Only some forms of joint stereo use the joint frequency encoding technique, such as intensity stereo coding.

Implementations

When used within the MP3 compression process, joint stereo normally employs multiple techniques, and can switch between them for each MPEG frame. Typically, a modern encoder's joint stereo mode uses M/S stereo for some frames and L/R stereo for others, whichever method yields the best result. Encoders use different algorithms to determine when to switch and how much space to allocate to each channel; quality can suffer if the switching is too frequent or if the side channel doesn't get enough bits. With some encoding software, it is possible to force the use of M/S stereo for all frames, mimicking the joint stereo mode of some early encoders like Xing. Within the LAME encoder, this is known as forced joint stereo.^[8]

As with MP3, Ogg Vorbis stereo files can employ either L/R stereo or joint stereo. When using joint stereo, both M/S stereo and intensity stereo methods may be used. As opposed to MP3 where M/S stereo (when used) is applied before quantization, an Ogg Vorbis encoder applies M/S stereo to samples in the frequency domain after quantization, making application of M/S stereo a lossless step. After this step, any frequency area can be converted to intensity stereo by removing the corresponding part of the M/S signal's side channel. Ogg Vorbis' floor function will take care of the required left-right panning.^{[ citation needed ]} Opus similarly has support for all three options in the CELT layer; the SILK layer is M/S-only.^[9]

Notes

↑ So named because the addition and subtraction can be represented by a matrix.

Related Research Articles

In information theory, data compression, source coding, or bit-rate reduction is the process of encoding information using fewer bits than the original representation. Any particular compression is either lossy or lossless. Lossless compression reduces bits by identifying and eliminating statistical redundancy. No information is lost in lossless compression. Lossy compression reduces bits by removing unnecessary or less important information. Typically, a device that performs data compression is referred to as an encoder, and one that performs the reversal of the process (decompression) as a decoder.

MPEG-1 is a standard for lossy compression of video and audio. It is designed to compress VHS-quality raw digital video and CD audio down to about 1.5 Mbit/s without excessive quality loss, making video CDs, digital cable/satellite TV and digital audio broadcasting (DAB) practical.

Ogg is a free, open container format maintained by the Xiph.Org Foundation. The authors of the Ogg format state that it is unrestricted by software patents and is designed to provide for efficient streaming and manipulation of high-quality digital multimedia. Its name is derived from "ogging", jargon from the computer game Netrek.

Windows Media Audio (WMA) is a series of audio codecs and their corresponding audio coding formats developed by Microsoft. It is a proprietary technology that forms part of the Windows Media framework. WMA consists of four distinct codecs. The original WMA codec, known simply as WMA, was conceived as a competitor to the popular MP3 and RealAudio codecs. WMA Pro, a newer and more advanced codec, supports multichannel and high-resolution audio. A lossless codec, WMA Lossless, compresses audio data without loss of audio fidelity. WMA Voice, targeted at voice content, applies compression using a range of low bit rates. Microsoft has also developed a digital container format called Advanced Systems Format to store audio encoded by WMA.

Adaptive Transform Acoustic Coding (ATRAC) is a family of proprietary audio compression algorithms developed by Sony. MiniDisc was the first commercial product to incorporate ATRAC, in 1992. ATRAC allowed a relatively small disc like MiniDisc to have the same running time as CD while storing audio information with minimal perceptible loss in quality. Improvements to the codec in the form of ATRAC3, ATRAC3plus, and ATRAC Advanced Lossless followed in 1999, 2002, and 2006 respectively.

Digital audio is a representation of sound recorded in, or converted into, digital form. In digital audio, the sound wave of the audio signal is typically encoded as numerical samples in a continuous sequence. For example, in CD audio, samples are taken 44,100 times per second, each with 16-bit resolution. Digital audio is also the name for the entire technology of sound recording and reproduction using audio signals that have been encoded in digital form. Following significant advances in digital audio technology during the 1970s and 1980s, it gradually replaced analog audio technology in many areas of audio engineering, record production and telecommunications in the 1990s and 2000s.

MPEG-1 Audio Layer II or MPEG-2 Audio Layer II is a lossy audio compression format defined by ISO/IEC 11172-3 alongside MPEG-1 Audio Layer I and MPEG-1 Audio Layer III (MP3). While MP3 is much more popular for PC and Internet applications, MP2 remains a dominant standard for audio broadcasting.

Advanced Audio Coding (AAC) is an audio coding standard for lossy digital audio compression. It was designed to be the successor of the MP3 format and generally achieves higher sound quality than MP3 at the same bit rate.

<span class="mw-page-title-main">Ambisonics</span> Full-sphere surround sound format

Ambisonics is a full-sphere surround sound format: in addition to the horizontal plane, it covers sound sources above and below the listener.

Digital Radio Mondiale is a set of digital audio broadcasting technologies designed to work over the bands currently used for analogue radio broadcasting including AM broadcasting—particularly shortwave—and FM broadcasting. DRM is more spectrally efficient than AM and FM, allowing more stations, at higher quality, into a given amount of bandwidth, using xHE-AAC audio coding format. Various other MPEG-4 codecs and Opus are also compatible, but the standard now specifies xHE-AAC.

Musepack or MPC is an open source lossy audio codec, specifically optimized for transparent compression of stereo audio at bitrates of 160–180 kbit/s. It was formerly known as MPEGplus, MPEG+ or MP+.

High-Efficiency Advanced Audio Coding (HE-AAC) is an audio coding format for lossy data compression of digital audio defined as an MPEG-4 Audio profile in ISO/IEC 14496–3. It is an extension of Low Complexity AAC (AAC-LC) optimized for low-bitrate applications such as streaming audio. The usage profile HE-AAC v1 uses spectral band replication (SBR) to enhance the modified discrete cosine transform (MDCT) compression efficiency in the frequency domain. The usage profile HE-AAC v2 couples SBR with Parametric Stereo (PS) to further enhance the compression efficiency of stereo signals.

These tables compare features of multimedia container formats, most often used for storing or streaming digital video or digital audio content. To see which multimedia players support which container format, look at comparison of media players.

Parametric stereo is an audio compression algorithm used as an audio coding format for digital audio. It is considered an Audio Object Type of MPEG-4 Part 3 that serves to enhance the coding efficiency of low bandwidth stereo audio media. Parametric Stereo digitally codes a stereo audio signal by storing the audio as monaural alongside a small amount of extra information. This extra information describes how the monaural signal will behave across both stereo channels, which allows for the signal to exist in true stereo upon playback.

MPEG Surround, also known as Spatial Audio Coding (SAC) is a lossy compression format for surround sound that provides a method for extending mono or stereo audio services to multi-channel audio in a backwards compatible fashion. The total bit rates used for the core and the MPEG Surround data are typically only slightly higher than the bit rates used for coding of the core. MPEG Surround adds a side-information stream to the core bit stream, containing spatial image data. Legacy stereo playback systems will ignore this side-information while players supporting MPEG Surround decoding will output the reconstructed multi-channel audio.

Constrained Energy Lapped Transform (CELT) is an open, royalty-free lossy audio compression format and a free software codec with especially low algorithmic delay for use in low-latency audio communication. The algorithms are openly documented and may be used free of software patent restrictions. Development of the format was maintained by the Xiph.Org Foundation and later coordinated by the Opus working group of the Internet Engineering Task Force (IETF).

Opus is a lossy audio coding format developed by the Xiph.Org Foundation and standardized by the Internet Engineering Task Force, designed to efficiently code speech and general audio in a single format, while remaining low-latency enough for real-time interactive communication and low-complexity enough for low-end embedded processors. Opus replaces both Vorbis and Speex for new applications, and several blind listening tests have ranked it higher-quality than any other standard audio format at any given bitrate until transparency is reached, including MP3, AAC, and HE-AAC.

Unified Speech and Audio Coding (USAC) is an audio compression format and codec for both music and speech or any mix of speech and audio using very low bit rates between 12 and 64 kbit/s. It was developed by Moving Picture Experts Group (MPEG) and was published as an international standard ISO/IEC 23003-3 and also as an MPEG-4 Audio Object Type in ISO/IEC 14496-3:2009/Amd 3 in 2012.

An audio coding format is a content representation format for storage or transmission of digital audio. Examples of audio coding formats include MP3, AAC, Vorbis, FLAC, and Opus. A specific software or hardware implementation capable of audio compression and decompression to/from a specific audio coding format is called an audio codec; an example of an audio codec is LAME, which is one of several different codecs which implements encoding and decoding audio in the MP3 audio coding format in software.

References

↑ F. Baumgarte and C. Faller, “Design and evaluation of binaural cue coding,” in AES 113th Conv., Los Angeles, CA, Oct. 2002.
↑ "Stereophonic Broadcasting: Technical Details of Pilot-tone System", Information Sheet 1604(4), BBC Engineering Information Service, June 1970
↑ "Stereo disc recording". Archived from the original on 25 September 2006. Retrieved 4 October 2006.
↑ Jean-Marc Valin; Gregory Maxwell; Timothy B. Terriberry; Koen Vos (October 17–20, 2013). "High-Quality, Low-Delay Music Coding in the Opus Codec" (PDF). www.xiph.org. New York, NY: Xiph.Org Foundation. p. 2. Archived from the original (PDF) on 14 July 2018. Retrieved 19 August 2014. CELT's look-ahead is 2.5 ms, while SILK's look-ahead is 5 ms, plus 1.5 ms for the resampling (including both encoder and decoder resampling). For this reason, the CELT path in the encoder adds a 4 ms delay. However, an application can restrict the encoder to CELT and omit that delay. This reduces the total look-ahead to 2.5 ms.
↑ "Opus 1.3 Released". jmvalin.ca. For all higher-order ambisonics, channel mapping 3 provides a more efficient representation by first transforming the ambisonics signals with a designated mixing matrix before encoding. This 1.3 release provides matrices for first, second, and third order.
↑ Purnhagen, Heiko (October 5–8, 2004). "LOW COMPLEXITY PARAMETRIC STEREO CODING IN MPEG-4" (PDF). 7th International Conference on Digital Audio Effects: 163–168.
↑ HAN, Chih-Kang. MPEG Surround Codec Acceleration and Implementation on TI DSP Platform (PDF) (MSc).
↑ "Detailed command line switches". LAME documentation. Retrieved 2013-12-13. JOINT STEREO [...] means the encoder can use (on a frame by frame basis) either L/R stereo or mid/side stereo. In mid/side stereo, [...] more bits are allocated to the mid channel than the side channel. When there isn't too much stereo separation, this effectively increases the bandwidth, so having higher quality with the same amount of bits. Using mid/side stereo inappropriately can result in audible compression artifacts. Too much switching between mid/side and regular stereo can also sound bad. To determine when to switch to mid/side stereo, LAME uses a much more sophisticated algorithm than the one described in the ISO documentation. FORCED MID/SIDE STEREO forces all frames to be encoded with mid/side stereo. It should only be used if you are sure every frame of the input file has very little stereo separation.
↑ RFC 6716, §§ 4.2.1, 4.3

External links

Jürgen Herre, Fraunhofer IIS. From Joint Stereo to Spatial Audio Coding - Recent Progress and Standardization. October 2004, Paper 157, DAFx'04 7th International Conference of Digital Audio Effects.

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[2] So named because the addition and subtraction can be represented by a matrix.

[1] F. Baumgarte and C. Faller, “Design and evaluation of binaural cue coding,” in AES 113th Conv., Los Angeles, CA, Oct. 2002.

[3] "Stereophonic Broadcasting: Technical Details of Pilot-tone System", Information Sheet 1604(4), BBC Engineering Information Service, June 1970

[4] "Stereo disc recording". Archived from the original on 25 September 2006. Retrieved 4 October 2006.

[ValinAES135-5] Jean-Marc Valin; Gregory Maxwell; Timothy B. Terriberry; Koen Vos (October 17–20, 2013). "High-Quality, Low-Delay Music Coding in the Opus Codec" (PDF). www.xiph.org. New York, NY: Xiph.Org Foundation. p. 2. Archived from the original (PDF) on 14 July 2018. Retrieved 19 August 2014. CELT's look-ahead is 2.5 ms, while SILK's look-ahead is 5 ms, plus 1.5 ms for the resampling (including both encoder and decoder resampling). For this reason, the CELT path in the encoder adds a 4 ms delay. However, an application can restrict the encoder to CELT and omit that delay. This reduces the total look-ahead to 2.5 ms.

[6] "Opus 1.3 Released". jmvalin.ca. For all higher-order ambisonics, channel mapping 3 provides a more efficient representation by first transforming the ambisonics signals with a designated mixing matrix before encoding. This 1.3 release provides matrices for first, second, and third order.

[LC-M4-7] Purnhagen, Heiko (October 5–8, 2004). "LOW COMPLEXITY PARAMETRIC STEREO CODING IN MPEG-4" (PDF). 7th International Conference on Digital Audio Effects: 163–168.

[8] HAN, Chih-Kang. MPEG Surround Codec Acceleration and Implementation on TI DSP Platform (PDF) (MSc).

[9] "Detailed command line switches". LAME documentation. Retrieved 2013-12-13. JOINT STEREO [...] means the encoder can use (on a frame by frame basis) either L/R stereo or mid/side stereo. In mid/side stereo, [...] more bits are allocated to the mid channel than the side channel. When there isn't too much stereo separation, this effectively increases the bandwidth, so having higher quality with the same amount of bits. Using mid/side stereo inappropriately can result in audible compression artifacts. Too much switching between mid/side and regular stereo can also sound bad. To determine when to switch to mid/side stereo, LAME uses a much more sophisticated algorithm than the one described in the ISO documentation. FORCED MID/SIDE STEREO forces all frames to be encoded with mid/side stereo. It should only be used if you are sure every frame of the input file has very little stereo separation.

[10] RFC 6716, §§ 4.2.1, 4.3

[1]

[lower-alpha 1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]